Adds options for tracking text length in text vectorizers#195
Adds options for tracking text length in text vectorizers#195
Conversation
…aps), and tests for each
…ut feature vector and tested this
…st loading since we changed the arguments to SmartTextVectorizerModel
Codecov Report
@@ Coverage Diff @@
## master #195 +/- ##
==========================================
- Coverage 86.39% 86.26% -0.13%
==========================================
Files 310 310
Lines 10019 10058 +39
Branches 351 548 +197
==========================================
+ Hits 8656 8677 +21
- Misses 1363 1381 +18
Continue to review full report at Codecov.
|
| case (true, true) => | ||
| val textLengths = new TextMapLenEstimator[TextMap]().setInput(f +: others).getOutput() | ||
| val nullIndicators = new TextMapNullEstimator[TextMap]().setInput(f +: others).getOutput() | ||
| new VectorsCombiner().setInput(Seq(hashedFeatures, textLengths, nullIndicators): _*).getOutput() |
There was a problem hiding this comment.
why not .setInput(hashedFeatures, textLengths, nullIndicators)?
There was a problem hiding this comment.
Oh no idea why I did that - fixed
| blackListKeys: Array[String] = Array.empty, | ||
| others: Array[FeatureLike[TextMap]] = Array.empty, | ||
| trackNulls: Boolean = TransmogrifierDefaults.TrackNulls, | ||
| trackTextLen: Boolean = TransmogrifierDefaults.TrackTextLen, |
| val textLenVector = if (args.shouldTrackLen) getLenVector(keysText, rowTextTokenized) else OPVector.empty | ||
|
|
||
| categoricalVector.combine(textVector, textNullIndicatorsVector: _*) | ||
| categoricalVector.combine(textVector, Seq(textLenVector, textNullIndicatorsVector): _*) |
There was a problem hiding this comment.
categoricalVector.combine(textVector, textLenVector, textNullIndicatorsVector)
| val textColumns = if (textFeatures.nonEmpty) { | ||
| makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++ textFeatures.map(_.toColumnMetaData(isNull = true)) | ||
| if (shouldTrackLen) { | ||
| makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++ |
There was a problem hiding this comment.
Hmmm, best I can come up with is
val textColumns = if (textFeatures.nonEmpty) {
makeVectorColumnMetadata(textFeatures, makeHashingParams()) ++
(if (shouldTrackLen) textFeatures.map(_.toColumnMetaData(descriptorValue =
OpVectorColumnMetadata.TextLenString)) else Array.empty[OpVectorColumnMetadata]) ++
(if (shouldTrackNulls) textFeatures.map(_.toColumnMetaData(isNull = true))
else Array.empty[OpVectorColumnMetadata])
} else Array.empty[OpVectorColumnMetadata]
which looks less readable to me...
| val textLenVector = if (args.shouldTrackLen) getLenVector(textTokens) else OPVector.empty | ||
|
|
||
| categoricalVector.combine(textVector, textNullIndicatorsVector: _*) | ||
| categoricalVector.combine(textVector, Seq(textLenVector, textNullIndicatorsVector): _*) |
There was a problem hiding this comment.
same categoricalVector.combine(textVector, textLenVector, textNullIndicatorsVector)
| /** | ||
| * Param that decides whether or not lengths of text are tracked during vectorization | ||
| */ | ||
| trait TrackTextLenParam extends Params { |
There was a problem hiding this comment.
once this param is added - will existing models fail to load or not?
There was a problem hiding this comment.
Yes existing models will fail to load. That's why I had to re-generate the old model that we test loading with
There was a problem hiding this comment.
lets indicate this in the pr description so we wont forget to include it in our release notes
| meta.history.keys shouldBe Set(f1.name, f2.name) | ||
| meta.columns.length shouldBe 12 | ||
| meta.columns.foreach { col => | ||
| if (col.index < 4) { |
There was a problem hiding this comment.
omg, these if/else are horrible. any better ideas?! ;)
There was a problem hiding this comment.
Not at the moment. An alternative would be to do explicit comparisons on all the array indices, eg.
meta.columns(1).parentFeatureName shouldBe Seq(f1.name)
meta.columns(1).grouping shouldBe None
which I think is even worse. I don't think the if/elses are that bad - they're just checking certain ranges of the feature vector. I think those explicit comparisons need to be there regardless since it's a unit test checking the output of a specific input.
|
@Jauntbox are you planning to address the comments? |
…rifAI into km/text-len-defaults
Related issues
N/A
Describe the proposed solution
This PR adds options for tracking text length in the relevant text vectorizers:
SmartTextVectorizer, SmartTextMapVectorizer, TextMapHashingVectorizer, as well as the vectorize and smartVectorize shortcuts for Text, TextArea, TextMap, and TextAreaMap
Describe alternatives you've considered
N/A
Additional context
This is the second part of #187