Transmogrify to use smart vectorizer #63
Conversation
|
Thanks for the contribution! It looks like @sxd929 is an internal user so signing the CLA is not required. However, we need to confirm this. |
| val (f, other) = castAs[TextAreaMap](g) | ||
| // Explicitly set cleanText to false here in order to match behavior of Text vectorization | ||
| f.vectorize(shouldPrependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys, | ||
| f.smartVectorize(maxCategoricalCardinality = TextTokenizer.MaxCategoricalCardinality, |
There was a problem hiding this comment.
It think we should add MaxCategoricalCardinality and he rest of the missing defaults to TransmogrifierDefaults
There was a problem hiding this comment.
@sxd929 I also meant TextTokenizer.AutoDetectLanguage etc.
There was a problem hiding this comment.
good catch! true, thanks, I think this arg is the only added default
There was a problem hiding this comment.
talked to Leah and fixed, thanks!
There was a problem hiding this comment.
@tovbinm I have reverted the changes, but it seems that we can also set default in transmogrify as, for example, AutoDetectLanguage = TextTokenizer.AutoDetectLanguage and limit the use to transmogrify and vectorize, what do you think?
|
@sxd929 I've invited you to the org: https://github.com/salesforce Once accepted, you can kick the CLA bot: https://cla.salesforce.com/status/salesforce/TransmogrifAI/pull/63 |
Codecov Report
@@ Coverage Diff @@
## master #63 +/- ##
==========================================
+ Coverage 85.88% 85.88% +<.01%
==========================================
Files 294 294
Lines 9521 9530 +9
Branches 320 320
==========================================
+ Hits 8177 8185 +8
- Misses 1344 1345 +1
Continue to review full report at Codecov.
|
| val vectorized = Seq(textMap).transmogrify() | ||
| it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in { | ||
|
|
||
| val vectorized = textMap.vectorize(trackNulls = TransmogrifierDefaults.TrackNulls, |
There was a problem hiding this comment.
Minor: You can just do textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText).
| f.smartVectorize(maxCategoricalCardinality = MaxCategoricalCardinality, | ||
| numHashes = DefaultNumOfFeatures, autoDetectLanguage = TextTokenizer.AutoDetectLanguage, | ||
| minTokenLength = TextTokenizer.MinTokenLength, toLowercase = TextTokenizer.ToLowercase, | ||
| prependFeatureName = PrependFeatureName, cleanText = false, cleanKeys = CleanKeys, |
There was a problem hiding this comment.
Is there a reason why these weren't following the defaults in TransmogrifierDefaults in the first place? CleanText is set to true there.
There was a problem hiding this comment.
good point! it seems that this was an issue with vectorizer but smart vectorizer fixed it, fixed, thanks a lot!
| @@ -541,7 +540,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging { | |||
| val retrieved = SanityCheckerSummary.fromMetadata(summary.getSummaryMetadata()) | |||
|
|
|||
| // Check that all of the hashed text columns (and the null indicator column itself) are thrown away | |||
There was a problem hiding this comment.
Can you change the comments to agree with the new behavior too? The text field is detected as categorical and pivoted now instead of being hashed.
| @@ -575,7 +574,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging { | |||
|
|
|||
| // Drop the whole hash space but not the null indicator column (it has an indicator group, so does not get | |||
There was a problem hiding this comment.
good catch! fixed! thanks!
| val vectorized = Seq(textMap).transmogrify() | ||
| it should "not calculate correlations on hashed text features if asked not to (using vectorizer)" in { | ||
|
|
||
| val vectorized = textMap.vectorize(cleanText = TransmogrifierDefaults.CleanText) |
There was a problem hiding this comment.
this line seems redundant?
tovbinm
left a comment
There was a problem hiding this comment.
lgtm! let's merge this as it is now.
| vector.v.size < TransmogrifierDefaults.DefaultNumOfFeatures + (TransmogrifierDefaults.TopK + 2) * 3 shouldBe true | ||
| vector.v.size >= TransmogrifierDefaults.DefaultNumOfFeatures + 6 shouldBe true | ||
| vector.v.size < (TransmogrifierDefaults.TopK + 2) * 5 shouldBe true | ||
| vector.v.size >= 10 shouldBe true |
Related issues
Refer to issue(s) addressed in this pull request from [Issues]
change transmogrify to use smart text vectorizer
Describe the proposed solution
change transmogrify to use smart text vectorizer
add argument cleanKeys to SmartTextMapVectorizer
set MaxCategoricalCardinality to be 30 and use previous default for other settings
fix test that failed due to the change
Describe alternatives you've considered
N/A
Additional context
N/A