Specify categorical variables in metadata by michaelweilsalesforce · Pull Request #120 · salesforce/TransmogrifAI

michaelweilsalesforce · 2018-09-07T19:23:11Z

Related issues
The one-hot encoding provided by Transmogrify doesn't specify that the created columns are categorical. As a consequence, Decision Tree and Random Forest will treat these columns as numerics.

Describe the proposed solution
For each feature engineering transformation that creates categorical columns, we add the Binary attribute to their metadata.

tovbinm · 2018-09-07T22:12:24Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/AttributeTestUtils.scala

+import org.scalatest.Matchers
+import org.scalatest.junit.JUnitRunner
+
+object AttributeTestUtils extends Matchers{


better use a trait - it will be cleaner and simpler to use.

make return type the Assertion
here it is:

trait AttributeAsserts { self: Matchers => final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = ??? }

then mixin it like this:

@RunWith(classOf[JUnitRunner]) class BinaryVectorizerTest extends OpTransformerSpec[OPVector, BinaryVectorizer] with AttributeAsserts { ... }

tovbinm · 2018-09-07T22:18:09Z

features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala

    newColumns: Array[OpVectorColumnMetadata]
  ): OpVectorMetadata = OpVectorMetadata(name, newColumns, history)

+  val textTypes = Seq(MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextMap, Binary, BinaryMap,


how is Binary and BinaryMap are testTypes?

also instead please use FeatureType.shortTypeName[Text] etc

I want the package name as well. E.g com.salesforce.features.types. MultiPickList

As it is specified in OpVectorColumnMetadata.parentFeatureType

Then FeatureType.typeName

tovbinm · 2018-09-07T22:18:28Z

features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala

 package com.salesforce.op.utils.spark

 import com.salesforce.op.FeatureHistory
+import com.salesforce.op.features.types.{Binary, BinaryMap, MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextList, TextMap}


import com.salesforce.op.features.types._

codecov · 2018-09-09T23:18:33Z

Codecov Report

Merging #120 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   86.18%   86.21%   +0.02%     
==========================================
  Files         297      297              
  Lines        9670     9676       +6     
  Branches      334      539     +205     
==========================================
+ Hits         8334     8342       +8     
+ Misses       1336     1334       -2

Impacted Files	Coverage Δ
...m/salesforce/op/utils/spark/OpVectorMetadata.scala	`84.9% <100%> (+1.92%)`	⬆️
...om/salesforce/op/utils/spark/OpSparkListener.scala	`97.4% <0%> (-1.3%)`	⬇️
.../salesforce/op/features/FeatureBuilderMacros.scala	`100% <0%> (+100%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e200397...23e805d. Read the comment docs.

tovbinm · 2018-09-09T23:44:13Z

core/src/test/scala/com/salesforce/op/stages/impl/feature/AttributeAsserts.scala

+   * @param expectedNominal Expected array of booleans. True if the field is nominal, false if not.
+   */
+  final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = {
+    val attributes = AttributeGroup.fromStructField(schema).attributes.get


avoid .get, rather do attributes.map(_.map(_.isNominal)) shouldBe Some(expectedNominal)

leahmcguire

Looks good, that is a lot of updated tests! My only real question is if those are the only types we want to make sure are flagged. Also text is not actually a binary attribute by default it is a count of occurrences of the hash. Is there another attribute group type for counts? or should they be numeric?

leahmcguire · 2018-09-10T16:45:45Z

features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala

+  val categoricalTypes = Seq(FeatureType.typeName[MultiPickList], FeatureType.typeName[MultiPickListMap],
+    FeatureType.typeName[Text], FeatureType.typeName[TextArea], FeatureType.typeName[TextAreaMap],
+    FeatureType.typeName[TextMap], FeatureType.typeName[Binary], FeatureType.typeName[BinaryMap],
+    FeatureType.typeName[TextList])


picklist? Combo box? country, state, city, id

Oh yeah If it is only hashing + count, let's remove all these Text types. Do we only do hashing to Combo box, country, state, city, id?

we do pivot - so they should be picked up automatically. I think we also do pivot on multiPickList. So you may want to remove the categorical types check completely and only rely on the indicatorValue

leahmcguire · 2018-09-10T16:46:41Z

features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala

+      .map { case (_, g) => g.head -> g.map(_.index) }
+    val colMeta = colData.map { case (c, i) =>
+      c.toMetadata(i)
+    }


nit unnecessary new line hanging => is bad

tovbinm · 2018-09-10T16:59:27Z

features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala

      .putMetadataArray(OpVectorMetadata.ColumnsKey, colMeta.toArray)
      .putMetadata(OpVectorMetadata.HistoryKey, FeatureHistory.toMetadata(history))
      .build()
+    val attributes = columns.map { c =>


val attributes = columns.map { case c if c.indicatorValue.isDefined || categoricalTypes.exists(c.parentFeatureType.contains) => BinaryAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index) case c => NumericAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index) }

tovbinm

lgtm

tovbinm · 2018-09-11T01:47:49Z

@michaelweilsalesforce please update the description of the PR to clarify the reasons and the solution.

salesforce-cla · 2021-04-03T15:41:49Z

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

mweilsalesforce added 11 commits September 4, 2018 17:09

Adding Attributes when converting to Metadata

12ce677

Treated Text types as special uses cases

1ac45eb

FirstTest

9c7d9ff

Tested on DateVectorizers

f7546db

Up to GeoLocationTests

96a1d8c

Up to NumericVectorizerTest

0531c68

Up to OpMapVectorizer

8c64851

Up to SmartTextVectorizer

96427de

Up to URLVectorizerTests

d52131a

fix scalastyle

e9e692c

Merge branch 'master' into mw/categorical-metadata

3c1ceeb

michaelweilsalesforce requested review from leahmcguire and tovbinm as code owners September 7, 2018 19:23

michaelweilsalesforce added the work in progress label Sep 7, 2018

michaelweilsalesforce changed the title ~~Mw/categorical metadata~~ Specify categorical variables in metadata Sep 7, 2018

tovbinm reviewed Sep 7, 2018

View reviewed changes

mweilsalesforce added 3 commits September 7, 2018 16:10

with AttributeAsserts

95d43cc

import com.salesforce.op.features.types._

186d81f

FreatureType.typeName

78edeca

tovbinm reviewed Sep 9, 2018

View reviewed changes

Avoiding .get

34f516f

leahmcguire reviewed Sep 10, 2018

View reviewed changes

tovbinm reviewed Sep 10, 2018

View reviewed changes

Addressing remaining PR comments

23e805d

tovbinm approved these changes Sep 11, 2018

View reviewed changes

tovbinm merged commit 6d69992 into master Sep 11, 2018

tovbinm deleted the mw/categorical-metadata branch September 11, 2018 01:47

tovbinm added ready for review and removed work in progress labels Sep 13, 2018

ericwayman pushed a commit that referenced this pull request Feb 8, 2019

Specify categorical variables in metadata (#120)

c7d19ac

salesforce-cla bot added the cla:missing label Apr 3, 2021

Conversation

michaelweilsalesforce commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tovbinm Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leahmcguire left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tovbinm left a comment

Choose a reason for hiding this comment

Uh oh!

tovbinm commented Sep 11, 2018

Uh oh!

salesforce-cla bot commented Apr 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelweilsalesforce commented Sep 7, 2018 •

edited

Loading

tovbinm Sep 7, 2018 •

edited

Loading

codecov bot commented Sep 9, 2018 •

edited

Loading