Specify categorical variables in metadata#120
Conversation
| import org.scalatest.Matchers | ||
| import org.scalatest.junit.JUnitRunner | ||
|
|
||
| object AttributeTestUtils extends Matchers{ |
There was a problem hiding this comment.
- better use a trait - it will be cleaner and simpler to use.
- make return type the
Assertion
here it is:
trait AttributeAsserts {
self: Matchers =>
final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = ???
}then mixin it like this:
@RunWith(classOf[JUnitRunner])
class BinaryVectorizerTest extends OpTransformerSpec[OPVector, BinaryVectorizer] with AttributeAsserts { ... }| newColumns: Array[OpVectorColumnMetadata] | ||
| ): OpVectorMetadata = OpVectorMetadata(name, newColumns, history) | ||
|
|
||
| val textTypes = Seq(MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextMap, Binary, BinaryMap, |
There was a problem hiding this comment.
- how is
BinaryandBinaryMapare testTypes? - also instead please use
FeatureType.shortTypeName[Text]etc
There was a problem hiding this comment.
I want the package name as well. E.g com.salesforce.features.types. MultiPickList
As it is specified in OpVectorColumnMetadata.parentFeatureType
| package com.salesforce.op.utils.spark | ||
|
|
||
| import com.salesforce.op.FeatureHistory | ||
| import com.salesforce.op.features.types.{Binary, BinaryMap, MultiPickList, MultiPickListMap, Text, TextArea, TextAreaMap, TextList, TextMap} |
There was a problem hiding this comment.
import com.salesforce.op.features.types._
Codecov Report
@@ Coverage Diff @@
## master #120 +/- ##
==========================================
+ Coverage 86.18% 86.21% +0.02%
==========================================
Files 297 297
Lines 9670 9676 +6
Branches 334 539 +205
==========================================
+ Hits 8334 8342 +8
+ Misses 1336 1334 -2
Continue to review full report at Codecov.
|
| * @param expectedNominal Expected array of booleans. True if the field is nominal, false if not. | ||
| */ | ||
| final def assertNominal(schema: StructField, expectedNominal: Array[Boolean]): Assertion = { | ||
| val attributes = AttributeGroup.fromStructField(schema).attributes.get |
There was a problem hiding this comment.
avoid .get, rather do attributes.map(_.map(_.isNominal)) shouldBe Some(expectedNominal)
leahmcguire
left a comment
There was a problem hiding this comment.
Looks good, that is a lot of updated tests! My only real question is if those are the only types we want to make sure are flagged. Also text is not actually a binary attribute by default it is a count of occurrences of the hash. Is there another attribute group type for counts? or should they be numeric?
| val categoricalTypes = Seq(FeatureType.typeName[MultiPickList], FeatureType.typeName[MultiPickListMap], | ||
| FeatureType.typeName[Text], FeatureType.typeName[TextArea], FeatureType.typeName[TextAreaMap], | ||
| FeatureType.typeName[TextMap], FeatureType.typeName[Binary], FeatureType.typeName[BinaryMap], | ||
| FeatureType.typeName[TextList]) |
There was a problem hiding this comment.
picklist? Combo box? country, state, city, id
There was a problem hiding this comment.
Oh yeah If it is only hashing + count, let's remove all these Text types. Do we only do hashing to Combo box, country, state, city, id?
There was a problem hiding this comment.
we do pivot - so they should be picked up automatically. I think we also do pivot on multiPickList. So you may want to remove the categorical types check completely and only rely on the indicatorValue
| .map { case (_, g) => g.head -> g.map(_.index) } | ||
| val colMeta = colData.map { case (c, i) => | ||
| c.toMetadata(i) | ||
| } |
There was a problem hiding this comment.
nit unnecessary new line hanging => is bad
| .putMetadataArray(OpVectorMetadata.ColumnsKey, colMeta.toArray) | ||
| .putMetadata(OpVectorMetadata.HistoryKey, FeatureHistory.toMetadata(history)) | ||
| .build() | ||
| val attributes = columns.map { c => |
There was a problem hiding this comment.
val attributes = columns.map {
case c if c.indicatorValue.isDefined || categoricalTypes.exists(c.parentFeatureType.contains) =>
BinaryAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
case c =>
NumericAttribute.defaultAttr.withName(c.makeColName()).withIndex(c.index)
}|
@michaelweilsalesforce please update the description of the PR to clarify the reasons and the solution. |
|
Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this. |
Related issues
The one-hot encoding provided by Transmogrify doesn't specify that the created columns are categorical. As a consequence, Decision Tree and Random Forest will treat these columns as numerics.
Describe the proposed solution
For each feature engineering transformation that creates categorical columns, we add the
Binaryattribute to their metadata.