Conversation
… that inherit from splitter can use them
|
Thanks for the contribution! Before we can merge this, we need @AdamChit to sign the Salesforce.com Contributor License Agreement. |
|
@TuanNguyen27 Could you review |
| * | ||
| * @group param | ||
| */ | ||
| private[op] final val downSampleFraction = new DoubleParam(this, "downSampleFraction", |
There was a problem hiding this comment.
these can be protected instead of private to OP
There was a problem hiding this comment.
also maybe set a default value of 1
| new DataSplitter() | ||
| .setSeed(seed) | ||
| .setReserveTestFraction(reserveTestFraction) | ||
| .setMaxTrainingSample(maxTrainingSample) |
There was a problem hiding this comment.
is this also exposed for the datacutter class?
There was a problem hiding this comment.
Yes, I added it to SplitterParams which datacutter has access to - e4b8a92. So that I can use the same set/get functions across DataBalancer, DataCutter and DataSplitter.
| modelSelector.splitter.get.getMaxTrainingSample shouldBe 1000 | ||
| } | ||
|
|
||
| it should "set maxTrainingSample and down-sample" in { |
There was a problem hiding this comment.
Did you also mean to check maxTrainingSample was set in this test as well?
There was a problem hiding this comment.
I moved testing the maxTrainingSample set/get functions to here and forgot to rename this test. Great catch!
core/src/test/scala/com/salesforce/op/stages/impl/regression/RegressionModelSelectorTest.scala
Outdated
Show resolved
Hide resolved
| summary = Option(DataSplitterSummary()) | ||
| val dataSetSize = data.count().toDouble | ||
| val sampleF = getMaxTrainingSample / dataSetSize | ||
| val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1 |
There was a problem hiding this comment.
could be a min here too
also: lowerCamelCase
| val DownSampleFraction = if (getMaxTrainingSample < dataSetSize) sampleF else 1 | |
| val downSampleFraction = math.min(sampleF, 1.0) |
core/src/test/scala/com/salesforce/op/stages/impl/tuning/DataSplitterTest.scala
Outdated
Show resolved
Hide resolved
|
|
||
| val seed = 1234L | ||
| val dataCount = 1000 | ||
| val MaxTrainingSampleDefault = 1E6.toLong |
There was a problem hiding this comment.
I would give this a different name so we don't confuse it with the actual default in SplitterParams
|
|
||
| dataSplitter.getMaxTrainingSample shouldBe maxRows | ||
| } | ||
|
|
There was a problem hiding this comment.
It's probably worth checking downSampleFraction params was set here as well, for completeness so that you have checked everything in DataSplitterParams
…AdamChit/TransmogrifAI into achit/regression-training-limit
…plitterTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>
…egressionModelSelectorTest.scala Co-Authored-By: Christopher Rupley <crupley@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #413 +/- ##
==========================================
+ Coverage 86.97% 86.99% +0.02%
==========================================
Files 337 337
Lines 11060 11078 +18
Branches 357 597 +240
==========================================
+ Hits 9619 9637 +18
Misses 1441 1441
Continue to review full report at Codecov.
|
…AdamChit/TransmogrifAI into achit/regression-training-limit
| case s if s == classOf[DataSplitterSummary].getName => DataSplitterSummary( | ||
| preSplitterDataCount = metadata.getLong(ModelSelectorNames.PreSplitterDataCount), | ||
| downSamplingFraction = metadata.getDouble(ModelSelectorNames.DownSample) | ||
| ) |
There was a problem hiding this comment.
please add the downsample fraction to the datacutter params as well...
There was a problem hiding this comment.
I added downsample fraction into the datacutter params as part of the multi class classification training limit changes. I'll create the PR for it today.
|
Unrelated Test fails during the Travis CI check https://travis-ci.com/salesforce/TransmogrifAI/jobs/243091644 I'll create a ticket to increase the tolerance on the test here |
Related issues
DataBalancer for binary classification has a parameter that controls the max data passed into modeling - Regression should allow similar limits
Describe the proposed solution
Steps:
Investigate where the check should occur (somewhere in DataSplitter)
Add logic to downsample when the number of records is reached
Add downsampling information to the summary object and log
Add tests to DataSplitterTest and RegressionModelSelectorTest to cover the downsampling logic
Describe alternatives you've considered
Not having a limit for any of the model types - this was not optimal because some spark models may have very long runtimes or bad behavior with too much data. So the default will be to downsample once we have passes 1M records and give the user the option to set their own maxTrainingSample if they are ok with working with large dataset