Local scoring (aka Sparkless) using Aardpfark by tovbinm · Pull Request #41 · salesforce/TransmogrifAI

tovbinm · 2018-08-08T01:47:25Z

Describe the proposed solution
Added a subproject that enables loading and scoring models without Spark context but locally using Aardpfark (PFA for Spark) and Hadrian libraries instead. This allows orders of magnitude faster scoring times compared to Spark.

Describe alternatives you've considered
dbml-local, ml-local and a custom runtime.

Additional Context
Sample usage:

import com.salesforce.op.local._
val model = workflow.loadModel("/path/to/model")
val scoreFn = model.scoreFunction
val score: Map[String, Any] = scoreFn(Map("age" -> 18, "name" -> "Peter"))

Test results (single thread running on MacBook Pro i7 3.5Ghz):

Scored 6000000 records in 239s
Average time per record: 0.0399215ms

TODO:

add a type to ScoreFunction
consider using mleap.

tovbinm · 2018-08-25T20:49:04Z

local/src/main/scala/com/salesforce/op/local/OpWorkflowRunnerLocal.scala

+            }.head
+            val vector = r(inputName).asInstanceOf[Vector].toArray
+            val input = s"""{"$inputName":${vector.mkString("[", ",", "]")}}"""
+            val res = e.action(e.jsonInput(input)).toString


@MLnick is using json is the most efficient way to call engine action?

codecov · 2018-08-25T21:29:09Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a8eaf4b). Click here to learn what that means.
The diff coverage is 71.73%.

@@            Coverage Diff            @@
##             master      #41   +/-   ##
=========================================
  Coverage          ?   86.14%           
=========================================
  Files             ?      296           
  Lines             ?     9594           
  Branches          ?      319           
=========================================
  Hits              ?     8265           
  Misses            ?     1329           
  Partials          ?        0

Impacted Files	Coverage Δ
.../scala/com/salesforce/op/utils/spark/RichRow.scala	`16.66% <0%> (ø)`
...om/salesforce/op/local/OpWorkflowRunnerLocal.scala	`100% <100%> (ø)`
...com/salesforce/op/local/OpWorkflowModelLocal.scala	`81.08% <81.08%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a8eaf4b...903ad95. Read the comment docs.

manuzhang · 2018-08-27T05:06:07Z

Have you looked at https://github.com/combust/mleap ?

tovbinm · 2018-08-27T15:57:02Z

@manuzhang last time I checked spark support wast that great, but it seems better now. Whats your experience with it?

tovbinm · 2018-08-27T17:21:40Z

@manuzhang let's chat over https://gitter.im ?

leahmcguire · 2018-08-29T16:17:26Z

features/src/main/scala/com/salesforce/op/utils/spark/RichRow.scala

+     *
+     * @return a [[collection.mutable.Map]] with row contents
+     */
+    def toMutableMap: collection.mutable.Map[String, Any] = {


https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L354

so you are saying that row.getValuesMap[Any] should work as well? let me try.

Oook, so my function is faster, because getValuesMap calls def getAs[T](fieldName: String): T = getAs[T](fieldIndex(fieldName)) for each value, while my function operates on indices.

…to mt/pfa-local

leahmcguire · 2018-08-29T18:02:36Z

local/src/main/scala/com/salesforce/op/local/OpWorkflowRunnerLocal.scala

+   */
+  def score(params: OpParams): ScoreFunction = {
+    require(params.modelLocation.isDefined, "Model location must be set in params")
+    val model = workflow.loadModel(params.modelLocation.get)


will the standard load method work on spark models that use parquet storage without a spark context?

None of the spark ml readers require the context explicitly, but I will need to verify, cause they might get/create spark context inside. Do you have a model in mind that I can check against?

maybe try PCA

oh snap, they simply create a spark context internally when loading models 🤦‍♂️ https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L212

well, we also use spark context when reading the model & stages - https://github.com/salesforce/TransmogrifAI/blob/master/core/src/main/scala/com/salesforce/op/OpWorkflowModelReader.scala#L61
and
https://github.com/salesforce/TransmogrifAI/blob/master/features/src/main/scala/com/salesforce/op/stages/OpPipelineStageReader.scala#L63

leahmcguire · 2018-08-29T18:13:53Z

local/src/test/scala/com/salesforce/op/local/OpWorkflowRunnerLocalTest.scala

+
+  // TODO: remove .map[Text] once Aardpfark supports null inputs for StringIndexer
+  val indexed = description.map[Text](v => if (v.isEmpty) Text("") else v)
+    .indexed(handleInvalid = StringIndexerHandleInvalid.Skip)


We should add a spark model that uses complicated serialization - maybe PCA since that uses parquet

…pfa-local

leahmcguire · 2018-08-30T21:08:32Z

build.gradle

        commonsIOVersion = '2.6'
        scoveragePluginVersion = '1.3.1'
+        hadrianVersion = '0.8.5'
+        aardpfarkVersion = '0.1.0-SNAPSHOT'


why are we pulling in a shapshot?

Initial draft of PFA based scoring (aka Sparkless)

b9028c4

tovbinm requested a review from leahmcguire as a code owner August 8, 2018 01:47

tovbinm changed the title ~~Initial draft of PFA based scoring (aka Sparkless)~~ PFA based scoring (aka Sparkless) Aug 8, 2018

tovbinm added the work in progress label Aug 8, 2018

jamesward closed this Aug 8, 2018

jamesward reopened this Aug 8, 2018

tovbinm added 3 commits August 8, 2018 09:04

Merge branch 'master' into mt/pfa-local

ced9ee7

Merge branch 'master' into mt/pfa-local

9e986a7

Merge branch 'master' into mt/pfa-local

470f841

tovbinm changed the title ~~PFA based scoring (aka Sparkless)~~ PFA based local scoring (aka Sparkless) Aug 8, 2018

tovbinm added 4 commits August 9, 2018 15:08

Merge branch 'master' into mt/pfa-local

eda18ad

Merge branch 'master' into mt/pfa-local

2320bbf

Merge branch 'master' into mt/pfa-local

f7d690e

Merge branch 'master' into mt/pfa-local

3ddbfa7

tovbinm changed the title ~~PFA based local scoring (aka Sparkless)~~ Local scoring (aka Sparkless) using Aardpfark Aug 15, 2018

tovbinm and others added 7 commits August 17, 2018 16:47

Merge branch 'master' into mt/pfa-local

214c181

Merge branch 'master' into mt/pfa-local

e0617bc

Use official hadrian release

43ccba3

update

ceab612

Merge branch 'master' into mt/pfa-local

0bb6172

Merge branch 'master' into mt/pfa-local

dd45409

Merge branch 'master' into mt/pfa-local

5d1f8b7

tovbinm commented Aug 25, 2018

View reviewed changes

refactoring

2053037

tovbinm and others added 3 commits August 26, 2018 22:44

Merge branch 'master' into mt/pfa-local

4d6b711

pfa seems to work

855217b

Merge branch 'master' into mt/pfa-local

ebacf5e

tovbinm and others added 8 commits August 27, 2018 15:16

minor cleanups

fcdaae4

Merge branch 'master' into mt/pfa-local

0021d3c

use json4s cause it's faster

0808f78

cleanup

d73baba

cleanup2

74d8c26

Merge branch 'master' into mt/pfa-local

7ab91f1

Merge branch 'master' into mt/pfa-local

68135c5

Update build.gradle

b796b1e

tovbinm requested a review from gerashegalov August 29, 2018 06:21

leahmcguire reviewed Aug 29, 2018

View reviewed changes

tovbinm and others added 5 commits August 29, 2018 09:32

updated toMap function

7b2cdcf

revert

67e42e5

Merge branch 'mt/pfa-local' of github.com:salesforce/TransmogrifAI in…

aebef60

…to mt/pfa-local

nicefy

377d52f

Merge branch 'master' into mt/pfa-local

c84c85a

leahmcguire reviewed Aug 29, 2018

View reviewed changes

tovbinm and others added 4 commits August 29, 2018 20:50

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

8ac9daa

…pfa-local

Merge branch 'master' into mt/pfa-local

05ea9c4

Merge branch 'master' into mt/pfa-local

4d72d05

Added comment

91e9248

leahmcguire reviewed Aug 30, 2018

View reviewed changes

leahmcguire approved these changes Aug 30, 2018

View reviewed changes

Merge branch 'master' into mt/pfa-local

903ad95

tovbinm merged commit 47e7e37 into master Aug 30, 2018

tovbinm deleted the mt/pfa-local branch August 30, 2018 21:55

albertodema mentioned this pull request Oct 1, 2018

Model Load from a brand new workflow #75

Closed

ericwayman pushed a commit that referenced this pull request Feb 8, 2019

Local scoring (aka Sparkless) using Aardpfark (#41)

83af0a0

salesforce-cla bot added the cla:signed label Jul 19, 2020

Conversation

tovbinm commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

manuzhang commented Aug 27, 2018

Uh oh!

tovbinm commented Aug 27, 2018

Uh oh!

tovbinm commented Aug 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tovbinm commented Aug 8, 2018 •

edited

Loading

codecov bot commented Aug 25, 2018 •

edited

Loading