Skip to content

Mutable tensor for local session and web session#536

Merged
wjsi merged 20 commits intomars-project:masterfrom
sighingnow:mut-web
Jul 18, 2019
Merged

Mutable tensor for local session and web session#536
wjsi merged 20 commits intomars-project:masterfrom
sighingnow:mut-web

Conversation

@sighingnow
Copy link
Contributor

@sighingnow sighingnow commented Jul 11, 2019

What do these changes do?

Support mutable tensor in local session and web session.

  • For the LocalSession, we
    1. maintain a map of name -> mut tensor in LocalSession, and do write to the ndarray directly.
    2. when seal, we use the ndarray to construct a mt.tensor and execute it with self._executor know the tensor is executed.
  • For the WebSession, we
    1. add several necessary endpoints to web api server.
    2. when write, we forward the index and value to MutableTensorActor, the MutableTensorActor maintains the buffer, and send record chunks to corresponding workers.
    3. When seal, do the same thing with LocalClusterSession.

Some design point that needs to clarify:

  1. The index (such as (slice(1, None, None))) cannot be serialized to json directly, thus I add a helper class MutableTensor.Index and leverage the Serializable.to_json to do the serialization.
  2. In WebSession, the index calculation and chunk transfer are done in MutableTensorActor.
  3. All functionalities of mutable tensor in LocalSession work well with the WebSession (as shown by the unit test).
  4. Along this PR I also do some refactor and improvement on the previous implementation to avoid code duplication.

About the (known) failed test cases:

It seems that there is no way to obtain the information and message of exception that initially raised in server at client side. We have dump_exception and reraise,

(Edit: this exception info issue has been fixed in commit 31520ebc9 of this PR.)

    def _dump_exception(self, exc_info):
        pickled_exc = pickle.dumps(exc_info)
        self.write(json.dumps(dict(
            exc_info=base64.b64encode(pickled_exc),
        )))
        raise web.HTTPError(500, 'Internal server error')
    if resp.status_code >= 400:
        resp_json = json.loads(resp.text)
        exc_info = base64.b64decode(resp_json['exc_info'])
        six.reraise(*exc_info)

But it doesn't work because:

  1. The base64.b64encode(pickled_exc) is bytes, not str, which cannot be json.dumps.
  2. Even after fixing the problem (by .decode('ascii')), we will found that resp.text is something like
<html><title>500: Internal Server Error</title><body>500: Internal Server Error</body></html>

rather than the serialized exception information. Not sure if this limitation is a bug or just by-design.

Related issue number

#415

mars/web/api.py Outdated
register_web_handler('/api/session/(?P<session_id>[^/]+)/graph/(?P<graph_key>[^/]+)/data/(?P<tileable_key>[^/]+)',
GraphDataHandler)
register_web_handler('/api/session/(?P<session_id>[^/]+)/mutable-tensor', MutableTensorHandler)
register_web_handler('/api/session/(?P<session_id>[^/]+)/mutable-tensor/write', MutableTensorWriteHandler)
Copy link
Member

@wjsi wjsi Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API design does not conform to RESTful styles. May change to ``/api/session/(?P<session_id>[^/]+)/mutable-tensor/(?P<name>[^/]+) and use GET / PUT / POST methods to handle data read / write / seal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @wjsi Mutable tensor requires four endpoints: create/get/write/seal. Is it ok to use the following mapping?

  • POST for create
  • HEAD for get (is HEAD ok here ?)
  • PUT for write
  • GET for seal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET shall not cause any side-effect on the storage. Therefore create and write can be merged into PUT and the implementation of API decide whether to create while POST for seal, or POST with an action indicating create or seal. If I were the writer of this PR, I would prefer the former solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create and write can be merged into PUT and the implementation of API decide whether to create while POST for seal

Nice suggestion, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have revised the endpoint by use POST for both write and seal (easier to distinguish since write has body payload and seal doesn't have, and write's body payload is raw bytes, not json, making create and write are harder to distinguish without extra paramter).

Now all four API of mutable tensor share the same HTTP endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customized headers can also be used to pass string meta data if you do not like query strings.

Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, have some question about serialization a bit mentioned in comment of #540 .

_nsplits=tensor.nsplits, _key=tensor.key, _chunks=tensor.chunks))


def setitem_as_records(nsplits_acc, output_chunk, value, ts, is_scalar):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc should be updated.

@qinxuye
Copy link
Collaborator

qinxuye commented Jul 14, 2019

One more thing, really cool to see the error can be serialized and sent to client, but sadly no test is added for that, could you please try to add some ut?

@hekaisheng
Copy link
Contributor

hekaisheng commented Jul 15, 2019

We can move MutableTensor.Index out of MutableTensor as a unified way to serialize indexes. Will you do it in this PR or do it later in my PR #540 ?

@sighingnow
Copy link
Contributor Author

The exception info return by the http api of web session can be validated in the self.assertRaises part of this PR. I will add a standalone test for the exception info as well.

@sighingnow
Copy link
Contributor Author

@hekaisheng I will revise the patch and move the Index class out as soon as possible. Do you guys think is SerializableIndex a reasonable name for this class? (since we also have a Index class in dataframe part).

@sighingnow
Copy link
Contributor Author

  • Add tests for return exception info from web api.
  • Add mars.tensor.core.mutable_tensor constructor, I haven't import it in mars.tensor.__init__ since there seems cycle import, will dig into it as soon as possible.
  • I will rebase to master (to adapt the serialable Index part) after Support fetch tensor data slices from client #540 merged, will ping reviewers then.

@qinxuye
Copy link
Collaborator

qinxuye commented Jul 16, 2019

  • Add tests for return exception info from web api.
  • Add mars.tensor.core.mutable_tensor constructor, I haven't import it in mars.tensor.__init__ since there seems cycle import, will dig into it as soon as possible.
  • I will rebase to master (to adapt the serialable Index part) after Support fetch tensor data slices from client #540 merged, will ping reviewers then.

Great, I have an idea, can we add a fill_value parameter to create_mutable_tensor, so that we can initialize some value for the users?

@sighingnow
Copy link
Contributor Author

Great, I have an idea, can we add a fill_value parameter to create_mutable_tensor, so that we can initialize some value for the users?

Agree. The initial_value can a property of MutableTensorActor and will be used to create the empty numpy tensor when seal. Will do that ASAP.

@hekaisheng
Copy link
Contributor

hekaisheng commented Jul 16, 2019

#540 has been merged, you can rebase master now.

Better to rebase after #546 merged.

@sighingnow
Copy link
Contributor Author

Better to rebase after #546 merged.

Will do that.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@sighingnow
Copy link
Contributor Author

  • Rebase to master and delete MutableTensor.index
  • Implement fill_value for mutable tensor constructor.

qinxuye
qinxuye previously approved these changes Jul 17, 2019
Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ep = self.get_scheduler(chunk_key)
# register quota
quota_ref = self.ctx.actor_ref(MemQuotaActor.default_uid(), address=ep)
quota_ref.request_batch_quota({record_chunk_key: records.nbytes})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReceiverActor takes little process memory, and quota request is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quota is for chunk of records (the (index, value) record of write operations) and the record chunk may be spilled, thus the quota is required, IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReceiverActor stores data in plasma_store or disk. This means the cost of process memory equals to zero when receiving data from other machines. What's more, we serialize with pyarrow with zero-copy and spill in small chunks (not chunks in Mars), hence the additional memory cost is no more than the size of these chunks. Therefore there is no need requesting for quotas before data transfer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. No need to requesting quota now.

return '%s_load_memory_%s' % (graph_key, chunk_key)


def put_chunk(session_id, chunk_key, data, receiver_ref):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function shall be put in transfer.py and renamed as put_remote_chunk as it is not referenced by any code in worker module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

mars/web/api.py Outdated

def post(self, session_id, name):
try:
# If the request contains no body payload, it is seal, otherwise it is create
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it is better to use a customized header or a query string argument to define the action of POST.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. A parameter action has been added.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@qinxuye
Copy link
Collaborator

qinxuye commented Jul 17, 2019

Could you please confirm that the new commits have resolved your comments? @wjsi

Copy link
Member

@wjsi wjsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wjsi wjsi merged commit 7fe0cef into mars-project:master Jul 18, 2019
@sighingnow sighingnow deleted the mut-web branch July 18, 2019 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants