Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A way to serialize/deserialize ndindex #86

Open
telamonian opened this issue Sep 30, 2020 · 15 comments · May be fixed by #91
Open

A way to serialize/deserialize ndindex #86

telamonian opened this issue Sep 30, 2020 · 15 comments · May be fixed by #91
Assignees

Comments

@telamonian
Copy link
Contributor

telamonian commented Sep 30, 2020

Hi all. This seems like a really neat project, and one I'll probably get a lot of use out of. Question: would a way to deserialize/serialize an ndindex be something that's within the scope of this project?

Here's my use case -> jupyterlab/jupyterlab-hdf5#4. Short version:

  • A user opens an hdf5 dataset in their browser, then specify which 1D or 2D slice of a hyperslab to view using numpy-style slice notation
  • The jupyterlab-hdf5 frontend ships said slice to its backend in the form of a string
  • On the backend, I rehydrate/deserialize/whatever the slice string and then hand it to h5py, which also uses numpy-style slices to specify hyperslabs

I'd be more than willing to pitch in and help with this in the form of code/PRs, but I'll probably need some help figuring out how to handle deserialization.

@asmeurer
Copy link
Member

What form of deserialization do you have in mind. Currently, the repr form of any ndindex expression always recreates it. I haven't tested it but I imagine pickle should work just fine as well.

@telamonian
Copy link
Contributor Author

telamonian commented Sep 30, 2020

My starting point is a string (collected as user input), so pickle isn't going to help (also no way to pickle/unpickle on the browser side, which after all is running js and not python). repr might work. I'll check into that.

To give you a concrete example, if a user opens a 4D dataset, they would then enter a 2D slab specification like:

"13, 10:1000, 0, :"

which will then get passed to the python backend server via a GET request. I've come up with some hacky ways to parse such a string into a numpy-compatible tuple of slices. My best so far:

def parseSlice(sliceStr):
    return tuple((slice(*(int(i) if i else None for i in part.strip().split(':'))) if ':' in part else int(part.strip())) for part in sliceStr.split(','))

which you can then use like

>>> parseSlice("13, 10:1000, 0, :")

(13, slice(10, 1000, None), 0, slice(None, None, None))

but I can't imagine this is a robust approach. I think I would need an actual formal parser of some flavor.

@asmeurer
Copy link
Member

I see. Parsing strings is not something I had considered, but it could be added. Probably the best way would be to do something akin to ast.literal_eval but extended to support what we need for indices. You'd also have to wrap the slice in some indexing object, but I've also thought that we should have something akin to numpy.s_ that converts __getitem__ into an ndindex object.

@telamonian
Copy link
Contributor Author

telamonian commented Sep 30, 2020

something akin to ast.literal_eval

yes, this exactly! Now if I could just use eval on raw user input, all of my problems would be solved :)

So I think a starting point for the parser could be

# minimize parsed ops
from numpy import s_

ast.parse(f's_[{sliceStr}]')

where numpy.s_ could be replaced by ndindex.s_ (or whatever you end up calling it).

Of course then we have to figure out what to do with the resulting abstract syntax tree...

@asmeurer
Copy link
Member

Yes, exactly. If you look at how ast.literal_eval works, it's pretty simple. https://github.com/python/cpython/blob/17b5be0c0a3f74141014e06a660f1b5ddb002fec/Lib/ast.py#L54. It just recursively walks the ast and checks for known OK nodes. I think we can just copy it, and add Slice nodes. Maybe we could also do some careful checking for Name nodes if we want to allow slice(1, 2) or array([1, 2]). Then we use ast.parse(f'dummy[{input_string}]').body[0].value.slice.

Support masks is harder, because usually people construct masks out of the array itself as an expression, like a[a > 0]. Do you need support for that?

@telamonian
Copy link
Contributor Author

Support masks is harder, because usually people construct masks out of the array itself as an expression, like a[a > 0]. Do you need support for that?

For my purposes? No. In fact, for my application it would be best if anything other than a comma separated sequence of literal index or literal slice caused an exception when you try to parse it. But that's just my own narrow-ish use case.

In terms of the general purpose ndindex pkg, does it make sense to support parsing mask indexes? You're prob in a better position than I am to answer said general question.

@telamonian
Copy link
Contributor Author

telamonian commented Sep 30, 2020

Here's a simplified parser implementation based on ast.parse that I hacked together:

import ast
from numpy import s_

def astParseSlice(sliceStr):
    for node in ast.walk(ast.parse(f's_[{sliceStr}]')):
        if isinstance(node, ast.ExtSlice):
            return (x for x in (iors.value.n if isinstance(iors, ast.Index) else slice(iors.lower.n if iors.lower is not None else None, iors.upper.n if iors.upper is not None else None, iors.step.n if iors.step is not None else None) if isinstance(iors, ast.Slice) else None for iors in ast.iter_child_nodes(node)) if x is not None)

Basic idea:

  • parses sliceStr
  • walks the nodes of the resulting tree (in no particular order)
  • stops at the first (only?) ExtSlice node and unpacks it into an iterator. Anything other than an Index or Slice node is ignored and then dropped from the final result

But for a real implementation, I agree that copying and modifying ast.literal_eval makes the most sense. The code for literal_eval is pleasantly concise.

edit

Looking into it further, my implementation above fails for a myriad of inputs, everything from "13, 14" to "8:9", and for interesting reasons. So yeah, we should definitely start from the existing literal_eval (not that that was really in question)

@asmeurer
Copy link
Member

ndindex is presently designed around the idea of "raw" indices, that don't have any knowledge of the array being indexed. So a boolean mask like a > 0 would be represented in ndindex as just the boolean array, without any knowledge of a. So I would omit support for it for now.

@asmeurer
Copy link
Member

I think we can copy literal_eval and

  • Make it use ast.parse(f'dummy[{input_string}]').body[0].value.slice.
  • Remove ifs for nodes that we know we won't use, like the set and dict.
  • Add elif isinstance(node, Slice) and elif isinstance(node, ExtSlice). I don't think we need to worry about the depth because we won't allow Subscript which would appear for any nested indexing.
  • Modify it so it returns ndindex types instead of built-in types.

The function doesn't need to worry about input validation because the ndindex classes already do that.

@telamonian
Copy link
Contributor Author

Nice, that makes it quite clear how to accomplish this. I'll take a stab at a PR

@scopatz
Copy link
Contributor

scopatz commented Oct 5, 2020

Thanks @telamonian!

@telamonian
Copy link
Contributor Author

telamonian commented Oct 8, 2020

Sorry, I got a little distracted by some debugging stuff I promised to add to the "how to develop a jupyter extension" jupytercon tutorial. I'll probably make the PR for this over the weekend

Edit: I've started work

@telamonian
Copy link
Contributor Author

Okay, so I ran into problems with the implementation, and after doing some digging it turns out that the ast.literal_eval function changed a bit in 3.8, and the way that slices are represented in the AST changed significantly in 3.9:

From https://docs.python.org/3/library/ast.html#node-classes:

Changed in version 3.8: Class ast.Constant is now used for all constants.

Changed in version 3.9: Simple indices are represented by their value, extended slices are represented as tuples.

Deprecated since version 3.8: Old classes ast.Num, ast.Str, ast.Bytes, ast.NameConstant and ast.Ellipsis are still available, but they will be removed in future Python releases. In the meantime, instantiating them will return an instance of a different class.

Deprecated since version 3.9: Old classes ast.Index and ast.ExtSlice are still available, but they will be removed in future Python releases. In the meantime, instantiating them will return an instance of a different class.

So, eg the section for handling ast.Num nodes was removed from ast.literal_eval in 3.8. However, the Python installed in my dev env is 3.7. This meant that the the version of ast.literal_eval that I copied-pasted from cpython master was choking on the AST generated by my dev instance of Python (3.7 AST uses ast.Num extensively)

I can definitely come up with something that will work for 3.7, but making a string->index parser that will work with all of 3.6-3.9 will take some extra work. I'll look into it further

@telamonian
Copy link
Contributor Author

turns out to have been much simpler than I initially thought to implement a literal_eval_index function that works in both cpy37 and cpy38. There's a PR up now with a working prototype.

One catch: cpy39 will probably require some more work and explicit handling, since it makes changes to the grammar of ast.Slice. First I'll need to set up pyenv (or something) so I can install cpy39 without snapping my dev env in half.

@asmeurer Currently the PR contains a literal_eval_index.py file with a bare literal_eval_index function that will output either a single bare index or a tuple (dependent on input). I'll need your input on how to actually integrate my literal_eval_index with the existing ndindex package

@telamonian
Copy link
Contributor Author

good news: it turns out the code that I already wrote essentially gets support for the new grammar in cpy39 "for free". ast in cpy39 replaced all uses of ast.Index and ast.ExtSlice with one of: ast.Constant, ast.Slice, or an ast.Tuple containing one or both. So the support I added for ast.Slice (along with the existing support for ast.Constant and ast.Tuple from the original ast.literal_eval) already covers the cpy39 use case.

@asmeurer I could still use some pointers about how to integrate literal_eval_index with the rest of the project

@ericdatakelly ericdatakelly added this to the April 2021 milestone Apr 12, 2021
@ericdatakelly ericdatakelly modified the milestones: April 2021, May 2021 May 10, 2021
@ericdatakelly ericdatakelly modified the milestones: May 2021, June 2021 Jun 7, 2021
@ericdatakelly ericdatakelly removed this from the June 2021 milestone Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants