Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support something like OPT_STRICT_INTEGER for deserialization #473

Closed
nwalters512 opened this issue Apr 11, 2024 · 6 comments
Closed

Support something like OPT_STRICT_INTEGER for deserialization #473

nwalters512 opened this issue Apr 11, 2024 · 6 comments
Labels

Comments

@nwalters512
Copy link

nwalters512 commented Apr 11, 2024

Consider this code:

import orjson
orjson.dumps(orjson.loads("2800000000000000000"), option=orjson.OPT_STRICT_INTEGER)

This will fail, which is currently the expected behavior. The value is deserialized to int(2800000000000000000), which is >= 2 ** 53, and so cannot be serialized per orjson.OPT_STRICT_INTEGER.

However, the following code succeeds:

# 2 ** 64 == 18446744073709551616
orjson.dumps(orjson.loads("18446744073709551616"), option=orjson.OPT_STRICT_INTEGER)
# b'1.8446744073709552e19'

This is because orjson parses that value as float(18446744073709551616), and it can always serialize floats back to strings, albeit not losslessly.

What I'd like is to be able to opt-in to parsing any numbers that are >= 2 ** 53 as floats so that they can be round-tripped through orjson without error. We could do this via roughly the same API as orjson.dumps uses:

import orjson
orjson.loads("2800000000000000000", option=orjson.OPT_STRICT_INTEGER)

We could of course use a different constant, which might be reasonable given that the semantics are different ("strict" could imply failure, but what I really want are just stricter bounds on what is parsed as an int instead of a float).

Would you be open to a PR implementing his feature? If so, do you have thoughts on what the API should look like?

@ijl
Copy link
Owner

ijl commented Apr 15, 2024

Do you have a use case for strictly 53 bit numbers, or is this more that 64-bit integers, 128-bit integers and how floats can be coerced is not done well? Does deserializing a number that doesn't have a dot or exponent strictly as an integer handle the concern? In general, stricter behavior and support for 128-bit integers and floats would be good. I think the implementation is likely on the scale of writing a yyjson-like parser in Rust, though--the parser passing "raw numbers" probably being a big regression and definitely more difficult to handle now that it's fallible after the tape has been constructed.

@nwalters512
Copy link
Author

I do have a use case for strictly 53-bit integers: I'm interoperating with JavaScript. Specifically, I'm round-tripping JSON to/from JavaScript and Python, and I want to be able to do so losslessly. Here's a sequence where things would fail with OPT_STRICT_INTEGER and the current parsing behavior:

  1. Start with JSON serialized from Python:
>>> orjson.dumps({ "number": 2.8e18 }, option=orjson.OPT_STRICT_INTEGER)
b'{"number":2.8e18}'
  1. Parse that with JavaScript:
> JSON.parse('{"number":2.8e18}')
{ number: 2800000000000000000 }
  1. Serialize back to JSON with JavaScript:
> JSON.stringify({ number: 2800000000000000000 })
'{"number":2800000000000000000}'
  1. Parse again with Python:
>>> data = orjson.loads('{"number":2800000000000000000}')
>>> data
{'number': 2800000000000000000}
  1. Serialize again with Python:
>>> orjson.dumps(data, option=orjson.OPT_STRICT_INTEGER)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Integer exceeds 53-bit range

If I could add the proposed option on step 4 above, things would work perfectly: the value would be deserialized to float(2.8e18), and I could keep going endlessly through steps 1-4 above without loss of information or encountering errors.

As for an implementation, I spent some time looking through the codebase, and while I'm not particularly familiar with either Rust or yyjson, I think perhaps we could hook into things here:

pub fn parse_node(elem: *mut yyjson_val) -> NonNull<pyo3_ffi::PyObject> {
match ElementType::from_tag(elem) {
ElementType::String => parse_yy_string(elem),
ElementType::Uint64 => parse_u64(unsafe { (*elem).uni.u64_ }),
ElementType::Int64 => parse_i64(unsafe { (*elem).uni.i64_ }),
ElementType::Double => parse_f64(unsafe { (*elem).uni.f64_ }),
ElementType::Null => parse_none(),
ElementType::True => parse_true(),
ElementType::False => parse_false(),
ElementType::Array => parse_yy_array(elem),
ElementType::Object => parse_yy_object(elem),
}
}

parse_i64 and parse_u64 could, in theory, take some options as another argument and internally decide to return either a PyLong or a PyFloat, depending on if val is in the desired range or not. Does this sound at all feasible?

@nwalters512
Copy link
Author

Bumping this so the stalebot doesn't close this out prematurely!

@github-actions github-actions bot added the Stale label Apr 30, 2024
@ijl ijl removed the Stale label Apr 30, 2024
@ijl
Copy link
Owner

ijl commented Apr 30, 2024

I think I follow your use case. I don't follow how adding the option is a real solution given you can specify >53 bit integers as floats, yes. I'm not a numerical computing person and I suppose I don't follow how this doesn't quickly become a much wider issue of having defined behavior for all of it, including 128-bit integers and floats. And because none of that is in the spec it would require understanding what is done across as many libraries as possible and looking for a "most compatible without being incorrect" option.

If this is an important issue for you I would suggest experimenting with your own fork instead of waiting on this. I think I bias to not touching number parsing unless there's a general plan.

@nwalters512
Copy link
Author

orjson does currently have defined behavior for 128-bit integers and floats, as far as I can tell. The defined behavior does what I want. For instance, orjson.dumps(88291326719355847026813766449910520462) fails with OverflowError: int too big to convert, and orjson.loads("88291326719355847026813766449910520462") gives me 8.829132671935584e+37. Is there something else you meant by "having defined behavior for all of it, including 128-bit integers and floats"?

I'd be happy to work in a fork to try this out, but I'd be hesitant to maintain a fork indefinitely. Do you see any path to this landing in orjson, and if so, is there anything I can do to make that happen (further examples, more detailed descriptions, a proof-of-concept + tests, etc.)? The performance improvements from using orjson have been extremely promising in early testing, so I'm quite motivated to find a way to make this work!

@github-actions github-actions bot added the Stale label May 12, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
@nwalters512
Copy link
Author

@ijl any chance I could get a response from you so I can know if this is worth pursuing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants