Support something like OPT_STRICT_INTEGER for deserialization #473

nwalters512 · 2024-04-11T17:18:58Z

Consider this code:

import orjson
orjson.dumps(orjson.loads("2800000000000000000"), option=orjson.OPT_STRICT_INTEGER)

This will fail, which is currently the expected behavior. The value is deserialized to int(2800000000000000000), which is >= 2 ** 53, and so cannot be serialized per orjson.OPT_STRICT_INTEGER.

However, the following code succeeds:

# 2 ** 64 == 18446744073709551616
orjson.dumps(orjson.loads("18446744073709551616"), option=orjson.OPT_STRICT_INTEGER)
# b'1.8446744073709552e19'

This is because orjson parses that value as float(18446744073709551616), and it can always serialize floats back to strings, albeit not losslessly.

What I'd like is to be able to opt-in to parsing any numbers that are >= 2 ** 53 as floats so that they can be round-tripped through orjson without error. We could do this via roughly the same API as orjson.dumps uses:

import orjson
orjson.loads("2800000000000000000", option=orjson.OPT_STRICT_INTEGER)

We could of course use a different constant, which might be reasonable given that the semantics are different ("strict" could imply failure, but what I really want are just stricter bounds on what is parsed as an int instead of a float).

Would you be open to a PR implementing his feature? If so, do you have thoughts on what the API should look like?

The text was updated successfully, but these errors were encountered:

ijl · 2024-04-15T21:11:08Z

Do you have a use case for strictly 53 bit numbers, or is this more that 64-bit integers, 128-bit integers and how floats can be coerced is not done well? Does deserializing a number that doesn't have a dot or exponent strictly as an integer handle the concern? In general, stricter behavior and support for 128-bit integers and floats would be good. I think the implementation is likely on the scale of writing a yyjson-like parser in Rust, though--the parser passing "raw numbers" probably being a big regression and definitely more difficult to handle now that it's fallible after the tape has been constructed.

nwalters512 · 2024-04-16T00:17:22Z

I do have a use case for strictly 53-bit integers: I'm interoperating with JavaScript. Specifically, I'm round-tripping JSON to/from JavaScript and Python, and I want to be able to do so losslessly. Here's a sequence where things would fail with OPT_STRICT_INTEGER and the current parsing behavior:

Start with JSON serialized from Python:

>>> orjson.dumps({ "number": 2.8e18 }, option=orjson.OPT_STRICT_INTEGER)
b'{"number":2.8e18}'

Parse that with JavaScript:

> JSON.parse('{"number":2.8e18}')
{ number: 2800000000000000000 }

Serialize back to JSON with JavaScript:

> JSON.stringify({ number: 2800000000000000000 })
'{"number":2800000000000000000}'

Parse again with Python:

>>> data = orjson.loads('{"number":2800000000000000000}')
>>> data
{'number': 2800000000000000000}

Serialize again with Python:

>>> orjson.dumps(data, option=orjson.OPT_STRICT_INTEGER)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Integer exceeds 53-bit range

If I could add the proposed option on step 4 above, things would work perfectly: the value would be deserialized to float(2.8e18), and I could keep going endlessly through steps 1-4 above without loss of information or encountering errors.

As for an implementation, I spent some time looking through the codebase, and while I'm not particularly familiar with either Rust or yyjson, I think perhaps we could hook into things here:

orjson/src/deserialize/yyjson.rs

Lines 184 to 196 in 632345a

    
           pub fn parse_node(elem: *mut yyjson_val) -> NonNull<pyo3_ffi::PyObject> { 
        
               match ElementType::from_tag(elem) { 
        
                   ElementType::String => parse_yy_string(elem), 
        
                   ElementType::Uint64 => parse_u64(unsafe { (*elem).uni.u64_ }), 
        
                   ElementType::Int64 => parse_i64(unsafe { (*elem).uni.i64_ }), 
        
                   ElementType::Double => parse_f64(unsafe { (*elem).uni.f64_ }), 
        
                   ElementType::Null => parse_none(), 
        
                   ElementType::True => parse_true(), 
        
                   ElementType::False => parse_false(), 
        
                   ElementType::Array => parse_yy_array(elem), 
        
                   ElementType::Object => parse_yy_object(elem), 
        
               } 
        
           }

parse_i64 and parse_u64 could, in theory, take some options as another argument and internally decide to return either a PyLong or a PyFloat, depending on if val is in the desired range or not. Does this sound at all feasible?

nwalters512 · 2024-04-22T22:56:28Z

Bumping this so the stalebot doesn't close this out prematurely!

ijl · 2024-04-30T20:39:27Z

I think I follow your use case. I don't follow how adding the option is a real solution given you can specify >53 bit integers as floats, yes. I'm not a numerical computing person and I suppose I don't follow how this doesn't quickly become a much wider issue of having defined behavior for all of it, including 128-bit integers and floats. And because none of that is in the spec it would require understanding what is done across as many libraries as possible and looking for a "most compatible without being incorrect" option.

If this is an important issue for you I would suggest experimenting with your own fork instead of waiting on this. I think I bias to not touching number parsing unless there's a general plan.

nwalters512 · 2024-05-04T17:50:40Z

orjson does currently have defined behavior for 128-bit integers and floats, as far as I can tell. The defined behavior does what I want. For instance, orjson.dumps(88291326719355847026813766449910520462) fails with OverflowError: int too big to convert, and orjson.loads("88291326719355847026813766449910520462") gives me 8.829132671935584e+37. Is there something else you meant by "having defined behavior for all of it, including 128-bit integers and floats"?

I'd be happy to work in a fork to try this out, but I'd be hesitant to maintain a fork indefinitely. Do you see any path to this landing in orjson, and if so, is there anything I can do to make that happen (further examples, more detailed descriptions, a proof-of-concept + tests, etc.)? The performance improvements from using orjson have been extremely promising in early testing, so I'm quite motivated to find a way to make this work!

nwalters512 · 2024-05-15T21:47:21Z

@ijl any chance I could get a response from you so I can know if this is worth pursuing?

github-actions bot added the Stale label Apr 30, 2024

ijl removed the Stale label Apr 30, 2024

github-actions bot added the Stale label May 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support something like OPT_STRICT_INTEGER for deserialization #473

Support something like OPT_STRICT_INTEGER for deserialization #473

nwalters512 commented Apr 11, 2024 •

edited

ijl commented Apr 15, 2024

nwalters512 commented Apr 16, 2024

nwalters512 commented Apr 22, 2024

ijl commented Apr 30, 2024

nwalters512 commented May 4, 2024

nwalters512 commented May 15, 2024

Support something like OPT_STRICT_INTEGER for deserialization #473

Support something like OPT_STRICT_INTEGER for deserialization #473

Comments

nwalters512 commented Apr 11, 2024 • edited

ijl commented Apr 15, 2024

nwalters512 commented Apr 16, 2024

nwalters512 commented Apr 22, 2024

ijl commented Apr 30, 2024

nwalters512 commented May 4, 2024

nwalters512 commented May 15, 2024

nwalters512 commented Apr 11, 2024 •

edited