Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

marshall module ideas #8

Open
pfalcon opened this issue Jan 7, 2018 · 9 comments
Open

marshall module ideas #8

pfalcon opened this issue Jan 7, 2018 · 9 comments

Comments

@pfalcon
Copy link
Owner

pfalcon commented Jan 7, 2018

It seems that MsgPack is a viable choice to implement marshall encdoing: https://github.com/msgpack/msgpack/blob/master/spec.md

Possibly, an adhoc serialization format would be even more efficient, but at least MsgPack is able to differentiate bytes vs str's, etc.

@pfalcon
Copy link
Owner Author

pfalcon commented Jan 7, 2018

Problems would be: no differentiation between tuple and list, dict and OrderedDict.

@pfalcon
Copy link
Owner Author

pfalcon commented Jan 7, 2018

Also, no encoding of array with 8 bits of length, there's a jump from 4 bits to 16 bits (same for maps).

@pfalcon
Copy link
Owner Author

pfalcon commented Jan 7, 2018

There's also CBOR, and teh-drama between it and MsgPack: msgpack/msgpack#129

@pfalcon
Copy link
Owner Author

pfalcon commented Jan 7, 2018

CBOR is used in CoAP, so kinda would be "more useful" than MsgPack...

@pfalcon
Copy link
Owner Author

pfalcon commented Jan 7, 2018

MsgPack has random gap in:

fixstr 101xxxxx 0xa0 - 0xbf
bin 8 11000100 0xc4

I.e., only short textual strs can be efficiently encoded, bytestr's require explicit len byte always.

CBOR doesn't have that "limitation": https://tools.ietf.org/html/rfc7049#appendix-B (of course, it encodes something else less efficiently instead, as all MsgPack encoding bytes are used (well, one is reserved)).

@pfalcon
Copy link
Owner Author

pfalcon commented Jan 7, 2018

Note that motivation for marshall module is encoding data rows for btree database. I.e. the motivation is: "need to serialize tuples for btree db" -> "why not implement that by implementing marshall module which can be used for many other things too".

That adds additional requirement: being able to efficiently compare serialized arrays (i.e. without requiring full decoding).

@pfalcon
Copy link
Owner Author

pfalcon commented Jan 7, 2018

CBOR defines encodings for bignums for example. Looks, like it's a winner.

@hardkrash
Copy link

CBOR tags are rather extensible, they are looking to incorporate fixed point types and arrays for ADCs.
https://cbor-wg.github.io/array-tags/

@smurfix
Copy link

smurfix commented May 10, 2021

MsgPack has random gap in

Umm, no? 0xc0 through 0xc3 are None//False/True. CBOR also has gaps in it …

A more relevant advantage of CBOR is that you can prefix an item with a rather simple "use the following data as input to ‹class›__setstate__()" tag, where the class name is encoded in the tag. If you want to do the same thing with msgpack, you need either an in-memory copy of the object's encoded bytestring or two passes on the data structure __getstate() returns. Shorter tags could be used for more-common distinctions between e.g. tuple and list: just specify a "read only hint" tag, and possibly a "the following data is ordered" tag for OrderedDict.

Another advantage would be the ability to encode indeterminate-length data (this is basically impossible with msgpack), though I have no idea whether that is actually a relevant use case for micropython/pycopy.

pfalcon pushed a commit that referenced this issue Dec 2, 2021
asan considers that memcmp(p, q, N) is permitted to access N bytes at each
of p and q, even for values of p and q that have a difference earlier.
Accessing additional values is frequently done in practice, reading 4 or
more bytes from each input at a time for efficiency, so when completing
"non_exist<TAB>" in the repl, this causes a diagnostic:

    ==16938==ERROR: AddressSanitizer: global-buffer-overflow on
    address 0x555555cd8dc8 at pc 0x7ffff726457b bp 0x7fffffffda20 sp 0x7fff
    READ of size 9 at 0x555555cd8dc8 thread T0
        #0 0x7ffff726457a  (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xb857a)
        #1 0x555555b0e82a in mp_repl_autocomplete ../../py/repl.c:301
        #2 0x555555c89585 in readline_process_char ../../lib/mp-readline/re
        #3 0x555555c8ac6e in readline ../../lib/mp-readline/readline.c:513
        #4 0x555555b8dcbd in do_repl /home/jepler/src/micropython/ports/uni
        #5 0x555555b90859 in main_ /home/jepler/src/micropython/ports/unix/
        #6 0x555555b90a3a in main /home/jepler/src/micropython/ports/unix/m
        #7 0x7ffff619a09a in __libc_start_main ../csu/libc-start.c:308
        #8 0x55555595fd69 in _start (/home/jepler/src/micropython/ports/uni

    0x555555cd8dc8 is located 0 bytes to the right of global variable
    'import_str' defined in '../../py/repl.c:285:23' (0x555555cd8dc0) of
    size 8
      'import_str' is ascii string 'import '

Signed-off-by: Jeff Epler <jepler@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants