Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packing/unpacking the bin format family #72

Open
rybakit opened this issue Nov 18, 2015 · 6 comments
Open

Packing/unpacking the bin format family #72

rybakit opened this issue Nov 18, 2015 · 6 comments

Comments

@rybakit
Copy link
Contributor

rybakit commented Nov 18, 2015

Packing/unpacking bin data fails for me:

var_dump(bin2hex(msgpack_pack("\x80")));

Expected: string(6) "c40180"
Actual: string(4) "a180"

var_dump(bin2hex(msgpack_unpack("\xc4\x01\x80")));

Expected: string(2) "80"
Actual: PHP Warning: [msgpack] (php_msgpack_unserialize) Parse error in ...

$ pecl list

INSTALLED PACKAGES, CHANNEL PECL.PHP.NET:
=========================================
PACKAGE VERSION STATE
msgpack 0.5.6   beta
$ php -v

PHP 5.5.9-1ubuntu4.14 (cli) (built: Oct 28 2015 01:34:46)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies
    with Xdebug v2.2.3, Copyright (c) 2002-2013, by Derick Rethans
@Sean-Der
Copy link
Member

Hey @rybakit!

Good news first!
Using msgpack at 105961a93599a5a12f2e33eaebe7b25ac69f1984 var_dump(bin2hex(msgpack_unpack("\xc4\x01\x80"))); gives me string(2) "80"

Bad news is that we are just packing your data as string/raw this. Even though this is labled raw it was renamed to str in the new spec (but has the same leading bits)? So the a1(101) you are seeing is just a one byte string!

All I did was diagnose stuff though, didn't give you any answers. I think everything is working correctly? Feel free to close if that answer is good, or I would be surprised if I was completely confused :) (I read this through about 5 times...)

@rybakit
Copy link
Contributor Author

rybakit commented Nov 21, 2015

Hi @Sean-Der!

According to the spec:

String extending Raw type represents a UTF-8 string
Binary extending Raw type represents a byte array

So I would expect "\x80" to be packed as a byte array (bin 8), not a string. I've also tested it with a python library which returns c40180, not a180.

@Sean-Der
Copy link
Member

@rybakit

Right now we don't look at the string we are packing, so "\x80" is just handled like every other string. I will look at the Python library and see how they implemented packing (There isn't a dedicated binary type, so maybe they walk the string and decide based on its contents?)

Is this behavior breaking PHP <-> Python for you right now?

@rybakit
Copy link
Contributor Author

rybakit commented Nov 23, 2015

There isn't a dedicated binary type, so maybe they walk the string and decide based on its contents?

They check if the string is Unicode before packing it:

...
elif isinstance(obj, unicode):
    _pack_string(obj, fp)
elif isinstance(obj, str):
    _pack_binary(obj, fp)
...

Is this behavior breaking PHP <-> Python for you right now?

I'm not affected by this right now, but I saw these tickets (#36, #53) were closed and assumed that the bin family is fully supported. I also wrote a pure php implementation of the msgpack protocol and tried to run my tests against this pecl extension and found that they fail on bin and ext formats.

@Sean-Der
Copy link
Member

I really appreciate you making msgpack-php better just because! Sure needs some people who care about quality.. as I mess things up!

Let me look at what I can do with this! The only part that is rough with this is that check will increase runtime. All I have is a char * + int, so I will have to see what the right way to handle that.

thanks (also I owe you a merge and a bugfix already I should have time this week)

@rybakit
Copy link
Contributor Author

rybakit commented Aug 30, 2016

@Sean-Der Maybe this can be used to detect utf8 strings? Also, you may consider adding a flag to enable/disable utf8/bin auto-detection. That will help to keep encoding fast for the cases when the string type is known beforehand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants