Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 8259, bitwise checks, next siblings, doxygen comments, and more #197

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

themobiusproject
Copy link

@themobiusproject themobiusproject commented Jun 2, 2020

This PR does a number of things, but the main points are:

  • it is fully RFC-8259 compliant
  • enables strict mode by default and adds a permissive mode
  • extendeds jsmntype_t to:
    • allow bitwise checks for better json validation while parsing
    • add more information in post processing
    • define the rules for both strict and permissive modes
  • adds extra validation when tokens == NULL
    • the parser keeps track of depth and type of container down to 16 or 32 levels deep depending on the JSMN_SHORT_TOKENS flag
  • adds siblings to traverse through a single level
    • if there are more than one json string in the buffer, the first element of the first json string links to the first element of the second json string through the sibling
  • allows three different modes for treating multiple json objects in the same buffer
    • the default mode stops at the end of a json object allowing the user to parse the next json object if there is one
      • a new function is available to query if there is anything left in the buffer
    • JSMN_MULTIPLE_JSON parses multiple json objects in a single buffer
    • JSMN_MULTIPLE_JSON_FAIL parses one object will fail if there is any non-whitespace character following it in the buffer
      • this is probably the mode that follows the intent of RFC-8259 the best
  • JSMN_PERMISSIVE flags have been extended to be more modular
    • JSMN_PERMISSIVE_KEY allows PRIMITIVEs to be OBJECT KEYs
    • JSMN_PERMISSIVE_PRIMITIVE allows PRIMITIVEs to be any contiguous value
  • adds doxygen compatible comments and doxygen.cfg

Code-wise, the opening and closing of containers, colons, and commas have been pulled out of jsmn_parse and put into their own functions (jsmn_parse_ prefix with container_open, container_close, colon, and comma respectively). jsmn_parse is extremely clean. Overall, the code is quite verbose but most of it reduces to bit checks and comparisons.

For post processing, every token has multiple flags set.

  • JSMN_KEY or JSMN_VALUE
  • JSMN_OBJECT, JSMN_ARRAY, JSMN_STRING, or JSMN_PRIMITIVE

These base flags are combined for complex types such as:

  • JSMN_OBJ_VAL for an object that is also a value, or
  • JSMN_STR_KEY for a string that is also a key.

Primitives are further flagged:

  • JSMN_PRI_LITERAL for true, false, or null
  • JSMN_PRI_MINUS if it is a negative number
  • JSMN_PRI_DECIMAL if it is a float or double
  • JSMN_PRI_EXPONENT if it has an exponent

This and a lot of validation is done with an enum that can be packed into two bytes. With the enum, vanilla jsmntok_t, parent, and sibling, the final size of the jsmntok_t can be reduced to 12 bytes, or 3 ints, per token.

I have also added over 300 validation tests courtesy of JSONTestSuite. The tests are set up in three groups: the first are the tests that should pass, second are the tests that shouldn't pass, and third are the tests that may pass based on the json parser implementation.

This PR should resolve #118, resolve #131, resolve #137, resolve #158, resolve #163, resolve #177, resolve #178, and resolve #179.
I believe this can also resolve #164 by being able to traverse the tokens by sibling.

This should be able to close PR #93, PR #102, PR #108, PR #166, PR #168, PR #172, PR #192, and PR #198.

I also plan on getting utf8 into here with a flag to enable or disable it as to avoid jsmn ballooning in size for small devices.

A strong set of permissive rules would be welcome to further refine my rules.

jsmnint_t typedef'd to allow easy changing of main int type
jsmntype_t extended to allow more checks
JSMN_STRICT now has simplified format checking via parser->expected
JSMN_NEXT_SIBLING allow for easy parsing to a parents next sibling
@themobiusproject themobiusproject changed the title Bitwise checks Bitwise checks, next siblings, doxygen compatible comments, and more Jun 2, 2020
After the parse, primitive tokens can be queried with JSMN_PRI_MINUS for 
negative numbers, JSMN_PRI_DECIMAL for fractions, and JSMN_PRI_EXPONENT 
for numbers with exponents. JSMN_PRI_LITERAL denotes 'true', 'false', or 
'null'.
This undoes my previous comment preference, but I would rather keep c89 
compliance.
The Makefile not also compiles with the c89 standard. Removed some flags 
that have no use in this branch.
Updated jsmn.h to pass all available tests.
Modified testutil.h to better support jsmnint_t.
Remove a warning when compiling the tests because invalid characters are 
in the file for testing.
@themobiusproject themobiusproject mentioned this pull request Jun 10, 2020
* Updated jsmn to allow for continuation of PRIMITIVEs after a 
JSMN_ERROR_PART error. Conditionals are split between lines for coverage 
tests. jsmn_defines.h documents and sets all of the possible build 
configurations.
* jsmn_utils added for better examples. The explode example is in the 
example folder.
* tests have been updated to correctly test different configurations.
* README.md updated with c89 comments.
@themobiusproject themobiusproject changed the title Bitwise checks, next siblings, doxygen compatible comments, and more RFC 8259, bitwise checks, next siblings, doxygen comments, and more Jun 18, 2020
@atomsnc
Copy link

atomsnc commented Aug 20, 2020

Hello,

There seems to be an issue in this PR where a string like
char s[] = "{\"array\":[1,2]}";
fails, when an array is inside an object.

It can be fixed by changing
if (type & (JSMN_OBJECT | JSMN_INSD_OBJ))
to
if (type & (JSMN_OBJECT | JSMN_INSD_OBJ) && !(type & JSMN_ARRAY))
in jsmn_parse_comma

@themobiusproject
Copy link
Author

@atomsnc - Sorry, I saw your post and then life hit me for a bit.

Very good catch. I don't know how I wasn't testing for that case previously. I think your fix fixes the symptom but not the cause, so I solved the problem a little further up by ignoring JSMN_INSD_OBJ altogether.

Thank you, and I am also pleased that someone could get through my code.

@pt300
Copy link
Collaborator

pt300 commented Oct 24, 2020

A massive pr

@pt300
Copy link
Collaborator

pt300 commented Oct 24, 2020

I really like how strict mode is the default here with possibility of enabling specific relaxations. Especially ones that help cut corners just to squeeze extra performance by reducing conformance. I also think introduction of the sibling feature is a good idea, helps a lot when traversing a more complex JSON.

Myself I mostly was thinking about trying to fix issues with JSMN without introducing additional fields to parser's state. But honesty, by just saving what's expected in the future should just result in more performant code, as bitwise checks ought to be faster and cleaner than deriving what should come next from current tokens and input string.

I think I will try to mix your PR with @dominickpastore to deal with certain parts that just personally felt ugly to me. Throughout the next week or two I will try to battle with git in an attempt to do it properly and push the results onto a separate branch.

@themobiusproject
Copy link
Author

@pt300 I appreciate the comments.

A massive pr

It certainly is.

I swiped the next_sibling idea from #68 and shorter tokens and unsigned ints from #93 years ago when I first found jsmn.

Myself I mostly was thinking about trying to fix issues with JSMN without introducing additional fields to parser's state. But honesty, by just saving what's expected in the future should just result in more performant code, as bitwise checks ought to be faster and cleaner than deriving what should come next from current tokens and input string.

I made sure to keep the original types and added new types to be able to check for after parsing, and these led into adding extra states into the same enum for deeper checks by knowing what is legal as the next types. I worked hard to keep everything in a 32-bit uint.

I like the enums and bitwise checks because it is calculated at compile time but makes the code easier to read in places. expected = JSMN_PRI_INTEGER | JSMN_PRI_DECIMAL | JSMN_PRI_EXPONENT | JSMN_CLOSE; lets you be able to compare the code to the json rfc standard easily and it is all compile time code.

I think I will try to mix your PR with @dominickpastore to deal with certain parts that just personally felt ugly to me. Throughout the next week or two I will try to battle with git in an attempt to do it properly and push the results onto a separate branch.

I look forward to it. I hope you can feel the spirit of jsmn's current release in my PR because it was a great starting point and I tried to keep as much of it as I could while make some significant changes. I am curious though; what parts feel ugly to you? I know I can write some incredibly dense code that isn't the easiest to follow.

@pt300
Copy link
Collaborator

pt300 commented Oct 25, 2020

After the last sweep through your code it's been mostly few smaller things like the use of defined() in preprocessor macros instead of #ifdef and #ifndef. There was something else but I cannot remember at this time. I will be looking thoroughly once more and will let you know if I find any issues. Also, just a reminder that I am no expert. Feel free to say so if what I say feels wrong.

@atomsnc
Copy link

atomsnc commented Oct 25, 2020

@themobiusproject I feel like a worthwhile addition would be to have a coalesce strategy in the utils for open tokens when the tokens array cannot be large.

For tokens which have token.end != JSMN_NEG, they can either by dropped or moved out with a callback.

next_sibling will need to be computed again.

Something like

int coalesce_open_tokens(jsmntok_t *tokens, jsmn_parser *parser) {
    jsmnint_t i, index = 0;
    
    for (i = 0; i < parser->count; i++) {
        if (tokens[i].end == JSMN_NEG) {
            tokens[index++] = tokens[i];            
        }
#if defined(JSMN_NEXT_SIBLING)
        else {
            /* Relink siblings if any */
            for (jsmnint_t j = 0; j < i; j++) {
                if (tokens[j].next_sibling == i) {
                    tokens[j].next_sibling = tokens[i].next_sibling;
                    break;
                }
            }
        }
#endif
    }
    /* Return 0 if all the tokens are being worked on. We truly cannot add more. */ 
    if (index == parser->count) return 0;
    parser->count = index;
    parser->toknext = index;
    return 1;
}

Above code is untested.

@themobiusproject
Copy link
Author

@pt300

After the last sweep through your code it's been mostly few smaller things like the use of defined() in preprocessor macros instead of #ifdef and #ifndef. There was something else but I cannot remember at this time. I will be looking thoroughly once more and will let you know if I find any issues. Also, just a reminder that I am no expert. Feel free to say so if what I say feels wrong.

I actually also prefer #ifdef and #ifndef, but the reason I went with #if defined() and #if !defined() is because of the occasional instance of #if defined() && defined() which looked horrible when surrounded by #if{,n}defs.

@atomsnc

I feel like a worthwhile addition would be to have a coalesce strategy in the utils for open tokens when the tokens array cannot be large.

Could you better describe this use case? I am trying to figure out where it fits in so I can better understand the code snippet.

@atomsnc
Copy link

atomsnc commented Oct 26, 2020

@themobiusproject When the tokens array for storing tokens while parsing cannot be too big (especially in embedded devices). In that case when the array gets full, user can evict a few completed tokens to parse a bit more.

@pt300
Copy link
Collaborator

pt300 commented Nov 4, 2020

I have picked most of the changes you've done in this PR and those are now visible in experimental branch. For now I have left out the utilities and the example utilising them.

Definitely there are more changes coming to what is on the branch right. Mostly cleaning up the code and actually making sure there's no space for nasty surprises. Then, it will go back to adding features again.

@holgerschurig
Copy link

holgerschurig commented Nov 14, 2021

@pt300 @themobiusproject I just checked out the experimental branch that contains this patch. However, now example/simple.c doesn't parse the JSON in it anymore. This here still works:

{"user": "johndoe", "admin": false, "uid": 1000, "groups": ["users"]}

but this here won't parse anymore:

{"user": "johndoe", "admin": false, "uid": 1000, "groups": ["users", "wheel", "audio", "video"]}

This are the warnings during compilation and the (invalid) result of the run:

holger@holger:~/jsmn$ gcc -Wall -Wextra -oa example/simple.c 
example/simple.c: In function ‘jsoneq’:
example/simple.c:16:50: warning: comparison of integer expressions of different signedness: ‘int’ and ‘jsmnint’ {aka ‘unsigned int’} [-Wsign-compare]
   16 |   if (tok->type == JSMN_STRING && (int)strlen(s) == tok->end - tok->start &&
      |                                                  ^~
example/simple.c: In function ‘main’:
example/simple.c:66:21: warning: comparison of integer expressions of different signedness: ‘int’ and ‘jsmnint’ {aka ‘unsigned int’} [-Wsign-compare]
   66 |       for (j = 0; j < t[i + 1].size; j++) {
      |                     ^
holger@holger:~/jsmn$ ./a
Failed to parse JSON: -3

I also tried then PRs 194-196 on top of master. And with that, the example/simple.c still runs.

@themobiusproject
Copy link
Author

themobiusproject commented Nov 24, 2021

@holgerschurig I just added the json from the simple.c to my tests and it parses. You can check 0560339. When @pt300 merged my code, he grabbed it before 9f7e9de which did leave an error in parsing. That commit needs to be applied to the experimental branch and it should fix the problem.

As for the files in the example, I only ever touch explode.c because I wrote that one along with the jsmn_explodeJSON function associated with it. You can copy JSON_STRING from simple.c (or your unescaped version in your previous comment) into library.json in the root directory and run explode from the examples folder, it will parse and show more information than you will probably want to see.

@pt300
Copy link
Collaborator

pt300 commented Nov 24, 2021

Sorry again for the inactivity.

This looks odd, as I remember the tests were passing. I will make sure to look into this tomorrow and put the patch in.

I also really wanted to simplify some parts of the parsing, such as reparsing some tokens from the beginning instead of trying to continue parsing. Should cut down complexity without major performance hit + make the code easier to maintain and more difficult to break. But, you can see how well it's going. Hope you can forgive me.

@pt300
Copy link
Collaborator

pt300 commented Nov 25, 2021

I have applied the changes, but there seem to be still an issue with parsing. I will have to revisit it some other time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment