Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGFAULT when parsing CDATA in single threading mode. #163

Open
Jean-Daniel opened this issue Nov 12, 2018 · 2 comments
Open

SEGFAULT when parsing CDATA in single threading mode. #163

Jean-Daniel opened this issue Nov 12, 2018 · 2 comments

Comments

@Jean-Daniel
Copy link

When using the parser in MyHTML_OPTIONS_PARSE_MODE_SINGLE mode, it is initialized in myhtml_init like this:

        case MyHTML_OPTIONS_PARSE_MODE_SINGLE:
            if((status = myhtml_create_stream_and_batch(myhtml, 0, 0)))
                return status;

As this call specify that is need 0 stream, the myhtml->thread_stream is initialized to NULL.

myhtml->thread_stream = NULL;

But then, when parsing CDATA (in myhtml_tokenizer_state_markup_declaration_open()), the parser try to call myhtml_tree_wait_for_last_done_token(), which try to access unconditionally tree->myhtml->thread_stream->timespec and obviously it crashes (thread_stream is NULL).

Backtrace:

  myhtml_tree_wait_for_last_done_token(tree=., token_for_wait=.) at tree.c:2457
  myhtml_tokenizer_state_markup_declaration_open(tree=., token_node=., html="…", html_offset=413, html_size=378555) at tokenizer.c:943
  myhtml_tokenizer_chunk_process(tree=., html="…", html_length=378555) at tokenizer.c:88
  myhtml_tokenizer_chunk(tree=., html="…", html_length=378555) at tokenizer.c:104
  myhtml_tokenizer_begin(tree=., html="…", html_length=378555) at tokenizer.c:42
  myhtml_parse_fragment(tree=., encoding=MyENCODING_DEFAULT, html="…") at main.c
@lexborisov
Copy link
Owner

Hi @Jean-Daniel
In a single mode, tokens will always be equal and the program will not enter the loop.
Do you have an example html where the program in a single mode enter to this loop?

I saw and corrected another problem. Please, try code from master.

Thanks for the report!

@Jean-Daniel
Copy link
Author

Sorry, I didn't gave you enough info. I'm actually using the parser to extract some data from html fragments (I only have the content), and I don't really need a full tree.
So I'm using the 'after token done' callback, and disable the tree by using MyHTML_TREE_PARSE_FLAGS_WITHOUT_BUILD_TREE.

A quick test reveal that this is the later flag that trigger the bug. Without it, the parser works flawlessly, but when I set this flag, it crashes on CDATA.

#import <myhtml/api.h>

int main(int argc, char **argv) {
  const char *bytes = "<div><![CDATA[ foo ]]></div>";
  size_t length = strlen(bytes);

  myhtml_t* myhtml = myhtml_create();
  myhtml_init(myhtml, MyHTML_OPTIONS_PARSE_MODE_SINGLE, 1, 0);

  myhtml_tree_t* tree = myhtml_tree_create();
  myhtml_tree_init(tree, myhtml);
  myhtml_tree_parse_flags_set(tree, MyHTML_TREE_PARSE_FLAGS_WITHOUT_BUILD_TREE | MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN);

  // parse html (we only have the body)
  myhtml_parse_fragment(tree, MyENCODING_UTF_8, bytes, length, MyHTML_TAG_BODY, MyHTML_NAMESPACE_HTML);
  myhtml_tree_destroy(tree);

  myhtml_destroy(myhtml);
  return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants