Tokenizer gets stuck on bad-match regex matching #14

AWe · 2012-10-10T09:48:52Z

i found a problem by using your Twokenize tokenizer.

The program stucks in the Twokenize.simpleTokenize() method in this part:
while(matches.find()){
// The spans of the "bads" should not be split.
if (matches.start() != matches.end()){ //unnecessary?
List bad = new ArrayList(1);
bad.add(splitPunctText.substring(matches.start(),matches.end()));
bads.add(bad);
badSpans.add(new Pair<Integer, Integer>(matches.start(),matches.end()));
}
}

Example string: String test = "@rkdalswl0302 완댜)물론~얼굴은아니지만....................................................................................................";

haijieg · 2012-10-19T17:57:19Z

Another input I found that causes the stuck in while-loop is here:

String test = "7h... nh? a nhìu th?t nhìu ^^~..........................................................................."

Looks like the "....." pattern is the problem?

haijieg · 2012-10-22T03:10:22Z

This is not the only bad example. Also I don't think the problem is the while loop, it might be due to some catastrophic regex matching which can take years.

brendano · 2012-10-22T03:46:59Z

Yeah it's one of the regexes for sure...

On Sun, Oct 21, 2012 at 11:10 PM, haijieg notifications@github.com wrote:

This is not the only bad example. Also I don't think the problem is the
while loop, it might be due to some catastrophic regex matching which can
take years.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9651481.

brendano · 2012-10-22T13:28:12Z

If you could assemble as many as you can find, it would be helpful. I'm trying to create minimal test cases and it's very subtle. (The problem is in non-determinancy of the regex engine... when you start using lookaheads/lookbehinds it's no longer a DFA and horrible things can happen. I'm afraid this is looking complicated to figure out.)

brendano · 2012-10-22T13:33:44Z

actually no don't bother i think i see the problem (the eastern emoticon system)

brendano · 2012-10-22T14:44:42Z

OK, it's due to a use of a backreference (or some sort of regex pattern reference) in "basicface" when embedded inside "eastEmote". I have to consult with the author of this to see what's going on

Fixed error where ........ would tokenize to ... ... ... and the error where ~......... would hang Hopefully fixes github issue #14 (from tobiowo@471d223 but with CRLF->LF) Change analysis: sample of 1,355,000 tweets changes 30,000 of them they all look like improvements, mostly the ellipsis

brendano · 2012-10-23T04:15:49Z

Hi, I just pushed a bugfix to master that makes these examples work. Could you try it on the large dataset? Let me know if building it is annoying (see the comment I wrote on #15 )

and re:

This is not the only bad example.

Of course. Welcome to large-scale text processing: every one-out-of-a-billion bug will happen :) .... so especially in a distributed system, have to log what section of the dataset your runner is running on...

When I used to use Hadoop for NLP software that had once per 10mil or 100mil or so bugs like this, we had to run the NLP system out-of-process to monitor deadlocks like this one, so you can notice when the process isn't working and kill it. Not a fun situation, sorry :(

brendano · 2012-10-23T05:16:02Z

I just ran it on 1489999 or so tweets just now and here is the distribution of number of milliseconds per tweet. The top 8 worst ones are more than 500 ms, which isn't good (max 921 ms) ... this sounds like another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if you go to a billion tweets the maxes are worse? I started a larger job but won't finish for a while. (Another caveat with this experiment, this includes JSON parsing, done in-process with --input-format json.)

haijieg · 2012-10-23T05:25:10Z

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano notifications@github.com wrote:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

haijieg · 2012-10-23T06:10:55Z

It passed the test of daily.10k with 14720000 tweets. Will try on larger
dataset later tomorrow.

Jay

On Tue, Oct 23, 2012 at 1:25 AM, Haijie Gu gu.haijie@gmail.com wrote:

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano notifications@github.comwrote:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

brendano · 2012-10-23T20:23:47Z

OK, from a bigger sample of 143mil tweets ("daily 100k" here) I'm seeing a similar distribution for the lower <2ms range

There are 4146 tweets that took more than 10ms, and 96 that took longer than 1000ms (max 143 seconds); those last ones should be investigated at some point.

brendano mentioned this issue Oct 23, 2012

Setup doesn't work #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer gets stuck on bad-match regex matching #14

Tokenizer gets stuck on bad-match regex matching #14

AWe commented Oct 10, 2012

haijieg commented Oct 19, 2012

haijieg commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 23, 2012

brendano commented Oct 23, 2012

haijieg commented Oct 23, 2012

haijieg commented Oct 23, 2012

brendano commented Oct 23, 2012

Tokenizer gets stuck on bad-match regex matching #14

Tokenizer gets stuck on bad-match regex matching #14

Comments

AWe commented Oct 10, 2012

haijieg commented Oct 19, 2012

haijieg commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 22, 2012

brendano commented Oct 23, 2012

brendano commented Oct 23, 2012

haijieg commented Oct 23, 2012

haijieg commented Oct 23, 2012

brendano commented Oct 23, 2012