Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackOverflowException in Xeger #5

Open
moodmosaic opened this issue Oct 21, 2012 · 7 comments
Open

StackOverflowException in Xeger #5

moodmosaic opened this issue Oct 21, 2012 · 7 comments

Comments

@moodmosaic
Copy link
Owner

The following regular expressions create a StackOverflowException:

^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*@)?((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.[a-zA-Z]{2,4})(\:[0-9]+)?(/[^/][a-zA-Z0-9\.\,\?\'\\/\+&%\$#\=~_\-@]*)*$
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$

The Xeger class keep traversing the Automaton's States resulting to a StackOverflowException. This happens in the IList<Transition> GetSortedTransitions() method of the State class.

The issue appeared after this commit which addresses this issue.

@moodmosaic
Copy link
Owner Author

As @ymotton points out the following regular expression also throws:

@"(.*)\r\n\r\n"

@moodmosaic moodmosaic removed their assignment Jul 1, 2015
@edwardskrod
Copy link

@moodmosaic Did you guys ever figure this out?

@moodmosaic
Copy link
Owner Author

@edwardskrod, no we haven't. See also #13 (comment).

@trylock
Copy link

trylock commented Nov 13, 2018

There is a non-zero probability that pretty much any non-trivial regular expression will throw a StackOverflowException. This is because generally there will be cycles in its automaton. However, the expected number of steps of the Xeger algorithm is bounded (see http://web.cs.elte.hu/~lovasz/erdos.pdf) Simply rewriting the Generate method to use a for loop or something like that instead of recursion should do the trick. I can send a pull request when I have time, unless there is another problem with this I'm not aware of.

Also, fun fact about the linked issue: If we pick one of 2^16 characters at random, the mean time to get an English letter is 2^16/26 (about 2521) trials.

@moodmosaic
Copy link
Owner Author

I can send a pull request when I have time

Please do. There's no need for us to strictly follow the Java version (dk.brics.automaton and Xeger). 🚀

@trylock
Copy link

trylock commented Nov 18, 2018

I finally had a little time to think about this properly, First of all, the linked document is obviously not relevant to this discussion as it deals with undirected graphs. Under current conditions, there are some regular expressions where the expected number of steps is exponential (for example: .*some long suffix). Another disadvantage of the old algorithm is that the generated words depend heavily on the graph structure. Consider this regexp: a{1,4}, for the Xeger algorithm, P(a) = 1/2, P(aa) = 1/4, P(aaa) = 1/8, P(aaaa) = 1/8

I thought about possible alternatives and there is a lot of them. It all depends on our requirements.

  1. uniform distribution of generated words
  2. uniform/custom distribution of generated word lengths
  3. uniform/custom distribution of cycle lengths (for example in [a-z]+@[a-zA-Z0-9_-]+ we might want the part before @ to have the same expected length as the part after @)

Of course, I assume we restrict generated words to some finite subset (like words of length at most n).

These requirements go against each other. For example, in general you can't have both 1) and 2) (.* - there is a lot more longer words). It would be nice to have multiple generators.

As for the Xeger algorithm, I'd rather get rid of it completely. However, I understand that a lot of code may depend on its probability distribution. We can maybe create a modified version which will stop after a random number of steps and generate the rest of the word without using cycles. This shouldn't break the distribution too much and it will reduce the expected number of steps greatly. What do you think?

@moodmosaic
Copy link
Owner Author

I would definitely go with different generators, with an API that also supports LINQ, like this one. This one is from Hedgehog but the idea about how the API can look like is similar.

We can have a Gen type that takes care of generating the data we want. – On distribution, I'd rather have uniform distribution by default, and have some generator(s) that allow you to use a weighted distribution.

(All examples I use come from Hedgehog, the property testing tool I'm working on, but the concept of data generation applies here as well.)


At this point we might wonder if it's worth keeping the Xeger algorithm around; if I could go back in time, I wouldn't port Xeger and instead I would only port dk.brics.automaton instead of both...

For now, I would keep Xeger around, perhaps refactor it to use the (new) Gen class but I wouldn't change its public methods if possible, to avoid breaking other peoples code until the next major version bump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants