Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancements to enable large inputs #147

Open
vitorsr opened this issue Jan 18, 2023 · 6 comments
Open

Performance enhancements to enable large inputs #147

vitorsr opened this issue Jan 18, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@vitorsr
Copy link

vitorsr commented Jan 18, 2023

Hi,

I want to run this on a billion-sized low entropy case-insensitive UTF-8 input with a variable number of characters up to approximately 256 bytes each.

The data comply to an specification format but empirically it is very low entropy, only uses a very narrow subset of the specification and also a very narrow subset of UTF-8 characters.

Think of URI schemes like mailto [1]. The standard yields ample provisions for the format but empirically only a narrow subset is used.

However DFA is very slow even for very small samples. In addition even though I am not very proficient in Rust, I cannot see how to introduce MP or MPI constructs. DFA minimization also returns impractical expressions and I cannot see how to relax, penalize or restrict the size of the expression, coalesce (arbitrary) character ranges or groups, include prior knowledge of the superset expression defined by the specification format, or take advantage of data preorderings.

Finding an expression to match legal and legitimate data based on the empirical set is useful in determining diverging datum.

As an example, if some subexpression is know by the specification to be constrained to \w characters and in the data all lowercase alphabet characters except one were present, it would be fine to merge ranges to [a-z]. The point here would be that even though this subexpression can be all \w characters per specification, only [a-z] would be used in practice - and if some new datum had numbers, even though the specification allows it, it would be kind of weird given a billion samples did not.

The constraints imposed by the DFA solution as implemented are too strict in this regard and I cannot see where to go from here.

To summarize, I would greatly appreciate if you could point me towards scaling this library to work on large inputs.

I hope this tiny discussion interests you to some degree.

Vítor

[1] https://datatracker.ietf.org/doc/html/rfc6068#section-2

@pemistahl
Copy link
Owner

Hi Vítor, thanks for reaching out to me. This tool is simply not meant to scale to billion-sized inputs. I would not even know how to accomplish that. For DFA generation, I use a separate library so I'm limited by that one. The tool's primary purpose is to help creating regular expressions with a focus on correctness (which is hard enough), not performance. I will work on a new feature that allows to create various character classes and ranges of characters so that more generalized expressions are possible.

Feel free to open a PR if you see a way to improve performance or other aspects. Even though I'm relatively proficient in Rust, I'm far from being an expert.

@vitorsr
Copy link
Author

vitorsr commented Jan 19, 2023 via email

@pemistahl
Copy link
Owner

Thank you for your suggestions. Looks interesting. I will leave this issue open and come back to it as soon as I continue working on performance improvements.

@pemistahl pemistahl added the enhancement New feature or request label Jan 20, 2023
@vitorsr vitorsr changed the title Scaling to billion-sized inputs Performance enhancements to enable large inputs Jan 20, 2023
@vitorsr
Copy link
Author

vitorsr commented Jan 20, 2023

I changed the title so it looks less like clickbait to outside viewers.

I'll share a few test sets to consider and more as they come along.

  • Case 1.1
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -r -
^(?:a{255}b|a{254}b{2}|a{253}b{3}|a{252}b{4}|a{251}b{5}|a{250}b{6}|a{249}b{7}|a{248}b{8}|a{247}b{9}|a{246}b{10}|a{24...

Expected output: ^[ab]{256}$.
grex does not yield the expected output. Adding more characters to the repeated alphabet or increasing the size of combinations chokes the program. There is insufficient character set (or interval) detection.

  • Case 1.2
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -
^(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:...

Expected output: ^[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab....
grex does not yield the expected output. The expression builder favors nesting to a fault without the repetitions or substitution flags.

  • Case 1.3
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -w -r -
^\w{256}$

Expected output: ^\w{256}$.
grex yields the correct output. However this shouldn't take so long - after substitution all entries are the same. There needs to be deduplication guards to avoid choking the program.

  • Case 2.1
$ grex -r '<{{ aaa }}>' '<{{ aab }}>' '<{{ aac }}>' '<{{ abb }}>' '<{{ abc }}>' '<{{ acc }}>' '<{{ bbb }}>' '<{{ bbc }}>' '<{{ bcc }}>' '<{{ ccc }}>' '</{{ aaa }}>' '</{{ aab }}>' '</{{ aac }}>' '</{{ abb }}>' '</{{ abc }}>' '</{{ acc }}>' '</{{ bbb }}>' '</{{ bbc }}>' '</{{ bcc }}>' '</{{ ccc }}>'
^<(?:/\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>|\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>)$

Expected output: ^</?\{{2} [abc]{3} \}{2}>$.
grex does not yield the expected output. This case illustrates insufficient common prefix and suffix detection as well as the issues pointed out in Case 1.1 and 1.2.

These test cases also only use a single thread. Lots of operations including (independent) iterations and sorting could be exchanged by parallel counterpart calls for a free win.

@scheidelerl
Copy link

Hey,
so far I like the tool very much, but if I may I would like to give more input on this topic.

I have a very simple data set where the mixture is initial digits followed by one or two lowercase non-digit characters or followed by one or two lowercase non-digit characters with a roman numeral from I-VI in parentheses.

I would offer the following regex. Surely this can also be done more simply, but here:

^[0-9]+[a-z]{0,2}(\s\((IV|V?I{0,3})\))?$

regex101 analysis.

Test file: test.txt

Here's what I got with grex -f text.txt:

^(?:(?:(?:13|69)6 \(II|440 \(II|725 \(II|(?:2(?:56j|85|98)|4(?:2(?:2b|9)|1[07]|4[38])|7(?:10e|02|22)|1(?:63|75|81)|3(?:43|12)|5(?:15|83)|6(?:0[0358]|1[34])|89[08]|502) \(I|(?:13|69)6 \(V|(?:(?:783|(?:(?:23|31)|44))|883) \(I)I?\)|(?:296|409|708|821) \(I(?:II?\)|\))|9(?:(?:57i|22) \(II?\)|9(?:8 \(I(?:II?\)|\))|9a|9|[0-7])?|5(?:7[ad-fh]|[0-689])|7(?:7[a-h]|[0-689])|57?|77?|1[0-9ab]|2[013-9]|6[013-9a]|[0348][0-9]|1|2|6|[0348])?|(?:255 \(I[IV]|(?:13|69)6 \(IV|440 \(IV|725 \(IV)\)|2(?:5(?:5 \(I\)|6(?:a[a-s]|[bce-ik-oq-y])|7[a-c]|[0-489])|1[0-57-9]|4[0-9a-c]|6[0-4689]|7[1-689]|8[0-46-8]|3[0-9]|9[0-579])|(?:13|69)6 \(I\)|440 \(I\)|7(?:2(?:5 \(I\)|[1346-9])|1(?:0(?:[ab][a-z]|[df-rt-z])|[1-57-9])|10c[a-i]|7(?:4(?:a[a-i]|[b-kw-z])|[0-35-8])|3(?:9[ac-g]|[0-8])|8(?:3[ab]|[124-69])|0[013-79])|1(?:0(?:1(?:9[a-j]|[0-8])|2[01346-9]|6[4-9a-f]|7[013-9])|1(?:0[0-7]|[1-9])|3(?:6b|[0-579])|6(?:2[bc]|[014-689])|2[0-25-9]|4[013-689]|5[235-9]|7[0-46-9a-c]|8[02-9])|(?:1(?:060|54|67)|277|35[58]|7(?:16|20|8[78]))[ab]|(?:1(?:0(?:22|63|72)|50)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80))a|1(?:0(?:19?|2|6|7)?|10?|62?|2|3|4|5|7|8)?|2(?:5(?:6a?)?|1|4|6|7|8)?|7(?:1(?:0[ab]?)?|10c|7(?:4a?)?|39?|0|2|8)?|1(?:060|54|67)|1(?:0(?:22|63|72)|50)|105[013-9]|1(?:38|47)[a-e]|3(?:2(?:4[a-eg-l]|[0-35-9])|45[a-e]|8(?:4[abd-f]|6[a-e]|[0-357-9])|0[013-9]|3[0-79]|4[046-9a-k]|5[0-49]|6[1-9]|7[0-46-9]|1[013-79])|4(?:2(?:2[ac]|[0135-8])|0[0-8]|1[1-689]|3[0-46-9ab]|4[124-79])|108[02-9]|(?:1(?:42|51)|267)[a-d]|(?:270|3(?:38|56|75))[a-c]|79(?:0[ab]|[1-9])|8(?:3(?:6[a-fh-mq-z]|[0-57-9])|4(?:5[a-d]|8[a-c]|[0-4679ab])|5(?:0[ab]|[1-9])|7(?:5[a-l]|[0-46-9])|8(?:3[a-c]|[0-24-9])|0[02-8]|1[02-9]|2[02-9]|6[0-9a]|9[1-79a]|[a-h])|(?:1(?:0[0349]|9)|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])[0-9]|105|1(?:38|47)|3(?:24?|45|8(?:4|6)?|0|3|4|5|6|7)?|4(?:22?|0|1|3)?|108|1(?:42|51)|267|270|3(?:38|56|75)|277|35[58]|7(?:16|20|8[78])|790?|8(?:36?|4(?:5|8)?|50?|75?|0|1|2|6|9)?|1(?:0[0349]|9)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80)|5(?:1[01346-9]|8[0-24-9]|9[0-9a-d])|6(?:0[124679]|1[0-25-9]|7[0-9c]|9[0-57-9])|76[0-9a]|50[013-9]|5(?:1|8)?|6(?:0|1|7|9)?|76|50|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])$

I tried to find out with the analysis of regex101 if grex follows some kind of rule set in which it is grouped to possibly provide helpful information for the enhancement. Since I didn't really have the skills to get into the code right now. But I love regex. The pattern grex is trying to group is too wild for me to put into words.

The analysis of regex101 indicates that grex has formed a large non-capturing group and 59 alternatives to follow.

I do play around with -r, --min-repetitions and --min-substring-length and -dand -w.

  • with -r the thing goes much wilder. here. I believe because we do have unique lines.
  • I think --min-repetitions and --min-substring-length doesn't do anything useful in this case.
  • -d and -w would do a lot for grouping, but in this case I will be specific - especially with the roman numeral. here

If this comment does not provide important or purposeful information, let me know and I will delete it. I just wanted to share my play around.

@pemistahl
Copy link
Owner

Hi @scheidelerl, thank you for your valuable input, very much appreciated. :)

I know about the shortcomings of the current algorithm, that's why the issues #48 and #122 already exist. I want to work on these, I'm just lacking free time at the moment. But the project is still active and I will continue working on it. I just can't tell you when.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants