Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused about the actual use case #242

Open
emrakyz opened this issue May 11, 2024 · 0 comments
Open

Confused about the actual use case #242

emrakyz opened this issue May 11, 2024 · 0 comments

Comments

@emrakyz
Copy link

emrakyz commented May 11, 2024

First of all, thanks a lot for this tool.

The idea looks really good and promising but I couldn't understand what the actual use case was. The documentation lacks in terms of various interesting examples.

I have tried to refind some of the patterns I had already written myself just to test grex.

For example I use the below pattern to extract doi addresses from various inputs and/or pdfs:
sed -n -E 's/.*((DOI|doi)((\.(org))?\/?|:? *))([^: ]+[^ .]).*/doi:\6/p; q'

The actual pattern is this:
.*((DOI|doi)((\.(org))?\/?|:? *))([^: ]+[^ .]).*

The part that I aim to capture is (6th captured group which is the DOI Address): ([^: ]+[^ .])

As an example, it captures this part at the end: 10.36227/techrxiv.22659061.v1

I have tried to place lots of valid cases on each line (doi addresses in different forms as in the above regex pattern) to a test.txt file.

I used grex -r -g -c --no-start-anchor -f "test.txt" command. I knew that it couldn't give me a pattern similar to my original one but the resulting output was even much more different than I expected. I got an extremely long regex pattern which also captured unwanted parts (false-positive constants) that would break the command for my actual use case. This is understandable but impossible to avoid without infinite examples that are completely different from each other, except the actual constants.

I have also tried to test different cases in order to refind the regex patterns I had written before with simpler patterns using some made-up examples in the test file. The below pattern that had been written before, could be an example:
^ *([0-9]+).*\s{2,}(.+)$

But grex outputs a pattern that is always wrong and very long; not similar to my actual pattern. The output is not "wrong" in technical sense but definitely not usable to achieve something. It's not even appropriate to be modified to some extent manually and then used. No matter the sample size, this was always the case in my tests.

Even for very basic cases; since we can't be expressive enough, the output is not usable. Without proper expression, this tool can only create patterns which are usable only with almost infinite example cases that cover all possibilities.

The problem is that, - as far as I understand - we can't be expressive enough especially in terms of constants, variables, wanted parts, unwanted parts, the actual main pattern that should captured and all. Without these, I could not find a proper use case but I really want to use this tool in an actual scenario. How can we be more expressive so we can automate creating at least a base pattern to work on?

I have also tried to find the final pattern in a segmented way but failed similarly.

Writing regex is fairly easy for small, simple tasks. What I initially had in mind for this tool was that it would be helpful for us to create regex, for very complex patterns easily in a more efficient, more correct way. Right now I feel like we have a very powerful and robust but useless tool. Is this just "experimental" or a kind of a base that will be used by future tools?

Could you please inform what I do wrong? What is the best practice to solve an actual problem using grex? What type of problems are best to get help from grex?

I probably misunderstood the tool or made a mistake regarding the intended use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant