Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Extension of regex filtering to extract data #804

Open
f0sh opened this issue Mar 7, 2024 · 7 comments · May be fixed by #805
Open

Feature request: Extension of regex filtering to extract data #804

f0sh opened this issue Mar 7, 2024 · 7 comments · May be fixed by #805

Comments

@f0sh
Copy link

f0sh commented Mar 7, 2024

I couldn't find a way, how to use re.sub to extract data as a filter in urlwatch. Unless I haven't overseen anything, and there is a way to do it, I would like to open a feature request.

According to #401 I would like to request an extensions of the regex filtering. Instead of replacing matched strings, it would be useful to have a positive regex filter, to extract specific data. An example of this filtering is even in the documentation. However there it is implemented using shellpipes and grep, but for me it turned out to be a bit unstable.

url: https://example.net/shellpipe-grep.txt
filter:
  - shellpipe: "grep -i -o 'price: <span>.*</span>'"

IMO better approach would be to have a filter like re.findall which goes along with re.sub:

url: https://example.net/pricelist.html
filter:
  - re.findall: 'price: <span>(.*)</span>'
@Jamstah
Copy link
Contributor

Jamstah commented Mar 7, 2024

grep and grepi can be used directly, so you could do something like:

filter:
  - grepi: 'price: <span>.*</span>'
  - re.sub:
      pattern: '^.*(price: <span>.*</span>).*$'
      repl: '\1'

@Jamstah
Copy link
Contributor

Jamstah commented Mar 7, 2024

findall might be easier though, what are you thinking for the output, just put each match on a new line?

@f0sh
Copy link
Author

f0sh commented Mar 7, 2024

findall might be easier though, what are you thinking for the output, just put each match on a new line?

yes, that's what I was thinking too.

I didn't check in the source yet, how it is implemented, but it felt like, it could be easier integrated. But maybe the similar names of re.sub in urlwatch and the re package fooled me.

@Jamstah
Copy link
Contributor

Jamstah commented Mar 7, 2024

Yes, its not hard to add it. A little more than re.sub because you have to do something with the matches, where re.sub will just give you the string to return.

https://github.com/thp/urlwatch/blob/master/lib/urlwatch/filters.py#L831

@Jamstah Jamstah linked a pull request Mar 9, 2024 that will close this issue
Jamstah added a commit to Jamstah/urlwatch that referenced this issue Mar 9, 2024
Actually using re.finditer so we can apply a repl to the result. This
allows users to pick out matches and reformat them in one step.

Fixes thp#804

Signed-off-by: James Hewitt <james.hewitt@uk.ibm.com>
@Jamstah
Copy link
Contributor

Jamstah commented Mar 9, 2024

I actually have a couple of places that this would simplify my filters, so have put in an implementation in #805. See what you think.

Jamstah added a commit to Jamstah/urlwatch that referenced this issue Mar 9, 2024
Actually using re.finditer so we can apply a repl to the result. This
allows users to pick out matches and reformat them in one step.

Fixes thp#804

Signed-off-by: James Hewitt <james.hewitt@uk.ibm.com>
Jamstah added a commit to Jamstah/urlwatch that referenced this issue Mar 10, 2024
Actually using re.finditer so we can apply a repl to the result. This
allows users to pick out matches and reformat them in one step.

Fixes thp#804

Signed-off-by: James Hewitt <james.hewitt@uk.ibm.com>
@thp
Copy link
Owner

thp commented Mar 12, 2024

For filtering out HTML elements, use the CSS or XPath filters. Never use regex.

@f0sh
Copy link
Author

f0sh commented Mar 12, 2024

For filtering out HTML elements, use the CSS or XPath filters. Never use regex.

For me, this was not the intention here. It's more that you want to extract certain data. I just tried to make my problem more clear by taking the previous example from the urlwatch docs. The scenario is more, that you have an p element and want to extract some data from there. E.g [...] and therefore for blablabla we set the price of 2.39€.[...]. The idea is to only grab the data 2.39€ without the whole text.

With re.sub you always have to build a regex which catches the whole paragraph which is error prone. And the grep solution only works on full lines.

I actually got inspired by changedetection.io which I tried recently, because of their GUI and they have this nice data extraction feature. However their scripting is much more troublesome so I would like to stick with urlwatch.

OT: It's just a bit frustrating, why open source often has to invent new wheels instead of joining forces. It would be amazing to see, if changedetection.io would have used urlwatch under the hood, to build a more powerful solution.

Jamstah added a commit to Jamstah/urlwatch that referenced this issue Mar 12, 2024
Actually using re.finditer so we can apply a repl to the result. This
allows users to pick out matches and reformat them in one step.

Fixes thp#804

Signed-off-by: James Hewitt <james.hewitt@uk.ibm.com>
Jamstah added a commit to Jamstah/urlwatch that referenced this issue May 5, 2024
Actually using re.finditer so we can apply a repl to the result. This
allows users to pick out matches and reformat them in one step.

Fixes thp#804

Signed-off-by: James Hewitt <james.hewitt@uk.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants