Fixed bug with @data-testid and removed some classes #540

felipehertzer · 2024-04-03T13:31:53Z

I was looking the benchmark and I found some attributes that can be improved.

Also, I found a bug in my last pull request with "@data-testid", I found out Reuters is using it in the whole website and when I set favor_precision=True it is getting the wrong content.

It will be great if you could release a minor release with that fix.

class 'permission' remove unwanted content from The Independent
class 'lightbox' is usually means modals
class ' hide' remove content that contains the class hide in the end.

codecov · 2024-04-03T13:37:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.79%. Comparing base (f84e648) to head (a90a3c2).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #540   +/-   ##
=======================================
  Coverage   97.79%   97.79%           
=======================================
  Files          23       23           
  Lines        3444     3444           
=======================================
  Hits         3368     3368           
  Misses         76       76

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2024-04-03T16:25:58Z

Thanks, I'll have a look at it later. I have a few fixes in the works, once a few PRs are merged I'll make a new release.

adbar · 2024-04-04T16:00:58Z

@felipehertzer I added a small script and updated the guidelines in tests/README.rst. If you want to have a look, it's how I test such cases.

felipehertzer · 2024-04-05T03:18:32Z

Hey @adbar, sorry I should have check it, this tag data-testid is new to me.

I have a few questions:

Can I add more sites to the check?
Can we have it run automatically on pull request and post a comment here?
I saw that you are testing Reuters, but the HTML is different, to the live version, and does not contain the new tag, do you have a script to download the HTML into the eval folder? or how can we stay updated and keep metrics correct?

Thanks.

adbar · 2024-04-05T10:21:26Z

No worries! I'll work on the documentation but here is how I'd answer your questions:

Yes you could add more websites if there are not already in the benchmark and not all in English (there are too many English-only benchmarks), there is interesting work on it in Added Coinbase article annotation #197 but the PR seems to be stale.
I'm not sure how to automatically run it but if you do you could add the evaluation to the CI/CD.
That's tricky, the pages change through time but it's also good to have old pages in the benchmark because people are also using the extractor on web archives. I'd say the new tag in the Reuters page is rather a case for tests/realworld_tests.py (you could add a new function and test problematic markup there).

felipehertzer · 2024-04-05T10:37:03Z

Alright, I have a few Australians and I can help with some Brazillian Portuguese as well.

I can have a look in the Github action, you may need setup a env variable, but I will test it first.

Maybe we need to start adding real world tests when working on xpaths so we can prevent this from happening again

adbar · 2024-04-05T10:49:21Z

@felipehertzer You'll see that I annotated metadata for some of the documents in the benchmark. Since you're interested in certain metadata, you could maybe add a small evaluation for it as well?

adbar · 2024-04-05T14:39:49Z

I find a slightly better recall for @data-testid alone instead of @data-testid="AuthorCard". The rest are either not visible or have negative effects.
Maybe the XPaths are not always the best option to address such cases or maybe there is a better combination to be found.

adbar · 2024-04-17T16:09:31Z

@felipehertzer Are you still working on the PR?

felipehertzer · 2024-04-22T05:02:22Z

Hey @adbar, I changed the data-testid to be a little less specific. For the other task I will open a new pull request. Thanks.

adbar · 2024-04-22T10:35:22Z

Thanks! There is something wrong with the tests but it's unrelated. I need to check it before merging.

adbar · 2024-04-23T10:55:23Z

I ran a few tests, the overall result is slightly worse on my data but it's still an acceptable change. The lightbox rules clearly harms accuracy so I removed it. Does the current form work as expected on your data?

The author XPath would be a case for AUTHOR_XPATHS or AUTHOR_DISCARD_XPATHS below, what do you think?

felipehertzer · 2024-05-03T07:59:30Z

Hey @adbar,

Sorry, I totally missed your last comment.

I'm fine with the 'lightbox' I can use the custom xpath on trafilatura.

Yes, it can be 'AUTHOR_XPATHS', in that case was a card on the bottom of the article with the author info. Or maybe AuthorURL in that case:

<span class="_1a6f5d6fad2874362309" data-testid="AuthorCard">
<span>
<strong>
<a class="d1dba0c3091a3c30ebd6" data-testid="AuthorURL" href="/by/p535y1">AUTHOR NAME</a></strong> is a senior reporter with The Australian....
</span>
</span>

Update xpaths.py

0055fef

Merge branch 'adbar:master' into patch-2

18ab957

Fixed data-testid

3ff7703

Merge branch 'master' into patch-2

440e0d8

adbar and others added 2 commits April 22, 2024 14:57

Merge branch 'master' into patch-2

87d29d1

test and amend rules

a90a3c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed bug with @data-testid and removed some classes #540

Fixed bug with @data-testid and removed some classes #540

felipehertzer commented Apr 3, 2024

codecov bot commented Apr 3, 2024 •

edited

adbar commented Apr 3, 2024

adbar commented Apr 4, 2024

felipehertzer commented Apr 5, 2024

adbar commented Apr 5, 2024

felipehertzer commented Apr 5, 2024

adbar commented Apr 5, 2024

adbar commented Apr 5, 2024

adbar commented Apr 17, 2024

felipehertzer commented Apr 22, 2024

adbar commented Apr 22, 2024

adbar commented Apr 23, 2024

felipehertzer commented May 3, 2024

Fixed bug with @data-testid and removed some classes #540

Are you sure you want to change the base?

Fixed bug with @data-testid and removed some classes #540

Conversation

felipehertzer commented Apr 3, 2024

codecov bot commented Apr 3, 2024 • edited

Codecov Report

adbar commented Apr 3, 2024

adbar commented Apr 4, 2024

felipehertzer commented Apr 5, 2024

adbar commented Apr 5, 2024

felipehertzer commented Apr 5, 2024

adbar commented Apr 5, 2024

adbar commented Apr 5, 2024

adbar commented Apr 17, 2024

felipehertzer commented Apr 22, 2024

adbar commented Apr 22, 2024

adbar commented Apr 23, 2024

felipehertzer commented May 3, 2024

codecov bot commented Apr 3, 2024 •

edited