Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed bug with @data-testid and removed some classes #540

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

felipehertzer
Copy link
Contributor

Hey @adbar,

I was looking the benchmark and I found some attributes that can be improved.

Also, I found a bug in my last pull request with "@data-testid", I found out Reuters is using it in the whole website and when I set favor_precision=True it is getting the wrong content.

It will be great if you could release a minor release with that fix.

  • class 'permission' remove unwanted content from The Independent
  • class 'lightbox' is usually means modals
  • class ' hide' remove content that contains the class hide in the end.

Copy link

codecov bot commented Apr 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.79%. Comparing base (f84e648) to head (a90a3c2).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #540   +/-   ##
=======================================
  Coverage   97.79%   97.79%           
=======================================
  Files          23       23           
  Lines        3444     3444           
=======================================
  Hits         3368     3368           
  Misses         76       76           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Apr 3, 2024

Thanks, I'll have a look at it later. I have a few fixes in the works, once a few PRs are merged I'll make a new release.

@adbar
Copy link
Owner

adbar commented Apr 4, 2024

@felipehertzer I added a small script and updated the guidelines in tests/README.rst. If you want to have a look, it's how I test such cases.

@felipehertzer
Copy link
Contributor Author

Hey @adbar, sorry I should have check it, this tag data-testid is new to me.

I have a few questions:

  • Can I add more sites to the check?
  • Can we have it run automatically on pull request and post a comment here?
  • I saw that you are testing Reuters, but the HTML is different, to the live version, and does not contain the new tag, do you have a script to download the HTML into the eval folder? or how can we stay updated and keep metrics correct?

Thanks.

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

No worries! I'll work on the documentation but here is how I'd answer your questions:

  • Yes you could add more websites if there are not already in the benchmark and not all in English (there are too many English-only benchmarks), there is interesting work on it in Added Coinbase article annotation #197 but the PR seems to be stale.
  • I'm not sure how to automatically run it but if you do you could add the evaluation to the CI/CD.
  • That's tricky, the pages change through time but it's also good to have old pages in the benchmark because people are also using the extractor on web archives. I'd say the new tag in the Reuters page is rather a case for tests/realworld_tests.py (you could add a new function and test problematic markup there).

@felipehertzer
Copy link
Contributor Author

Alright, I have a few Australians and I can help with some Brazillian Portuguese as well.

I can have a look in the Github action, you may need setup a env variable, but I will test it first.

Maybe we need to start adding real world tests when working on xpaths so we can prevent this from happening again

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

@felipehertzer You'll see that I annotated metadata for some of the documents in the benchmark. Since you're interested in certain metadata, you could maybe add a small evaluation for it as well?

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

I find a slightly better recall for @data-testid alone instead of @data-testid="AuthorCard". The rest are either not visible or have negative effects.
Maybe the XPaths are not always the best option to address such cases or maybe there is a better combination to be found.

@adbar
Copy link
Owner

adbar commented Apr 17, 2024

@felipehertzer Are you still working on the PR?

@felipehertzer
Copy link
Contributor Author

Hey @adbar, I changed the data-testid to be a little less specific. For the other task I will open a new pull request. Thanks.

@adbar
Copy link
Owner

adbar commented Apr 22, 2024

Thanks! There is something wrong with the tests but it's unrelated. I need to check it before merging.

@adbar
Copy link
Owner

adbar commented Apr 23, 2024

I ran a few tests, the overall result is slightly worse on my data but it's still an acceptable change. The lightbox rules clearly harms accuracy so I removed it. Does the current form work as expected on your data?

The author XPath would be a case for AUTHOR_XPATHS or AUTHOR_DISCARD_XPATHS below, what do you think?

@felipehertzer
Copy link
Contributor Author

Hey @adbar,

Sorry, I totally missed your last comment.

I'm fine with the 'lightbox' I can use the custom xpath on trafilatura.

Yes, it can be 'AUTHOR_XPATHS', in that case was a card on the bottom of the article with the author info. Or maybe AuthorURL in that case:

<span class="_1a6f5d6fad2874362309" data-testid="AuthorCard">
<span>
<strong>
<a class="d1dba0c3091a3c30ebd6" data-testid="AuthorURL" href="/by/p535y1">AUTHOR NAME</a></strong> is a senior reporter with The Australian....
</span>
</span>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants