New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed bug with @data-testid and removed some classes #540
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #540 +/- ##
=======================================
Coverage 97.79% 97.79%
=======================================
Files 23 23
Lines 3444 3444
=======================================
Hits 3368 3368
Misses 76 76 ☔ View full report in Codecov by Sentry. |
Thanks, I'll have a look at it later. I have a few fixes in the works, once a few PRs are merged I'll make a new release. |
@felipehertzer I added a small script and updated the guidelines in tests/README.rst. If you want to have a look, it's how I test such cases. |
Hey @adbar, sorry I should have check it, this tag data-testid is new to me. I have a few questions:
Thanks. |
No worries! I'll work on the documentation but here is how I'd answer your questions:
|
Alright, I have a few Australians and I can help with some Brazillian Portuguese as well. I can have a look in the Github action, you may need setup a env variable, but I will test it first. Maybe we need to start adding real world tests when working on xpaths so we can prevent this from happening again |
@felipehertzer You'll see that I annotated metadata for some of the documents in the benchmark. Since you're interested in certain metadata, you could maybe add a small evaluation for it as well? |
I find a slightly better recall for |
@felipehertzer Are you still working on the PR? |
Hey @adbar, I changed the data-testid to be a little less specific. For the other task I will open a new pull request. Thanks. |
Thanks! There is something wrong with the tests but it's unrelated. I need to check it before merging. |
I ran a few tests, the overall result is slightly worse on my data but it's still an acceptable change. The lightbox rules clearly harms accuracy so I removed it. Does the current form work as expected on your data? The author XPath would be a case for |
Hey @adbar, Sorry, I totally missed your last comment. I'm fine with the 'lightbox' I can use the custom xpath on trafilatura. Yes, it can be 'AUTHOR_XPATHS', in that case was a card on the bottom of the article with the author info. Or maybe AuthorURL in that case: <span class="_1a6f5d6fad2874362309" data-testid="AuthorCard">
<span>
<strong>
<a class="d1dba0c3091a3c30ebd6" data-testid="AuthorURL" href="/by/p535y1">AUTHOR NAME</a></strong> is a senior reporter with The Australian....
</span>
</span> |
Hey @adbar,
I was looking the benchmark and I found some attributes that can be improved.
Also, I found a bug in my last pull request with "@data-testid", I found out Reuters is using it in the whole website and when I set
favor_precision=True
it is getting the wrong content.It will be great if you could release a minor release with that fix.