spider fix: use internal download utilities for robots.txt #590

adbar · 2024-05-08T09:14:57Z

This change makes the crawler more robust by using Trafilatura's download function instead of the default urllib.robotparser one.

codecov · 2024-05-08T09:20:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.90%. Comparing base (efe38bb) to head (af0822c).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #590      +/-   ##
==========================================
+ Coverage   97.81%   97.90%   +0.09%     
==========================================
  Files          21       21              
  Lines        3437     3443       +6     
==========================================
+ Hits         3362     3371       +9     
+ Misses         75       72       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

spider fix: use internal download utilities for robots.txt

13c868f

adbar linked an issue May 8, 2024 that may be closed by this pull request

No timeout in urllib.robotparser with focused_crawler #566

Closed

separate function and tests

af0822c

adbar merged commit 92bdd6e into master May 8, 2024
15 checks passed

adbar deleted the fix_robots_download branch May 8, 2024 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spider fix: use internal download utilities for robots.txt #590

spider fix: use internal download utilities for robots.txt #590

adbar commented May 8, 2024 •

edited

codecov bot commented May 8, 2024 •

edited

spider fix: use internal download utilities for robots.txt #590

spider fix: use internal download utilities for robots.txt #590

Conversation

adbar commented May 8, 2024 • edited

codecov bot commented May 8, 2024 • edited

Codecov Report

adbar commented May 8, 2024 •

edited

codecov bot commented May 8, 2024 •

edited