Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OVERALL_DISCARD_XPATH not discarding in some cases #510

Open
felipehertzer opened this issue Feb 19, 2024 · 1 comment
Open

OVERALL_DISCARD_XPATH not discarding in some cases #510

felipehertzer opened this issue Feb 19, 2024 · 1 comment
Labels
question Further information is requested

Comments

@felipehertzer
Copy link
Contributor

felipehertzer commented Feb 19, 2024

Hi @adbar,

I encountered this problem when trying to scrape this site.

The code is bringing the sidebar with the body, the sidebar div class is 'l-sidebar'.

So in my tests I arrived at this function:

tree = prune_unwanted_nodes(tree, OVERALL_DISCARD_XPATH, with_backup=True)

If I set with_backup to False it works.

It may be related to this TODO:

# todo: adjust for recall and precision settings
if new_len > old_len/7:
    return tree

example 2

@adbar adbar added the question Further information is requested label Feb 19, 2024
@adbar
Copy link
Owner

adbar commented Feb 19, 2024

Hi @felipehertzer, as with the other heuristics it's tricky, the number 7 is chosen arbitrarily, let's see if we can find a better way or another threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants