Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed lists inside tables when include_tables=True #534

Merged
merged 6 commits into from
Apr 11, 2024

Conversation

mikhainin
Copy link
Contributor

A possible fix for #531

@adbar
Copy link
Owner

adbar commented Apr 2, 2024

@mikhainin It turns out your fix doesn't work well for nested elements in tables (see tests). Could you please have a look at it and see if you find a solution?

@mikhainin
Copy link
Contributor Author

mikhainin commented Apr 2, 2024

@adbar, From what I can see, this test-case is failing:

htmlstring = html.fromstring(
        """<html>
              <body><article>
                <table>
                  <tbody>
                    <tr>
                      <td>
                        <small>text<br></small>
                        <h4>more_text</h4>
                      </td>
                      <td><a href='link'>linktext</a></td>
                    </tr>
                  </tbody>
                </table>
              </article></body>
            </html>"""
    )
    processed = extract(
        htmlstring, no_fallback=True, output_format='xml', config=DEFAULT_CONFIG, include_links=True
    )
    result = processed.replace('\n', '').replace(' ', '')
    assert """<table><row><cell>text<head>more_text</head></cell></row></table>""" in result

New version (the current change) produces:

<doccategories=""tags=""fingerprint="576b5da16c181e08"><main>
<table>
    <row>
        <cell>text
            <head>more_text</head>
        </cell>
        <cell>
            <p>linktext</p>
        </cell>
    </row>
</table>
</main><comments/></doc>

The current master (expected behaviour I suppose):

<doccategories=""tags=""fingerprint="d7ffdfa76c785e1c"><main>
<table>
    <row>
        <cell>text
            <head>more_text</head>
        </cell>
    </row>
</table>
</main><comments/></doc>

Could you explain why <p>linktext</p> should not exist in the output, please?

@adbar
Copy link
Owner

adbar commented Apr 2, 2024

Yes, you're perfectly right, we need to change this test then. Do you want to do it? By the way, you could also add a test for the particular problem you're solving.

Your PR slightly decreases precision on the benchmark, it could have something to do with undesirable content getting added by .itertext(), maybe we need to filter the nested elements more.

@mikhainin
Copy link
Contributor Author

Your PR slightly decreases precision on the benchmark, it could have something to do with undesirable content getting added by .itertext(), maybe we need to filter the nested elements more.

I can give a try with it but I would need some guidance: I'm familiar with this library for less than a week :)

@adbar
Copy link
Owner

adbar commented Apr 3, 2024

Of course, I'll look for potential ways to fix it and give you the necessary info.

@adbar
Copy link
Owner

adbar commented Apr 3, 2024

  1. The link in the test sample shouldn't actually be in the output because it's not required in the options and links in tables are often superfluous content. So the tests are working correctly for the setting they're for.
  2. .itertext() contains element tails, setting with_tail=False makes things marginally better but doesn't solve the main problem
  3. The best solution is probably to iterate through the element's children (if len(child) >0 and then child.iterdescendants()). You can see how that's done in the handle_lists() function above and/or directly use this function if child.tag == "list".

The code is convoluted and badly needs to be simplified, however HTML documents are not as regular as they should on the web so that's why a lot of safeguards have been implemented at different levels in the course of time.

@mikhainin
Copy link
Contributor Author

Thanks - that's helpful!

I updated the implementation it pass our tests and Trafilatura's. Could you take a look once again, please?

Copy link

codecov bot commented Apr 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.49%. Comparing base (54ad86c) to head (7d9c440).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #534      +/-   ##
==========================================
- Coverage   97.58%   97.49%   -0.09%     
==========================================
  Files          23       23              
  Lines        3389     3394       +5     
==========================================
+ Hits         3307     3309       +2     
- Misses         82       85       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

The walrus operator is not available before Python 3.8 but that's a minor issue.

@mikhainin
Copy link
Contributor Author

Removed the operator

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

The tests pass but on the benchmark the precision is lower. It could be that tables with too many nested elements contain undesirable text.

@mikhainin
Copy link
Contributor Author

I rebased on the latest master and run python tests/comparison_small.py
This branch:

nothing
{'true positives': 0, 'false positives': 0, 'true negatives': 2250, 'false negatives': 2236, 'time': 0}
baseline
{'true positives': 1886, 'false positives': 610, 'true negatives': 1640, 'false negatives': 350, 'time': 8.721751689910889}
precision: 0.756 recall: 0.843 accuracy: 0.786 f-score: 0.797
trafilatura
{'true positives': 1991, 'false positives': 183, 'true negatives': 2067, 'false negatives': 245, 'time': 39.170817136764526}
time diff.: 4.49
precision: 0.916 recall: 0.890 accuracy: 0.905 f-score: 0.903
trafilatura + fallback
{'true positives': 2027, 'false positives': 182, 'true negatives': 2068, 'false negatives': 209, 'time': 57.875588178634644}
time diff.: 6.64
precision: 0.918 recall: 0.907 accuracy: 0.913 f-score: 0.912

Master:

$ python tests/comparison_small.py
number of documents: 750
nothing
{'true positives': 0, 'false positives': 0, 'true negatives': 2250, 'false negatives': 2236, 'time': 0}
baseline
{'true positives': 1886, 'false positives': 610, 'true negatives': 1640, 'false negatives': 350, 'time': 4.424571752548218}
precision: 0.756 recall: 0.843 accuracy: 0.786 f-score: 0.797
trafilatura
{'true positives': 1991, 'false positives': 183, 'true negatives': 2067, 'false negatives': 245, 'time': 18.823368310928345}
time diff.: 4.25
precision: 0.916 recall: 0.890 accuracy: 0.905 f-score: 0.903
trafilatura + fallback
{'true positives': 2027, 'false positives': 182, 'true negatives': 2068, 'false negatives': 209, 'time': 26.65433382987976}
time diff.: 6.02
precision: 0.918 recall: 0.907 accuracy: 0.913 f-score: 0.912

If you tell me the right way to run the benchmark, please?

@adbar
Copy link
Owner

adbar commented Apr 5, 2024

I still see a small difference, you need to re-install the package from the branch to see it:
pip3 install --no-deps -U . ; python3 tests/comparison_small.py

It's not a big deal, in the worst case your PR would be used when favor_recall is activated.

@mikhainin
Copy link
Contributor Author

Yeah in kath.net-Menschensohn.html, master version only includes "Mehr zu" header and the current change includes all the links

@adbar
Copy link
Owner

adbar commented Apr 8, 2024

I would then change the condition to elif child.tag == "list" and favor_recall.

@adbar adbar merged commit 5ca01a8 into adbar:master Apr 11, 2024
15 of 16 checks passed
@alroythalus
Copy link

web_content = "".join(
    extract(
        web_content,
        include_formatting=True,
        include_tables=True,
        include_comments=False,
        include_links=False,
        output_format="xml",
        favor_recall=True,
        config=config,
    )
) 

hows the code preserve the indentation of the list items?
Eg site: https://www.spotify.com/in-en/legal/privacy-policy/
@adbar @mikhainin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants