Fixed lists inside tables when include_tables=True #534

mikhainin · 2024-04-02T12:53:16Z

A possible fix for #531

adbar · 2024-04-02T14:53:26Z

@mikhainin It turns out your fix doesn't work well for nested elements in tables (see tests). Could you please have a look at it and see if you find a solution?

mikhainin · 2024-04-02T17:20:15Z

@adbar, From what I can see, this test-case is failing:

htmlstring = html.fromstring(
        """<html>
              <body><article>
                <table>
                  <tbody>
                    <tr>
                      <td>
                        <small>text<br></small>
                        <h4>more_text</h4>
                      </td>
                      <td><a href='link'>linktext</a></td>
                    </tr>
                  </tbody>
                </table>
              </article></body>
            </html>"""
    )
    processed = extract(
        htmlstring, no_fallback=True, output_format='xml', config=DEFAULT_CONFIG, include_links=True
    )
    result = processed.replace('\n', '').replace(' ', '')
    assert """<table><row><cell>text<head>more_text</head></cell></row></table>""" in result

New version (the current change) produces:

<doccategories=""tags=""fingerprint="576b5da16c181e08"><main>
<table>
    <row>
        <cell>text
            <head>more_text</head>
        </cell>
        <cell>
            <p>linktext</p>
        </cell>
    </row>
</table>
</main><comments/></doc>

The current master (expected behaviour I suppose):

<doccategories=""tags=""fingerprint="d7ffdfa76c785e1c"><main>
<table>
    <row>
        <cell>text
            <head>more_text</head>
        </cell>
    </row>
</table>
</main><comments/></doc>

Could you explain why <p>linktext</p> should not exist in the output, please?

adbar · 2024-04-02T18:13:08Z

Yes, you're perfectly right, we need to change this test then. Do you want to do it? By the way, you could also add a test for the particular problem you're solving.

Your PR slightly decreases precision on the benchmark, it could have something to do with undesirable content getting added by .itertext(), maybe we need to filter the nested elements more.

mikhainin · 2024-04-03T08:44:26Z

Your PR slightly decreases precision on the benchmark, it could have something to do with undesirable content getting added by .itertext(), maybe we need to filter the nested elements more.

I can give a try with it but I would need some guidance: I'm familiar with this library for less than a week :)

adbar · 2024-04-03T16:27:29Z

Of course, I'll look for potential ways to fix it and give you the necessary info.

adbar · 2024-04-03T17:03:49Z

The link in the test sample shouldn't actually be in the output because it's not required in the options and links in tables are often superfluous content. So the tests are working correctly for the setting they're for.
.itertext() contains element tails, setting with_tail=False makes things marginally better but doesn't solve the main problem
The best solution is probably to iterate through the element's children (if len(child) >0 and then child.iterdescendants()). You can see how that's done in the handle_lists() function above and/or directly use this function if child.tag == "list".

The code is convoluted and badly needs to be simplified, however HTML documents are not as regular as they should on the web so that's why a lot of safeguards have been implemented at different levels in the course of time.

mikhainin · 2024-04-05T11:28:32Z

Thanks - that's helpful!

I updated the implementation it pass our tests and Trafilatura's. Could you take a look once again, please?

codecov · 2024-04-05T11:42:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.49%. Comparing base (54ad86c) to head (7d9c440).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #534      +/-   ##
==========================================
- Coverage   97.58%   97.49%   -0.09%     
==========================================
  Files          23       23              
  Lines        3389     3394       +5     
==========================================
+ Hits         3307     3309       +2     
- Misses         82       85       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2024-04-05T11:50:05Z

The walrus operator is not available before Python 3.8 but that's a minor issue.

mikhainin · 2024-04-05T12:26:00Z

Removed the operator

adbar · 2024-04-05T14:28:25Z

The tests pass but on the benchmark the precision is lower. It could be that tables with too many nested elements contain undesirable text.

mikhainin · 2024-04-05T16:25:32Z

I rebased on the latest master and run python tests/comparison_small.py
This branch:

nothing
{'true positives': 0, 'false positives': 0, 'true negatives': 2250, 'false negatives': 2236, 'time': 0}
baseline
{'true positives': 1886, 'false positives': 610, 'true negatives': 1640, 'false negatives': 350, 'time': 8.721751689910889}
precision: 0.756 recall: 0.843 accuracy: 0.786 f-score: 0.797
trafilatura
{'true positives': 1991, 'false positives': 183, 'true negatives': 2067, 'false negatives': 245, 'time': 39.170817136764526}
time diff.: 4.49
precision: 0.916 recall: 0.890 accuracy: 0.905 f-score: 0.903
trafilatura + fallback
{'true positives': 2027, 'false positives': 182, 'true negatives': 2068, 'false negatives': 209, 'time': 57.875588178634644}
time diff.: 6.64
precision: 0.918 recall: 0.907 accuracy: 0.913 f-score: 0.912

Master:

$ python tests/comparison_small.py
number of documents: 750
nothing
{'true positives': 0, 'false positives': 0, 'true negatives': 2250, 'false negatives': 2236, 'time': 0}
baseline
{'true positives': 1886, 'false positives': 610, 'true negatives': 1640, 'false negatives': 350, 'time': 4.424571752548218}
precision: 0.756 recall: 0.843 accuracy: 0.786 f-score: 0.797
trafilatura
{'true positives': 1991, 'false positives': 183, 'true negatives': 2067, 'false negatives': 245, 'time': 18.823368310928345}
time diff.: 4.25
precision: 0.916 recall: 0.890 accuracy: 0.905 f-score: 0.903
trafilatura + fallback
{'true positives': 2027, 'false positives': 182, 'true negatives': 2068, 'false negatives': 209, 'time': 26.65433382987976}
time diff.: 6.02
precision: 0.918 recall: 0.907 accuracy: 0.913 f-score: 0.912

If you tell me the right way to run the benchmark, please?

adbar · 2024-04-05T16:52:14Z

I still see a small difference, you need to re-install the package from the branch to see it:
pip3 install --no-deps -U . ; python3 tests/comparison_small.py

It's not a big deal, in the worst case your PR would be used when favor_recall is activated.

mikhainin · 2024-04-05T19:29:12Z

Yeah in kath.net-Menschensohn.html, master version only includes "Mehr zu" header and the current change includes all the links

adbar · 2024-04-08T16:00:00Z

I would then change the condition to elif child.tag == "list" and favor_recall.

alroythalus · 2024-05-06T08:27:18Z

web_content = "".join(
    extract(
        web_content,
        include_formatting=True,
        include_tables=True,
        include_comments=False,
        include_links=False,
        output_format="xml",
        favor_recall=True,
        config=config,
    )
)

hows the code preserve the indentation of the list items?
Eg site: https://www.spotify.com/in-en/legal/privacy-policy/
@adbar @mikhainin

mikhainin mentioned this pull request Apr 2, 2024

List element inside a table is lost #531

Open

Mikhail Galanin added 3 commits April 5, 2024 17:03

Fixed lists inside tables when include_tables=True

daa8773

Better implementation

791c55e

This operator does not exist in Python 3.8

954897d

mikhainin force-pushed the fix-lists-in-tables branch from 46a986e to 954897d Compare April 5, 2024 16:21

adbar and others added 3 commits April 11, 2024 13:27

use only if recall is preferred

f5b6045

Merge branch 'master' into fix-lists-in-tables

8a65919

update and fix variable name

7d9c440

adbar merged commit 5ca01a8 into adbar:master Apr 11, 2024
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed lists inside tables when include_tables=True #534

Fixed lists inside tables when include_tables=True #534

mikhainin commented Apr 2, 2024

adbar commented Apr 2, 2024

mikhainin commented Apr 2, 2024 •

edited

adbar commented Apr 2, 2024

mikhainin commented Apr 3, 2024

adbar commented Apr 3, 2024

adbar commented Apr 3, 2024

mikhainin commented Apr 5, 2024

codecov bot commented Apr 5, 2024 •

edited

adbar commented Apr 5, 2024

mikhainin commented Apr 5, 2024

adbar commented Apr 5, 2024

mikhainin commented Apr 5, 2024

adbar commented Apr 5, 2024

mikhainin commented Apr 5, 2024

adbar commented Apr 8, 2024

alroythalus commented May 6, 2024

Fixed lists inside tables when include_tables=True #534

Fixed lists inside tables when include_tables=True #534

Conversation

mikhainin commented Apr 2, 2024

adbar commented Apr 2, 2024

mikhainin commented Apr 2, 2024 • edited

adbar commented Apr 2, 2024

mikhainin commented Apr 3, 2024

adbar commented Apr 3, 2024

adbar commented Apr 3, 2024

mikhainin commented Apr 5, 2024

codecov bot commented Apr 5, 2024 • edited

Codecov Report

adbar commented Apr 5, 2024

mikhainin commented Apr 5, 2024

adbar commented Apr 5, 2024

mikhainin commented Apr 5, 2024

adbar commented Apr 5, 2024

mikhainin commented Apr 5, 2024

adbar commented Apr 8, 2024

alroythalus commented May 6, 2024

mikhainin commented Apr 2, 2024 •

edited

codecov bot commented Apr 5, 2024 •

edited