List of smaller extraction bugs (text & metadata) #4

adbar · 2020-01-09T11:37:31Z

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see BODY_XPATH and COMMENTS_XPATH lists).

Thanks!

The text was updated successfully, but these errors were encountered:

adbar · 2022-01-04T14:02:06Z

Extraction bugs in text and metadata can be listed here as in adbar/htmldate#8 where issues specifically related to dates should be reported.

For details see below.

cheezman34 · 2022-02-11T19:51:58Z

Words are getting smashed together on this page:

https://research.checkpoint.com/2021/a-deep-dive-into-doublefeature-equation-groups-post-exploitation-dashboard/

I looked into the extraction code here a bit. The date here is inside a span, which gets stripped, and then the date becomes the tail of the header. All of the whitespace (which includes a newline) gets lost, and then the tail is just directly appended to the header. I'm not sure if the best strategy to fix would be to include a space between the tail and text of the header node when they get extracted, or maybe to look for newlines in the text and somehow respect them. It looks like some of the lxml stuff just strips whitespace automatically when you access "text" and "tail" attributes.

I didn't dig into this one, but I'm guessing it's something similar as the first case. The webpage relies on whitespace that gets stripped by the extraction algorithm.

adbar · 2022-02-14T16:48:58Z

Yes, I think the issues in the document you mention are related to deleted <span> sections.

karlkovaciny · 2022-02-15T07:36:36Z

Hey, this is a great library. I was ready to subscribe to a service just to get what this does for me.

For the 30th page I extracted, https://thehill.com/homenews/senate/594044-sen-lujan-to-return-to-senate-in-time-to-vote-for-supreme-court-nominee, Trafilatura 1.0.0 returned only 150 chars of text:

© Greg Nash
Luján planning return to Senate in time to vote for Supreme Court nominee
By Olafimihan Oshin - 02/13/22 12:54 PM EST
Skip to main content

I downloaded the HTML source (lujan.txt) and confirmed it does have the article text in it (starting with "Sen. Ben Ray Luján").

I decided to try the external fallback "Readability". I started Python in my trafilatura container and ran this code:

import lxml
with open('lujan.html') as f:
     doc = parse(f).getroot()
     x = trafilatura.external.try_readability(doc, "file:///lujan.html")
     print(lxml.etree.tostring(x, pretty_print=True, encoding="unicode"))

But that just gave me a bunch of XML/JavaScript that didn't even have the main text in it.

Perhaps a fallback could be added that when extracted text is small and there are large continuous blocks of unextracted text, to grab those instead?

adbar · 2022-02-17T18:55:39Z

Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That's definitely a case to watch for.

EDIT: for the archived version of the page I now get the same problem as you.

adbar · 2022-05-13T10:04:50Z

Suggested in #208:

Text below the article: https://web.archive.org/web/20220513095359/https://www.vivereancona.it/2022/05/13/ubriaco-cammina-lungo-la-flaminia-quando-vede-arrivare-i-soccorsi-si-getta-nei-cespugli-e-fugge/2100180067/
Few text lost in false positives (other articles): https://web.archive.org/web/20220513100231/https://www.vallesabbianews.it/notizie-it/%28Ro%C3%A8-Volciano%29-Davide-Nedrotti-campione-del-mondo-60595.html
Text above and below: https://web.archive.org/web/20220513100330/https://www.quinewsfirenze.it/firenze-terremoto-chianti.htm

felipehertzer · 2022-05-23T00:11:25Z

Hey @adbar

I'm having problem with a few publications like huffpost where it is not extracting the metadata correctly.
But, if I change the line bellow to tree = fromstring(htmlobject.encode('utf8'), parser=HTML_PARSER) it starts to work.
What do you think?

trafilatura/trafilatura/utils.py

Line 177 in 168e660

tree = fromstring(htmlobject, parser=HTML_PARSER)

Example: https://bit.ly/3PuvL26
Other example: https://bit.ly/3ai8zEf

adbar · 2022-06-01T16:21:53Z

Hi @felipehertzer, I don't think I can reproduce the bug, which metadata fields do you mean exactly?

kinoute · 2023-09-21T04:13:59Z

Hello,

URL of testing: https://orientxxi.info/fa
Trafilatura version : 1.6.2

import trafilatura
downloaded = trafilatura.fetch_url("https://orientxxi.info/fa")
trafilatura.extract(downloaded, output_format="json")

I am wondering why the title is not the one provided in the HTML element <title>? Trafilatura returns a long sentence:

{"title": "به زبانهای دیگر Yémen. Une paix qui se fait attendre Laurent Bonnefoy · 21 septembre أوسلو، نموذج للفشل دانيال ليفي · 21 أيلول (سبتمبر) موقع “أوريان 21” يدعوكم للاحتفال بعيد ميلاده العاشر! · 20 أيلول (سبتمبر) Petroleum. Turkey vs. Iraq, but the Kurds are Collateral Victims Benoît Drevet · 20 September El doble estándar de Egipto para acoger a sus “huéspedes” sudaneses Séverine Evanno · 1ro de septiembre Khaled El Qaisi, colpevole di Palestina Cecilia Dalla Negra · 18 settembre", "author": null,....

Thanks!

adbar · 2023-10-09T12:16:20Z

Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: <h2 class="indication">به زبانهای دیگر</h3> It implies that all that follows is a title.

Please note that the extraction doesn't work as well on homepages in general.

sepsi77 · 2023-10-10T11:28:59Z

Hi @adbar,

I ran into extraction issues.

URL: https://microsoft.github.io/autogen/docs/Use-Cases/enhanced_inference/

Output: ! d o c t y p e h t m l >

I also tested using htmlttext feature and it didn't work any better. It gave me this output.

h t m l c l a s s = " d o c s - v e r s i o n - c u r r e n t " l a n g = " e n " d i r = " l t r " >

I run scraping of HTML outside of trafilitura. I confirmed that we are getting all of the HTML, but seems like there's something in the HTML that trips the extraction.

I used trafilitura.extract() and passed the html code as string into the function. I tested different settings for the favor_recall and favor_precision arguments. They didn't change the output in any significant way. I also tested using trafilitura.baseline() function and it yielded similar results.

adbar · 2023-10-10T12:02:19Z

@sepsi77 There are LXML-related issues on MacOS M1, M2 etc. (see also #166).
Is it the platform you're using or can you provide more details?

sepsi77 · 2023-10-10T14:11:26Z

@adbar yes, I'm on M1 MacBook

adbar · 2023-10-10T14:27:31Z

Did you try building LXML from source?

sepsi77 · 2023-10-10T15:31:15Z

I can't seem to get it to work. I'm new into this level of tweaking with the system. The installation fails because of missing precompiled Cython files. Trying to run that with the --without-cython flag also doesn't work.

RuntimeError: ERROR: Trying to build without Cython, but pre-generated 'src/lxml/etree.c' is not available (to ignore this error, pass --without-cython or set environment variable WITHOUT_CYTHON=true).

I think I'll just move the script into a Docker container and see if that helps.

kinoute · 2023-10-11T15:59:58Z

@adbar Thanks for your answer on my previous case. I have another one! Doing something like:

        trafi_extraction = trafilatura.extract(
            response.decode(errors='ignore'),
            output_format='json',
            include_images=False,
            date_extraction_params={
                'extensive_search': True,
                'original_date': True,
                'min_date': EARLIEST_VALID_DATE,
            },
            include_comments=False,
        )
        
        trafilatura_data = trafi_extraction and json.loads(trafi_extraction)

Returns

json.decoder.JSONDecodeError: Invalid \escape: line 1 column 2947 (char 2946)

For this given URL : http://sport.kurganobl.ru/8980.html

trafic_extraction contains :

{"title": null, "author": null, "hostname": null, "date": "2016-12-12", "categories": "", "tags": "", "fingerprint": "6920faf8766bf202", "id": null, "license": null, "comments": null, "raw_text": ", 8 . 350 35 . 1000 1000 , . 21 8 .  7 300 , 01:08:25, .  ̀ .  1000 1000 div>    \n      \n         \n      \n -    \n         -2016   \n        \n     \n    \n        - \n    \n          \n        \n   -      \n        \n           \n     \n        \n , , ! -      \n       \n     \r\n8\r\n\r\n\r\n1000\r\n1000\r\n      \n     \n       \n          \n        \n ,    -     \n    \n        \n     II  -    \n        \n   - \n     \n       \n    ! \n          \n           \n ++ =     \n         \n         \n       \n      \n  \r\n8\r\n \r\n\r\n1000\r\n1000\r\n     ZauraLife \n      150     \n       \n     76-        \n 3     :   \n           2015  \n   \n             \n    \n       \n      \n        \n     \n          \n       \n       \n     \n     \n   \n     -    \n          \n  \r\n8\r\n \r\n\r\n1000\r\n1000\r\n - 2016  \n           \n          \n     \n       \n             \n       \n    \n     ! \n         \n    \n     \n       \n    - 2016     \n       \n     \n     \n     ! \n     \n      \n    \n    \n     \n   \r\n8\r\n \r\n\r\n1000\r\n1000\r\n \n     \n    \n      \n    \n   - \n       \n           \n        \n   \n          - \n     -   2015 ? \n          \n        \n             \n        ( ) \n             \n        \n            \n         \n   \r\n8\r\n \r\n\r\n1000\r\n1000\r\n      \n       \n       \n    \n        \n -      \n          \n     \n         \n          \n    II  -   \n         -  \n       \n      \n         \n          \n    \n      \n           \n    \n          \n         \n                \n       \n           \n       \n      \n            \n       \n            \n          \n      \n        \n         \n            \n       \n          \n        \n      \n       \n      \n         \n           \n          \n           \n         \n            \n      \n     \n     \n       \n     \n           \n        \n , ,   !  \n        \n        \n     \n  ,     \n   2015       \n             \n             \n        2016      \n       \n       , ,   ! \n        \n          \n      \n          \n -     \n           \n    \n     \n     ! \n     \n      \n           \n       \n       \n    .   - . \n           \n         \n <\r\n\r\n1000\r\n1000\r\na href=\"8444.html\" title=\"  \">   \n      \n          \n         \n  -    -    2015  \n         \n         \n    \n          \n         ZauraLife \n    \n       \n      \n      \n   \n    \n    \n      \n     \n         \n XXVII      \n        \n          \n          \n !      \n    \n         \n         \n     \n XXVII       \n     \n        \n        \n      -   -    \n            \n          \n   -  \n       \n      26  \n        ? \n          \n !      2016 \n   38", "text": ", 8 . 350 35 .\n1000 1000 , . 21 8 .\n7 300 , 01:08:25, .\ǹ .\n1000 1000 div>\n-\n-2016\n-\n-\n, , ! -\n8\n1000\n1000\n, -\nII -\n-\n!\n++ =\n8\n1000\n1000\nZauraLife\n150\n76-\n3 :\n2015\n-\n8\n1000\n1000\n- 2016\n!\n- 2016\n!\n8\n1000\n1000\n-\n-\n- 2015 ?\n( )\n8\n1000\n1000\n-\nII -\n-\n, , !\n,\n2015\n2016\n, , !\n-\n!\n. - .\n<\n1000\n1000\na href=\"8444.html\" title=\" \">\n- - 2015\nZauraLife\nXXVII\n!\nXXVII\n- -\n-\n26\n?\n! 2016\n38", "language": null, "image": null, "pagetype": null, "source": null, "source-hostname": null, "excerpt": null}

Edit: Right now I am handling this with this method:

    def fix_invalid_escapes(self, s):
        # This regex matches a backslash not followed by a valid JSON escape
        return re.sub(r'\\(?![/bfnrt"\\u])', r'\\\\', s)

But I think maybe Trafilatura could handle this natively? (I'm not even sure my fix is enough/good)

adbar · 2023-10-12T12:54:35Z

Hi @kinoute, there must be something wrong in the way you encore or decode the HTML response, I cannot reproduce the bug:
trafilatura -u "http://sport.kurganobl.ru/8980.html" --json works on my computer.

adbar · 2023-10-12T15:16:07Z

@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way:
https://formulae.brew.sh/formula/trafilatura

sepsi77 · 2023-10-19T08:06:10Z

Thanks @adbar using brew to install trafilitura fixed the problem.

hugoobauer · 2024-01-18T10:26:10Z

Hi there, I'm not sure this is the right thread, but here's the problem I'm having. Some sites have more than one <article> node for a single article: https://conselhos-desportivos.decathlon.pt/guia-de-treino-para-gluteos

The XPath that extracts the text is (.//article)[1], so it only extracts the first paragraph. Do you have a solution in mind? Do you think modifying the XPath to retrieve all <articles> and iterating over them to concatenate them is a good solution?

adbar · 2024-01-18T12:24:07Z

Hi @hugoobauer, this problem is also mentioned in #432. The problem with taking all article elements is that sometimes they are related content and not main content (e.g. a list of teasers at the end of a page).
IMHO this is an improper use of the <article> tag but I'm not sure what to do about it: the XPath would have to be changed or a new heuristic on content length added.

hugoobauer · 2024-01-18T13:58:08Z

Hi @adbar, I completely agree that this is a misuse of <article>. I'm looking for a way to extract all the "relevant" content from a page, even if I take a bit too much. In this case, retrieving info at the bottom of the page that's more or less related to the article bothers me less than missing the majority of an article's content.

So I made a little POC to test a solution:

change the XPath from (.//article)[1] to (.//article)
update the loop to handle the case where several nodes are returned

    for expr in BODY_XPATH:
        # select tree if the expression has been found
        try:
            subtrees = tree.xpath(expr)
            if len(subtrees) > 1:  # and favor_recall=True ?
                new_subtree = Element(subtrees[0].tag)
                for _subtree in subtrees:
                    for child in _subtree:
                        # if len(' '.join(child.itertext()).strip()) > MIN_EXTRACTED_SIZE ? 
                        new_subtree.append(child)
                subtree = new_subtree
            else:
                subtree = subtrees[0]
        except IndexError:
            continue

If there's only one item, it's the same as before. Otherwise, I create a new node of the same tag (article in this case), and I insert in it each child of each of the nodes. In addition, we could check whether the favor_recall option is enabled, so that it's not done by default. And use the MIN_EXTRACTED_SIZE value to extract only those elements that are long enough?
What do you think ? I've only been studying the repository for a short time, so I may have missed something.

adbar · 2024-01-18T16:18:59Z

@hugoobauer Your idea looks good. The length heuristic would have to run on whole <article> elements and I'm not sure how.

In any case, feel free to draft a pull request for this or for another issue. You can add a test case somewhere in tests/unit_tests.py and the tests have to pass (realworld_tests.py are also relevant here). You can also check the benchmark in the tests/ folder to see if performance improves.

hugoobauer · 2024-01-18T17:48:56Z

Okay great, I will work on a PR soon

Sang12-2017-18 · 2024-03-25T09:49:40Z

Hi @adbar

I am having an issue with this URL - https://www.energyvault.com/about#leaders. I am not able to extract the text from it. Here's the code I am using:

def get_text(url=None, html_text=None):
    from trafilatura import bare_extraction, fetch_url
    if not url and not html_text:
        raise ValueError("Either 'url' or 'html_text' must be provided")
    if html_text:
        html_string = html_text
    else:
        url_response = fetch_url(url)
        html_string = url_response
    extracted_data = bare_extraction(html_string,
                                     include_links=True,
                                     include_formatting=True,
                                     include_images=True,
                                     include_tables=True)
    doc_text = extracted_data["text"] if extracted_data else None
    return doc_text


if __name__ == "__main__":
    url = "https://www.energyvault.com/about#leaders"
    text = get_text(url=url)
    print(text)

When I debugged it a little bit, I find it throws an exception with the following traceback -

Traceback (most recent call last):
  File "/lib/python3.11/site-packages/trafilatura/core.py", line 921, in bare_extraction
    document = extract_metadata(tree, url, date_extraction_params, no_fallback, author_blacklist)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/trafilatura/metadata.py", line 535, in extract_metadata
    metadata.date = find_date(tree, **date_config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/htmldate/core.py", line 986, in find_date
    return converted or search_page(htmlstring, options)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/htmldate/core.py", line 724, in search_page
    dateobject = datetime(int(bestmatch[1]), int(bestmatch[2]), 1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: month must be in 1..12

Let me know if you need any more info.

adbar · 2024-03-25T12:28:51Z

Hi @Sang12-2017-18, I cannot reproduce the bug as such but something is odd with this webpage. Do you use the latest version of the trafilatura and htmldate packages? If so, please file an issue on the htmldate repository.

Sang12-2017-18 · 2024-03-26T05:26:30Z

Hi @adbar
Thank you for the quick response. I have the latest versions of trafilatura (v1.8.0), and htmldate (v1.8.0). I'll surely file an issue in the htmldate repository. Before that, I wanted to know one thing - for my requirement, extracting the date published from the web page is not necessary. I'm quite okay if the date comes as None, but I want other fields like text, author etc. Is there any configuration option available such that we can exclude dates while extracting, but keep other metadata?

adbar · 2024-03-28T15:44:36Z

@Sang12-2017-18 So far there is no such option. I still cannot reproduce the error, how did you get the traceback?

adbar · 2024-04-15T13:17:54Z

@Sang12-2017-18 the bug is now fixed in Htmldate version 1.8.1. As for the option to bypass metadata extraction I'm going to add it to the to do list.

adbar added good first issue Good for newcomers up for grabs Good for (first) contributors labels Jan 9, 2020

adbar pinned this issue Sep 21, 2020

adbar unpinned this issue Sep 21, 2020

adbar closed this as completed Sep 21, 2020

adbar reopened this Oct 20, 2021

adbar pinned this issue Oct 20, 2021

adbar changed the title ~~Test trafilatura on further web pages and report bugs~~ List of smaller extraction bugs (text & metadata) Jan 4, 2022

adbar added a commit that referenced this issue May 11, 2022

extraction fix: div with only lb (#4)

14d6205

This was referenced May 13, 2022

Removal of useless new line and carriage return characters in Html headings and paragraphs #199

Closed

Incorrect recognition #208

Closed

hugoobauer mentioned this issue Jan 24, 2024

Merge multiple nodes returned by XPath #487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of smaller extraction bugs (text & metadata) #4

List of smaller extraction bugs (text & metadata) #4

adbar commented Jan 9, 2020

adbar commented Jan 4, 2022 •

edited

cheezman34 commented Feb 11, 2022

adbar commented Feb 14, 2022

karlkovaciny commented Feb 15, 2022

adbar commented Feb 17, 2022 •

edited

adbar commented May 13, 2022

felipehertzer commented May 23, 2022 •

edited

adbar commented Jun 1, 2022

kinoute commented Sep 21, 2023

adbar commented Oct 9, 2023

sepsi77 commented Oct 10, 2023

adbar commented Oct 10, 2023

sepsi77 commented Oct 10, 2023

adbar commented Oct 10, 2023

sepsi77 commented Oct 10, 2023

kinoute commented Oct 11, 2023 •

edited

adbar commented Oct 12, 2023

adbar commented Oct 12, 2023

sepsi77 commented Oct 19, 2023

hugoobauer commented Jan 18, 2024

adbar commented Jan 18, 2024

hugoobauer commented Jan 18, 2024 •

edited

adbar commented Jan 18, 2024

hugoobauer commented Jan 18, 2024

Sang12-2017-18 commented Mar 25, 2024 •

edited

adbar commented Mar 25, 2024

Sang12-2017-18 commented Mar 26, 2024 •

edited

adbar commented Mar 28, 2024

adbar commented Apr 15, 2024

List of smaller extraction bugs (text & metadata) #4

List of smaller extraction bugs (text & metadata) #4

Comments

adbar commented Jan 9, 2020

adbar commented Jan 4, 2022 • edited

cheezman34 commented Feb 11, 2022

adbar commented Feb 14, 2022

karlkovaciny commented Feb 15, 2022

adbar commented Feb 17, 2022 • edited

adbar commented May 13, 2022

felipehertzer commented May 23, 2022 • edited

adbar commented Jun 1, 2022

kinoute commented Sep 21, 2023

adbar commented Oct 9, 2023

sepsi77 commented Oct 10, 2023

adbar commented Oct 10, 2023

sepsi77 commented Oct 10, 2023

adbar commented Oct 10, 2023

sepsi77 commented Oct 10, 2023

kinoute commented Oct 11, 2023 • edited

adbar commented Oct 12, 2023

adbar commented Oct 12, 2023

sepsi77 commented Oct 19, 2023

hugoobauer commented Jan 18, 2024

adbar commented Jan 18, 2024

hugoobauer commented Jan 18, 2024 • edited

adbar commented Jan 18, 2024

hugoobauer commented Jan 18, 2024

Sang12-2017-18 commented Mar 25, 2024 • edited

adbar commented Mar 25, 2024

Sang12-2017-18 commented Mar 26, 2024 • edited

adbar commented Mar 28, 2024

adbar commented Apr 15, 2024

adbar commented Jan 4, 2022 •

edited

adbar commented Feb 17, 2022 •

edited

felipehertzer commented May 23, 2022 •

edited

kinoute commented Oct 11, 2023 •

edited

hugoobauer commented Jan 18, 2024 •

edited

Sang12-2017-18 commented Mar 25, 2024 •

edited

Sang12-2017-18 commented Mar 26, 2024 •

edited