Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of smaller extraction bugs (text & metadata) #4

Open
adbar opened this issue Jan 9, 2020 · 29 comments
Open

List of smaller extraction bugs (text & metadata) #4

adbar opened this issue Jan 9, 2020 · 29 comments
Labels
good first issue Good for newcomers up for grabs Good for (first) contributors

Comments

@adbar
Copy link
Owner

adbar commented Jan 9, 2020

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see BODY_XPATH and COMMENTS_XPATH lists).

Thanks!

@adbar adbar added good first issue Good for newcomers up for grabs Good for (first) contributors labels Jan 9, 2020
@adbar adbar pinned this issue Sep 21, 2020
@adbar adbar unpinned this issue Sep 21, 2020
@adbar adbar closed this as completed Sep 21, 2020
@adbar adbar reopened this Oct 20, 2021
@adbar adbar pinned this issue Oct 20, 2021
@adbar adbar changed the title Test trafilatura on further web pages and report bugs List of smaller extraction bugs (text & metadata) Jan 4, 2022
@cheezman34
Copy link

Words are getting smashed together on this page:

https://research.checkpoint.com/2021/a-deep-dive-into-doublefeature-equation-groups-post-exploitation-dashboard/

Screen Shot 2022-02-11 at 11 42 46 AM
Screen Shot 2022-02-11 at 11 44 11 AM

I looked into the extraction code here a bit. The date here is inside a span, which gets stripped, and then the date becomes the tail of the header. All of the whitespace (which includes a newline) gets lost, and then the tail is just directly appended to the header. I'm not sure if the best strategy to fix would be to include a space between the tail and text of the header node when they get extracted, or maybe to look for newlines in the text and somehow respect them. It looks like some of the lxml stuff just strips whitespace automatically when you access "text" and "tail" attributes.

Screen Shot 2022-02-11 at 11 44 51 AM
Screen Shot 2022-02-11 at 11 45 15 AM

I didn't dig into this one, but I'm guessing it's something similar as the first case. The webpage relies on whitespace that gets stripped by the extraction algorithm.

@adbar
Copy link
Owner Author

adbar commented Feb 14, 2022

Yes, I think the issues in the document you mention are related to deleted <span> sections.

@karlkovaciny
Copy link

Hey, this is a great library. I was ready to subscribe to a service just to get what this does for me.

For the 30th page I extracted, https://thehill.com/homenews/senate/594044-sen-lujan-to-return-to-senate-in-time-to-vote-for-supreme-court-nominee, Trafilatura 1.0.0 returned only 150 chars of text:

© Greg Nash
Luján planning return to Senate in time to vote for Supreme Court nominee
By Olafimihan Oshin - 02/13/22 12:54 PM EST
Skip to main content

I downloaded the HTML source (lujan.txt) and confirmed it does have the article text in it (starting with "Sen. Ben Ray Luján").

I decided to try the external fallback "Readability". I started Python in my trafilatura container and ran this code:

import lxml
with open('lujan.html') as f:
     doc = parse(f).getroot()
     x = trafilatura.external.try_readability(doc, "file:///lujan.html")
     print(lxml.etree.tostring(x, pretty_print=True, encoding="unicode"))

But that just gave me a bunch of XML/JavaScript that didn't even have the main text in it.

Perhaps a fallback could be added that when extracted text is small and there are large continuous blocks of unextracted text, to grab those instead?

@adbar
Copy link
Owner Author

adbar commented Feb 17, 2022

Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That's definitely a case to watch for.

EDIT: for the archived version of the page I now get the same problem as you.

adbar added a commit that referenced this issue May 11, 2022
@felipehertzer
Copy link
Contributor

felipehertzer commented May 23, 2022

Hey @adbar

I'm having problem with a few publications like huffpost where it is not extracting the metadata correctly.
But, if I change the line bellow to tree = fromstring(htmlobject.encode('utf8'), parser=HTML_PARSER) it starts to work.
What do you think?

tree = fromstring(htmlobject, parser=HTML_PARSER)

Example: https://bit.ly/3PuvL26
Other example: https://bit.ly/3ai8zEf

@adbar
Copy link
Owner Author

adbar commented Jun 1, 2022

Hi @felipehertzer, I don't think I can reproduce the bug, which metadata fields do you mean exactly?

@kinoute
Copy link

kinoute commented Sep 21, 2023

Hello,

URL of testing: https://orientxxi.info/fa
Trafilatura version : 1.6.2

import trafilatura
downloaded = trafilatura.fetch_url("https://orientxxi.info/fa")
trafilatura.extract(downloaded, output_format="json")

I am wondering why the title is not the one provided in the HTML element <title>? Trafilatura returns a long sentence:

{"title": "به زبانهای دیگر Yémen. Une paix qui se fait attendre Laurent Bonnefoy · 21 septembre أوسلو، نموذج للفشل دانيال ليفي · 21 أيلول (سبتمبر) موقع “أوريان 21” يدعوكم للاحتفال بعيد ميلاده العاشر! · 20 أيلول (سبتمبر) Petroleum. Turkey vs. Iraq, but the Kurds are Collateral Victims Benoît Drevet · 20 September El doble estándar de Egipto para acoger a sus “huéspedes” sudaneses Séverine Evanno · 1ro de septiembre Khaled El Qaisi, colpevole di Palestina Cecilia Dalla Negra · 18 settembre", "author": null,....

Thanks!

@adbar
Copy link
Owner Author

adbar commented Oct 9, 2023

Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: <h2 class="indication">به زبانهای دیگر</h3> It implies that all that follows is a title.

Please note that the extraction doesn't work as well on homepages in general.

@sepsi77
Copy link

sepsi77 commented Oct 10, 2023

Hi @adbar,

I ran into extraction issues.

URL: https://microsoft.github.io/autogen/docs/Use-Cases/enhanced_inference/

Output: ! d o c t y p e h t m l >

I also tested using htmlttext feature and it didn't work any better. It gave me this output.

h t m l c l a s s = " d o c s - v e r s i o n - c u r r e n t " l a n g = " e n " d i r = " l t r " >

I run scraping of HTML outside of trafilitura. I confirmed that we are getting all of the HTML, but seems like there's something in the HTML that trips the extraction.

I used trafilitura.extract() and passed the html code as string into the function. I tested different settings for the favor_recall and favor_precision arguments. They didn't change the output in any significant way. I also tested using trafilitura.baseline() function and it yielded similar results.

@adbar
Copy link
Owner Author

adbar commented Oct 10, 2023

@sepsi77 There are LXML-related issues on MacOS M1, M2 etc. (see also #166).
Is it the platform you're using or can you provide more details?

@sepsi77
Copy link

sepsi77 commented Oct 10, 2023

@adbar yes, I'm on M1 MacBook

@adbar
Copy link
Owner Author

adbar commented Oct 10, 2023

Did you try building LXML from source?

@sepsi77
Copy link

sepsi77 commented Oct 10, 2023

I can't seem to get it to work. I'm new into this level of tweaking with the system. The installation fails because of missing precompiled Cython files. Trying to run that with the --without-cython flag also doesn't work.

RuntimeError: ERROR: Trying to build without Cython, but pre-generated 'src/lxml/etree.c' is not available (to ignore this error, pass --without-cython or set environment variable WITHOUT_CYTHON=true).

I think I'll just move the script into a Docker container and see if that helps.

@kinoute
Copy link

kinoute commented Oct 11, 2023

@adbar Thanks for your answer on my previous case. I have another one! Doing something like:

        trafi_extraction = trafilatura.extract(
            response.decode(errors='ignore'),
            output_format='json',
            include_images=False,
            date_extraction_params={
                'extensive_search': True,
                'original_date': True,
                'min_date': EARLIEST_VALID_DATE,
            },
            include_comments=False,
        )
        
        trafilatura_data = trafi_extraction and json.loads(trafi_extraction)

Returns

json.decoder.JSONDecodeError: Invalid \escape: line 1 column 2947 (char 2946)

For this given URL : http://sport.kurganobl.ru/8980.html

trafic_extraction contains :

{"title": null, "author": null, "hostname": null, "date": "2016-12-12", "categories": "", "tags": "", "fingerprint": "6920faf8766bf202", "id": null, "license": null, "comments": null, "raw_text": ", 8 . 350 35 . 1000 1000 , . 21 8 .  7 300 , 01:08:25, .  ̀ .  1000 1000 div>    \n      \n         \n      \n -    \n         -2016   \n        \n     \n    \n        - \n    \n          \n        \n   -      \n        \n           \n     \n        \n , , ! -      \n       \n     \r\n8\r\n\r\n\r\n1000\r\n1000\r\n      \n     \n       \n          \n        \n ,    -     \n    \n        \n     II  -    \n        \n   - \n     \n       \n    ! \n          \n           \n ++ =     \n         \n         \n       \n      \n  \r\n8\r\n \r\n\r\n1000\r\n1000\r\n     ZauraLife \n      150     \n       \n     76-        \n 3     :   \n           2015  \n   \n             \n    \n       \n      \n        \n     \n          \n       \n       \n     \n     \n   \n     -    \n          \n  \r\n8\r\n \r\n\r\n1000\r\n1000\r\n - 2016  \n           \n          \n     \n       \n             \n       \n    \n     ! \n         \n    \n     \n       \n    - 2016     \n       \n     \n     \n     ! \n     \n      \n    \n    \n     \n   \r\n8\r\n \r\n\r\n1000\r\n1000\r\n \n     \n    \n      \n    \n   - \n       \n           \n        \n   \n          - \n     -   2015 ? \n          \n        \n             \n        ( ) \n             \n        \n            \n         \n   \r\n8\r\n \r\n\r\n1000\r\n1000\r\n      \n       \n       \n    \n        \n -      \n          \n     \n         \n          \n    II  -   \n         -  \n       \n      \n         \n          \n    \n      \n           \n    \n          \n         \n                \n       \n           \n       \n      \n            \n       \n            \n          \n      \n        \n         \n            \n       \n          \n        \n      \n       \n      \n         \n           \n          \n           \n         \n            \n      \n     \n     \n       \n     \n           \n        \n , ,   !  \n        \n        \n     \n  ,     \n   2015       \n             \n             \n        2016      \n       \n       , ,   ! \n        \n          \n      \n          \n -     \n           \n    \n     \n     ! \n     \n      \n           \n       \n       \n    .   - . \n           \n         \n <\r\n\r\n1000\r\n1000\r\na href=\"8444.html\" title=\"  \">   \n      \n          \n         \n  -    -    2015  \n         \n         \n    \n          \n         ZauraLife \n    \n       \n      \n      \n   \n    \n    \n      \n     \n         \n XXVII      \n        \n          \n          \n !      \n    \n         \n         \n     \n XXVII       \n     \n        \n        \n      -   -    \n            \n          \n   -  \n       \n      26  \n        ? \n          \n !      2016 \n   38", "text": ", 8 . 350 35 .\n1000 1000 , . 21 8 .\n7 300 , 01:08:25, .\ǹ .\n1000 1000 div>\n-\n-2016\n-\n-\n, , ! -\n8\n1000\n1000\n, -\nII -\n-\n!\n++ =\n8\n1000\n1000\nZauraLife\n150\n76-\n3 :\n2015\n-\n8\n1000\n1000\n- 2016\n!\n- 2016\n!\n8\n1000\n1000\n-\n-\n- 2015 ?\n( )\n8\n1000\n1000\n-\nII -\n-\n, , !\n,\n2015\n2016\n, , !\n-\n!\n. - .\n<\n1000\n1000\na href=\"8444.html\" title=\" \">\n- - 2015\nZauraLife\nXXVII\n!\nXXVII\n- -\n-\n26\n?\n! 2016\n38", "language": null, "image": null, "pagetype": null, "source": null, "source-hostname": null, "excerpt": null}

Edit: Right now I am handling this with this method:

    def fix_invalid_escapes(self, s):
        # This regex matches a backslash not followed by a valid JSON escape
        return re.sub(r'\\(?![/bfnrt"\\u])', r'\\\\', s)

But I think maybe Trafilatura could handle this natively? (I'm not even sure my fix is enough/good)

@adbar
Copy link
Owner Author

adbar commented Oct 12, 2023

Hi @kinoute, there must be something wrong in the way you encore or decode the HTML response, I cannot reproduce the bug:
trafilatura -u "http://sport.kurganobl.ru/8980.html" --json works on my computer.

@adbar
Copy link
Owner Author

adbar commented Oct 12, 2023

@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way:
https://formulae.brew.sh/formula/trafilatura

@sepsi77
Copy link

sepsi77 commented Oct 19, 2023

Thanks @adbar using brew to install trafilitura fixed the problem.

@hugoobauer
Copy link

Hi there, I'm not sure this is the right thread, but here's the problem I'm having. Some sites have more than one <article> node for a single article: https://conselhos-desportivos.decathlon.pt/guia-de-treino-para-gluteos

The XPath that extracts the text is (.//article)[1], so it only extracts the first paragraph. Do you have a solution in mind? Do you think modifying the XPath to retrieve all <articles> and iterating over them to concatenate them is a good solution?

@adbar
Copy link
Owner Author

adbar commented Jan 18, 2024

Hi @hugoobauer, this problem is also mentioned in #432. The problem with taking all article elements is that sometimes they are related content and not main content (e.g. a list of teasers at the end of a page).
IMHO this is an improper use of the <article> tag but I'm not sure what to do about it: the XPath would have to be changed or a new heuristic on content length added.

@hugoobauer
Copy link

hugoobauer commented Jan 18, 2024

Hi @adbar, I completely agree that this is a misuse of <article>. I'm looking for a way to extract all the "relevant" content from a page, even if I take a bit too much. In this case, retrieving info at the bottom of the page that's more or less related to the article bothers me less than missing the majority of an article's content.

So I made a little POC to test a solution:

  • change the XPath from (.//article)[1] to (.//article)
  • update the loop to handle the case where several nodes are returned
    for expr in BODY_XPATH:
        # select tree if the expression has been found
        try:
            subtrees = tree.xpath(expr)
            if len(subtrees) > 1:  # and favor_recall=True ?
                new_subtree = Element(subtrees[0].tag)
                for _subtree in subtrees:
                    for child in _subtree:
                        # if len(' '.join(child.itertext()).strip()) > MIN_EXTRACTED_SIZE ? 
                        new_subtree.append(child)
                subtree = new_subtree
            else:
                subtree = subtrees[0]
        except IndexError:
            continue

If there's only one item, it's the same as before. Otherwise, I create a new node of the same tag (article in this case), and I insert in it each child of each of the nodes. In addition, we could check whether the favor_recall option is enabled, so that it's not done by default. And use the MIN_EXTRACTED_SIZE value to extract only those elements that are long enough?
What do you think ? I've only been studying the repository for a short time, so I may have missed something.

@adbar
Copy link
Owner Author

adbar commented Jan 18, 2024

@hugoobauer Your idea looks good. The length heuristic would have to run on whole <article> elements and I'm not sure how.

In any case, feel free to draft a pull request for this or for another issue. You can add a test case somewhere in tests/unit_tests.py and the tests have to pass (realworld_tests.py are also relevant here). You can also check the benchmark in the tests/ folder to see if performance improves.

@hugoobauer
Copy link

Okay great, I will work on a PR soon

@Sang12-2017-18
Copy link

Sang12-2017-18 commented Mar 25, 2024

Hi @adbar

I am having an issue with this URL - https://www.energyvault.com/about#leaders. I am not able to extract the text from it. Here's the code I am using:

def get_text(url=None, html_text=None):
    from trafilatura import bare_extraction, fetch_url
    if not url and not html_text:
        raise ValueError("Either 'url' or 'html_text' must be provided")
    if html_text:
        html_string = html_text
    else:
        url_response = fetch_url(url)
        html_string = url_response
    extracted_data = bare_extraction(html_string,
                                     include_links=True,
                                     include_formatting=True,
                                     include_images=True,
                                     include_tables=True)
    doc_text = extracted_data["text"] if extracted_data else None
    return doc_text


if __name__ == "__main__":
    url = "https://www.energyvault.com/about#leaders"
    text = get_text(url=url)
    print(text)

When I debugged it a little bit, I find it throws an exception with the following traceback -

Traceback (most recent call last):
  File "/lib/python3.11/site-packages/trafilatura/core.py", line 921, in bare_extraction
    document = extract_metadata(tree, url, date_extraction_params, no_fallback, author_blacklist)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/trafilatura/metadata.py", line 535, in extract_metadata
    metadata.date = find_date(tree, **date_config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/htmldate/core.py", line 986, in find_date
    return converted or search_page(htmlstring, options)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/htmldate/core.py", line 724, in search_page
    dateobject = datetime(int(bestmatch[1]), int(bestmatch[2]), 1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: month must be in 1..12

Let me know if you need any more info.

@adbar
Copy link
Owner Author

adbar commented Mar 25, 2024

Hi @Sang12-2017-18, I cannot reproduce the bug as such but something is odd with this webpage. Do you use the latest version of the trafilatura and htmldate packages? If so, please file an issue on the htmldate repository.

@Sang12-2017-18
Copy link

Sang12-2017-18 commented Mar 26, 2024

Hi @adbar
Thank you for the quick response. I have the latest versions of trafilatura (v1.8.0), and htmldate (v1.8.0). I'll surely file an issue in the htmldate repository. Before that, I wanted to know one thing - for my requirement, extracting the date published from the web page is not necessary. I'm quite okay if the date comes as None, but I want other fields like text, author etc. Is there any configuration option available such that we can exclude dates while extracting, but keep other metadata?

@adbar
Copy link
Owner Author

adbar commented Mar 28, 2024

@Sang12-2017-18 So far there is no such option. I still cannot reproduce the error, how did you get the traceback?

@adbar
Copy link
Owner Author

adbar commented Apr 15, 2024

@Sang12-2017-18 the bug is now fixed in Htmldate version 1.8.1. As for the option to bypass metadata extraction I'm going to add it to the to do list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers up for grabs Good for (first) contributors
Projects
None yet
Development

No branches or pull requests

8 participants