Simple content scoring prototype #583

zirkelc · 2024-05-03T13:15:24Z

I implemented a small prototype as discussed in #572. It's super basic and I don't expect you to add it to the library, just wanted to show you some results and hear your opinion.

The scoring.py module contains the functions to count the unique words for a HTML and a markdown. The semantic elements (header, footer, nav) are removed, but other elements could be added based on class names or ids, or other tags link forms, inputs, etc. Links are not removed yet, but I think relative links are a good indicator that these are navigation elements and so they could also be removed I think. Absolute links or more specifically links to other hostnames should probably be kept.

The ratio unique_words_html / unique_words_markdown is then considered the content score. My assumption is the following:

a score < 0.5 is bad because lot of content from the HTML is probably missing
a score between 0.5 and 1.0 is probably good because a lot of content was preserved
a score > 1.0 is bad because it means the extraction returned more content than the HTML which probably means non-content related elements were extracted

I ported the Readability.js isProbablyReaderable to compare the results.My assumption is that a HTML file with is_probably_readerable=False should also have a low content score.

The scoring_small.py module is a copy of comparison_small.py. I collect all unique words counts and calculate some statistics. Here are the results so far:

All scores
=================================
Number of samples: 750

Average ratio of content score to HTML score: 0.676667469420454
Average difference between HTML score and content score: 190.01466666666667

Range           Count   Percentage
----------------------------
0.00 - 0.50     172     22.93%
0.50 - 0.75     212     28.27%
0.75 - 1.00     327     43.60%
1.00 - 1.25     29      3.87%
1.25 - 1.50     0       0.00%
1.50 - inf      2       0.27%

The range with the most examples is 0.75 - 1.0


is_probably_readerable is False
=================================
Number of samples: 49

Average ratio of content score to HTML score: 0.4560205723986458
Average difference between HTML score and content score: 213.3877551020408

Range           Count   Percentage
----------------------------
0.00 - 0.50     24      48.98%
0.50 - 0.75     10      20.41%
0.75 - 1.00     5       10.20%
1.00 - 1.25     6       12.24%
1.25 - 1.50     0       0.00%
1.50 - inf      0       0.00%

The range with the most examples is 0.0 - 0.5

The current average content score across all files is 0.676 and for files which are not readerable (is_probably_readerable=False) it is 0.456. I assume the content score could be increased even more with more cleansing of the HTML. I will have to investigate the cases with a low content score and a score above 1.0 to see if they are actually bad.

Do you have any remark or ideas for improvements? Of course, these are all assumptions, so please don't hesitate to point out flaws in my logic or implementation.

zirkelc · 2024-05-03T13:50:05Z

In case you are interested, here are pages with is_probably_readerable=False:

[
    {
        "file": "die-partei.net.luebeck.html",
        "score": 0.7719298245614035,
        "html": 57,
        "trafilatura": 44,
    },
    {
        "file": "schleifen.ucoz.de.briefe.html",
        "score": 1.052325581395349,
        "html": 172,
        "trafilatura": 181,
    },
    {
        "file": "love-hina.ch.0409.html",
        "score": 0.3643410852713178,
        "html": 129,
        "trafilatura": 47,
    },
    {
        "file": "wehranlage-horka.de.887.html",
        "score": 0.5235602094240838,
        "html": 191,
        "trafilatura": 100,
    },
    {
        "file": "nextkabinett.wordpress.com.garden.html",
        "score": 0.09857482185273159,
        "html": 842,
        "trafilatura": 83,
    },
    {
        "file": "wiki.piratenpartei.de.stammtisch.html",
        "score": 0.38513513513513514,
        "html": 148,
        "trafilatura": 57,
    },
    {
        "file": "pix-bavaria.de.html",
        "score": 0.7720588235294118,
        "html": 136,
        "trafilatura": 105,
    },
    {
        "file": "lavazza.de.qualita.html",
        "score": 0.08863636363636364,
        "html": 440,
        "trafilatura": 39,
    },
    {
        "file": "gnaur.wordpress.com.moglichkeit.html",
        "score": 0.12643678160919541,
        "html": 174,
        "trafilatura": 22,
    },
    {"file": "seelenradio.de.leo.html", "score": 0.2, "html": 185, "trafilatura": 37},
    {
        "file": "ohneq.de.johannes.html",
        "score": 0.6134453781512605,
        "html": 119,
        "trafilatura": 73,
    },
    {
        "file": "xinhuanet.com.c_1125597921.html",
        "score": 0.4955357142857143,
        "html": 224,
        "trafilatura": 111,
    },
    {
        "file": "banyuetan.org.1000200033136171577956287380194268_1.html",
        "score": 0.4943820224719101,
        "html": 356,
        "trafilatura": 176,
    },
    {
        "file": "baike.baidu.com.tanya.html",
        "score": 0.6725197541703248,
        "html": 1139,
        "trafilatura": 766,
    },
    {
        "file": "scmp.com.playbook.html",
        "score": 0.21359223300970873,
        "html": 103,
        "trafilatura": 22,
    },
    {
        "file": "juliasleseblog.blogspot.com.irland.html",
        "score": 0,
        "html": 0,
        "trafilatura": 0,
    },
    {"file": "cecil.de.lieblingsfarbe.html", "score": 1.0, "html": 1, "trafilatura": 1},
    {"file": "street-one.de.blue.html", "score": 1.0, "html": 1, "trafilatura": 1},
    {
        "file": "it-for-kids.org.variables.html",
        "score": 0.9117647058823529,
        "html": 34,
        "trafilatura": 31,
    },
    {
        "file": "zahlenzauberin.wordpress.com.ferien.html",
        "score": 0.5082872928176796,
        "html": 181,
        "trafilatura": 92,
    },
    {
        "file": "rueda.wikidot.com.enchufla.html",
        "score": 0.6058091286307054,
        "html": 482,
        "trafilatura": 292,
    },
    {"file": "changenow.de.loibl.html", "score": 0, "html": 0, "trafilatura": 0},
    {
        "file": "chip.de.bestcrypt.html",
        "score": 0.4508670520231214,
        "html": 346,
        "trafilatura": 156,
    },
    {
        "file": "faz.net.leone.html",
        "score": 0.028588098016336057,
        "html": 3428,
        "trafilatura": 98,
    },
    {
        "file": "archive.ordnungsrausch.com.orga-life.html",
        "score": 0.1505016722408027,
        "html": 299,
        "trafilatura": 45,
    },
    {
        "file": "weselpower.wordpress.com.monstergesprche.html",
        "score": 0.2,
        "html": 90,
        "trafilatura": 18,
    },
    {
        "file": "0b4609a864eb4fa0bbcb2b395f6be9eb.html",
        "score": 0.18,
        "html": 200,
        "trafilatura": 36,
    },
    {
        "file": "backen.de.maulwurfkuchen.html",
        "score": 0.2687074829931973,
        "html": 294,
        "trafilatura": 79,
    },
    {"file": "thelocal.se.tattooed.html", "score": 1.0, "html": 13, "trafilatura": 13},
    {
        "file": "bundeswehrkarriere.de.Laura.html",
        "score": 0,
        "html": 0,
        "trafilatura": 0,
    },
    {
        "file": "fouryears.eu.interning.html",
        "score": 0.3142857142857143,
        "html": 175,
        "trafilatura": 55,
    },
    {"file": "wevolver.com.vehicle.html", "score": 1.0, "html": 8, "trafilatura": 8},
    {
        "file": "nhk.or.jp.k100.html",
        "score": 0.5454545454545454,
        "html": 33,
        "trafilatura": 18,
    },
    {
        "file": "bettycrocker.com.pineapple.html",
        "score": 0.18652849740932642,
        "html": 386,
        "trafilatura": 72,
    },
    {
        "file": "cybercook.com.br.sequilho.html",
        "score": 0.5740740740740741,
        "html": 162,
        "trafilatura": 93,
    },
    {"file": "workable.com.gousto.html", "score": 0, "html": 0, "trafilatura": 0},
    {
        "file": "journals.univie.ac.at.submissions.html",
        "score": 0.3556701030927835,
        "html": 388,
        "trafilatura": 138,
    },
    {
        "file": "sports.fr.lorient.html",
        "score": 0.9892665474060823,
        "html": 559,
        "trafilatura": 553,
    },
    {
        "file": "_Ziemniaki na szóstej, surówka na dziesiątej_. Jak pomagać, żeby nie zaszkodzić_ [PORADNIK W PIGUŁCE].html",
        "score": 0.053987730061349694,
        "html": 815,
        "trafilatura": 44,
    },
    {"file": "dlg.org-Preis.html", "score": 0.88, "html": 25, "trafilatura": 22},
    {
        "file": "homify.de-Tischdecke.html",
        "score": 0.625,
        "html": 32,
        "trafilatura": 20,
    },
    {
        "file": "outdoor-magazin.com-vanlife.html",
        "score": 0.08672936259143156,
        "html": 957,
        "trafilatura": 83,
    },
    {"file": "camping.info-ligurien.html", "score": 1.0, "html": 17, "trafilatura": 17},
    {
        "file": "dw.com-elephants.html",
        "score": 0.15234375,
        "html": 256,
        "trafilatura": 39,
    },
    {
        "file": "mitundvoneinander.com-Frühling.html",
        "score": 0.20689655172413793,
        "html": 232,
        "trafilatura": 48,
    },
    {
        "file": "nestle-family-com-chicken.html",
        "score": 0.32664756446991405,
        "html": 349,
        "trafilatura": 114,
    },
    {
        "file": "ekiba.de-trauer.html",
        "score": 0.686084142394822,
        "html": 309,
        "trafilatura": 212,
    },
    {
        "file": "eurosport.de-corona.html",
        "score": 0.6981981981981982,
        "html": 444,
        "trafilatura": 310,
    },
    {
        "file": "eurailpress.de-rekordniveau.html",
        "score": 0.4868421052631579,
        "html": 152,
        "trafilatura": 74,
    },
]

codecov · 2024-05-03T14:41:12Z

Codecov Report

Attention: Patch coverage is 0% with 66 lines in your changes are missing coverage. Please review.

Project coverage is 95.97%. Comparing base (3b8f2ee) to head (8e6afae).
Report is 1 commits behind head on master.

Files	Patch %	Lines
trafilatura/readability_utils.py	0.00%	37 Missing ⚠️
trafilatura/scoring.py	0.00%	29 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #583      +/-   ##
==========================================
- Coverage   97.82%   95.97%   -1.86%     
==========================================
  Files          23       23              
  Lines        3449     3503      +54     
==========================================
- Hits         3374     3362      -12     
- Misses         75      141      +66

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2024-05-03T14:46:54Z

Hi @zirkelc, thanks for your work, here are a few comments:

Now I get what the isProbablyReaderable function in the original Javascript code is about! Thanks for pointing it out.
- You could actually use the heuristics present in trafilatura/readability_lxml.py and at best integrate the new parts of the your code (readability_utils.py) into this file, which you can also amend it if necessary.
- Feel free to make another PR just for the integration of the new function is_probably_readerable() with its underlying components, this PR would then focus on calculating scores.
  Does that make sense to you?
Side note: The evaluation code is currently being worked on to hopefully make the evaluation functions more modular and less redundant.
As to the results, there seems to be a change in the distribution towards 0 so isProbablyReaderable is partly working but the results are not clear-cut or at least it would be hard to find a precise threshold here, what do you think?

zirkelc · 2024-05-06T08:01:12Z

I opened a new PR for is_probably_readerable(). As I've described in the other PR, there is a difference in the results for the implementation of is_probably_readerable() with BeautifulSoup vs LXML. I will run the tests again when I have figured out which one is right. Also, I will include the unique words from HTML and Trafilatura and the actual extraction to better compare the results.

adbar · 2024-05-22T15:47:09Z

@zirkelc I'm not sure what to do with this pull request, do you want to keep working on it by leveraging the functionality you just introduced?

zirkelc · 2024-05-23T07:08:25Z

@zirkelc I'm not sure what to do with this pull request, do you want to keep working on it by leveraging the functionality you just introduced?

I wanted to get back to you on that. I did some more tests and re-implemented this function in javascript (because that's the environment I'm usually working in). The results are mixed with the majority of scores for non-readable pages fall in the ranges 0-0.5 or they are greaten than 1.0 (both ranges are bad so the result is good). However, there a certain cases where the result are in the good range 0.5-1.0 even though the pages are not readable. So I'm not sure how reliable the scores will get. Maybe the combination of is_readable = 0.5 < score < 1.0 and is_probably_readable() could be a useful metric to return?

If you think this score would be an useful addition to Trafilatura, I can develop further and update the evaluation to use the new is_probably_readable function. Otherwise I'm also okay with closing this PR.

adbar · 2024-05-23T10:52:56Z

I also think this reliability issue would prevent us from directly using such a metric. It's nice to have ported is_probably_readarable() though and we can come back to it in the future.

I'd be in favor of closing this PR now and focusing on improving/porting further components of readability.js. If you'd like you can work on a PR or if it's easier for you in the short term maybe list striking differences between the current version and the port in a new issue?

zirkelc · 2024-05-23T13:20:05Z

Okay, I agree. Let's close this PR and I will create a new issue to discuss the port.

zirkelc added 2 commits May 3, 2024 13:57

feat: simple content scoring

071ebd5

rename

8e6afae

zirkelc closed this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple content scoring prototype #583

Simple content scoring prototype #583

zirkelc commented May 3, 2024

zirkelc commented May 3, 2024

codecov bot commented May 3, 2024

adbar commented May 3, 2024

zirkelc commented May 6, 2024

adbar commented May 22, 2024

zirkelc commented May 23, 2024

adbar commented May 23, 2024

zirkelc commented May 23, 2024

Simple content scoring prototype #583

Simple content scoring prototype #583

Conversation

zirkelc commented May 3, 2024

zirkelc commented May 3, 2024

codecov bot commented May 3, 2024

Codecov Report

adbar commented May 3, 2024

zirkelc commented May 6, 2024

adbar commented May 22, 2024

zirkelc commented May 23, 2024

adbar commented May 23, 2024

zirkelc commented May 23, 2024