Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing of header blocks #45

Open
jojennin opened this issue Nov 9, 2022 · 3 comments
Open

Preprocessing of header blocks #45

jojennin opened this issue Nov 9, 2022 · 3 comments
Labels

Comments

@jojennin
Copy link

jojennin commented Nov 9, 2022

Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML.

It seems that the current version (3.0.0) does not process header blocks as is described in the documentation:

The preprocessing looks for short header blocks which precede good blocks and at the same time there is no more than MAX_HEADING_DISTANCE characters between the header block and the good block.

The code for this appears to be implemented on line 314 of core.py and works by searching for good blocks that come after the short header block by looking at the class_type attribute. However, at this point in the code, the class_type of each paragraph has only been initialized and not copied from cf_class, so when searching for a good block within max_heading_distance characters, a good block will never be found, as class_type has not been copied for subsequent paragraphs.

Here is a simple dummy HTML file that I have created to provide an example of how a short block that comes between a short header and a good block will be classified as "bad". However, had the header block been labeled as "neargood" from the preprocessing step, the short block would have been labeled as "good".

Is this preprocessing of the header blocks as described above intended? If so, then I think the documentation should be updated to reflect this change. Otherwise, it seems a simple loop over the paragraphs copying cf_class to class_type (as was done in version 2.2.0) would suffice to fix the issue.

@miso-belica
Copy link
Owner

Hello @jojennin, thanks for your detailed report and kind words. If this worked in v2.2 and not now I had to break something :( I will quickly go through git bisect if I manage to find the commit. Thank you for the HTML to test this.

@miso-belica
Copy link
Owner

I just created a test for version in main branch and it seems it is working. Maybe I don't understand the issue here? Can you please create a test where you can demonstrate that the current version fails?

from unittest.mock import ANY


def test_heading_should_be_processed():
    html = """<html>
      <head>
        <title>Eddie Brunner - Wikipedia</title>
      </head>
      <body>
        <h1>Eddie Brunner</h1>
        <p>
        <b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
        </p>
        <p>
        Eddie Brunner was a Swiss jazz reedist and bandleader.
        Brunner learned to play clarinet, piano, and tenor and alto saxophone,
        and began playing professionally in the early 1930s with Rene Dumont,
        Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
        recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
        he moved back to Switzerland once the
        <a href="/wiki/World_War_II" title="World War II">war</a>
        had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of 
        the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
        and recorded for radio and television broadcasts in the 1950s.
        </p>
      </body>
    </html>"""
    paragraphs = justext.justext(html, justext.get_stoplist("English"))

    assert [p.text for p in paragraphs] == [
        "Eddie Brunner",
        'Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich).',
        ANY,
    ]

@jojennin
Copy link
Author

jojennin commented Dec 4, 2022

Thank you @miso-belica for your response.

The issue is not in the retrieval of paragraphs, rather in the classification (specifically in the context-aware classification). Because the cf_class attribute is not copied to the class_type attribute of all paragraphs before the start of the context-aware classification, the classification of short headings is not consistent with what is mentioned in the documentation (more details on where this occurs in the code are in my original post). I provide an example below:

import justext
import justext.core as jt

html = """<html>
      <head>
        <title>Eddie Brunner - Wikipedia</title>
      </head>
      <body>
        <h1>Eddie Brunner</h1>
        <p>
        <b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
        </p>
        <p>
        Eddie Brunner was a Swiss jazz reedist and bandleader.
        Brunner learned to play clarinet, piano, and tenor and alto saxophone,
        and began playing professionally in the early 1930s with Rene Dumont,
        Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
        recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
        he moved back to Switzerland once the
        <a href="/wiki/World_War_II" title="World War II">war</a>
        had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of
        the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
        and recorded for radio and television broadcasts in the 1950s.
        </p>
      </body>
    </html>
"""


def justext_v3(html, stoplist):
  paragraphs = justext.justext(html, stoplist)

  return paragraphs


def justext_v220_copyclass(html, stoplist):
  dom = jt.html_to_dom(html)
  dom = jt.preprocessor(dom)

  paragraphs = jt.ParagraphMaker.make_paragraphs(dom)

  jt.classify_paragraphs(paragraphs, stoplist)

  # Copy the class_type attribute
  # as is done in v2.2.0 (Line 296 of core.py in v2.2.0)
  # Commenting out this loop will result in the same classification
  # as the current version
  for paragraph in paragraphs:
    paragraph.class_type = paragraph.cf_class

  jt.revise_paragraph_classification(paragraphs)

  return paragraphs


def main():

  stoplist = justext.get_stoplist("English")
  paragraphs = justext_v3(html, stoplist)

  paragraphs_v220_copyclass = justext_v220_copyclass(html, stoplist)

  for p3, p2 in zip(paragraphs, paragraphs_v220_copyclass):
    if p3.class_type != p2.class_type:
      print(p3.text, p3.class_type)
      print(p2.text, p2.class_type)


if __name__ == "__main__":
  main()

Running this script results in the following:

Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich). bad
Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich). good

If you comment out the loop that copies the cf_class to class_type attribute, the classification is the same as the current version of jusText.

The difference is due to the classification of short header blocks. If the cf_class is copied to class_type before the context-aware classification, then the first loop in the function revise_paragraph_classification (the header preprocessing loop) changes the attribute class_type of the short header block (<h1>Eddie Brunner</h1>) from short to neargood. Consequently, this changes the classfication of the paragraph <b>Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich).</b> from bad to good which occurs in the loop that processes short paragraphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants