Crux replaces page title with site title. #25

ciferkey · 2022-08-01T21:39:34Z

I've been running crux over several sites and noticed the following bug.

Problem

Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816

Test based off the README example to verify the problem:

  @Test
  fun broken() {
    val crux = Crux()

    val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()

    val document = Jsoup.connect(httpUrl.toString()).get()

    val resource = runBlocking {
      crux.extractFrom(httpUrl, document)
    }

    assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
  }

The sequence of events is:

HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
WebAppManifestParser extracts the title "BBC"
The fold operation in Crux.extractFrom uses Resource.plus to merge the resources overwriting the title with "BBC"

crux/src/main/kotlin/com/chimbori/crux/api/Resource.kt

Line 51 in 3b4586c

fields = if (anotherResource?.fields == null) fields else fields + anotherResource.fields,

Possible solutions

If you update Crux.createDefaultPlugins to place WebAppManifestParser before HtmlMetadataExtractor like this:

public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
  // Static redirectors go first, to avoid getting stuck into CAPTCHAs.
  GoogleUrlRewriter(),
  FacebookUrlRewriter(),
  // Remove any tracking parameters remaining.
  TrackingParameterRemover(),
  // Prefer canonical URLs over AMP URLs.
  AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
  // Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
  WebAppManifestParser(okHttpClient),
  // Parses many standard HTML metadata attributes.
  HtmlMetadataExtractor(okHttpClient),
  // Extracts the best possible favicon from all the markup available on the page itself.
  FaviconExtractor(),
  // Parses the content of the page to remove ads, navigation, and all the other fluff.
  ArticleExtractor(okHttpClient),
)

It will produce the correct results.

This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser after HtmlMetadataExtractor or can we reorder it?

If that is not possible then we might need to consider a new way to handle merging the fields.

The text was updated successfully, but these errors were encountered:

chimbori · 2022-08-01T21:54:49Z

That sounds perfect: solving via reordering the plugins is the best solution.

I didn't envision this exact scenario when writing it up, so this is a good bug that you reported.

ciferkey mentioned this issue Aug 2, 2022

Reordered default plugins so HtmlMetadataExtractor overrides WebAppManifestParser. #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crux replaces page title with site title. #25

Crux replaces page title with site title. #25

ciferkey commented Aug 1, 2022

chimbori commented Aug 1, 2022

Crux replaces page title with site title. #25

Crux replaces page title with site title. #25

Comments

ciferkey commented Aug 1, 2022

Problem

Possible solutions

chimbori commented Aug 1, 2022