Ruleset testing #49

deoxykev · 2023-11-26T22:48:49Z

As per everywall/ladder-rules#3, we'll need to implement some robust testing.

The main challenge is to test after client-side JS rendering happens, which will probably mean we'll need a headless browser.

A test could look like this: https://github.com/everywall/ladder/blob/ladder_tests/tests/tests/www-wellandtribune-ca.spec.ts

And the results like this.

Perhaps we'll need some codegen in order to go from ruleset to test?

ladddder · 2023-11-27T21:25:16Z

Yes, this would be cool.

joncrangle · 2023-11-28T04:08:33Z

This seems like a very cool idea.

I played around with your example a bit, and I think we may be able to leverage Github actions to run a shell script whenever a ruleset yaml is uploaded or changed to generate and run a test. From your example, I changed await expect(page.getByText(paywallText)).toBeVisible(); to await expect(page.getByText(paywallText)).not.toBeVisible(); for the test with ladder so both tests pass and could be used for CI.

I used the following bash script to generate a Playwright test.

./generate_test.sh -i rulesets/ca/_multi-metroland-media-group.yaml > tests/_multi-metroland-media-group.spec.ts

generate_test.sh

#!/bin/bash

# Command-line argument parsing
while getopts "i:" opt; do
  case $opt in
    i)
      input_file=$OPTARG
      ;;
    \?)
      echo "Invalid option: -$OPTARG" >&2
      exit 1
      ;;
    :)
      echo "Option -$OPTARG requires an argument." >&2
      exit 1
      ;;
  esac
done

# Check if the input file is provided
if [ -z "$input_file" ]; then
  echo "Usage: $0 -i <input_yaml_file>"
  exit 1
fi

# Extract information from the "tests" section
url=$(awk '/- url:/ {sub(/- url: /, ""); sub(/^[[:space:]]*/, ""); print}' "$input_file")
domain=$(echo "$url" | awk -F/ '{print $3}')
test=$(awk '/test:/ {sub(/test: /, ""); sub(/^[[:space:]]*/, ""); print}' "$input_file")

# Generate Playwright test script
echo "import { expect, test } from '@playwright/test';"
echo
echo "test('$domain has paywall by default', async ({ page }) => {"
echo "  await page.goto('$url');"
echo "  await expect($test).toBeVisible();"
echo "});"
echo
echo "test('$domain + Ladder does not have paywall', async ({ page }) => {"
echo "  await page.goto('http://localhost:8080/$url');"
echo "  await page.waitForLoadState();"
echo "  await expect($test).not.toBeVisible();"
echo "});"

In the ruleset yaml, I put a Playwright locator in the test portion:

tests:
  - url: https://www.wellandtribune.ca/news/niagara-region/niagara-transit-commission-rejects-council-request-to-reduce-its-budget-increase/article_e9fb424c-8df5-58ae-a6c3-3648e2a9df66.html
    test: page.getByText("This article is exclusive to subscribers.")

At the moment, the bash script is pretty limited to just checking if a specified element is or is not visible. If we continue this way, we may want the script to be a bit more general so it can capture other scenarios. This may require anyone contributing a rule to be a bit more explicit in their ruleset tests section, so rather than contribute a Playwright locator they may need to provide the expectation with both a locator and an assertion:

    test: expect(page.getByText("This article is exclusive to subscribers.")).toBeVisible()

Some additional parsing in the bash script could insert a .not before the assertion for the ladder test.

deoxykev · 2023-11-28T13:59:17Z

Nice work!

I've been thinking about how to generate rules for any site, in an automated fashion. One of the main roadblocks is figuring out whether or not a site is paywalled, and to generate a test for it. I wonder if it's as simple as extracting visible text from a page, and asking an LLM whether or not it is paywall text is sufficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruleset testing #49

Ruleset testing #49

deoxykev commented Nov 26, 2023

ladddder commented Nov 27, 2023

joncrangle commented Nov 28, 2023

deoxykev commented Nov 28, 2023

Ruleset testing #49

Ruleset testing #49

Comments

deoxykev commented Nov 26, 2023

ladddder commented Nov 27, 2023

joncrangle commented Nov 28, 2023

deoxykev commented Nov 28, 2023