Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test lists v1.5 #1720

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Test lists v1.5 #1720

wants to merge 11 commits into from

Conversation

hellais
Copy link
Collaborator

@hellais hellais commented May 7, 2024

WIP branch to come up with a nicer data format for the future test lists v2 data format.

@hellais hellais changed the title Test lists v2 Test lists v1.5 May 7, 2024
* `category_description` - Description of the category
* `date_added` - ISO timestamp of when it was added
* `source` - string representing the name of the person that added it
* `notes` - a JSON string representing metadata for the URL (see URL Meta below)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add note about the quoting format and the fact that JSON format is determined by peaking the first byte which should be {

@bact
Copy link
Contributor

bact commented May 9, 2024

For metadata in notes, please see #1723 for what Thailand list is trying to put in there for the time being.

@bact
Copy link
Contributor

bact commented May 13, 2024

Proposed Metadata - for Discussion/Comments

Some are from the discussion at the iMAP/OONI Partner Gathering 2024.

Webpage status

These are characteristics intrinsic to the webpage/URL itself.

  • Page cannot be found (removed by the site owner)
  • Parking
  • Domain no longer registered
  • Date last known updated, as check by human

Observation status

Data about the observation activity.

  • Date last checked by human
    • Can be blank, if not existed.
    • Can be the same as or different from "Date last known updated, as check by human"
    • Given a fresh URL: if on 2024-05-13, a human look at a blog and found a latest post from the same day, both "Date last known updated, as check by human" and "Date last checked by human" will be 2024-05-13.
    • Later, on 2024-09-01, a human look at the same blog and find no new post. Then "Date last checked by human" will be 2024-09-01 and "Date last known updated, as check by human" will still be 2024-05-13.

Webpage category/Additional information

A category given by human judgement or need knowledge extrinsic to the URL.

  • Remove redundant category_description from the CSV
  • A way to say URL 1 and URL 2 are the same page or related
    • For example, two domains that run by the same organization.
    • In addition to canonical URLs
  • Probing frequency tier ("importance")

Note: Category now works on at least 3 independent axis/dimensions -- that’s why they overlap a lot:

  • Content (topics): Environment, Human Rights Issues, LGBT, Public Health, Sex Education
  • Container (type of media/technology that hold the content): Hosting and Blogging Platforms, Media sharing, File-sharing, Social Networking
  • Creator (type of organization): Government, Intergovernmental Organizations

But the category rearrangement will break the ability to compare with measurements from projects that use v1.0 version of test list spec.

Use Cases

As discussed, use cases will be very useful for the discussion.
As they will allow us to know what kind of metadata, when, and in which way it is best to collect/annotate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants