Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate static site content from W3ACT data #2

Open
7 tasks
anjackson opened this issue Aug 20, 2019 · 2 comments
Open
7 tasks

Generate static site content from W3ACT data #2

anjackson opened this issue Aug 20, 2019 · 2 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Aug 20, 2019

To attempt to keep things more clearly modular, I am moving the prototype code that generates a Hugo static site from the W3ACT data from ukwa-manage and into here. The current implementation has been ported over (to be checked in), but there are also some gaps.

  • Collections and sub-collections: Only top-level collections are marked as published or not. Collections really need separating so that top-level and sub-collections are handled properly.
  • Similarly, Targets need separating into Archived Web Sites and Archived Web Pages, i.e. when a Target is a specific resource within another Target, it should be 'demoted' to being a Archived Web Page that belongs to a Collection.

The infrastructure for cross-referencing these things also needs consideration.

Ideally, this process should facilitate moving to mastering the data in GitHub/NetlifyCMS.

  • Support wct_at_oid for Collections and wct_id for Targets, as alias at least.
  • Support Subjects as a taxonomy.
  • Separate sites from pages.
  • Decide manageable file layout.
  • Add Nominating Organisation as a top-level entity, listing Web Sites (and Collections) for that organisation? Possibly even adding users?
  • Add full hierarchy to Collection names, as currently e.g. 'Interest Groups' show up multiple times.
  • Consider adding things like 'Interest Groups' as a distinct taxonomy rather than sub-collections?
@anjackson
Copy link
Contributor Author

For cross-references, Hugo provides quite sophisticated functionality for Related Content that allow us to perform these lookups. The IDs have to be coerced to strings but it works fine. The main 'gotcha' was the default threshold (80) was too high and even exect matches didn't get picked out. Not clear how the scoring works! The downside is that we have to use the same threshold for everything, so if we attempted to use the Related Content feature for other purposes than direct references, we might get too many (poor) matches that need to be cut down.

@anjackson
Copy link
Contributor Author

Well, b40406f implements the basic proof-of-concept for ukwa-site.

Not clear how best to handle identifiers. We have a LOT of records (19,516 pages, mostly host-level Targets) and looking records up is easier if the main ID is also the filename (we can support things like WCT-IDs via aliases).

Totally opaque filenames are very cumbersome to work with manually, but semantic names can be brittle over time. However, the primary URL for a record should be stable in general, so we could arrange the web site records by e.g. host (no www) or domain and creation date:

targets/gov.uk-2019-04-12.md

or perhaps just host and a version number (if needed):

targets/gov.uk-1.md

Using the host like this would help users find records. I am planning to collapse Target records down to hosts, so there would normally be just one file per host. Individual highlighted URLs within a Collection would be handled separately (although this also needs some more thought).

Note that NetlifyCMS also currently doesn't support content in sub-folders, but may do soon. Keeping everything in one folder is likely not very performant, but may be acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant