Generate static site content from W3ACT data #2

anjackson · 2019-08-20T14:49:52Z

To attempt to keep things more clearly modular, I am moving the prototype code that generates a Hugo static site from the W3ACT data from ukwa-manage and into here. The current implementation has been ported over (to be checked in), but there are also some gaps.

Collections and sub-collections: Only top-level collections are marked as published or not. Collections really need separating so that top-level and sub-collections are handled properly.
Similarly, Targets need separating into Archived Web Sites and Archived Web Pages, i.e. when a Target is a specific resource within another Target, it should be 'demoted' to being a Archived Web Page that belongs to a Collection.

The infrastructure for cross-referencing these things also needs consideration.

Ideally, this process should facilitate moving to mastering the data in GitHub/NetlifyCMS.

Support wct_at_oid for Collections and wct_id for Targets, as alias at least.
Support Subjects as a taxonomy.
Separate sites from pages.
Decide manageable file layout.
Add Nominating Organisation as a top-level entity, listing Web Sites (and Collections) for that organisation? Possibly even adding users?
Add full hierarchy to Collection names, as currently e.g. 'Interest Groups' show up multiple times.
Consider adding things like 'Interest Groups' as a distinct taxonomy rather than sub-collections?

The text was updated successfully, but these errors were encountered:

anjackson · 2019-08-21T11:05:55Z

For cross-references, Hugo provides quite sophisticated functionality for Related Content that allow us to perform these lookups. The IDs have to be coerced to strings but it works fine. The main 'gotcha' was the default threshold (80) was too high and even exect matches didn't get picked out. Not clear how the scoring works! The downside is that we have to use the same threshold for everything, so if we attempted to use the Related Content feature for other purposes than direct references, we might get too many (poor) matches that need to be cut down.

anjackson · 2019-08-21T13:37:37Z

Well, b40406f implements the basic proof-of-concept for ukwa-site.

Not clear how best to handle identifiers. We have a LOT of records (19,516 pages, mostly host-level Targets) and looking records up is easier if the main ID is also the filename (we can support things like WCT-IDs via aliases).

Totally opaque filenames are very cumbersome to work with manually, but semantic names can be brittle over time. However, the primary URL for a record should be stable in general, so we could arrange the web site records by e.g. host (no www) or domain and creation date:

targets/gov.uk-2019-04-12.md

or perhaps just host and a version number (if needed):

targets/gov.uk-1.md

Using the host like this would help users find records. I am planning to collapse Target records down to hosts, so there would normally be just one file per host. Individual highlighted URLs within a Collection would be handled separately (although this also needs some more thought).

Note that NetlifyCMS also currently doesn't support content in sub-folders, but may do soon. Keeping everything in one folder is likely not very performant, but may be acceptable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate static site content from W3ACT data #2

Generate static site content from W3ACT data #2

anjackson commented Aug 20, 2019 •

edited

anjackson commented Aug 21, 2019

anjackson commented Aug 21, 2019

Generate static site content from W3ACT data #2

Generate static site content from W3ACT data #2

Comments

anjackson commented Aug 20, 2019 • edited

anjackson commented Aug 21, 2019

anjackson commented Aug 21, 2019

anjackson commented Aug 20, 2019 •

edited