Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support creation of unlimited graph types #268

Open
gothub opened this issue Aug 27, 2020 · 4 comments
Open

Support creation of unlimited graph types #268

gothub opened this issue Aug 27, 2020 · 4 comments
Assignees
Labels
metadig All issues related to metadig

Comments

@gothub
Copy link
Contributor

gothub commented Aug 27, 2020

@mbjones here are proposed changes to the quality engine to support generation/retrieval of any number of assessment graph types for a set of data (portal, member node, all of DataONE).

The quality engine should allow the creation of any number of graphs for a set of metadata. For example, for the metadata associated with a DataONE portal (i.e the collectionQuery pids), any number of different assessment graphs should be created and available when this portal is updated. The current list of desired graphs are "monthly", "cumulative", "check-analysis", but there could be many more.

The current REST endpoints to create and retrieve a graph for a portal is shown here with an example curl command:

curl -X POST 'https://docker-ucsb-4.dataone.org:30443/quality/scores?id=urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22&suite=FAIR-suite-0.3.1'

curl -X GET -H "Accept: image/png"  'https://docker-ucsb-4.dataone.org:30443/quality/scores?id=urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22&suite=FAIR-suite-0.3.1'

Note that the 'id' currently can either be a portal series id (urn:uuid:*) or a node id (e.g. urn:node:CA_OPC).

The quality engine and API should be extended to support these additional parameters for retrieval:

  • format='
    • for example: '&format=eml' which would filter the input data based on any EML format type
  • graphType=
    • for example: '&graphType=check-analysis'

The request to generate data and a graph should not include the type of graph to create, as all known graph types and variations should be created and made available for retrieval, based on only the id and suite.

The scripts that generate each graph type could follow a naming convention, so that the quality engine could automatically run them when they are added to the quality engine.

@gothub gothub added this to the 2.4.0 milestone Aug 27, 2020
@gothub gothub self-assigned this Aug 27, 2020
@mbjones
Copy link
Member

mbjones commented Aug 27, 2020

Thanks @gothub . This is great. I think we could tweak a few details to improve it. Here's a few questions and comments:

  1. why is there a URI for generating a graph? Wouldn't all graphs be generated whenever needed, typically on first creation of a collection, or on update of a suite, or on a timed schedule via a queue? Seems like another process should control queuing up these graph generation jobs, and not a REST URI. I am also sitting here thinking about whether we should generalize it to correspond to a process/script to be run that might do various analytical tasks, and produce some sort of well-defined output like a graph, but not limited to a graph. As I think about this, the pattern converges on Clowder more and more.
  2. the format filter is a bit fuzzy. wouldn't it be best to call it formatId, and be a repeatable list of formats to include (ORed together)? Also, do we really need format at all -- wouldn't it be best to create a collection with the relevant datasets filtered (e.g., by formatId)? Then, we wouldn't have to treat format differently at all.
  3. Do we want the URL to include quality, given our discussion of how that word is loaded? Can we come up with a better service collection name? /assessments/? /runs/? /results/? something else?
  4. The graphType looks good, and will probably work. We could consider renaming it to productType, or even incorporating it directly into the resource URI, which would be the more restful way to encode this. It also gets rid of the content negotiation, and makes it much easier for clients to request (it's hard for example to set the accept header in a browser url bar). For example, an alternative URI form could be:
  • Overall pattern: GET /assessments/{suite}/{identifier}/{productType}
  • GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/cumulative
  • GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/monthly
  • GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/check-analysis
  • GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/score-csv
  • GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/check-csv

That is a more RESTful pattern, where the suite is treated as a collection, the data collection comes second, and the product type is the resource that's available for that collection. It also unifies the graph and csv retrieval to get rid of the difficult Accept header, and opens the door to new product types that are neither graphs nor csv files, like PDFs.

I'd like to get feedback from @csjx and others on this as well. Let's discuss.

@gothub
Copy link
Contributor Author

gothub commented Aug 28, 2020

@mbjones thx for the review - here are some thoughts on the points you raised:

  1. Using a REST endpoint to generate an assessment (graph, data) retains the possibility of having DataONE MNs/CN queue requests (via metacat) when metadata or portal documents are created/updated, which was part of the original design. I've diverged from this graph a bit, as all requests (for generation or retrieval) are routed through metadig-controller.
    However, currently, the metadig-scheduler container is the only entity sending generation requests. These requests are based on the harvesting taks which watch for new/updated metadata and portal documents.

  2. Regarding the format filter, it seemed a bit cumbersome to have to specify every formatId for a desired graph, for example, for EML it would be '&formatId= eml://ecoinformatics.org/eml-2.0.0&formatId= eml://ecoinformatics.org/eml-2.0.1..., vs format=eml`, for the entire EML format family. If a requirement is to be able to retrieve different graphs based on filtering of formatId, then there needs to be a way for a client to specify what filter was applied, e.g. "give me all assessments for ISO metadata".
    Regarding the generation request, the engine could create graphs for all pre-defined filters, i.e. one graph for only EML content, one for ISO, one for no filters applied, so no filter would need to be specified for the request.

@mbjones
Copy link
Member

mbjones commented Aug 28, 2020

Thanks @gothub let's discuss next week with @csjx

@mbjones mbjones added the metadig All issues related to metadig label Apr 29, 2021
@gothub
Copy link
Contributor Author

gothub commented May 18, 2021

@mbjones @csjx when would be a good time to discuss/enumerate the range of products that need to be generated and retrievable, and how that is represented in the API.

The current potential list of product types:

  • graphs
    • scores as cumulative average
    • scores aggregated by month
    • check-analysis: all checks for a suite, with each check failure/success percent, grouped by category (i.e. F,A,I,R)
    • check-analysis: optional/required checks summarized by category
    • other graphs TBD
  • data files (CSV)
    • scores as cumulative average
    • scores aggregated by month
    • check-analysis:
      • each line contains
        • check_id, check_name, check_type, check_level, status, data_source, pid, obsoletes, obsoleted_by, sequence_id
        • e.g.: "resource.creatorIdentifier.present.1", "Resource Creator Identifier Present", "Findable", "REQUIRED", "FAILURE", "urn:node:ARCTIC", "doi:10.18739/A2RB6W25S", NA , "urn:uuid:8cdb22c6-cb33-4553-93ca-acb6f5d53ee4", "urn:uuid:8cdb22c6-cb33-4553-93ca-acb6f5d53ee4"

Each of these products can be generated for or filtered by the following:
- for all of DataONE, or an MN, or a collection (portal)
- graph including one or multiple metadata formats (EML, ISO, DataCite, schema.org)
- for a specified assessment suite

@gothub gothub modified the milestones: 2.4.0, 3.0 May 26, 2021
@jeanetteclark jeanetteclark removed this from the 3.0 milestone Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadig All issues related to metadig
Projects
None yet
Development

No branches or pull requests

3 participants