Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validating prepublished dataset files #717

Open
2 tasks
langphil opened this issue Feb 16, 2018 · 3 comments
Open
2 tasks

Validating prepublished dataset files #717

langphil opened this issue Feb 16, 2018 · 3 comments

Comments

@langphil
Copy link
Contributor

langphil commented Feb 16, 2018

Octopub currently provides CSVLint with a URL query string to validate a CSV and return a status to Octopub.

<td><a href="https://csvlint.io/?uri=<%= @dataset.gh_pages_url %>/data/<%= file.filename %>"><img src="https://csvlint.io/?uri=<%= @dataset.gh_pages_url %>/data/<%= file.filename %>&format=svg" alt="CSVlint validation result" /></a></td>

This query string points at the Github repository that was created during Octopub's publishing process. As part of the current development, we should be providing the S3 upload url as a Query String.

The issue is this - when a CSV is provided to CSVLint it not only validates the file, it also publishes it, making it available for download - this is against the scope of the current Octopub development.

Without the function of private validation of CSV files the vision for Octopub as a prepublishing tool is lost.

Goals

  • Provide the S3 file to CSVLint as a query string
  • Stop prepublished files being published by CSVLint
@olivierthereaux
Copy link
Contributor

I agree that the way octopub currently calls csvlint.io is not acceptable given the shift to a pre-publishing workflow.

Passing the S3 uri to csvlint.io is not acceptable either – regardless of whatever security policy we use in S3, I do not think it would be OK to expose the secure-through-obscurity URI of a yet unpublished resource, as csvlint would automatically publicise the URI in the its "recent validations" page.

Think we've got 4 options:

  1. Kill off the "recent validations" page on csvlint.io. It's not helpful at all, and causes us recurring grief. Pros: relatively easy to do, and kills two birds with one stone. Cons: I would worry that this may still expose the S3 uri beyond what is reasonable.
  2. Spin off a separate instance of csvlint.io purely for use with octopub. Pros: fairly trivial to do. Cons: more maintenance and hosting cost, plus see above on point 1)
  3. Stop using csvlint.io and use the csvlint.rb library internally instead. Pros: would be way more secure, and probably much more efficient. Cons: would require more significant development, and may also make it harder in the long run to integrate with lintol rather than csvlint, if/when we decide to switch.
  4. Keep using csvlint.io, but instead of href-ing to it, POST the actual file payload to it. csvlint.io accepts both a GET with the uri as query string, or a POST for file upload; and it does not list resources POSTed to it in its recent validations page. Pros: much safer, and does not preclude switching to a similar access point in Lintol in the future. Cons: similar to the solution of using csvlint.rb above.

My preference would be for 4) if we think it is doable, or 1) as a quick-and-dirty workaround.

@rachelwilson
Copy link
Contributor

Thanks for this!

(just double checking i'm not missing something) For point 4's Cons, when you say "similar to the solution of using csvlint.rb above" did you mean just the "would require more significant development" part, but not the "may also make it harder in the long run to integrate with lintol" part.

Out of interest: do we know why POSTed validations don't appear in the "recent validations" list? Was that a conscious decision or a technical quirk?

@olivierthereaux
Copy link
Contributor

For point 4's Cons, when you say "similar to the solution of using csvlint.rb above" did you mean just the "would require more significant development" part, but not the "may also make it harder in the long run to integrate with lintol" part.

Correct!

Out of interest: do we know why POSTed validations don't appear in the "recent validations" list? Was that a conscious decision or a technical quirk?

As far as I can tell, because the content payload is POSTed, unless csvlint can store it and create a URI for it, it can't point to a URI and therefore it makes no sense to add it to the recent validations.

@olivierthereaux olivierthereaux added this to Backlog in Octopub Roadmap May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Octopub Roadmap
  
Backlog
Development

No branches or pull requests

3 participants