[SVCS-353] Look for Dataverse renamed files on upload #300

AddisonSchiller · 2017-11-13T16:28:17Z

Dataverse 'ingests' certain file types. These file types get renamed.
In upload when Waterbutler tries to find the correct metadata to return,
it will 500 since it was not looking for the renamed file.

Ticket

https://openscience.atlassian.net/browse/SVCS-353

Purpose

Uploading some file types to dataverse would cause a 500 error to be thrown in waterbutler. The file would still upload, but would have a new extension.

Changes

Allow waterbutler to return the correct metadata based on the current file name, as well as a new property called file.original_name. This new property will parse DV raw metadata to figure out what the original file type was. This way WB can return the correct metadata and won't throw a 500.
added tests for new functions.

Side effects

Currently DV just deletes old files if you try to upload a conflicting file name. (unsure if this is correct functionality or not)
Because DV changes the file names when uploading certain file types (csv, xlsx, and others), it wont be able to delete the original because it will look like it does not exist. Instead DV uses their own renaming scheme , <conflicting-file-2< ...
Because of this, when a file that DV renames gets uploaded, if it got renamed with a -1, or a -2 etc, WB won't look for that. instead it will return the metadata of the non-renamed version.
This shouldn't be much of a problem, as it is really only used for updating fangorn. A page refresh negates this error

I had thought of a few ways to negate this, but decided against it as it would add quite a few more DV api calls to the process. If it is determined in CR that we want to do that, then it can be implemented

QA Notes

The effect file types are as follows: .csv, .RData, .sav, .dta, .por, .xlsx
If you upload one of these files types to DV, it will be converted into a .tab file via their ingestion process.

Steps:

there are some sample file types on the JIRA ticket to use.
do not batch upload all the files (DV will return a 400 since it will get stuck on one of the ingestion operations)
One by one, upload each file
it should complete successfully and show up in fangorn as <name.tab>.
clicking on it should go to the right file.

Steps for 'side-effects' notes:

upload a file for the second time
it will look like 2 of the same file are on the page
refresh the page and one of them should show up with a -1

Deployment Notes

coveralls · 2017-11-13T17:14:14Z

Coverage increased (+0.03%) to 89.136% when pulling 3ada43b on AddisonSchiller:feature/SVCS-353-davaverse-csv-bug into 473191c on CenterForOpenScience:develop.

TomBaxter

Found one issue, annotated over three comments.

After a little testing and reviewing of the Dataverse API I think we need to go a little further here. As you already noted, Dataverse is saving certain file types as tabular files. As this PR stands we present the '.tab" file extension and return the tabular format when downloading.

We can't allow our users data to be abused this way. Fitz has pinged product about updating the terms of service for the Dataverse provider.

Though we may have to let this aggression stand to some extent, I think we might be able to mask our users from it.

Since we have a means of identifying the original file format we can hopefully present that file extension to them. Also there is a download parameter "format=original" that we should be able to use to get the original file back. http://guides.dataverse.org/en/latest/api/dataaccess.html?highlight=download#id4

I think it is worth giving it a try. To be honest, if we can't guarantee the data integrity, I'd almost prefer to just not allow those file types to be uploaded. So horrible.

TomBaxter · 2017-11-17T16:26:09Z

tests/providers/dataverse/test_utils.py

+
+    def test_original_ext_from_raw_metadata(self, format_dict):
+        for key in format_dict:
+            assert key == dv_utils.original_ext_from_raw_metadata(format_dict[key])


waterbutler.providers.dataverse.utils.ORIGINAL_FORMATS is a dictionary, which are unordered. Therefore this test will fail roughly two out of three runs. There are three keys which will match in dv_utils.original_ext_from_raw_metadata only one of which is 'RData'.

TomBaxter · 2017-11-17T16:40:01Z

waterbutler/providers/dataverse/metadata.py

+        name of the file.
+        """
+
+        ext = dv_utils.original_ext_from_raw_metadata(self.raw)


Similar to previous dv_utils.original_ext_from_raw_metadata comment. Will return more or less at random 'RData', 'rdata', or 'Rdata' for self.raw with 'originalFormatLabel': 'R Data'

TomBaxter · 2017-11-17T16:52:51Z

waterbutler/providers/dataverse/utils.py

+
+    },
+    # Rdata can come in a few different forms, so just list all of them here
+    'RData': {


See two previous comments regarding multiple 'original_label': 'R Data'

TomBaxter · 2017-11-17T18:44:26Z

waterbutler/providers/dataverse/provider.py

@@ -178,7 +178,8 @@ def can_duplicate_names(self):
        # Find appropriate version of file
        metadata = await self._get_data('latest')
        files = metadata if isinstance(metadata, list) else []
-        file_metadata = next(file for file in files if file.name == path.name)
+        file_metadata = next(file for file in files if file.name == path.name or


does this need and extra set of prenths?

Dataverse 'ingests' certain file types. These file types get renamed. In upload when Waterbutler tries to find the correct metadata to return, it will 500 since it was not looking for the renamed file.

AddisonSchiller · 2017-11-21T18:49:40Z

Changes on last commit:

original_name -> original_names and is now a list
This gets rid of the need for "RData", "rdata" and "Rdata" etc. It now returns a list of possible names to look for metadata.

If you upload a large RData file, it takes awhile for dataverse to convert it. Any uploads during that time will give a 400 back saying ingestion is currently clogged.
the provider will now look for these messages and display a message to the user that the upload process is clogged and to wait a few seconds and try again.

To test the above upload an RData file. Once its load bar finishes, immediately upload another ingested file type. It should throw the error.

New tests for this new exception handling etc.

coveralls · 2017-11-21T21:10:15Z

Coverage increased (+0.05%) to 89.992% when pulling c803684 on AddisonSchiller:feature/SVCS-353-davaverse-csv-bug into 26bf209 on CenterForOpenScience:develop.

Johnetordoff · 2017-11-28T15:00:31Z

waterbutler/providers/dataverse/utils.py

+    if not label or not file_format or not content_type:
+        return None
+
+    for key in ORIGINAL_FORMATS:


probably better as:

for key, value in ORIGINAL_FORMATS.items():
if (label == value['original_label'] and ...

You can apply the above to TestUtils class as well.

Johnetordoff · 2017-11-28T15:03:31Z

waterbutler/providers/dataverse/provider.py

        await resp.release()

        # Find appropriate version of file
        metadata = await self._get_data('latest')
        files = metadata if isinstance(metadata, list) else []
-        file_metadata = next(file for file in files if file.name == path.name)
+        file_metadata = next(file for file in files if (file.name == path.name or


Can we not just use a list comprehension here?

coveralls · 2017-11-30T15:40:00Z

Coverage increased (+0.05%) to 89.992% when pulling 9f0cdbb on AddisonSchiller:feature/SVCS-353-davaverse-csv-bug into 26bf209 on CenterForOpenScience:develop.

AddisonSchiller force-pushed the feature/SVCS-353-davaverse-csv-bug branch from cf0afbc to 3ada43b Compare November 13, 2017 16:44

TomBaxter suggested changes Nov 17, 2017

View reviewed changes

cslzchen added the To Be Replaced label Nov 17, 2017

AddisonSchiller added 2 commits November 21, 2017 13:46

Look for Dataverse renamed files on upload

8aaafb2

Dataverse 'ingests' certain file types. These file types get renamed. In upload when Waterbutler tries to find the correct metadata to return, it will 500 since it was not looking for the renamed file.

Response to CR

c803684

AddisonSchiller force-pushed the feature/SVCS-353-davaverse-csv-bug branch from 725e635 to c803684 Compare November 21, 2017 18:46

AddisonSchiller changed the title ~~[SVCS-353]Look for Dataverse renamed files on upload~~ [SVCS-353] Look for Dataverse renamed files on upload Nov 27, 2017

Johnetordoff suggested changes Nov 28, 2017

View reviewed changes

Use .items over dict lookups

9f0cdbb

cslzchen added Code Review To Be Replaced and removed To Be Replaced Code Review labels Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SVCS-353] Look for Dataverse renamed files on upload #300

[SVCS-353] Look for Dataverse renamed files on upload #300

AddisonSchiller commented Nov 13, 2017

coveralls commented Nov 13, 2017 •

edited

TomBaxter left a comment

TomBaxter Nov 17, 2017

TomBaxter Nov 17, 2017

TomBaxter Nov 17, 2017

TomBaxter Nov 17, 2017

AddisonSchiller commented Nov 21, 2017

coveralls commented Nov 21, 2017 •

edited

Johnetordoff Nov 28, 2017

Johnetordoff Nov 28, 2017

Johnetordoff Nov 28, 2017

coveralls commented Nov 30, 2017 •

edited

[SVCS-353] Look for Dataverse renamed files on upload #300

Are you sure you want to change the base?

[SVCS-353] Look for Dataverse renamed files on upload #300

Conversation

AddisonSchiller commented Nov 13, 2017

Ticket

Purpose

Changes

Side effects

QA Notes

Deployment Notes

coveralls commented Nov 13, 2017 • edited

TomBaxter left a comment

Choose a reason for hiding this comment

TomBaxter Nov 17, 2017

Choose a reason for hiding this comment

TomBaxter Nov 17, 2017

Choose a reason for hiding this comment

TomBaxter Nov 17, 2017

Choose a reason for hiding this comment

TomBaxter Nov 17, 2017

Choose a reason for hiding this comment

AddisonSchiller commented Nov 21, 2017

coveralls commented Nov 21, 2017 • edited

Johnetordoff Nov 28, 2017

Choose a reason for hiding this comment

Johnetordoff Nov 28, 2017

Choose a reason for hiding this comment

Johnetordoff Nov 28, 2017

Choose a reason for hiding this comment

coveralls commented Nov 30, 2017 • edited

coveralls commented Nov 13, 2017 •

edited

coveralls commented Nov 21, 2017 •

edited

coveralls commented Nov 30, 2017 •

edited