Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locality and Geocoorddetail batch import tool #4548

Open
wants to merge 67 commits into
base: production
Choose a base branch
from

Conversation

melton-jason
Copy link
Contributor

@melton-jason melton-jason commented Feb 15, 2024

Fixes #4319

The Locality and GeoCoordDetail batch import tool allows users to match and update the Latitude1, Longitidude1, and Datum fields for Locality records and associate new GeoCoordDetails for the matched Locality records.

The primary motivation for the tool is such that the georefrencing for existing records in the database can take place outside of Specify (such as with CoGe), and there is no easy way to currently repatriate the data back into Specify and update those records.

This process is facilitated by requiring the guids of the Locality records in the CSV. Specify will use these guids to match/find the corresponding Locality record(s).
With the Locality record, the tool does the following:

  • If any latitude1, longitude1, and/or datum fields have values for the Locality record in the CSV, Specify will overwrite the current values in the database with those specified in the CSV
  • If any other column in the CSV matches a non-relationship field name in the GeoCoordDetail table, Specify will delete the previous GeoCoordDetail associated with the Locality record and create a new one populated with field values from the CSV
  • If there are any other columns in the CSV, Specify will display a warning to the user indicating there are unknown columns, but allow the user to proceed; the unknown columns will be ignored if the user proceeds.

Demo

Currently, access to the tool can be found in the User Tools menu:

locality_import_button

The primary interface is very similar to that of importing a file to be used as a DataSet in the WorkBench. Specify supports different encodings and delimiters, and displays the first 100 rows of the file:

coge_preview

The Import File button can be pressed to initiate the parsing/upload.

If there are any unknown columns in the dataset, a warning will first be displayed to the user so they can decide how to proceed:

coge_column_warning

If there are any bad values discovered when parsing the results, Specify will display a downloadable message stating which line the error was found on along with a message with more information about how to resolve the parsing error:

parse_error

If the parsing and upload is successful, Specify will display a brief results page stating how many records of each table were affected, along with the option to create a RecordSet of modified Locality Records:

upload_results

Checklist

  • Self-review the PR after opening it to make sure the changes look good
    and self-explanatory (or properly documented)
  • Add automated tests
  • Add relevant issue to release milestone

Testing instructions

For testing purposes, here is an example set of data for use on the KUFish database:
kufish_coge_sample.csv

Other datasets have been made for FITZ_NHMD, fwri, KUBirds, and lsumzmammals and are available in the following folder in the google drive:
https://drive.google.com/drive/folders/1B0hKQMBaX82nBUOB95uEa3H0MIrGrBv7?usp=drive_link

A foundation for other Datasets for use in other databases can be made by creating a Locality -> GeoCoordDetail query and exporting the results. For an example, here is the Query used to populate most of kufish_coge_sample.csv:
https://kufish51623-edge.test.specifysystems.org/specify/query/191/

There should be no errors in the kufish_coge_sample dataset presently, so modifications can be made to accomplish the desired errors and edge cases.

Generally, the cases which would need to be tested are:

  • Ensure an error message is displayed to the user when there is no Locality which has a guid which exists in the dataset
  • Data in unknown columns is properly ignored and not accounted for when uploading
  • If a Locality contains existing data for GeoCoordDetails, it should be deleted if there is at least one GeoCoordDetail field in the dataset
  • If there are no values in any GeoCoordDetail field for a Locality in the dataset and the Locality contains a GeoCoordDetail in Specify, the GeoCoordDetail should not be overwritten
  • Ensure that only the Locality fields which contain data in the dataset get updated for a Locality in the dataset (i.e., if longitude1 and datum have values in the dataset but latitude1 is either not a column or empty, ensure that latitude1 is never overwritten)
  • Ensure parsing error messages are intuitive enough to diagnose and resolve the problem with the dataset
    • Potential ways to introduce invalid values include:
      • Entering a value in a field which exceeds the maximum length for the field (the maximum length for a field can be found in the Schema Configuration)
      • Entering an incorrect type of value for the field (such as entering a letter into a number/integer field)
      • Not following a UIFormatter for a field

@melton-jason melton-jason added this to the 7.9.4 milestone Feb 15, 2024
@melton-jason melton-jason removed the request for review from maxpatiiuk February 21, 2024 21:13

return {
missingRequiredHeaders:
accumulator.missingRequiredHeaders.filter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use removeItem() util?

readonly error: LocalizedString;
}): JSX.Element {
return (
<p role="alert">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) would ErrorMessage component be better suited here or would it be too "loud"?

@@ -56,25 +49,43 @@ export function WbImportView(): JSX.Element {

function FilePicked({ file }: { readonly file: File }): JSX.Element {
const fileType = inferDataSetType(file);
const getSetDataSetName = useTriggerState(extractFileName(file.name));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a stripFileExtension util, which does something similar to extractFileName
could you please decide if one of these functions should be removed in favor of the other?

@@ -143,6 +143,9 @@ export const headerText = createDictionary({
abgeschlossen ist.
`,
},
coGeImportDataset: {
'en-us': 'Import CoGe Dataset'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so CoGe is even exposed to the user?
would they know what coge means?

@@ -292,4 +292,17 @@ export const localityText = createDictionary({
'uk-ua': 'DD MM SS.SS N/S/E/W (32 45 42.84 N)',
'de-ch': 'DD MM SS.SS N/S/O/W (32 45 42.84 N)',
},
localityImportHeaderError: {
'en-us': 'Errors Found in Column Headers',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(optional) is this better or worse?

Suggested change
'en-us': 'Errors Found in Column Headers',
'en-us': 'Errors found in column headers',

Copy link

@alesan99 alesan99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Geo Coord Details are overwritten / created when there are no Geo Coord values in the row in the csv.

chrome_VAvikgqy6w
TO
chrome_xN4ZhSVrI5
Tested on KUBirds_8_25_23 with kubirds_coge_sample.csv (after erasing the Geo Coord Detail values from a row)

All the other tests did work good though

@CarolineDenis CarolineDenis requested a review from a team May 16, 2024 14:18
@melton-jason
Copy link
Contributor Author

Looks like Geo Coord Details are overwritten / created when there are no Geo Coord values in the row in the csv.

chrome_VAvikgqy6w TO chrome_xN4ZhSVrI5 Tested on KUBirds_8_25_23 with kubirds_coge_sample.csv (after erasing the Geo Coord Detail values from a row)

All the other tests did work good though

#4548 (review)

@alesan99
I couldn't seem to recreate this issue.

To test this, I used a single Locality record on KUFish with a GUID of 72fc97f4-2926-4f35-aad3-0306b382483c and inserted some values in the GeoCoordDetail.

Then I imported the following CSV using the tool (where the only values are in the GUID, latitude1, longitude1, and datum columns):

GUID,Latitude1,Longitude1,Datum,Source,georefdetdate,georefremarks,maxuncertaintyest,maxuncertaintyestunit,namedplaceextent,errorpolygon,uncertaintypolygon
72fc97f4-2926-4f35-aad3-0306b382483c,17.0432700000,-96.6793200000,WGS84,,,,,,,,

The value in the GeoCoordDetail for the Locality remained untouched.

Are you able to provide the line of the CSV which corresponds to a Locality/GeoCoordDetails for which this occurs?

Copy link
Collaborator

@emenslin emenslin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing instructions

For testing purposes, here is an example set of data for use on the KUFish database:
kufish_coge_sample.csv

Other datasets have been made for FITZ_NHMD, fwri, KUBirds, and lsumzmammals and are available in the following folder in the google drive:
https://drive.google.com/drive/folders/1B0hKQMBaX82nBUOB95uEa3H0MIrGrBv7?usp=drive_link

A foundation for other Datasets for use in other databases can be made by creating a Locality -> GeoCoordDetail query and exporting the results. For an example, here is the Query used to populate most of kufish_coge_sample.csv:
https://kufish51623-edge.test.specifysystems.org/specify/query/191/

There should be no errors in the kufish_coge_sample dataset presently, so modifications can be made to accomplish the desired errors and edge cases.

Generally, the cases which would need to be tested are:

  • Ensure an error message is displayed to the user when there is no Locality which has a guid which exists in the dataset
  • Data in unknown columns is properly ignored and not accounted for when uploading
  • If a Locality contains existing data for GeoCoordDetails, it should be deleted if there is at least one GeoCoordDetail field in the dataset
  • If there are no values in any GeoCoordDetail field for a Locality in the dataset and the Locality contains a GeoCoordDetail in Specify, the GeoCoordDetail should not be overwritten
  • Ensure that only the Locality fields which contain data in the dataset get updated for a Locality in the dataset (i.e., if longitude1 and datum have values in the dataset but latitude1 is either not a column or empty, ensure that latitude1 is never overwritten)
  • Ensure parsing error messages are intuitive enough to diagnose and resolve the problem with the dataset
    • Potential ways to introduce invalid values include:
      • Entering a value in a field which exceeds the maximum length for the field (the maximum length for a field can be found in the Schema Configuration)
      • Entering an incorrect type of value for the field (such as entering a letter into a number/integer field)
      • Not following a UIFormatter for a field

Issues

The original error message that happens if there is an unknown field stays behind the loading screen the whole time and is still behind the create record set dialog at the end as well. This dialog should close once you start the upload.

cDT3fu9re.mp4

Could not consistently recreate this but sometimes after you imported a file and got some importing error (e.g field value is too long) if you tried to import the file again you would get a full specify error where you would have to return to the homepage.

Changes

I think when you are importing instead of just a loading screen for a long time it should have something like the workbench has
Like this:
Screenshot 2024-05-16 131607

I also think that it should auto create a record set. With the current behavior it is very easy to accidentally click out and then you cannot go back to create one again. At the very least I think you should not be able to click out of the 'create record set' dialog unless you explicitly click 'cancel'.

I also think the error dialog needs to include the header- just seeing the row isn't always super intuitve, also the column names in this dialog are too close together in my opinion.
Screenshot 2024-05-16 130541

@alesan99
Copy link

@alesan99 I couldn't seem to recreate this issue.

To test this, I used a single Locality record on KUFish with a GUID of 72fc97f4-2926-4f35-aad3-0306b382483c and inserted some values in the GeoCoordDetail.

Then I imported the following CSV using the tool (where the only values are in the GUID, latitude1, longitude1, and datum columns):

GUID,Latitude1,Longitude1,Datum,Source,georefdetdate,georefremarks,maxuncertaintyest,maxuncertaintyestunit,namedplaceextent,errorpolygon,uncertaintypolygon
72fc97f4-2926-4f35-aad3-0306b382483c,17.0432700000,-96.6793200000,WGS84,,,,,,,,

The value in the GeoCoordDetail for the Locality remained untouched.

Are you able to provide the line of the CSV which corresponds to a Locality/GeoCoordDetails for which this occurs?

Hm after doing some more testing in KU_Birds it looks like it only happens if I had a source defined in the source column, having only the columns you specified indeed does not cause any errors.

This was the culprit:

<style> </style>
guid latitude1 longitude1 datum source georefdetdate georefremarks maxuncertaintyest maxuncertaintyestunit namedplaceextent errorpolygon uncertaintypolygon
5d8f344a-7b48-11e4-8ef3-782bcb9cd5b5      WGS84 BioGeomancer              

Removing the value from source no longer overwrites the Geo Coord details.

I don't quite know "source"'s relation to Geo Coord details, but unless this is unintentional it looks like nothing was going wrong after all.

@melton-jason
Copy link
Contributor Author

@specify/ux-testing

Here are some changes I have made since the last review:

  • Removed all specific references of CoGe
    • The tool is source-agnostic and only requires the data match its own specific format. Thus all terms of 'CoGe Dataset' I have renamed to 'Locality Repatriation Dataset'.
    • Non-usable bonus points if you can come up with a better name!
locality_import_button
  • In the Parse Error dialog, I have included the field name as a column
parse_error
  • On a successful import, a recordset is always created, and in the import/upload results a link to the recordset is provided
upload_results
  • Fixed "The original error message that happens if there is an unknown field stays behind the loading screen the whole time and is still behind the create record set dialog at the end as well. This dialog should close once you start the upload"

Can these aspects be tested and feedback provided?

As far as including a progress bar over a loading is concerned, is there any additional functionality requested? We could integrate with the Notifications system, for example.

@melton-jason melton-jason requested a review from a team May 27, 2024 16:30
Copy link
Collaborator

@emenslin emenslin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing instructions

For testing purposes, here is an example set of data for use on the KUFish database:
kufish_coge_sample.csv

Other datasets have been made for FITZ_NHMD, fwri, KUBirds, and lsumzmammals and are available in the following folder in the google drive:
https://drive.google.com/drive/folders/1B0hKQMBaX82nBUOB95uEa3H0MIrGrBv7?usp=drive_link

A foundation for other Datasets for use in other databases can be made by creating a Locality -> GeoCoordDetail query and exporting the results. For an example, here is the Query used to populate most of kufish_coge_sample.csv:
https://kufish51623-edge.test.specifysystems.org/specify/query/191/

There should be no errors in the kufish_coge_sample dataset presently, so modifications can be made to accomplish the desired errors and edge cases.

Generally, the cases which would need to be tested are:

  • Ensure an error message is displayed to the user when there is no Locality which has a guid which exists in the dataset
  • Data in unknown columns is properly ignored and not accounted for when uploading
  • If a Locality contains existing data for GeoCoordDetails, it should be deleted if there is at least one GeoCoordDetail field in the dataset
  • If there are no values in any GeoCoordDetail field for a Locality in the dataset and the Locality contains a GeoCoordDetail in Specify, the GeoCoordDetail should not be overwritten
  • Ensure that only the Locality fields which contain data in the dataset get updated for a Locality in the dataset (i.e., if longitude1 and datum have values in the dataset but latitude1 is either not a column or empty, ensure that latitude1 is never overwritten)
  • Ensure parsing error messages are intuitive enough to diagnose and resolve the problem with the dataset
    • Potential ways to introduce invalid values include:
      • Entering a value in a field which exceeds the maximum length for the field (the maximum length for a field can be found in the Schema Configuration)
      • Entering an incorrect type of value for the field (such as entering a letter into a number/integer field)
      • Not following a UIFormatter for a field

Functionality works great!

Changes

  • I think the name should be changed, it looks kind of weird in the user tools since it's so long. Maybe something like 'Import Locality Dataset' to shorten it.
    Screenshot 2024-05-28 083921

  • I think there should be some sort of notification when it finishes. Since the import continues even if you close out of specify I think the user should get a notification when the record set is created.

  • The error dialog still looks a little weird and does not behave as expected. I would recommened at least changing 'row number' to 'row' and changing how the columns move when the dialog is resized.

O9Hliui8Ep.mp4
  • There should be some way to change the record set name before you click the link to it. I know that once the record set is created you can select the pencil icon and change it there but I don't know if users will want to go out of their way to have to change it. Maybe something like the workbench create a record set where it has the name auto filled in but you can change it before saving.
    Screenshot 2024-05-28 085232

I like this dialog but maybe a pencil here or something so users can change the name before linking out?

  • If you upload a dataset with errors you can still click on the 'import dataset button' and it'll just import the same one again. I think once you've imported a dataset that button should be grayed out unless you upload a new dataset otherwise users are stuck in the dialog until it finishes if they accidentally click it
qf78Dzv5ia.mp4

Overall, functionality works great, just some ui things I'd like changed! Great Work!

@Areyes42 Areyes42 self-requested a review May 28, 2024 16:48
Copy link

@Areyes42 Areyes42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing instructions

For testing purposes, here is an example set of data for use on the KUFish database:
kufish_coge_sample.csv

Other datasets have been made for FITZ_NHMD, fwri, KUBirds, and lsumzmammals and are available in the following folder in the google drive:
https://drive.google.com/drive/folders/1B0hKQMBaX82nBUOB95uEa3H0MIrGrBv7?usp=drive_link

A foundation for other Datasets for use in other databases can be made by creating a Locality -> GeoCoordDetail query and exporting the results. For an example, here is the Query used to populate most of kufish_coge_sample.csv:
https://kufish51623-edge.test.specifysystems.org/specify/query/191/

There should be no errors in the kufish_coge_sample dataset presently, so modifications can be made to accomplish the desired errors and edge cases.

Generally, the cases which would need to be tested are:

  • Ensure an error message is displayed to the user when there is no Locality which has a guid which exists in the dataset
  • Data in unknown columns is properly ignored and not accounted for when uploading
  • If a Locality contains existing data for GeoCoordDetails, it should be deleted if there is at least one GeoCoordDetail field in the dataset
  • If there are no values in any GeoCoordDetail field for a Locality in the dataset and the Locality contains a GeoCoordDetail in Specify, the GeoCoordDetail should not be overwritten
  • Ensure that only the Locality fields which contain data in the dataset get updated for a Locality in the dataset (i.e., if longitude1 and datum have values in the dataset but latitude1 is either not a column or empty, ensure that latitude1 is never overwritten)
  • Ensure parsing error messages are intuitive enough to diagnose and resolve the problem with the dataset
    • Potential ways to introduce invalid values include:
      • Entering a value in a field which exceeds the maximum length for the field (the maximum length for a field can be found in the Schema Configuration)
      • Entering an incorrect type of value for the field (such as entering a letter into a number/integer field)
      • Not following a UIFormatter for a field

Import tool looks excellent!

For the most part, I agree with everything that @emenslin mentioned for changes. I also think 'Import Locality Dataset' is a more concise name instead of 'Import Locality Repatriation Dataset'.

I noticed one small discrepancy: the row number in the parsing error notification does not match the one in the actual dataset. I understand that this happens because the row count starts at 0 and doesn't account if the first row is selected as the header. Although it's a nitpicky suggestion, I believe matching the row numbers would make it more intuitive and easier to navigate errors, since most spreadsheet editors start with 1.
Screenshot 2024-05-28 at 12 19 12 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Dev Attention Needed
Development

Successfully merging this pull request may close these issues.

Create a tool to repatriate data from CoGe and other georeferencing utilities
6 participants