Improve the Import Process #4549

kflemin · 2024-02-28T00:11:41Z

Per Peer Review feedback, upgrade data mapping and matching processes to make the workflow more intuitive. Include further integration of data types and dual unit support. Need to improve the performance of the importing process. Look at better scaling of worker nodes.

More details:

Profile the import process (steps 1 through 6 on the UI) - some steps seem to get skipped while others take a long time to run
- also profile each celery task and see which ones take the longest to run
- a specific example that is taking a long time (1 second for each imported record): https://github.com/SEED-platform/seed/blob/develop/seed/data_importer/match.py#L724
Investigate if there are things that can be skipped completely. for example:
- Geocoding: if the org doesn't have a mapquest key or lat/lng/UBID are not provided, skip the geocoding process
- Linking: if there's only one cycle, can we skip the whole linking process completely?
- Pairing: if there are no taxlots in the org and no taxlots in the import, can we skip the pairing process completely?
- Matching: are there improvements we can do here (matching within and without cycles?)
Can we improve the Progress API endpoint to be more fault tolerant:

frontend: keep retrying on error?
backend: reset the TTL of each key when it updates the progress values?

@axelstudios

axelstudios · 2024-05-01T16:29:29Z

The mapping step has room for improvement, but by far the biggest bottleneck is after you hit Confirm mappings & start matching:

Anecdotally, when uploading a file with 145,921 rows to an existing organization with 145,921 matching rows these are the celery tasks and their timing:

1x seed.data_importer.tasks._geocode_properties_or_tax_lots (35s)
1,460x seed.data_importer.tasks._map_additional_models (1m 8s)
1x seed.data_importer.match.match_and_link_incoming_properties_and_taxlots (2d 13h 55m 9s)
- It took approximately 1h 8m to get to step 3 / 6 (Matching Data (3/6): Merging Unmatched States)
- Step 3 / 6 took 15h 20m 36s to complete
- Step 6 / 6 took 1d 21h 23m 30s to complete, most of which was spent inside this loop that takes 1.1 seconds for each record in the import file
1x seed.data_importer.tasks.finish_matching (0.3s)

Total matching time: 61.94 hours (2 days 13 hours 56 minutes 33 seconds)

kflemin added the Enhancement Add this label if functionality was generally improved but not a full feature or maintentance. label Feb 28, 2024

kflemin changed the title ~~Parallelize the Import Process~~ Improve the Import Process Mar 22, 2024

axelstudios mentioned this issue May 13, 2024

Comprehensive performance improvements #4669

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the Import Process #4549

Improve the Import Process #4549

kflemin commented Feb 28, 2024 •

edited by axelstudios

axelstudios commented May 1, 2024 •

edited

Improve the Import Process #4549

Improve the Import Process #4549

Comments

kflemin commented Feb 28, 2024 • edited by axelstudios

axelstudios commented May 1, 2024 • edited

kflemin commented Feb 28, 2024 •

edited by axelstudios

axelstudios commented May 1, 2024 •

edited