Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the Import Process #4549

Open
kflemin opened this issue Feb 28, 2024 · 1 comment
Open

Improve the Import Process #4549

kflemin opened this issue Feb 28, 2024 · 1 comment
Labels
Enhancement Add this label if functionality was generally improved but not a full feature or maintentance.

Comments

@kflemin
Copy link
Contributor

kflemin commented Feb 28, 2024

Per Peer Review feedback, upgrade data mapping and matching processes to make the workflow more intuitive. Include further integration of data types and dual unit support. Need to improve the performance of the importing process. Look at better scaling of worker nodes.

More details:

  1. Profile the import process (steps 1 through 6 on the UI) - some steps seem to get skipped while others take a long time to run

  2. Investigate if there are things that can be skipped completely. for example:

    • Geocoding: if the org doesn't have a mapquest key or lat/lng/UBID are not provided, skip the geocoding process
    • Linking: if there's only one cycle, can we skip the whole linking process completely?
    • Pairing: if there are no taxlots in the org and no taxlots in the import, can we skip the pairing process completely?
    • Matching: are there improvements we can do here (matching within and without cycles?)
  3. Can we improve the Progress API endpoint to be more fault tolerant:

  • frontend: keep retrying on error?
  • backend: reset the TTL of each key when it updates the progress values?

@axelstudios

@kflemin kflemin added the Enhancement Add this label if functionality was generally improved but not a full feature or maintentance. label Feb 28, 2024
@kflemin kflemin changed the title Parallelize the Import Process Improve the Import Process Mar 22, 2024
@axelstudios
Copy link
Member

axelstudios commented May 1, 2024

The mapping step has room for improvement, but by far the biggest bottleneck is after you hit Confirm mappings & start matching:
Start Matching

Anecdotally, when uploading a file with 145,921 rows to an existing organization with 145,921 matching rows these are the celery tasks and their timing:

  • 1x seed.data_importer.tasks._geocode_properties_or_tax_lots (35s)
  • 1,460x seed.data_importer.tasks._map_additional_models (1m 8s)
  • 1x seed.data_importer.match.match_and_link_incoming_properties_and_taxlots (2d 13h 55m 9s)
    • It took approximately 1h 8m to get to step 3 / 6 (Matching Data (3/6): Merging Unmatched States)
    • Step 3 / 6 took 15h 20m 36s to complete
    • Step 6 / 6 took 1d 21h 23m 30s to complete, most of which was spent inside this loop that takes 1.1 seconds for each record in the import file
  • 1x seed.data_importer.tasks.finish_matching (0.3s)

Total matching time: 61.94 hours (2 days 13 hours 56 minutes 33 seconds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Add this label if functionality was generally improved but not a full feature or maintentance.
Projects
Status: Prioritized Todo
Development

No branches or pull requests

2 participants