Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor ingest #3009

Merged
merged 53 commits into from
May 21, 2024
Merged

feat: refactor ingest #3009

merged 53 commits into from
May 21, 2024

Conversation

rbiseck3
Copy link
Contributor

@rbiseck3 rbiseck3 commented May 13, 2024

Description

This refactors the current ingest CLI process to support better granularity in how the steps are ran

  • Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here
  • Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload.
  • The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere.
  • Quick update to the chunker/partition step to use the python client.
  • Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from.
  • Leverage last modified date to mark files to be reprocessed, even if the file already exists locally.

Callouts

Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more.

Testing

  • If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: example.py
  • If you want to run the new code as an isolated CLI:
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
  • If you want to see which commands have been migrated to the new version, there's now a v2 short help text next to those commands when running the current cli:
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help   

Options:
  --help  Show this message and exit.

Commands:
  airtable
  azure
  biomed
  box
  confluence
  delta-table
  discord
  dropbox
  elasticsearch
  fsspec
  gcs
  github
  gitlab
  google-drive
  hubspot
  jira
  local          v2
  mongodb
  notion
  onedrive
  opensearch
  outlook
  reddit
  s3             v2
  salesforce
  sftp
  sharepoint
  slack
  wikipedia

You can run any of the local or s3 specific ingest tests and these should now work.

@potter-potter
Copy link
Contributor

This is really cool! And its also a lot to understand. Could you add notes on how to test it out in the description?

@rbiseck3 rbiseck3 force-pushed the roman/refactor-ingest branch 2 times, most recently from 002824f to cc14e48 Compare May 15, 2024 19:03
@rbiseck3 rbiseck3 changed the title feat: refactor ingest (WIP) feat: refactor ingest May 15, 2024
@ryannikolaidis
Copy link
Contributor

ryannikolaidis commented May 16, 2024

are there doc changes that need to happen for users of the connectors via Python? guessing nothing breaks yet for them since this is only additive, but we'll just need to once we switch over

Comment on lines +22 to +25
for s in src_v2:
src_dict[s.name] = s
for d in dest_v2:
dest_dict[d.name] = d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, so we just defer to v2 for cli if there exists a v2 command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, at least right now as soon as the new connector exists, it replaces the old one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is actually pretty nice because it means that nothing breaks as far as users know. Also means we can add deprecations warnings for the Python users for whom it will break once we switch over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth doing that for the ones that you're bumping here? And probably bump the docs for the respective connectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we manage the docs for connectors now? I thought that changed recently and not sure if that's still being managed by this repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryannikolaidis
Copy link
Contributor

Overall I think there are a lot of good changes in here! This does come at the cost of quite a bit of complexity (though to be fair some of this is coming back in a bit cold). My biggest concern is that it's going to be tough for an average person to walk in and debug or be effective here, and currently you have to do a lot of tracing to grok what's going on. My main ask: can we equivalently (in size and scope) push the documentation? Including a lot more comments and docstrings as well as a comprehensive breakdown of everything happening with maybe some diagrams as well?

@rbiseck3 rbiseck3 force-pushed the roman/refactor-ingest branch 2 times, most recently from 58cdb61 to 253f158 Compare May 17, 2024 15:04
ryannikolaidis and others added 13 commits May 21, 2024 10:10
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should mention in the PR description the bug fix (which is actually what caused the PR's surface area to bloat a bunch)

Thanks for the additions so far. Just noting here from conversation, more extensive documentation (mentioned in comments) will be added before we switch over from original code and current flow. In the meantime this should be non-breaking for users.

@rbiseck3 rbiseck3 enabled auto-merge May 21, 2024 15:52
@rbiseck3 rbiseck3 added this pull request to the merge queue May 21, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 21, 2024
@rbiseck3 rbiseck3 added this pull request to the merge queue May 21, 2024
Merged via the queue into main with commit 3eaf65a May 21, 2024
46 checks passed
@rbiseck3 rbiseck3 deleted the roman/refactor-ingest branch May 21, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants