-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Scripts to integrate AWS S3 #807
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@jaredbradley243 is attempting to deploy a commit to the Arc53 Team on Vercel. A member of the Team first needs to authorize it. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #807 +/- ##
==========================================
- Coverage 19.56% 19.55% -0.01%
==========================================
Files 62 72 +10
Lines 2914 3340 +426
==========================================
+ Hits 570 653 +83
- Misses 2344 2687 +343 ☔ View full report in Codecov by Sentry. |
Script Functionality
Enabling S3 StorageTo enable S3 storage, use the
Enabling Role AssumptionIf accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the
This configuration allows the script to assume Note
|
To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. |
from boto3.s3.transfer import TransferConfig | ||
|
||
# When override=True .env variables take precedence over environment. https://pypi.org/project/python-dotenv/ | ||
dotenv.load_dotenv(override=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this changed? Default logic looks reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without override=true
, the environment takes precedence over the .env variables. If a user should change their AWS credentials in the .env file, I wanted the change to take effect immediately. 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't get the justification and the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tldr, if override is not set to true, the aws credentials in .env will not update each time the script is run.
Use case: User wants to change credentials in their env file between script runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see your clear understanding of this change. I trust the default behaviour is more generally acceptable and in this case as well.
Please provide a clear case or remove this change.
Thanks for the change requests, @larinam. I'll get to these soon! |
Working on these now. |
I added command line parameter support as well as env variables. 😁 |
This version includes support for command line arguments for s3_bucket, s3_documents_folder, s3_save_folder, s3_assume_role, aws_assume_role_profile S3 functionality has also been refactored into two functions, one for download located in ingest.py and one for upload located in open_ai_func.py
In general, according to the ticket, support for S3 should be added to the application as well. In this PR the changes are only in the scripts package. Please refer to the initial ticket for more information. |
To avoid passing s3 config through several calls, please consider using https://docs.pydantic.dev/latest/concepts/pydantic_settings/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the comments in the code and general comments.
from boto3.s3.transfer import TransferConfig | ||
|
||
# When override=True .env variables take precedence over environment. https://pypi.org/project/python-dotenv/ | ||
dotenv.load_dotenv(override=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see your clear understanding of this change. I trust the default behaviour is more generally acceptable and in this case as well.
Please provide a clear case or remove this change.
aws_assume_role_profile = aws_assume_role_profile or os.environ.get("AWS_ASSUME_ROLE_PROFILE") | ||
|
||
if not s3_bucket: | ||
print("Error: S3_BUCKET environment variable is not set.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get acquainted with the concept of Errors and Exceptions in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the message is not completely correct in the context of the latest changes.
What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
This PR modified the ingest and open_ai_func files to integrate AWS S3 into DocsGPT
Why was this change needed? (You can also link to an open issue here)
🚀 Feature: Use S3- bucket as the vector store #724
Other information: