Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Scripts to integrate AWS S3 #807

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jaredbradley243
Copy link
Collaborator

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    This PR modified the ingest and open_ai_func files to integrate AWS S3 into DocsGPT

  • Why was this change needed? (You can also link to an open issue here)
    🚀 Feature: Use S3- bucket as the vector store #724

  • Other information:

Copy link

vercel bot commented Dec 25, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nextra-docsgpt ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 25, 2023 11:26pm

Copy link

vercel bot commented Dec 25, 2023

@jaredbradley243 is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

codecov bot commented Dec 25, 2023

Codecov Report

Attention: Patch coverage is 0% with 93 lines in your changes are missing coverage. Please review.

Project coverage is 19.55%. Comparing base (7f79363) to head (30f2171).
Report is 180 commits behind head on main.

Files Patch % Lines
scripts/ingest.py 0.00% 59 Missing ⚠️
scripts/parser/open_ai_func.py 0.00% 34 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #807      +/-   ##
==========================================
- Coverage   19.56%   19.55%   -0.01%     
==========================================
  Files          62       72      +10     
  Lines        2914     3340     +426     
==========================================
+ Hits          570      653      +83     
- Misses       2344     2687     +343     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jaredbradley243
Copy link
Collaborator Author

Script Functionality

  1. Local Mode (default): Processes documents from local directories specified by the user.
  2. S3 Mode (--s3):
    • Downloads documents from an S3 bucket to a temporary local storage (s3_temp_storage).
    • Processes these documents.
    • Uploads the processed documents back to the S3 bucket.

Enabling S3 Storage

To enable S3 storage, use the --s3 flag when running the script.

  1. Environment Variables: Set these variables in your .env file:

    • S3_BUCKET: Name of your S3 bucket.
    • S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).
    • S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
  2. Running the Script:

    • python ingest.py ingest --s3

Enabling Role Assumption

If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.

  1. Environment Variable:
  • Add AWS_ASSUME_ROLE_PROFILE to your .env file with the name of the AWS profile for role assumption. Ex: AWS_ASSUME_ROLE_PROFILE="dev"
  1. AWS Configuration:
  • Credentials File (~/.aws/credentials):
    [default]
    aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY
    aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY
    
    [iamadmin]
    aws_access_key_id = EXAMPLEKEY123456
    aws_secret_access_key = EXAMPLESECRETKEY123456
  • Config File (~/.aws/config):
    [default]
    region = us-west-2
    output = json
    
    [profile dev]
    region = us-west-2
    role_arn = arn:aws:iam::123456789012:role/YourRoleName
    source_profile = iamadmin
  1. Running the Script with Role Assumption:

    • python your_script.py --s3 --s3-assume

This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.

Note

  • Ensure that the IAM role (YourRoleName) has necessary permissions for accessing the specified S3 bucket.
  • The script will create a temporary local storage (s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.

@larinam
Copy link
Collaborator

larinam commented Jan 21, 2024

To me, the following parameters look more logical to pass over command line parameters rather than as environment variables.
S3_BUCKET
S3_SAVE_FOLDER
AWS_ASSUME_ROLE_PROFILE

from boto3.s3.transfer import TransferConfig

# When override=True .env variables take precedence over environment. https://pypi.org/project/python-dotenv/
dotenv.load_dotenv(override=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this changed? Default logic looks reasonable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without override=true, the environment takes precedence over the .env variables. If a user should change their AWS credentials in the .env file, I wanted the change to take effect immediately. 😀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't get the justification and the use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tldr, if override is not set to true, the aws credentials in .env will not update each time the script is run.

Use case: User wants to change credentials in their env file between script runs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see your clear understanding of this change. I trust the default behaviour is more generally acceptable and in this case as well.
Please provide a clear case or remove this change.

scripts/ingest.py Outdated Show resolved Hide resolved
scripts/ingest.py Outdated Show resolved Hide resolved
scripts/parser/open_ai_func.py Show resolved Hide resolved
@jaredbradley243
Copy link
Collaborator Author

Thanks for the change requests, @larinam. I'll get to these soon!

@jaredbradley243
Copy link
Collaborator Author

Working on these now.

@jaredbradley243
Copy link
Collaborator Author

To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. S3_BUCKET S3_SAVE_FOLDER AWS_ASSUME_ROLE_PROFILE

I added command line parameter support as well as env variables. 😁

This version includes support for command line arguments for s3_bucket, s3_documents_folder, s3_save_folder, s3_assume_role, aws_assume_role_profile

S3 functionality has also been refactored into two functions, one for download located in ingest.py and one for upload located in open_ai_func.py
@larinam
Copy link
Collaborator

larinam commented Mar 30, 2024

In general, according to the ticket, support for S3 should be added to the application as well. In this PR the changes are only in the scripts package. Please refer to the initial ticket for more information.

@larinam
Copy link
Collaborator

larinam commented Mar 30, 2024

To avoid passing s3 config through several calls, please consider using https://docs.pydantic.dev/latest/concepts/pydantic_settings/
The configuration is static and doesn't change through the lifecycle of the script, so please consider using static object to manage these settings.

Copy link
Collaborator

@larinam larinam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comments in the code and general comments.

from boto3.s3.transfer import TransferConfig

# When override=True .env variables take precedence over environment. https://pypi.org/project/python-dotenv/
dotenv.load_dotenv(override=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see your clear understanding of this change. I trust the default behaviour is more generally acceptable and in this case as well.
Please provide a clear case or remove this change.

aws_assume_role_profile = aws_assume_role_profile or os.environ.get("AWS_ASSUME_ROLE_PROFILE")

if not s3_bucket:
print("Error: S3_BUCKET environment variable is not set.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get acquainted with the concept of Errors and Exceptions in Python.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the message is not completely correct in the context of the latest changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scripts Scripts
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants