Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGVector Support for Custom Connection Object #2566

Merged
merged 40 commits into from
May 24, 2024
Merged

PGVector Support for Custom Connection Object #2566

merged 40 commits into from
May 24, 2024

Conversation

Knucklessg1
Copy link
Collaborator

@Knucklessg1 Knucklessg1 commented May 1, 2024

Why are these changes needed?

This PR contains adding support for custom psycopg connections.

A user can define the connection object.

This is important because a connection object may have to be very custom for certain environments. We should allow the end user to specify the connection object for their environment.

Fix included for .gitattributes to commit certain files with lf line endings instead of crlf. (This was breaking bash scripts in the repo)

Fix included for psycopg[binary] dependency being installed for Windows and Mac, Linux can use the pure python implementation psycopg.

conn = psycopg.connect(conninfo=connection_string_encoded, autocommit=True)

ragproxyagent = RetrieveUserProxyAgent(
    name="ragproxyagent",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=1,
    retrieve_config={
        "task": "code",
        "docs_path": [
            "https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Examples/Integrate%20-%20Spark.md",
            "https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Research.md",
            os.path.join(os.path.abspath(""), "..", "website", "docs"),
        ],
        "custom_text_types": ["non-existent-type"],
        "chunk_token_size": 2000,
        "model": config_list[0]["model"],
        "vector_db": "pgvector",  # PGVector database
        "db_config": {
            "conn": conn 
        },
        "get_or_create": True,  # set to False if you don't want to reuse an existing collection
        "overwrite": False,  # set to True if you want to overwrite an existing collection
    },
    code_execution_config=False,  # set to False if you don't want to execute the code
)

And pass that into the db_config for the retrieve agent.

This also contains a fix for the psycopg.connect() using the username field directly.

Related issue number

NA

Checks

Copy link

gitguardian bot commented May 1, 2024

⚠️ GitGuardian has uncovered 8 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
10493810 Triggered Generic Password 8d19f65 test/agentchat/contrib/vectordb/test_pgvectordb.py View secret
10493810 Triggered Generic Password 4b7ba2b notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 4b7ba2b notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 4b7ba2b notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password fdbc3d5 test/agentchat/contrib/vectordb/test_pgvectordb.py View secret
10493810 Triggered Generic Password 6e91d73 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 10e2c2e notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 10e2c2e notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@Knucklessg1 Knucklessg1 self-assigned this May 1, 2024
@Knucklessg1 Knucklessg1 added rag retrieve-augmented generative agents vectordb Vector Databases labels May 1, 2024
@ekzhu ekzhu requested a review from thinkall May 1, 2024 23:13
@sonichi sonichi added this pull request to the merge queue May 24, 2024
Merged via the queue into main with commit 6604ca5 May 24, 2024
79 of 93 checks passed
@sonichi sonichi deleted the pyscopg_auth_fix branch May 24, 2024 18:05
jayralencar pushed a commit to jayralencar/autogen that referenced this pull request May 28, 2024
* Added fixes and tests for basic auth format

* User can provide their own connection object. Added test for it.

* Updated instructions on how to use. Fully tested all 3 authentication methods successfully.

* Get password from gitlab secrets.

* Hide passwords.

* Update notebook/agentchat_pgvector_RetrieveChat.ipynb

Co-authored-by: Li Jiang <bnujli@gmail.com>

* Hide passwords.

* Added connection_string test. 3 tests total for auth.

* Fixed quotes on db config params. No other changes found.

* Ran notebook

* Ran pre-commits and updated setup to include psycopg[binary] for windows and mac.

* Corrected list extension.

* Separate connection establishment function. Testing pending.

* Fixed pgvectordb auth

* Update agentchat_pgvector_RetrieveChat.ipynb

Added autocommit=True in example

* Rerun notebook

---------

Co-authored-by: Li Jiang <bnujli@gmail.com>
Co-authored-by: Li Jiang <lijiang1@microsoft.com>
@chenyanbiao
Copy link

chenyanbiao commented May 31, 2024

Hi @Knucklessg1 thanks for this awesome added feature!

Not sure if this is the right place to ask this question but would appreciate any help on it. Is chunk token size being used to split docs while using pgvector as a vectordatabase. I don't quite see the code where it splits based on chunk token size (normal usage for local file) but max token of the model by default for each docs(link), which means that the full local docs/files will be added to the vectordb and be input into the context directy based on vector distance.

@Knucklessg1
Copy link
Collaborator Author

@chenyanbiao did you take a look at the retrieve_utils.py?

This is where the logic for the split is happening. It's split the same way regardless of vectordb backend.

@chenyanbiao
Copy link

@Knucklessg1 Thanks for the response. Yes, it is what I am looking at. I understand that both ways use the same logic of splitting. My confusion is that the non-vectordb solution parse the parameter of chunk_token_size (link) while the vectordb solution parse the parameter of max_token for the split function (link), which is not consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rag retrieve-augmented generative agents vectordb Vector Databases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants