Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Metadata filtering on Pgvector #410

Closed
wants to merge 18 commits into from
Closed

Conversation

humcqc
Copy link
Contributor

@humcqc humcqc commented Mar 26, 2024

  • Implement Metadata filtering on Pgvector with 3 implementations to store Metadata: JSON, JSONB, COLUMNS
  • Add indexes on metadata
  • use Langchain4j metadata filter test
  • Change tests to @QuarkusTest. QuarkusUnitTest extension does not support ParametrizedTest with custom classes as argument. to check with quarkus team.

Not sure if this should be in quarkus-langchain4j or in langchain4j ?
PR open for discussion.

@humcqc humcqc requested a review from a team as a code owner March 26, 2024 14:15
@geoand
Copy link
Collaborator

geoand commented Mar 26, 2024

Thanks!

I'll let @jmartisk review this

@geoand
Copy link
Collaborator

geoand commented Mar 26, 2024

We will definitely need the PR rebased onto main so we can skip the merge commits

@humcqc
Copy link
Contributor Author

humcqc commented Mar 26, 2024

yes, will cleanup after review and comments. I know i should not sync with the github site, but too late

…store Metadata: JSON, JSONB, COLUMNS

- Add indexes on metadata
- use Langchain4j metadata filter test
- Change tests to @QuarkusTest. QuarkusUnitTest extension does not support ParametrizedTest with custom classes as argument.
@jmartisk
Copy link
Collaborator

I'll try to look into it asap, I'm not sure I'll have enough time before leaving until Tuesday.
Maybe @sebastienblanc would like to have a look too, as he implemented the Pgvector store.

@jmartisk
Copy link
Collaborator

I really do like it, very nice job! But yeah a lot of this looks like it could be moved to upstream langchain4j, which currently doesn't support metadata filtering on pgvector.
@langchain4j what do you think?
It could also benefit from the strategies (json, jsonb, columns) that you introduce for storing metadata

@langchain4j
Copy link
Collaborator

Yeah, would be awesome to have Filter mapping logic in https://github.com/langchain4j/langchain4j ! And the rest probably as well :)

@humcqc
Copy link
Contributor Author

humcqc commented Mar 27, 2024

ok will move it to langchain4j, and adapt this one.

@sebastienblanc
Copy link
Contributor

Will be happy to review this PR and also the parts moving to @langchain4j

@humcqc
Copy link
Contributor Author

humcqc commented Apr 2, 2024

last commit integrates : langchain4j/langchain4j#851 , this will no build because last langchain pgvector is not yet available. just for @jmartisk review.

humcqc and others added 4 commits April 2, 2024 14:26
- Fix asciidoctor doc generation: interfaces that extends interfaces  produce duplicated parameter in doc that make the build failing.
- add integration test
- Add buildStep to include dependencies in index
- Add default no args constructor for cdi
- !!! Need to do some tests in native mode.
@jmartisk
Copy link
Collaborator

jmartisk commented Apr 3, 2024

Thanks, I'm going to look into it ASAP

@jmartisk
Copy link
Collaborator

jmartisk commented Apr 3, 2024

I can't get this to work somehow...
Does this add some new requirements on the application to specify some properties?
On startup I get:

Caused by: org.postgresql.util.PSQLException: Unknown type vector.
	at org.postgresql.jdbc.PgPreparedStatement.setPGobject(PgPreparedStatement.java:537)
	at org.postgresql.jdbc.PgPreparedStatement.setObject(PgPreparedStatement.java:1049)
	at io.agroal.pool.wrapper.PreparedStatementWrapper.setObject(PreparedStatementWrapper.java:291)
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.addAllInternal(PgVectorEmbeddingStore.java:324)
	... 26 more

So I tried adding

quarkus.datasource.jdbc.additional-jdbc-properties.datatype.vector=com.pgvector.PGvector

in the application's properties, but then I get

2024-04-03 14:26:27,559 WARN  [io.agr.pool] (agroal-11) Datasource '<default>': Unable to load the class com.pgvector.PGvector responsible for the datatype vector
(...)
java.lang.RuntimeException: Sql statement issue: org.postgresql.util.PSQLException: Unable to load the class com.pgvector.PGvector responsible for the datatype vector 
Last query: init
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.initTable(PgVectorEmbeddingStore.java:172)
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.<init>(PgVectorEmbeddingStore.java:81)
	at io.quarkiverse.langchain4j.pgvector.PgVectorEmbeddingStore.<init>(PgVectorEmbeddingStore.java:43)
	at io.quarkiverse.langchain4j.pgvector.runtime.PgVectorEmbeddingStoreRecorder.lambda$embeddingStoreFunction$1(PgVectorEmbeddingStoreRecorder.java:26)
	at io.quarkiverse.langchain4j.pgvector.PgVectorEmbeddingStore_NcsXENBovnGlSJZ1aaFjoohrtKM_Synthetic_Bean.createSynthetic(Unknown Source)
	... 36 more

Then I tried adding

quarkus.class-loading.parent-first-artifacts=com.pgvector:pgvector

but I am back at the first error

@jmartisk
Copy link
Collaborator

jmartisk commented Apr 3, 2024

To reproduce this problem, for example, take the samples/chatbot application, change the quarkus-langchain4j-redis dependency into quarkus-langchain4j-pgvector and add quarkus.langchain4j.pgvector.dimension=1536 into application.properties and then try mvn quarkus:dev

(I built upstream langchain4j from your commit fcce3f1f26da1f1bba78b51267de6066a260ff1b and quarkus-langchain4j from 2dc8653 before trying this)

@humcqc
Copy link
Contributor Author

humcqc commented Apr 3, 2024

interesting, does the rag-with-pgvector integration test work for you ?
I think you face the same issue when i try to use the extension in my projects, there is some issue with pgvector and classloading.
I've put some question in the review where there are possible issues.

@sebastienblanc
Copy link
Contributor

@humcqc
Copy link
Contributor Author

humcqc commented Apr 3, 2024

@sebastienblanc should be automatic now with #399

@jmartisk
Copy link
Collaborator

jmartisk commented Apr 3, 2024

Right, the integration test is failing too.
The image isn't the problem, it's using pgvector/pgvector:pg16

@humcqc
Copy link
Contributor Author

humcqc commented Apr 3, 2024

Curious they are working for me. Could you show me the error ?
And do you have some glue about extension class loading when an extension (quarkus langchain ) have a dependency ( langchain) that have a dependency (pgvector).
I think there is an issue when we want to configure the agroal datasource with pgvector vector.

@jmartisk
Copy link
Collaborator

jmartisk commented Apr 3, 2024

When running the integration tests I get (this is the relevant part)

2024-04-03 15:12:42,956 INFO  [tc.pgvector/pgvector:pg16] (build-41) Container pgvector/pgvector:pg16 is starting: f1c5cdc29c85b6a1e002dad4a4a1f1b457439d927cc514868338ce6920bfb0bf
2024-04-03 15:12:46,368 INFO  [tc.pgvector/pgvector:pg16] (build-41) Container pgvector/pgvector:pg16 started in PT3.496847295S
2024-04-03 15:12:46,369 INFO  [tc.pgvector/pgvector:pg16] (build-41) Container is started (JDBC URL: jdbc:postgresql://localhost:46373/quarkus?loggerLevel=OFF)
2024-04-03 15:12:46,369 INFO  [io.qua.dev.pos.dep.PostgresqlDevServicesProcessor] (build-41) Dev Services for PostgreSQL started.
2024-04-03 15:12:46,370 INFO  [io.qua.dat.dep.dev.DevServicesDatasourceProcessor] (build-41) Dev Services for default datasource (postgresql) started - container ID is f1c5cdc29c85
2024-04-03 15:12:51,787 INFO  [ai.djl.uti.Platform] (main) Found matching platform from: jar:file:///home/jmartisk/.m2/repository/ai/djl/huggingface/tokenizers/0.26.0/tokenizers-0.26.0.jar!/native/lib/tokenizers.properties
2024-04-03 15:12:52,054 WARN  [io.agr.pool] (main) Datasource '<default>': JDBC resources leaked: 0 ResultSet(s) and 1 Statement(s)
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 14.37 s <<< FAILURE! -- in org.acme.example.openai.RAGWithMetadataFilterTest
[ERROR] org.acme.example.openai.RAGWithMetadataFilterTest -- Time elapsed: 14.37 s <<< ERROR!
java.lang.RuntimeException: java.lang.RuntimeException: Failed to start quarkus
	at io.quarkus.test.junit.QuarkusTestExtension.throwBootFailureException(QuarkusTestExtension.java:638)
	at io.quarkus.test.junit.QuarkusTestExtension.interceptBeforeAllMethod(QuarkusTestExtension.java:706)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
Caused by: java.lang.RuntimeException: Failed to start quarkus
	at io.quarkus.runner.ApplicationImpl.doStart(Unknown Source)
	at io.quarkus.runtime.Application.start(Application.java:101)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at io.quarkus.runner.bootstrap.StartupActionImpl.run(StartupActionImpl.java:285)
	at io.quarkus.test.junit.QuarkusTestExtension.doJavaStart(QuarkusTestExtension.java:252)
	at io.quarkus.test.junit.QuarkusTestExtension.ensureStarted(QuarkusTestExtension.java:605)
	at io.quarkus.test.junit.QuarkusTestExtension.beforeAll(QuarkusTestExtension.java:655)
	... 1 more
Caused by: java.lang.RuntimeException: org.postgresql.util.PSQLException: Unknown type vector.
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.addAllInternal(PgVectorEmbeddingStore.java:345)
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.addInternal(PgVectorEmbeddingStore.java:294)
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.add(PgVectorEmbeddingStore.java:210)
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.add(PgVectorEmbeddingStore.java:36)
	at io.quarkiverse.langchain4j.pgvector.PgVectorEmbeddingStore_NcsXENBovnGlSJZ1aaFjoohrtKM_Synthetic_ClientProxy.add(Unknown Source)
	at org.acme.example.EmbeddingStoreIngestor.embed(EmbeddingStoreIngestor.java:29)
	at org.acme.example.EmbeddingStoreIngestor.ingest(EmbeddingStoreIngestor.java:22)
	at org.acme.example.EmbeddingStoreIngestor_Observer_ingest_UItmqUFF_K3-3VnNeybAPBBp55c.notify(Unknown Source)
	at io.quarkus.arc.impl.EventImpl$Notifier.notifyObservers(EventImpl.java:346)
	at io.quarkus.arc.impl.EventImpl$Notifier.notify(EventImpl.java:328)
	at io.quarkus.arc.impl.EventImpl.fire(EventImpl.java:82)
	at io.quarkus.arc.runtime.ArcRecorder.fireLifecycleEvent(ArcRecorder.java:155)
	at io.quarkus.arc.runtime.ArcRecorder.handleLifecycleEvents(ArcRecorder.java:106)
	at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy_0(Unknown Source)
	at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy(Unknown Source)
	... 8 more
Caused by: org.postgresql.util.PSQLException: Unknown type vector.
	at org.postgresql.jdbc.PgPreparedStatement.setPGobject(PgPreparedStatement.java:537)
	at org.postgresql.jdbc.PgPreparedStatement.setObject(PgPreparedStatement.java:1049)
	at io.agroal.pool.wrapper.PreparedStatementWrapper.setObject(PreparedStatementWrapper.java:291)
	at dev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore.addAllInternal(PgVectorEmbeddingStore.java:324)
	... 22 more

@humcqc
Copy link
Contributor Author

humcqc commented Apr 3, 2024

weird:

2024-04-03 14:53:24,758 INFO  [tc.pgvector/pgvector:pg16] (build-86) Reusing container with ID: b263c49bad38c2cde3f87e32ece37bd12cb50e787a1e49de89433ce85a44a3be and hash: 40c16b4a38334133b9e756b168c97be7428fecb2
2024-04-03 14:53:24,759 INFO  [tc.pgvector/pgvector:pg16] (build-86) Reusing existing container (b263c49bad38c2cde3f87e32ece37bd12cb50e787a1e49de89433ce85a44a3be) and not creating a new one
2024-04-03 14:53:25,080 INFO  [tc.pgvector/pgvector:pg16] (build-86) Container pgvector/pgvector:pg16 started in PT1.305414S
2024-04-03 14:53:25,080 INFO  [tc.pgvector/pgvector:pg16] (build-86) Container is started (JDBC URL: jdbc:postgresql://localhost:57599/quarkus?loggerLevel=OFF)
2024-04-03 14:53:25,080 INFO  [io.qua.dev.pos.dep.PostgresqlDevServicesProcessor] (build-86) Dev Services for PostgreSQL started.
2024-04-03 14:53:25,081 INFO  [io.qua.dat.dep.dev.DevServicesDatasourceProcessor] (build-86) Dev Services for default datasource (postgresql) started - container ID is b263c49bad38
2024-04-03 14:53:33,574 INFO  [ai.djl.uti.Platform] (main) Found matching platform from: jar:file:///Users/humberto/.m2/repository/ai/djl/huggingface/tokenizers/0.26.0/tokenizers-0.26.0.jar!/native/lib/tokenizers.properties
2024-04-03 14:53:34,417 INFO  [io.quarkus] (main) quarkus-langchain4j-integration-test-rag-pgvector 999-SNAPSHOT on JVM (powered by Quarkus 3.8.2) started in 16.997s. Listening on: http://localhost:8081
2024-04-03 14:53:34,425 INFO  [io.quarkus] (main) Profile test activated. 
2024-04-03 14:53:34,425 INFO  [io.quarkus] (main) Installed features: [agroal, awt, cdi, jdbc-postgresql, langchain4j, langchain4j-openai, langchain4j-pgvector, narayana-jta, poi, qute, rest-client-reactive, rest-client-reactive-jackson, resteasy-reactive, resteasy-reactive-jackson, smallrye-context-propagation, vertx]
WARNING: A Java agent has been loaded dynamically (/Users/humberto/.m2/repository/net/bytebuddy/byte-buddy-agent/1.14.10/byte-buddy-agent-1.14.10.jar)
WARNING: If a serviceability tool is in use, please run with -XX:+EnableDynamicAgentLoading to hide this warning
WARNING: If a serviceability tool is not in use, please run with -Djdk.instrument.traceUsage for more information
WARNING: Dynamic loading of agents will be disallowed by default in a future release
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 24.51 s -- in org.acme.example.openai.RAGWithMetadataFilterTest
2024-04-03 14:53:36,369 INFO  [io.quarkus] (main) quarkus-langchain4j-integration-test-rag-pgvector stopped in 0.091s
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0

Perhaps link to the fact it re-use a previous pg16 container where connections have been set correctly in my side.

Let me remove the container.

By the way could you check my review question , I think it's link to the issue.

…roduce bad behavior during first getConnection

- add vector management in LangChain4jPgVectorBaseTest#clearstore in case it's the first call to the datasource.
- add pgvector artifact classloader setting in runtime pom instead of application
- rename PgVectorEmbeddingStore as QuarkusPgVectorEmbeddingStore
@humcqc
Copy link
Contributor Author

humcqc commented Apr 4, 2024

I've change the way we handle connection settings. some weird behavior at the first usage of the store in certain condition.
Should be ok.
I still need to find a better way when the db settings and the table is created by external script.

…groalpoollistener

- remove other settings.
- getConnection return just the connection
- Add an integretion test when embeddings table created by flyway.
@humcqc
Copy link
Contributor Author

humcqc commented Apr 4, 2024

@jmartisk , @sebastienblanc, should be ok. If you can give a try to validate your external project. I've added 2 integrations tests for rag integration. one with automatic table creation and another with flyway. For the connection settings I use a pool listener to configure pgvector settings. Then getConnection just give datasource connection. I've tested with internal test and external project. And with fresh pg containers. First time I have this behavior.

@jmartisk
Copy link
Collaborator

jmartisk commented Apr 5, 2024

Thanks, I will look into it today!

@jmartisk
Copy link
Collaborator

jmartisk commented Apr 5, 2024

Truly outstanding work, and it works for me now!

I have just a few comments.
On this PR, I added a few Javadoc clarification suggestions.

On the upstream PR, I commented that we need to avoid using GSON directly (langchain4j/langchain4j#851 (comment)) - after fixing that, could you please add an exclusion for gson, just like we have it in the OpenAI module (https://github.com/quarkiverse/quarkus-langchain4j/blob/0.10.3/openai/openai-common/runtime/pom.xml#L42) - we have a dependency on jackson which should be picked up instead of GSON.

@jmartisk
Copy link
Collaborator

jmartisk commented Apr 5, 2024

Works for me!
I won't give a +1 yet because we need to wait for this to be in a released version of langchain4j, but given the current status of this PR plus the current status of langchain4j/langchain4j#851, I think it's good.
Thanks so much, this is great.

@geoand
Copy link
Collaborator

geoand commented Apr 22, 2024

What's the status of this?

@langchain4j
Copy link
Collaborator

@geoand I have to re-review it. How urgent is it?

@geoand
Copy link
Collaborator

geoand commented Apr 22, 2024

Not terrible urgent from point of view, but it does seem like a very contribution

@humcqc
Copy link
Contributor Author

humcqc commented Apr 25, 2024

For this one i will create a new PR when langchain4j will be released with : langchain4j/langchain4j#851

langchain4j pushed a commit to langchain4j/langchain4j that referenced this pull request Apr 30, 2024
… JSONB and Columns (#851)

Implement metadata filtering for PgVectorEmbeddingStore with 3 implementations: COLUMN_PER_KEY, COMBINED_JSON and COMBINED_JSONB

Related: quarkiverse/quarkus-langchain4j#410
@humcqc
Copy link
Contributor Author

humcqc commented May 15, 2024

Waiting for langchain4J 0.31.0

@langchain4j
Copy link
Collaborator

@humcqc will be released next week

@humcqc
Copy link
Contributor Author

humcqc commented May 23, 2024

Too many conflicts.
Replaced by #619
This PR will be closed soon

@humcqc humcqc closed this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants