Fixes #3971: Check how to integrate vector databases via rest APIs #4059

vga91 · 2024-05-02T09:41:01Z

Changes

Created procedures ad-hoc for chroma, qdrant and weaviate.

Emulate the https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/ commands.

Neo4j Vector Index	Vector database correspondent
`CREATE VECTOR INDEX`	`apoc.vectordb.qdrant.createCollection`
`DROP VECTOR INDEX`	`apoc.vectordb.qdrant.deleteCollection`
add vector node/rel	`apoc.vectordb.qdrant.upsert`
`CALL db.index.vector.queryNodes` / `CALL db.index.vector.queryRelationships`	`apoc.vectordb.qdrant.get` and `apoc.vectordb.qdrant.query`
Delete vector node/rel	`apoc.vectordb.qdrant.delete`

the same for the ChromaDb procedures.
the same for the Weaviate procedures

NOTE: Like the apoc.ml ones, the chroma, qdrand and weaviate procedures are implemented in such a way that they have the same signature, even though under the hood they have different bodies/methods/etc.

Added 2 custom procedures apoc.vectordb.qdrant.get and apoc.vectordb.custom to handle other vector databases (like Pinecone tested in PineconeTest).

Using the apoc.vectordb.*.get and apoc.vectordb.*.query procedures, we can auto-create neo4j vector indexes and entities, using the mapping config.

NOTE: by default, with the apoc.vectordb.*get and apoc.vectordb.*query only score, metatada and entity are retrieved, to get also other results, we have to set the config allResults: true.

To evaluate

apoc.vectordb.custom could be changed to a more generic naming, e.g. apoc.restapi.custom(<conf>), since it could be used with other rest APIs
move RestAPIConfig to util package

Additional notes (after PR merge)

Open a follow-up issue:
Test / custom procedures with other databases (like Pinecone)
Added trello Core card: problem with Pinecone, create a PR after neo4j-contrib PR creation...
We cannot execute Pinecone fetch API with method: "", due to these 2 pieces of apoc core codes:
- setDoOutput(true)
- http.setChunkedStreamingMode(1024 * 1024);
  In both cases, we receive a 200OK, but with no results.

jexp · 2024-05-15T22:09:57Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+                                    @Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
+        var config = new HashMap<>(configuration);
+
+        String qdrantUrl = getChromaUrl(hostOrKey);


copy & paste typo - not qdrant :)

jexp · 2024-05-15T22:11:03Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+            @Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
+        var config = new HashMap<>(configuration);
+
+        String qdrantUrl = getChromaUrl(hostOrKey);


jexp · 2024-05-15T22:12:31Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+    public URLAccessChecker urlAccessChecker;
+
+    @Procedure("apoc.vectordb.chroma.createCollection")
+    @Description("apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config)")


can we have a bit better descriptions (for all the procedures), not just the signature again? otherwise the apoc.help output is not really informative if it shows the same content twice without a human description

jexp · 2024-05-15T22:17:14Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+    }
+
+    private static Entity handleMappingNode(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
+        String query = "CREATE CONSTRAINT IF NOT EXISTS FOR (n:%s) REQUIRE n.%s IS UNIQUE"


did you test that you can run both the constraint as well as the data creation operation in the same tx?

shouldn't we leave that to the user to create the constraint, otherwise it would do it for every entity

jexp · 2024-05-15T22:18:22Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+                transaction.commit();
+            }
+
+            String setVectorQuery = "CALL db.create.setNodeVectorProperty($entity, $key, $vector)";


we can set the property to a float array ourselves, no need to call cypher here.

jexp · 2024-05-15T22:19:00Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+
+    private static Entity handleMappingRel(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
+        try {
+            String query = "CREATE CONSTRAINT IF NOT EXISTS FOR ()-[r:%s]-() REQUIRE (r.%s) IS UNIQUE"


same as above I don't think we need to do that

jexp · 2024-05-15T22:20:05Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+            // in this case we cannot auto-create the rel, since we should have to define start and end node as well
+            Relationship rel;
+            try (Transaction transaction = db.beginTx()) {
+                Object propValue = metaProps.remove(mapping.getId());


should we really remove the mapping-id ? if we later return the metadata that's missing?

jexp · 2024-05-15T22:20:24Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+            try (Transaction transaction = db.beginTx()) {
+                Object propValue = metaProps.remove(mapping.getId());
+                rel = transaction.findRelationship(RelationshipType.withName(mapping.getType()), mapping.getProp(), propValue);
+                if (rel != null) {


should this not only happen when "create: true" is set?

jexp · 2024-05-15T22:20:49Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+                transaction.commit();
+            }
+
+            String setVectorQuery = "CALL db.create.setRelationshipVectorProperty($entity, $key, $vector)";


we can set the float array property in the same tx above

jexp · 2024-05-15T22:21:24Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+                    node = transaction.createNode(Label.label(mapping.getLabel()));
+                    node.setProperty(mapping.getProp(), propValue);
+                }
+                if (node != null) {


why to we write properties if create is not set to true? then we should just return the found node

I think we should only populate a node when create is true
alternatively we could have 3 modes (create / update / read) with read the default

jexp · 2024-05-15T22:21:47Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+        try {
+            Node node;
+            try (Transaction transaction = db.beginTx()) {
+                Object propValue = metaProps.remove(mapping.getId());


as below we should not remove the mapping id from the metadata

jexp · 2024-05-15T22:22:57Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+        }
+
+        db.executeTransactionally(setVectorQuery,
+                Map.of("entity", Util.rebind(tx, entity), "key", mapping.getEmbeddingProp(), "vector", embedding));


make sure to turn the double list into a float array

and just set the float array directly as property

jexp · 2024-05-15T22:26:10Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb.adoc

+
+APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases:
+
+- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant])


add pinecone to docs

jexp · 2024-05-15T22:28:01Z

extended/src/main/java/apoc/vectordb/VectorDbUtil.java

+     * @param entity we cannot declare entity with class Entity, 
+     *               as an error `cannot be converted to a Neo4j type: Don't know how to map `org.neo4j.graphdb.Entity` to the Neo4j Type` would be thrown
+     */
+    public record EmbeddingResult(


could we have two fields one for Node and one for Relationship
where one or the other is null?

otherwise Cypher cannot do anything with that Object result and you have to first call convert.toNode which would be really annoying.

jexp · 2024-05-15T22:28:36Z

extended/src/main/java/apoc/vectordb/VectorEmbedding.java

+    enum Type {
+        CHROMA(new ChromaEmbeddingType()),
+        QDRANT(new QdrantEmbeddingType()),
+        WEAVIATE(new WeaviateEmbeddingType());


jexp · 2024-05-15T22:29:46Z

extended/src/test/java/apoc/vectordb/PineconeTest.java

+import static org.junit.Assert.assertTrue;
+
+/**
+ * It leverages `apoc.vectordb.custom*` procedures


shouldn't we have a dedicated pinecone procedures set?

jexp

Please see my comments

jexp · 2024-05-22T11:09:54Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+- `apoc.vectordb.store` (to store host, credentials and mapping into the system database)
+
+All the procedures, except the `apoc.vectordb.store` one, can have, as a final parameter,
+a configuration map with these possible parameters:


possible -> optional

jexp · 2024-05-22T11:10:11Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+| headers | additional HTTP headers
+| method | HTTP method
+| endpoint | endpoint key, 
+    can be used to override the default endpoint created via the 1st parameter of the `apoc.vectordb.qdrant.*` and `apoc.vectordb.qdrant.*`,


jexp · 2024-05-22T11:10:31Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+    can be used to override the default endpoint created via the 1st parameter of the `apoc.vectordb.qdrant.*` and `apoc.vectordb.qdrant.*`,
+    to handle potential endpoint changes.
+| body | body HTTP request
+| jsonPath | To customize https://github.com/json-path/JsonPath[JSONPath] of the response. The default is `null`.


JSONPath parsing of the response

jexp · 2024-05-22T11:11:13Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+    See examples below.
+|===
+
+include::./qdrand.adoc[]


qdrant file name typo?

jexp · 2024-05-22T11:11:46Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+
+include::./custom.adoc[]
+
+== Store Vector db info (i.e. `apoc.vectordb.store`) 


Store is an implementation detail
What it's about for the user is configure?

jexp · 2024-05-22T11:13:15Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+We can save some info in the System Database to be reused later, that is the host, login credentials, and mapping,
+to be used in `*.get` and `.*query` procedures, except for the `apoc.vectordb.custom.get` one.
+
+Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.store(vectorName, host, credentialsValue, mapping)`,


shouldn't I be able to use multiple vector names for the same provider?
like "qdrant" + "books" or "qdrant" + "papers" ? we should not limit it to one only.
so that instead of host+key I would use "books" or "papers"

jexp · 2024-05-22T11:14:07Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.store(vectorName, host, credentialsValue, mapping)`,
+where `vectorName` can be "QDRANT", "CHROMA" or "WEAVIATE", 
+that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.
+Then `host` is the host base name, `credentialsValue` is the API key and `mapping` is a map that can be used instead of the homonym `embeddingConfig` parameter.


~~homonym~~
we should have mapping throughout when it's about the mapping
embeddingConfig is a different topic, no?

jexp · 2024-05-22T11:14:30Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.
+Then `host` is the host base name, `credentialsValue` is the API key and `mapping` is a map that can be used instead of the homonym `embeddingConfig` parameter.
+
+NOTE:: this procedure is only executable by a user with admin permissions


also needs to be routed to the systemdb leader? or does that happen now automatically???

jexp · 2024-05-22T11:16:15Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+[source,cypher]
+----
+CALL apoc.vectordb.store('QDRANT', 'custom-host-name', '<apiKey>', 
+  {embeddingProp: "vect", label: "Test", prop: "myId", id: "foo"}


I'm not 100% sure about these names, they are not that obvious:

what would be sensible and easily understandable? (throughout)

Perhaps:

embeddingKey

metadataKey

nodeLabel

nodeKey

jexp · 2024-05-22T13:18:18Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/qdrand.adoc

+.Create a collection (it leverages https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection[this API])
+[source,cypher]
+----
+CALL apoc.vectordb.qdrant.createCollection($host, 'test_collection', 'Cosine', 4, {<optional config>})


use $hostOrKey here too

jexp · 2024-05-22T13:19:02Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/weaviate.adoc

+
+== Weaviate
+
+Here is a list of all available Qdrant procedures:


typo Qdrant -> Weaviate

jexp · 2024-05-24T07:25:36Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+                .map(MapResult::new);
+    }
+
+    @Procedure(value = "apoc.vectordb.chroma.delete", mode = Mode.SCHEMA)


are these really mode=SCHEMA?

jexp · 2024-05-24T07:27:56Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+                .map(ListResult::new);
+    }
+
+    @Procedure(value = "apoc.vectordb.chroma.get", mode = Mode.SCHEMA)


this might be mode=WRITE if we keep the update behavior ?

I think if we should move the write behavior into a separate method, like queryAndUpdate or so? or updateGraphFromQuery ? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.

jexp · 2024-05-24T07:30:09Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+
+    @Procedure(value = "apoc.vectordb.chroma.query", mode = Mode.SCHEMA)
+    @Description("apoc.vectordb.chroma.query(hostOrKey, collection, vector, filter, limit, $configuration) - Retrieve closest vectors the the defined `vector`, `limit` of results,  in the collection with the name specified in the 2nd parameter")
+    public Stream<EmbeddingResult> query(@Name("hostOrKey") String hostOrKey,


sorry, that comment for the query procedure was meant for here:

I think if we should move the write behavior into a separate method, like queryAndUpdate or so? or updateGraphFromQuery ? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.

Removed Mode.SCHEMA, I had accidentally left it in that initially the procedure also auto-created the vector indexes in neo4j, I removed it now.
And added procedures queryAndUpdate with WRITE mode

jexp · 2024-05-24T07:31:22Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+                v -> listOfListsToMap((Map) v).stream());
+    }
+
+    private Map<String, Object> getVectorDbInfo(String hostOrKey, String collection, Map<String, Object> configuration, String templateUrl) {


it would be great if we could expose this getVectorDbInfo in a procedure call for each of the databases to get an overview what's in there.

added procedures with names apoc.vectordb.<type>.info

jexp · 2024-05-24T07:32:34Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+     * and mapping data to auto-create neo4j vector indexes and properties
+     */
+    @Procedure(value = "apoc.vectordb.custom.get", mode = Mode.SCHEMA)
+    @Description("apoc.vectordb.custom.get(host, $configuration) - Customizable get / query procedure")


little bit more detail in the description?

jexp · 2024-05-24T07:36:19Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+            throw new RuntimeException(embeddingErrMsg);
+        }
+
+        entity.setProperty(mapping.getEmbeddingProp(), embedding.stream()


I think we should just do a utility method that creates a float array of the list size and uses a for loop over the list to set the values. then the JVM can also optimize that to SIMD. I don't think that the streams are efficient here.

jexp · 2024-05-24T07:39:51Z

extended/src/main/java/apoc/vectordb/VectorEmbeddingHandler.java

+    // -- implementations
+    //
+
+    class QdrantEmbeddingHandler implements VectorEmbeddingHandler {


I wonder if we should move these implementations closer to where the vector databases are? either into the procedures file or an associated file? Otherwise we have to update this file whenever we add a new db?

…d vector as a default result

vga91 force-pushed the issue-3971 branch from 0d89656 to 1d25ac5 Compare May 2, 2024 10:16

vga91 added extended-functionality dev labels May 2, 2024

vga91 force-pushed the issue-3971 branch 5 times, most recently from 6d57a89 to 4291f7a Compare May 8, 2024 12:26

vga91 force-pushed the issue-3971 branch from 4291f7a to 019ec21 Compare May 10, 2024 09:21

jexp reviewed May 15, 2024

View reviewed changes

jexp requested changes May 15, 2024

View reviewed changes

vga91 marked this pull request as draft May 17, 2024 16:32

vga91 force-pushed the issue-3971 branch 3 times, most recently from 9ff02b4 to 9a8f108 Compare May 20, 2024 07:13

jexp reviewed May 22, 2024

View reviewed changes

jexp reviewed May 24, 2024

View reviewed changes

vga91 force-pushed the issue-3971 branch 4 times, most recently from b4af8cb to ae0152a Compare May 24, 2024 23:30

vga91 added 7 commits May 25, 2024 01:43

Fixes #3971: Check how to integrate vector databases via rest APIs

c94e0b3

fixed CI errors and removed unused imports

532b257

Changes review: added weaviate db, removed vector idx autocreation an…

b6c7461

…d vector as a default result

code clean

634cd24

Changes review: added systemdb store, removed constraint creation

47467dc

code clean

8f691b0

2nd changes review

d075a24

vga91 force-pushed the issue-3971 branch from ae0152a to d075a24 Compare May 24, 2024 23:43


		APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases:

		- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant])


		include::./custom.adoc[]

		== Store Vector db info (i.e. `apoc.vectordb.store`)


		== Weaviate

		Here is a list of all available Qdrant procedures:

Fixes #3971: Check how to integrate vector databases via rest APIs #4059

Are you sure you want to change the base?

Fixes #3971: Check how to integrate vector databases via rest APIs #4059

Conversation

vga91 commented May 2, 2024 • edited

Changes

To evaluate

Additional notes (after PR merge)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jexp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vga91 commented May 2, 2024 •

edited