feat: Document table for storing original loaded documents #867

Maximilian-Winter · 2024-01-20T02:28:06Z

Please describe the purpose of this pull request.
Adds the things discussed in this issue: #770

Have you tested this PR?
No, wasn't sure about how to test it.

Related issues or PRs
#770

Is your PR over 500 lines of code?
No

…moving embedding field from documents.

sarahwooders

@Maximilian-Winter could you please also add a test for the document store in test_load_archival.py (maybe around here https://github.com/cpacker/MemGPT/blob/main/tests/test_load_archival.py#L138) to ensure documents are properly inserted? I think in the test case example, we'd expect a single document (for the single) file to be inserted, and would alos want to check if the passages retrieved match the document ID.

memgpt/cli/cli_load.py

Maximilian-Winter · 2024-01-22T02:54:08Z

@sarahwooders I think we should add the test to test_storage.py because the Document store is a database as I have implement it not a archival memory if I get the term archival memory correct. But the passage would be part of the archival test, right?

sarahwooders · 2024-01-22T02:59:59Z

@sarahwooders I think we should add the test to test_storage.py because the Document store is a database as I have implement it not a archival memory if I get the term archival memory correct. But the passage would be part of the archival test, right?

Ah yeah I think ideally we could update both tests - the test_load_archival.py, we'd just want to check to make sure the loaded documents and passages match up with the data thats loaded. And for test_storage.py would test to make sure inserting documents works fine (similar to how passages are also tested).

Maximilian-Winter · 2024-01-22T05:02:27Z

@sarahwooders I checked again and SimpleWebpageReader returns one document with the complete text when using it with one page. But SimpleDirectoryReader returns a list of document chunks, when using it with one document. I found a work around by creating a new llama index document with the complete text and passing that to the store_docs function. Should also work as expected with multiple documents.

Maximilian-Winter · 2024-01-22T05:04:08Z

@sarahwooders I also added the necessary tests.

sarahwooders

Left a few minor comments, and we should get the tests to pass -- but should be close to merging soon!

sarahwooders · 2024-01-22T19:32:07Z

memgpt/agent_store/db.py

I think you need to add TableType.DOCUMENTS here and also for SQLLiteStorageConnector?

sarahwooders · 2024-01-22T19:32:48Z

memgpt/agent_store/db.py

This is a bug on our end (and causing tests for fail) -- TableType.DATA_SOURCES doesn't exist anymore so if you remove this the tests should pass.

sarahwooders · 2024-01-22T19:38:16Z

memgpt/cli/cli_load.py

Could you please leave a comment on why you're doing doc.text[2:] for future reference?

Also, do we not need to do the same thing for loading webpages?

I do this because the SimpleDirectoryReader adds two new lines in the chunks.

Maximilian-Winter · 2024-01-22T20:33:53Z

@sarahwooders I gonna add the comments about the doc.text[2:] later today. And try to do the rest.

Maximilian-Winter added 4 commits January 19, 2024 05:58

Added fields to document class.

dc26de4

Changing document storage type, add adding of documents to cli and re…

dd0f0a2

…moving embedding field from documents.

Merge remote-tracking branch 'upstream/main' into DocumentTable

40c5bea

Update formatting

babdb6b

Maximilian-Winter changed the title ~~feat: Document table for storing original loaded documents~~ feat: Document table for storing original loaded documents Jan 20, 2024

Maximilian-Winter added 3 commits January 21, 2024 05:29

Merge remote-tracking branch 'upstream/main' into DocumentTable

97924ed

Implement documents completely.

4de2584

Merge branch 'cpacker:main' into DocumentTable

d63b90a

cpacker marked this pull request as ready for review January 21, 2024 09:16

cpacker requested review from cpacker and sarahwooders January 21, 2024 09:16

Merge remote-tracking branch 'upstream/main' into DocumentTable

04c6ec3

sarahwooders requested changes Jan 22, 2024

View reviewed changes

memgpt/cli/cli_load.py Show resolved Hide resolved

Maximilian-Winter added 2 commits January 22, 2024 04:00

Merge remote-tracking branch 'upstream/main' into DocumentTable

f832e9b

Fixed ingestion issues and added tests.

b3847c3

Update Formatting issue

645d18f

Maximilian-Winter added 2 commits January 22, 2024 06:04

Update test_storage.py

049c1f9

Update test_storage.py

fba595d

sarahwooders requested changes Jan 22, 2024

View reviewed changes

Maximilian-Winter added 2 commits January 24, 2024 02:12

Merge remote-tracking branch 'upstream/main' into DocumentTable

04fb732

Update cli_load.py

ceff92f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Document table for storing original loaded documents #867

feat: Document table for storing original loaded documents #867

Maximilian-Winter commented Jan 20, 2024

sarahwooders left a comment

Maximilian-Winter commented Jan 22, 2024

sarahwooders commented Jan 22, 2024

Maximilian-Winter commented Jan 22, 2024 •

edited

Maximilian-Winter commented Jan 22, 2024

sarahwooders left a comment

sarahwooders Jan 22, 2024

sarahwooders Jan 22, 2024

sarahwooders Jan 22, 2024

sarahwooders Jan 22, 2024

Maximilian-Winter Jan 22, 2024

Maximilian-Winter commented Jan 22, 2024 •

edited

feat: Document table for storing original loaded documents #867

Are you sure you want to change the base?

feat: Document table for storing original loaded documents #867

Conversation

Maximilian-Winter commented Jan 20, 2024

sarahwooders left a comment

Choose a reason for hiding this comment

Maximilian-Winter commented Jan 22, 2024

sarahwooders commented Jan 22, 2024

Maximilian-Winter commented Jan 22, 2024 • edited

Maximilian-Winter commented Jan 22, 2024

sarahwooders left a comment

Choose a reason for hiding this comment

sarahwooders Jan 22, 2024

Choose a reason for hiding this comment

sarahwooders Jan 22, 2024

Choose a reason for hiding this comment

sarahwooders Jan 22, 2024

Choose a reason for hiding this comment

sarahwooders Jan 22, 2024

Choose a reason for hiding this comment

Maximilian-Winter Jan 22, 2024

Choose a reason for hiding this comment

Maximilian-Winter commented Jan 22, 2024 • edited

Maximilian-Winter commented Jan 22, 2024 •

edited

Maximilian-Winter commented Jan 22, 2024 •

edited