RAG - data cleaning before adding Knowledge base document #182

mathieuchateau · 2024-04-23T16:05:49Z

When adding PDF to knowledge base, collecting text can end wrong, like this:

Q U E L Q U E S A P P L I C A T I O N S E N A N A L Y S E L E X I C A L E E T L I N G U I S T I Q U E

Other examples:

dis.......quels sont

http : //www .cyberbr ic

Currently, the only cleaning I can see is this in pgVector2.py:

cleaned_content = document.content.replace("\x00", "\ufffd")

I think we can increase quality of knowledge base by sanitizing data:

Remove repeated characters
Remove punctuation
Remove single characters surrounded by space
Using stopwords (like from nltk library)
Lemmatize

I am quite new to the game for RAG, but we always do these text cleaning when doing embedding
I guess the worse part is that these garbages data count against maximum number of tokens, preventing to add more knowledge base articles.

The text was updated successfully, but these errors were encountered:

ashpreetbedi · 2024-04-23T16:47:29Z

I completely agree @mathieuchateau

A current fix would be to load the Documents manually instead of using the KnowledgeBase, something like:

reader = PDFReader()
pdf_documents: List[Document] = reader.read(uploaded_file)
for doc in pdf_documents:
    doc.content = clean_content(doc.content)

assistant.knowledge_base.load_documents(pdf_documents, upsert=True)

You could also do it with a Website reader:

scraper = WebsiteReader(max_links=2, max_depth=1)
web_documents: List[Document] = scraper.read(input_url)
for doc in web_documents:
    doc.content = clean_content(doc.content)

assistant.knowledge_base.load_documents(web_documents, upsert=True)

Example code

A long term approach would be to expose this as a parameter in the KnowledgeBase object, that accepts a "pre_process()" function, which runs on the content to clean it.

What do you think?

mathieuchateau · 2024-04-23T17:12:27Z

Looks promising, i will dig. I already have some functions to clean text like from tweets etc. I am mass injecting 5000 documents with 80 pages each in average. Maybe it’s too ambitious ? I bumped up limits on how many documents from knowledge base can be used regards, Mathieu Chateau Le mar. 23 avr. 2024 à 18:47, Ashpreet ***@***.***> a écrit :

…

I completely agree @mathieuchateau <https://github.com/mathieuchateau> A current fix would be to load the Documents manually instead of using the KnowledgeBase, something like: reader = PDFReader() pdf_documents: List[Document] = reader.read(uploaded_file) for doc in pdf_documents: doc.content = clean_content(doc.content) assistant.knowledge_base.load_documents(pdf_documents, upsert=True) You could also do it with a Website reader: scraper = WebsiteReader(max_links=2, max_depth=1) web_documents: List[Document] = scraper.read(input_url) for doc in web_documents: doc.content = clean_content(doc.content) assistant.knowledge_base.load_documents(web_documents, upsert=True) Example code <https://github.com/phidatahq/phidata/blob/main/cookbook/llms/groq/rag/app.py#L138-L141> A long term approach would be to expose this as a parameter in the KnowledgeBase object, that accepts a "pre_process()" function, which runs on the content to clean it. What do you think? — Reply to this email directly, view it on GitHub <#182 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCFMOKAJ3SHG7CAB2VMRODY62GDPAVCNFSM6AAAAABGVHT7OKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZSHEYDOOBVGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG - data cleaning before adding Knowledge base document #182

RAG - data cleaning before adding Knowledge base document #182

mathieuchateau commented Apr 23, 2024

ashpreetbedi commented Apr 23, 2024

mathieuchateau commented Apr 23, 2024 via email

RAG - data cleaning before adding Knowledge base document #182

RAG - data cleaning before adding Knowledge base document #182

Comments

mathieuchateau commented Apr 23, 2024

ashpreetbedi commented Apr 23, 2024

mathieuchateau commented Apr 23, 2024 via email