-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAG - data cleaning before adding Knowledge base document #182
Comments
I completely agree @mathieuchateau A current fix would be to load the
You could also do it with a Website reader:
A long term approach would be to expose this as a parameter in the What do you think? |
Looks promising, i will dig.
I already have some functions to clean text like from tweets etc.
I am mass injecting 5000 documents with 80 pages each in average. Maybe
it’s too ambitious ?
I bumped up limits on how many documents from knowledge base can be used
regards,
Mathieu Chateau
Le mar. 23 avr. 2024 à 18:47, Ashpreet ***@***.***> a écrit :
… I completely agree @mathieuchateau <https://github.com/mathieuchateau>
A current fix would be to load the Documents manually instead of using
the KnowledgeBase, something like:
reader = PDFReader()
pdf_documents: List[Document] = reader.read(uploaded_file)
for doc in pdf_documents:
doc.content = clean_content(doc.content)
assistant.knowledge_base.load_documents(pdf_documents, upsert=True)
You could also do it with a Website reader:
scraper = WebsiteReader(max_links=2, max_depth=1)
web_documents: List[Document] = scraper.read(input_url)
for doc in web_documents:
doc.content = clean_content(doc.content)
assistant.knowledge_base.load_documents(web_documents, upsert=True)
Example code
<https://github.com/phidatahq/phidata/blob/main/cookbook/llms/groq/rag/app.py#L138-L141>
A long term approach would be to expose this as a parameter in the
KnowledgeBase object, that accepts a "pre_process()" function, which runs
on the content to clean it.
What do you think?
—
Reply to this email directly, view it on GitHub
<#182 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCFMOKAJ3SHG7CAB2VMRODY62GDPAVCNFSM6AAAAABGVHT7OKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZSHEYDOOBVGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When adding PDF to knowledge base, collecting text can end wrong, like this:
Q U E L Q U E S A P P L I C A T I O N S E N A N A L Y S E L E X I C A L E E T L I N G U I S T I Q U E
Other examples:
dis.......quels sont
http : //www .cyberbr ic
Currently, the only cleaning I can see is this in pgVector2.py:
cleaned_content = document.content.replace("\x00", "\ufffd")
I think we can increase quality of knowledge base by sanitizing data:
I am quite new to the game for RAG, but we always do these text cleaning when doing embedding
I guess the worse part is that these garbages data count against maximum number of tokens, preventing to add more knowledge base articles.
The text was updated successfully, but these errors were encountered: