Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAG - data cleaning before adding Knowledge base document #182

Open
mathieuchateau opened this issue Apr 23, 2024 · 2 comments
Open

RAG - data cleaning before adding Knowledge base document #182

mathieuchateau opened this issue Apr 23, 2024 · 2 comments

Comments

@mathieuchateau
Copy link

When adding PDF to knowledge base, collecting text can end wrong, like this:

Q U E L Q U E S A P P L I C A T I O N S E N A N A L Y S E L E X I C A L E E T L I N G U I S T I Q U E

Other examples:

dis.......quels sont

http : //www .cyberbr ic

Currently, the only cleaning I can see is this in pgVector2.py:

cleaned_content = document.content.replace("\x00", "\ufffd")

I think we can increase quality of knowledge base by sanitizing data:

  • Remove repeated characters
  • Remove punctuation
  • Remove single characters surrounded by space
  • Using stopwords (like from nltk library)
  • Lemmatize

I am quite new to the game for RAG, but we always do these text cleaning when doing embedding
I guess the worse part is that these garbages data count against maximum number of tokens, preventing to add more knowledge base articles.

@ashpreetbedi
Copy link
Contributor

I completely agree @mathieuchateau

A current fix would be to load the Documents manually instead of using the KnowledgeBase, something like:

reader = PDFReader()
pdf_documents: List[Document] = reader.read(uploaded_file)
for doc in pdf_documents:
    doc.content = clean_content(doc.content)

assistant.knowledge_base.load_documents(pdf_documents, upsert=True)

You could also do it with a Website reader:

scraper = WebsiteReader(max_links=2, max_depth=1)
web_documents: List[Document] = scraper.read(input_url)
for doc in web_documents:
    doc.content = clean_content(doc.content)

assistant.knowledge_base.load_documents(web_documents, upsert=True)

Example code

A long term approach would be to expose this as a parameter in the KnowledgeBase object, that accepts a "pre_process()" function, which runs on the content to clean it.

What do you think?

@mathieuchateau
Copy link
Author

mathieuchateau commented Apr 23, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants