GenAI-Document Q&A

Document Q&A is designed to respond comprehensively to questions posed about the provided document, regardless of the section from which the questions originate.

Steps to run Streamlit app:

To create a Hugging face user access tokens or use an existing one, visit: https://huggingface.co/settings/tokens.
Create a new environment:
conda create -p genai python==3.9 -y
Activate the environment:
conda activate genai
Install the requirements:
pip install -r requirements.txt
Run the Streamlit application:
streamlit run app.py

Workflow:

Upload one or more PDF files. It will take little time to load. At backend, it will process, read, chunk and index the pdf files.
We can able to see the preview of the content. Expand to look into the content.
Ask the question that we have to know from the documents.

Quick start: https://huggingface.co/spaces/susheel-1999/documentQA

About Techniques:

Langchain is a framework for developing applications powered by language models. It enables applications that are context-aware and reason.

Chunking process - It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
from langchain.text_splitter import RecursiveCharacterTextSplitter

Types of chunking:
i) Character Text Splitter - Splitting text based on the characters.
ii) Recursive Character Text character -Text is split based on sequences of characters. This method is particularly effective for retaining the structure of paragraphs and sentences.
iii) Document Based Splitter - Text is split based on the structure of documents. This approach caters to specific document formats, such as Python-based documents, HTML, markup, and more.
iv) Semantic Chunking - Aims to identify points in the text where sentence similarity varies significantly (potentially with a threshold while considering the following sentence). These identified points serve as separators for creating meaningful chunks.
Integration of Hugging Face Models and Embeddings - Langchain seamlessly incorporates and provides access to Hugging Face models and embeddings. Users can leverage the following functionalities.
Emebeddings: from langchain_community.embeddings import HuggingFaceEmbeddings
LLMs: from langchain_community.llms import HuggingFaceHub
Integration of VectorDB - Langchain seamlessly incorporates and provides supports for many VectorDB (example: FAISS).
from langchain_community.vectorstores import FAISS
Schema - Class for storing a piece of text and associated metadata. TO conver
from langchain.schema import Document
Prompt Template - A template of a prompt can be easily designed with the help of the PromptTemplate class.
from langchain.prompts import PromptTemplate
LLM chain - The LLMChain class is used to execute the PromptTemplate.
from langchain.chains import LLMChain

Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes we can build and deploy powerful data apps.

Session State - Session State is a way to share variables between reruns, for each user session.

Why Reterival Augmented Technique for Question Answering Task or any task?:

Technique 1: Stuff
Uses ALL of the text from the documents in the prompt. It actually doesn’t work in Scenario where the data exceeds the token limit and causes rate-limiting errors.
Technique 2: map_reduce
It separates texts into batches, feeds each batch with the question to LLM separately, and comes up with the final answer based on the answers from each batch.
Technique 3: refine
It separates texts into batches, feeds the first batch to LLM, and feeds the answer and the second batch to LLM. It refines the answer by going through all the batches.
Technique 4: map-rerank
It separates texts into batches, feeds each batch to LLM, returns a score of how fully it answers the question, and comes up with the final answer based on the high-scored answers from each batch.
One issue with using Technique 1, 2, 3, and 4 are that it can be very costly because you are feeding more text and multiple hits to OpenAI API and the API is charged by the number of tokens. A better solution is RAG (Retrieval Augmented Generation) which retrieve relevant text chunks first and only use the relevant text chunks in the language model.
Technique 5: RAG
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response.
Steps involved:
i. Document Indexing into VectorDB
ii. Data Retriever
iii. Data Augmentation and Prompt Engineering
iv. Querying

Reference:

Langchain - https://python.langchain.com/docs/get_started/introduction
OpenAI - https://platform.openai.com/docs/introduction
Streamlit - https://docs.streamlit.io/library/api-reference/session-state

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

app.py

app.py

requirements.txt

requirements.txt

Repository files navigation

GenAI-Document Q&A

Steps to run Streamlit app:

Workflow:

About Techniques:

Why Reterival Augmented Technique for Question Answering Task or any task?:

Reference:

About

Releases

Packages

Languages

Susheel-1999/GenAI-DocumentQA

Folders and files

Latest commit

History

Repository files navigation

GenAI-Document Q&A

Steps to run Streamlit app:

Workflow:

About Techniques:

Why Reterival Augmented Technique for Question Answering Task or any task?:

Reference:

About

Topics

Resources

Stars

Watchers

Forks

Languages