ValueError: Input contains NaN. #3

chenyujiang11 · 2024-03-06T05:45:53Z

I encountered this error when I was adding text. Hope to get a solution to deal with this error.Thank you very much.
Traceback (most recent call last):
File "/home/jyc23/raptor-master/demo/newdemo.py", line 123, in
RA.add_documents(text)
File "/home/jyc23/raptor-master/raptor/RetrievalAugmentation.py", line 217, in add_documents
self.tree = self.tree_builder.build_from_text(text=docs)
File "/home/jyc23/raptor-master/raptor/tree_builder.py", line 280, in build_from_text
root_nodes = self.construct_tree(all_nodes, all_nodes, layer_to_nodes)
File "/home/jyc23/raptor-master/raptor/cluster_tree_builder.py", line 102, in construct_tree
clusters = self.clustering_algorithm.perform_clustering(
File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 194, in perform_clustering
clusters = perform_clustering(
File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 120, in perform_clustering
reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
File "/home/jyc23/raptor-master/raptor/cluster_utils.py", line 32, in global_cluster_embeddings
reduced_embeddings = umap.UMAP(
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/umap/umap_.py", line 2887, in fit_transform
self.fit(X, y, force_all_finite)
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/umap/umap_.py", line 2354, in fit
X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C", force_all_finite=force_all_finite)
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 957, in check_array
_assert_all_finite(
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 122, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/jyc23/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 171, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains NaN.

parthsarthi03 · 2024-03-07T02:35:07Z

Hey! Can you provide some more details about the text you are adding? How many tokens is it?

chenyujiang11 · 2024-03-07T05:50:41Z

Hey! Can you provide some more details about the text you are adding? How many tokens is it?

I have encountered this problem several times. The document read is sample.txt in the demo. The LLM currently used is Qwen/Qwen-1_8B-Chat-Int4, and the embedding model is BAAI/bge-small-zh-v1.5. This bug also occurred when using the demo's default embedding model multi-qa-mpnet-base-cos-v1, and the error is still reported in this place.

ExtReMLapin · 2024-03-07T08:37:37Z

Reproducing example :

import os

import torch
from raptor import BaseSummarizationModel, BaseQAModel, BaseEmbeddingModel, RetrievalAugmentationConfig
from transformers import AutoTokenizer, pipeline

from huggingface_hub import login
login()
class GEMMASummarizationModel(BaseSummarizationModel):
    def __init__(self, model_name="google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the GEMMA model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.summarization_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),  # Use "cpu" if CUDA is not available
        )

    def summarize(self, context, max_tokens=150):
        # Format the prompt for summarization
        messages=[
            {"role": "user", "content": f"Write a summary of the following, including as many key details as possible: {context}:"}
        ]
        
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Generate the summary using the pipeline
        outputs = self.summarization_pipeline(
            prompt,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )
        
        # Extracting and returning the generated summary
        summary = outputs[0]["generated_text"].strip()
        return summary
    
class GEMMAQAModel(BaseQAModel):
    def __init__(self, model_name= "google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.qa_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
        )

    def answer_question(self, context, question):
        # Apply the chat template for the context and question
        messages=[
              {"role": "user", "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}"}
        ]
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Generate the answer using the pipeline
        outputs = self.qa_pipeline(
            prompt,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )
        
        # Extracting and returning the generated answer
        answer = outputs[0]["generated_text"][len(prompt):]
        return answer
    
from sentence_transformers import SentenceTransformer
class SBertEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
        self.model = SentenceTransformer(model_name)

    def create_embedding(self, text):
        return self.model.encode(text)

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

from raptor import RetrievalAugmentation, RetrievalAugmentationConfig


RAC = RetrievalAugmentationConfig(summarization_model=GEMMASummarizationModel(), qa_model=GEMMAQAModel(), embedding_model=SBertEmbeddingModel())
RA = RetrievalAugmentation(config=RAC)

with open('harry.txt', 'r', encoding="utf8") as file:
    text = file.read()
RA.add_documents(text)
 



SAVE_PATH = "demo/cinderella"
RA.save(SAVE_PATH)

#extract text from harry-potter-3-le-prisonnier-dazkaban.pdf

txt data linked in this message

harry.txt

daniyal214 · 2024-04-03T07:32:23Z

@parthsarthi03 I'm facing the same issue. Any update on this?
@chenyujiang11 @ExtReMLapin did you guys able to resolve this?

ExtReMLapin · 2024-04-03T10:38:19Z

Didn’t retry

Amr-Hegazy1 · 2024-04-09T00:55:54Z

I had a similar issue and when I ran pip install -U sentence-transformers it worked fine

ATP-BME · 2024-04-30T08:33:15Z

it seems that the error is casued by using multiprocess when generating embeddings. set multiprocess=False and it worked fine

parthsarthi03 self-assigned this Mar 7, 2024

parthsarthi03 added the bug Something isn't working label Mar 7, 2024

parthsarthi03 assigned salman-abdullah Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Input contains NaN. #3

ValueError: Input contains NaN. #3

chenyujiang11 commented Mar 6, 2024

parthsarthi03 commented Mar 7, 2024

chenyujiang11 commented Mar 7, 2024

ExtReMLapin commented Mar 7, 2024

daniyal214 commented Apr 3, 2024

ExtReMLapin commented Apr 3, 2024

Amr-Hegazy1 commented Apr 9, 2024

ATP-BME commented Apr 30, 2024

ValueError: Input contains NaN. #3

ValueError: Input contains NaN. #3

Comments

chenyujiang11 commented Mar 6, 2024

parthsarthi03 commented Mar 7, 2024

chenyujiang11 commented Mar 7, 2024

ExtReMLapin commented Mar 7, 2024

daniyal214 commented Apr 3, 2024

ExtReMLapin commented Apr 3, 2024

Amr-Hegazy1 commented Apr 9, 2024

ATP-BME commented Apr 30, 2024