[Bug]: loading 5.7 GB data to llama-index #13588

yasir23 · 2024-05-20T02:24:36Z

Bug Description

I have loaded 5.7 GB dataset to the loader but its giving this response:
Ignoring wrong pointing object 59 0 (offset 0)
Ignoring wrong pointing object 60 0 (offset 0)
Ignoring wrong pointing object 62 0 (offset 0)
Ignoring wrong pointing object 63 0 (offset 0)
whole dataset is composed of pdf files...

Version

latest

Steps to Reproduce

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="./dataset")
documents = reader.load_data()

Relevant Logs/Tracbacks

No response

dosubot · 2024-05-20T02:25:04Z

Hey there, @yasir23! 🎉 I'm Dosu, your friendly neighborhood bot, here to lend a hand with bugs, answer your queries, and guide you on your journey to becoming a contributor. While we wait for a human maintainer to swing by, know that I'm here to help out. Let me dive into your issue and I'll get back to you with a full answer shortly!

logan-markewich · 2024-05-20T02:32:33Z

I've never seen his before. Pretty impossible to debug without a way to reproduce 😅

My guess is either running out of memory, or some kind of error reading some of your pdfs (corrupted? password protected?)

You can try loading them one by one instead maybe?

index = VectorStoreIndex(nodes=[], ...)

for doc in reader.iter_data():
  index.insert(doc)

dosubot · 2024-05-20T02:38:20Z

To tackle the issue of "Ignoring wrong pointing object" warnings when loading a 5.7 GB dataset of PDF files, consider the following steps:

Check for Corrupted or Password-Protected PDFs: The warnings might indicate problems with the PDF files themselves, such as corruption or password protection. Use tools like pypdf to check the integrity of your PDF files and ensure they are not password protected.
Memory Management: Make sure your system has enough memory to handle the dataset. If memory is an issue, try loading smaller chunks of the dataset sequentially.
Debugging with PDFReader: Use PDFReader from the llama-index to load individual PDF files or smaller subsets of your dataset. This can help identify specific files that cause the warnings.
Adjust Processing and Error Handling: If using parallel processing, ensure your system can handle the load. Consider setting raise_on_error to True in your SimpleDirectoryReader configuration to halt on errors, making it easier to identify problematic files.
Enhance Logging: Increase logging detail within your script or the llama-index configuration to capture more information about the loading process. This can help in identifying the root cause of the warnings.
Consult Documentation and Community: Look into the llama-index documentation and community forums for any known issues or tips related to loading large datasets or handling PDF files.

By following these steps, you should be able to identify and resolve the issue with the "Ignoring wrong pointing object" warnings during the dataset loading process.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

yasir23 added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: loading 5.7 GB data to llama-index #13588

[Bug]: loading 5.7 GB data to llama-index #13588

yasir23 commented May 20, 2024

dosubot bot commented May 20, 2024

logan-markewich commented May 20, 2024

dosubot bot commented May 20, 2024

Details

[Bug]: loading 5.7 GB data to llama-index #13588

[Bug]: loading 5.7 GB data to llama-index #13588

Comments

yasir23 commented May 20, 2024

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented May 20, 2024

logan-markewich commented May 20, 2024

dosubot bot commented May 20, 2024

Details