Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: loading 5.7 GB data to llama-index #13588

Open
yasir23 opened this issue May 20, 2024 · 3 comments
Open

[Bug]: loading 5.7 GB data to llama-index #13588

yasir23 opened this issue May 20, 2024 · 3 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@yasir23
Copy link

yasir23 commented May 20, 2024

Bug Description

I have loaded 5.7 GB dataset to the loader but its giving this response:
Ignoring wrong pointing object 59 0 (offset 0)
Ignoring wrong pointing object 60 0 (offset 0)
Ignoring wrong pointing object 62 0 (offset 0)
Ignoring wrong pointing object 63 0 (offset 0)
whole dataset is composed of pdf files...
● chatbot ipynb - SP client - Cursor 5_20_2024 7_15_19 AM

Version

latest

Steps to Reproduce

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="./dataset")
documents = reader.load_data()

Relevant Logs/Tracbacks

No response

@yasir23 yasir23 added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 20, 2024
Copy link

dosubot bot commented May 20, 2024

Hey there, @yasir23! 🎉 I'm Dosu, your friendly neighborhood bot, here to lend a hand with bugs, answer your queries, and guide you on your journey to becoming a contributor. While we wait for a human maintainer to swing by, know that I'm here to help out. Let me dive into your issue and I'll get back to you with a full answer shortly!

@logan-markewich
Copy link
Collaborator

I've never seen his before. Pretty impossible to debug without a way to reproduce 😅

My guess is either running out of memory, or some kind of error reading some of your pdfs (corrupted? password protected?)

You can try loading them one by one instead maybe?

index = VectorStoreIndex(nodes=[], ...)

for doc in reader.iter_data():
  index.insert(doc)

Copy link

dosubot bot commented May 20, 2024

To tackle the issue of "Ignoring wrong pointing object" warnings when loading a 5.7 GB dataset of PDF files, consider the following steps:

  1. Check for Corrupted or Password-Protected PDFs: The warnings might indicate problems with the PDF files themselves, such as corruption or password protection. Use tools like pypdf to check the integrity of your PDF files and ensure they are not password protected.

  2. Memory Management: Make sure your system has enough memory to handle the dataset. If memory is an issue, try loading smaller chunks of the dataset sequentially.

  3. Debugging with PDFReader: Use PDFReader from the llama-index to load individual PDF files or smaller subsets of your dataset. This can help identify specific files that cause the warnings.

  4. Adjust Processing and Error Handling: If using parallel processing, ensure your system can handle the load. Consider setting raise_on_error to True in your SimpleDirectoryReader configuration to halt on errors, making it easier to identify problematic files.

  5. Enhance Logging: Increase logging detail within your script or the llama-index configuration to capture more information about the loading process. This can help in identifying the root cause of the warnings.

  6. Consult Documentation and Community: Look into the llama-index documentation and community forums for any known issues or tips related to loading large datasets or handling PDF files.

By following these steps, you should be able to identify and resolve the issue with the "Ignoring wrong pointing object" warnings during the dataset loading process.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants