Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can we get metadata information like pdf file page number etc #226

Open
gouse95 opened this issue Sep 20, 2023 · 2 comments
Open

how can we get metadata information like pdf file page number etc #226

gouse95 opened this issue Sep 20, 2023 · 2 comments

Comments

@gouse95
Copy link

gouse95 commented Sep 20, 2023

No description provided.

@smwitkowski
Copy link
Contributor

@gouse95 If you can provide more information about what you're trying to do, and if you have code you can share even better!

With that said, I'm not surekor is going to be the best tool to get the page number from a PDF.

If you're using pypdf to read in your PDF, you can get the page number this way:

import PyPDF2

# Open the PDF file
pdf_file = PyPDF2.PdfFileReader('my_pdf.pdf')

# Get the total number of pages
total_pages = pdf_file.numPages

# Get the first page
first_page = pdf_file.getPage(0)

LangChain has a good example on how to use their PDF reader too, if you're planning on using that.
https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

If you have a PDF that maybe has a cover page, and then the page numbers don't index until after the first page of the file, then you might consider something that extracts the page number from the text. Still, I think you might be better off using a regular expression instead of kor given that you'd be passing the entire content of the PDF just to get the page number.

# Create a sample pdf page
page = 'This is the content of the PDF page\nHere's a new line.\n\npage: 10'

# Define the regular expression
regex = r'page:\s*(\d+)'

# Match the regular expression against the string
match = re.search(regex, page)

# Extract the page number from the match
if match:
    page_number = match.group(1)

    # Print the page number
    print(page_number)
else:
    print('Page number not found.')

Again, if you have some code you can share or more info about your use I might be able to give you better advice.

@eyurtsev
Copy link
Owner

Thanks @smwitkowski :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants