Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The sidebar doesn't load when a pdf has one page only #203

Open
JumanaFM opened this issue Dec 27, 2021 · 32 comments
Open

The sidebar doesn't load when a pdf has one page only #203

JumanaFM opened this issue Dec 27, 2021 · 32 comments
Labels
bug Something isn't working

Comments

@JumanaFM
Copy link
Member

currently in nbclient, if a pdf has one page, the sidebar does not load.

@JumanaFM JumanaFM added the bug Something isn't working label Dec 27, 2021
@semisenioritis
Copy link

does this only happen for single page pdfs or even multi page pdfs?

@karger
Copy link
Member

karger commented Dec 29, 2022

please provide publicly accessible test cases if possible

@semisenioritis
Copy link

semisenioritis commented Dec 30, 2022

https://home.ttic.edu/~avrim/book.pdf

This is the textbook that I am using. After some experimentation I realized that the issue is, the library you are using to make the pdf annotatable requires intense preprocessing (around 3-4 minutes for initial setup) and until the entire pdf isn't preprocessed, neither the annotation sidebar nor the annotations, show up. This makes sense, since the core is the pdf while the annotations buildup on the pdf itself, but this becomes a really big issue, when such setup time is required each time, the open window is changed/ tab is changed.
This makes the software unusable as the time required for loading is just not bearable.

I tried out the same document with nb1, and found that as each page was rendered as a single image and the annotations where blocklike in nature, thus loosing fine control, the rendering of each page was initiated as and when required, making the process faster.
My suggestion would be to provide a highlighting ability, that doesn't directly map to the threads, but maps to a background user-invisible and user non-interactable block style annotation, thus maybe making the system faster.

I'm probably missing out a lot of details since I don't know the code thoroughly, but I'd be happy to help out!

@semisenioritis
Copy link

Also, would it be possible to make the code of nb1 publicly available? I wasn't able to find it in the haystack repositories.

@karger
Copy link
Member

karger commented Dec 30, 2022

NB1 is at https://github.com/nbproject/nbproject

@karger
Copy link
Member

karger commented Dec 30, 2022

The right solution to the problem that you've identified is for NB to "process" the pdf (which nowadays means converting it to html for in-browser rendering) on the server once, and store it there, and deliver that HTML directly to the client at time of use, instead of the current approach of shipping the pdf to each client for processing at the time of use. There should be an issue for this but I can't find it; if it really isn't there we should add it @JumanaFM .

@semisenioritis
Copy link

semisenioritis commented Dec 30, 2022

Exactly, preprocessing is something that i think is happening on the client side, and if possible it should happen all at once in the pdf uploading process. It will probably save a lot of resources.

On another note, i checked out this same issue with mozila's inbuilt pdf viewer and hypothes.is's pdf annotator as well (both being open source) but neither of them seems to have this issue. Any idea how they manage and if same source code can be used? Mozilla doesnt have the ability to annotate and highlight, but other plugins based on mozila's pdf annotators work pretty smooth too.

Also, thanks for the nb1 link!

@karger
Copy link
Member

karger commented Dec 30, 2022 via email

@semisenioritis
Copy link

Ahh, got it. But i briefly looked at the nb2 source code and you also use pdf.js . Sorry in advance if its a basic question.

@karger
Copy link
Member

karger commented Dec 30, 2022

Right; we use the same canonical library as everyone else. But we're running it every time on the client, when instead it really ought to be run once on the server.

@karger
Copy link
Member

karger commented Dec 30, 2022

If you are looking to contribute this would be a very nice issue to work on.

@semisenioritis
Copy link

Im planning to modify nb a bit for my own requirements and I really need to be able to work with big files for this. I'd love to contribute!
Any resources for this specific issue you could point me towards?

@karger
Copy link
Member

karger commented Dec 30, 2022

I'd love for you to contribute back anything you think could be helpful to others. In particular this prerendering of large pdfs would be of great general benefit. NB1 did this (it rendered into images instead html, but same idea).

I take it you've already found the client and server code. We're active on the repo discussion and happy to help out if you need help understanding or finding specific things.

@semisenioritis
Copy link

Yup I have already setup nb2 on my laptop, but my system kept on crashing because of the local hosting. I think that for some reason, nb2 does both pre-rendering and client side rendering as it took twice the amount of time for my local nb than the hosted nb. Just a guess though.

I'll start with figuring out how nb1 rendered images so that I can use that here.

@karger
Copy link
Member

karger commented Dec 30, 2022 via email

@semisenioritis
Copy link

semisenioritis commented Dec 30, 2022

Totally agreed. Especially the points about pdf to images. I initially wanted to use nb1 when that was the only option available but then I realized that the without fine control over the text the context of the related question would be lost on the readers. This wouldn't be a very big issue, and was easily workaround able, but just made me postpone my project for later.

Nb2 did initially feel like more of a frontend modification at the cost of speed, but as i went deeper, i realized that a lot of features were added making it more user-friendly.

But if I shouldn't even refer to nb1 code, where is a good place to start?

@karger
Copy link
Member

karger commented Dec 30, 2022

Are you asking specifically about how to tackle server side rendering in nb2?

@semisenioritis
Copy link

semisenioritis commented Dec 31, 2022

Yes. Maybe some resource or something I can look into or something that already implements this well.

@JumanaFM
Copy link
Member Author

JumanaFM commented Jan 5, 2023

Yes. Maybe some resource or something I can look into or something that already implements this well.

This is how it's done on NB currently https://github.com/haystack/nb/blob/7f0e24a07db0b5de1f54c5d4f20114a14d994f73/public/nb_viewer.html
Take a look and contribute if you can, we appreciate it!

@karger
Copy link
Member

karger commented Jan 5, 2023

At present, nb_viewer fetches the target pdf from the nb server, then uses the pdf.js library to convert it to html that nb can annotate. we should instead be using the same pdf.js library on the server, to convert the pdf to html there once, then save the resulting html in a suitable cache directory so that html can be served on request.

@semisenioritis
Copy link

Yes. Maybe some resource or something I can look into or something that already implements this well.

This is how it's done on NB currently https://github.com/haystack/nb/blob/7f0e24a07db0b5de1f54c5d4f20114a14d994f73/public/nb_viewer.html Take a look and contribute if you can, we appreciate it!

really helpful, thanks!

@semisenioritis
Copy link

why not just save the generated html file on the server, deleting the original pdf?

At present, nb_viewer fetches the target pdf from the nb server, then uses the pdf.js library to convert it to html that nb can annotate. we should instead be using the same pdf.js library on the server, to convert the pdf to html there once, then save the resulting html in a suitable cache directory so that html can be served on request.

@semisenioritis
Copy link

what im thinking is that once the professor uploads the file on the server, the server takes the file converts it to a html file and saves that file for all later use.
if the student/professor wants to download the file as a pdf, we perform the same thing in reverse on the server and provide the document

@semisenioritis
Copy link

It seems that converting pdfs to html documents doesnt always workout and most of the files have their own specific fonts without which the file gets corrupted.
Also I looked a bit deeper into the hypothesis code and it seems that they arent using the pdf to html system either.
Not really sure how to proceed at this point

@karger
Copy link
Member

karger commented Jan 5, 2023 via email

@karger
Copy link
Member

karger commented Jan 5, 2023

pdfs that cannot be converted are just as big a problem with the current system as they would be with server-side conversion---it's the same library either way. So we're no worse off doing the conversion server side.

But such problematic pdfs are rare and getting rarer, because pdfjs is also the library that gets used by firefox to render pdfs in the browser, so it gets lots of attention.

Google chrome uses a different conversion library, pdfium, for the same purpose. We could use that library instead of pdfjs if we decided it was more robust. Pdfium would have to run in a separate process since it isn't js based, but we could easily have our server invoke it at need, using for example this python wrapper.

@semisenioritis
Copy link

Riiight, that makes sense. Ill try this

@semisenioritis
Copy link

semisenioritis commented Jan 7, 2023

@JumanaFM sorry for bothering you again and again but is there any documentation for pdf.js at all? no matter where I search I cant seem to find any documentation for the library at all. The official docs point to links that are incomplete and the only documentation that exists is user contributed and doesn't make a lot of sense ((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)). Where did you refer for the documentation?

I dont mind switching to pdfium but if i can I'd prefer staying close to the source code

@JumanaFM
Copy link
Member Author

JumanaFM commented Jan 8, 2023

@JumanaFM sorry for bothering you again and again but is there any documentation for pdf.js at all? no matter where I search I cant seem to find any documentation for the library at all. The official docs point to links that are incomplete and the only documentation that exists is user contributed and doesn't make a lot of sense ((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)). Where did you refer for the documentation?

I dont mind switching to pdfium but if i can I'd prefer staying close to the source code

Not a bother, happy to help!
The best resource is the official page
https://mozilla.github.io/pdf.js/

Another resource that might be helpful is hypothesis
https://github.com/hypothesis/pdf.js-hypothes.is

@karger
Copy link
Member

karger commented Jan 8, 2023 via email

@semisenioritis
Copy link

Not a bother, happy to help! The best resource is the official page https://mozilla.github.io/pdf.js/

Another resource that might be helpful is hypothesis https://github.com/hypothesis/pdf.js-hypothes.is

Thanks a lot!! I found a few more random resources, but the best docs are in the examples on the official page itself. Not a lot to go by, but you can get a brief overview.

@semisenioritis
Copy link

It might be worth investigating online which of pdf.js and pdfium is considered most robust/able to handle the most pdf weirdness/produces the best html all we do is invoke it for conversion, so the coupling to nb is very light---so it would probably be quite easy to switch, though we would need to keep using pdfjs for the legacy documents since we rely on the converted html being the same every time.

Sure ill look into comparing both too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants