-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small improvements to RAG #1871
base: staging
Are you sure you want to change the base?
Small improvements to RAG #1871
Conversation
This makes explicit which part of SillyTavern those messages come from.
When enabled, this attempts to heuristically sanitize the input text, and to strip the reference list.
Here's a reference on how to clean extracted PDF text. It's a Python package, but you can get an overall idea. https://github.com/pd3f/dehyphen Toast fixes are okay by me. |
Thanks, I wasn't aware of that package. Looks useful. Starred, bookmarked, and promptly forgotten. :P It seems to require some kind of AI model ( (Description of the available models: [1] [2]. It seems To avoid duplicating existing work, I could implement a backend in Extras, calling ( An alternative would be to reimplement the functionality of What do you think? |
You can run it under extras as a separate module, I think it's fine and better than whatever solutions using pure regex are. |
Ok, a new Extras module it is, then. Yes, quality will definitely be better with AI-based NLP, since unlike regexes, it was actually built to do things like this. :) Expect an update to this PR, and a separate PR for the Extras side, probably some time next week. But first, I have some specific questions:
Finally, even with AI, I'd still prefer to use Thing is, tagging headings is outside Ideally, we should also detect the next heading following the reference section, if any exists... but after styles are gone, I'm not sure if that's possible, beyond sending chunks of the document to the main LLM and instructing it to analyze them, which is horribly slow if the reference list is long, and probably not 100% reliable. Often, appendices are not tagged as such - the convention is to just number the appendix headings as A, B, C, ...where the content of each heading is arbitrary text. Fortunately, appendices are often not important for the main argument of a paper, so personally I'm fine with a solution that drops them too, at least for now. In some papers they do contain useful information, but I think we can leave that more difficult case for later. |
You can have both. Just install dehyphen as a requirement for Extras API.
I don't have any preference. But also "Science mode" is not a great name as there is no way to tell what exactly it does and why should someone be using that.
Skip it. |
Ok, will do.
In the general context of ST (not RAG only), true. Can be changed. I'll see what I come up with. The LLM might also have some ideas for naming the feature.
Ok. |
@Cohee1207: There's one more thing I forgot to ask about: I'd like to have a progress indicator for the ingestion process, as it can still take a minute or two for an average-sized scientific paper PDF even when using the Extras vectorizer. When I send in a PDF to Vector Storage, I'm constantly glancing at my CPU usage monitor to see if the ingestion is still running. But adding a progress indicator requires changing the API, because currently, the ingestion request ( So my question is that if I change this, do we need backward compatibility?
I emphasize that at this point, this is just an idea that in my opinion would improve the UX. If it's easy enough to do, I might include it in this PR. If not, then it's maybe better to leave it for later. A rough plan. This needs a very simple batch job controller:
EDIT: Terminology: in the first paragraph, meant the Extras vectorizer, not generic "backend". |
This doesn't require backward compatibility as it's a built-in plugin using an API that nothing else is using. |
Thanks! I'll go ahead, then. Updated schedule, expect something in the upcoming weeks. :) Summarizing, TODO:
|
Never mind me, just syncing this with the latest staging. |
Also, small status update: I installed A user-selectable, character-based AI model, from The perplexity of the original, unbroken text is obviously unknown, and varies between different texts, but since the measurement is local, this shouldn't matter. Since perplexity, roughly speaking, measures how surprising the text is, the algorithm picks the least surprising option. Most often, the least surprising option will be the correctly spelled option - but as with any spell-checker, rare words can trip it up. In practice, I think this is probably fine. Short of running an LLM-based analysis, this (or something similar in spirit) is the best an algorithm can do. The I tried different Also, reading the source code, I noticed that So it seems that the text cleanup step might require a bit more work than expected. Stay tuned... |
@Cohee1207: I had planned to do this in Extras, once I find the time to work on ST again. In the meantime, I've minimally updated the PR to resolve the merge conflicts. But I'd like to ask, now that Extras is discontinued, is there a preferred new approach? We will need the Another thing is also, I suppose that while the existing Extras server still works in the near term, it won't remain usable indefinitely. I think we need a long-term solution for a fast local RAG embeddings provider, as well as a local websearch result parser. (At least I don't want to subscribe to SerpAPI just for that.) It's a pity losing Talkinghead, but maybe, while a fun experiment, it's not that critical anyway. With a recent Ooba, now a 7B model with 8192 context fully fits into 8 GB VRAM, changing just the prompt preprocessing settings to (Also LLaMA-3 8B fits, at the same context size. Though IMHO, Dolphin-Mistral is still the bee's knees in the 7B size class, especially the recent v2.8.) |
The preferred way of adding functionality is server plugins.
Any reason why transformer.js embeddings do not work for you?
https://github.com/SillyTavern/SillyTavern-WebSearch-Selenium |
Ok. Thanks for the quick response!
Speed. At least when I last tried it, ingesting a single 20-page PDF took upwards of a minute of wall time. With the Extras server, ingesting the same document completes in a few seconds. PyTorch is fast, even on CPU. I have a pile of nearly 4k scientific papers I'm aiming to experiment on with RAG, so... ...yeah, I don't know if ST is built for that use case, but it's so close to an all-in-one LLM frontend that I'd like to use it for this, too.
Thank you, I didn't realize it was done already. :) |
You can look into skipping transformers.js entirely and running ONNX models directly with CUDA acceleration somehow. |
Calling into CUDA from JS? I suppose I can look into it. I'll keep you posted. |
Let's first investigate the near future of transformers.js. Ideally, I'd like a multicore CPU solution, laptop VRAM sizes being what they are - so that only the LLM itself takes up GPU memory. Some preliminary investigation: transformers.js runs on the ONNX runtime. Both onnx-runtime-web and onnx-runtime-node should already be running on WASM. The precompiled WASM binaries have promising-sounding For possibilities running on GPU, it looks like transformers.js is adding WebGPU support in v3. That could be another long-term solution here. It seems it's already possible to get Transformers models running on WebGPU with some glue code, as in this blog post by Wei Lu (2024). I'm not sure if it's worth the time investment, though, or if it's better to just wait until we can upgrade to transformers.js v3. The ONNX runtime itself already supports WebGPU on the web (browser) side, at least on Chrome-based browsers, on both Linux x64 and Windows x64. On the node.js (server) side, the support is still incomplete. On Linux x64, CUDA 11.8 is supported. No idea whether more recent versions are. On Windows x64, only DirectML is supported. As expected, that's Windows-specific, so not available for other OSs. On MacOS x64, ONNX currently has no GPU option, neither on the web nor on the node.js side. |
Here's a PR for the client side of Vector Storage.
The science mode is a prototype. It's been in its current state for the past two weeks or so, so I might as well post it now, for discussion.
It works with some documents (so it's already useful!), but it no-ops on others. This is essentially because the text content of PDF files in the wild often isn't very clean.
I quickly looked at
extractTextFromPDF
inSillyTavern/public/scripts/utils.js
. When the function iterates over each page of the input PDF, if the "items" returned bypdfjs
are paragraphs or something, we could improve things by separating them by\n
instead of a single space. This would keep headings on their own line, making the job of Science mode much easier (and making it much more reliable).I'm also thinking that the new function
sanitizeScientificInput
perhaps should be already called byextractTextFromPDF
(so it should be moved toutils.js
, and the science mode setting moved accordingly), instead of being used only by Vector Storage.Also, this PR is still missing one functionality I want to add - as mentioned in my TODO list, I'd like Vector Storage to emit an error message if the context length is too short for the RAG results. I simply haven't had the time to trace through that part of the code yet.
So this PR is still a WIP, but I plan to finish it in the near future.
@Cohee1207: What is your opinion?