Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small improvements to RAG #1871

Open
wants to merge 9 commits into
base: staging
Choose a base branch
from

Conversation

Technologicat
Copy link
Contributor

Here's a PR for the client side of Vector Storage.

  • New setting Science mode (off by default).
    • When enabled, the ingestion process heuristically sanitizes the input text (to fix headings T hat L ook L ike T his), and attempts to find and strip the reference list, in order to improve RAG search quality.
    • Especially, this avoids the RAG database from being poisoned by the high keyword concentration in the reference list. Without Science mode, it happens very often that when a scientific paper is fed into Vector Storage, almost any RAG lookup returns only snippets of the reference list, and nothing else.
  • Add some toast notifications:
    • File ingestion completed successfully
    • File ingestion failed
    • Vectorize all failed
  • Reformat all existing toast notifications to have Vector Storage in the title, to make it explicit which part of ST the notification came from.
  • Add some debug console messages, to help track what happens during RAG lookup.
    • This allows noticing rare unintuitive cases such as the file text content being just slightly under the configured size limit. If the input is a PDF, the user doesn't have a copy of the exact text content as seen by ST's importer.

The science mode is a prototype. It's been in its current state for the past two weeks or so, so I might as well post it now, for discussion.

It works with some documents (so it's already useful!), but it no-ops on others. This is essentially because the text content of PDF files in the wild often isn't very clean.

I quickly looked at extractTextFromPDF in SillyTavern/public/scripts/utils.js. When the function iterates over each page of the input PDF, if the "items" returned by pdfjs are paragraphs or something, we could improve things by separating them by \n instead of a single space. This would keep headings on their own line, making the job of Science mode much easier (and making it much more reliable).

I'm also thinking that the new function sanitizeScientificInput perhaps should be already called by extractTextFromPDF (so it should be moved to utils.js, and the science mode setting moved accordingly), instead of being used only by Vector Storage.

Also, this PR is still missing one functionality I want to add - as mentioned in my TODO list, I'd like Vector Storage to emit an error message if the context length is too short for the RAG results. I simply haven't had the time to trace through that part of the code yet.

So this PR is still a WIP, but I plan to finish it in the near future.

@Cohee1207: What is your opinion?

@Cohee1207
Copy link
Member

Here's a reference on how to clean extracted PDF text. It's a Python package, but you can get an overall idea.

https://github.com/pd3f/dehyphen

Toast fixes are okay by me.

@Technologicat
Copy link
Contributor Author

Thanks, I wasn't aware of that package. Looks useful. Starred, bookmarked, and promptly forgotten. :P

It seems to require some kind of AI model (flair). From the viewpoint of producing high-quality results, perplexity analysis does seem the right way to do this.

(Description of the available models: [1] [2]. It seems flair auto-downloads from AWS.)

To avoid duplicating existing work, I could implement a backend in Extras, calling dehyphen. We already have an embeddings provider there anyway, so in my opinion, a text sanitizer would fit right in.

(flair is also an embeddings provider. So then we'd have two, unless we switch to flair only, if it supports the sentence embeddings Vector Storage needs. I'll need to look into this in more detail.)

An alternative would be to reimplement the functionality of dehyphen in JS, but the flair part may be a lot of work and/or unusably slow. This option doesn't seem so inviting to me.

What do you think?

@Cohee1207
Copy link
Member

You can run it under extras as a separate module, I think it's fine and better than whatever solutions using pure regex are.

@Technologicat
Copy link
Contributor Author

Ok, a new Extras module it is, then. Yes, quality will definitely be better with AI-based NLP, since unlike regexes, it was actually built to do things like this. :)

Expect an update to this PR, and a separate PR for the Extras side, probably some time next week.

But first, I have some specific questions:

  • Do you prefer if I replace the current sentence embedder with flair, or is it better to have both?
    • I.e., which is more important: fewer dependencies and simpler code in Extras, or getting identical embeddings from the Extras and Main vectorization sources?
  • Do you mind if I move the text sanitizer to the core, make it sanitize all PDF (and TXT?) attachments, and move the Science mode option to the general advanced User settings?
  • What should we do if Extras is not installed?
    • Skip sanitization,
    • Apply a regex-based fallback, or
    • Something else?

Finally, even with AI, I'd still prefer to use \n as a separator in extractTextFromPDF if it turns out that the items are (at least mostly) semantically meaningful units. The dehyphen library will fix spurious hyphenation and whitespaces inserted mid-word, which will help a lot, but in order to strip the reference list (which is the main point of Science mode), we also need to detect the References section heading.

Thing is, tagging headings is outside dehyphen's job description. After text extraction, styles are gone, so I think all we can do easily and cheaply is to look for the literal text "References", alone, on its own line.

Ideally, we should also detect the next heading following the reference section, if any exists... but after styles are gone, I'm not sure if that's possible, beyond sending chunks of the document to the main LLM and instructing it to analyze them, which is horribly slow if the reference list is long, and probably not 100% reliable. Often, appendices are not tagged as such - the convention is to just number the appendix headings as A, B, C, ...where the content of each heading is arbitrary text.

Fortunately, appendices are often not important for the main argument of a paper, so personally I'm fine with a solution that drops them too, at least for now. In some papers they do contain useful information, but I think we can leave that more difficult case for later.

@Cohee1207
Copy link
Member

Do you prefer if I replace the current sentence embedder with flair, or is it better to have both?

You can have both. Just install dehyphen as a requirement for Extras API.

Do you mind if I move the text sanitizer to the core, make it sanitize all PDF (and TXT?) attachments, and move the Science mode option to the general advanced User settings?

I don't have any preference. But also "Science mode" is not a great name as there is no way to tell what exactly it does and why should someone be using that.

What should we do if Extras is not installed?

Skip it.

@Technologicat
Copy link
Contributor Author

You can have both. Just install dehyphen as a requirement for Extras API.

Ok, will do.

I don't have any preference. But also "Science mode" is not a great name as there is no way to tell what exactly it does and why should someone be using that.

In the general context of ST (not RAG only), true. Can be changed. I'll see what I come up with. The LLM might also have some ideas for naming the feature.

What should we do if Extras is not installed?

Skip it.

Ok.

@Technologicat
Copy link
Contributor Author

Technologicat commented Mar 6, 2024

@Cohee1207: There's one more thing I forgot to ask about:

I'd like to have a progress indicator for the ingestion process, as it can still take a minute or two for an average-sized scientific paper PDF even when using the Extras vectorizer. When I send in a PDF to Vector Storage, I'm constantly glancing at my CPU usage monitor to see if the ingestion is still running.

But adding a progress indicator requires changing the API, because currently, the ingestion request (/api/vector/insert) returns only after the whole thing completes.

So my question is that if I change this, do we need backward compatibility?

  • Do people typically run ST fully locally? In this case, the client and server versions are guaranteed to match. OR,
  • Are distributed installations a thing we need to account for? These can obviously have different versions of ST on the client and server machines.

I emphasize that at this point, this is just an idea that in my opinion would improve the UX. If it's easy enough to do, I might include it in this PR. If not, then it's maybe better to leave it for later.


A rough plan. This needs a very simple batch job controller:

  • The original request (sending the document for ingestion) would return immediately, with a job ID.
  • There would be another endpoint to get the status for a given job ID.
    • On the server, the status could live in a dictionary keyed by job ID.
    • It could also contain other metadata such as the filename so that we can display and/or log it easily.
    • Job status could be:
      • Initially: queued.
      • While running: the current progress, e.g. number of completed and total batches.
      • When completed: the ok/failed status, and possible other data (results) if the client needs that.
    • Querying the status of a completed job would return the results, and also remove the job from the dictionary.
  • The server would update the status inside the loop in getBatchVector as each batch completes, and one final time when it's done (or when it catches an exception).
  • The client, while waiting for ingestion to complete, would poll the job status every few seconds, and update its progress indicator accordingly.
    • Could be implemented in insertVectorItems, where it currently just awaits on /api/vector/insert.
    • Perhaps a toast message every 10 seconds or so would be the easiest. Or a div with some kind of progressbar. I don't yet know exactly what we have easily available.

EDIT: Terminology: in the first paragraph, meant the Extras vectorizer, not generic "backend".

@Cohee1207
Copy link
Member

This doesn't require backward compatibility as it's a built-in plugin using an API that nothing else is using.

@Technologicat
Copy link
Contributor Author

Thanks! I'll go ahead, then.

Updated schedule, expect something in the upcoming weeks. :)

Summarizing, TODO:

  • Extras:
    • Add text cleanup endpoint (/api/sanitize or something) to Extras, using dehyphen as the backend.
  • Core:
    • Sanitization, feature 1:
      • When a PDF or TXT file is attached, if Extras is available, call the new endpoint to clean up the text.
        • For all PDF/TXT attachments, not only in RAG!
      • See if we can separate section headings to their own line in extractTextFromPDF.
        • This is needed to detect the reference list in RAG, but is outside the job description of dehyphen.
        • Better formatting won't hurt elsewhere, either, so I think this still belongs to the same sanitization feature.
      • Invent a reasonable name for the feature and make it optional, with a toggle in Advanced User Settings.
        • Maybe "Clean up text of attached files (requires Extras)", or something?
        • Explain in the tooltip that this uses an AI model to fix spurious hyphenation and whitespaces, as is commonly encountered in text extracted from PDF files. Fixing broken formatting makes the text easier to analyze for the LLM.
        • I think this should default to on (and just let it error out if Extras is not available or is too old) - the toggle is mainly there in case this breaks someone's use case.
    • Sanitization, feature 2:
      • Modify the prototype "science mode" to only strip the reference list, not attempt any other sanitization.
      • Rename to "Strip reference list from attached files".
      • Move this feature to the core, too. No reason to not be able to strip reference lists from regular file attachments even when not using RAG.
      • Explain in the tooltip that when using RAG on a scientific paper, this mitigates database poisoning due to high concentration of search terms in the reference list. When not using RAG, omitting the reference list from the text content of the file attachment can save lots of tokens that would just fill up the context while not being very useful.
      • Warn that for most documents that have a reference list, this will strip appendices, too.
      • For best results, broken input files require also the sanitizer, to improve the chances of a successful reference list detection. Mention this somewhere?
  • Vector Storage:
    • Add a progress indicator to file ingestion, as per the rough plan in the previous comment above.
    • Debug the silent failure when the context window is too small; make it emit an error message.

@Technologicat
Copy link
Contributor Author

Never mind me, just syncing this with the latest staging.

@Technologicat
Copy link
Contributor Author

Also, small status update:

I installed dehyphen into my extras venv, played around with it, and investigated it in more detail.

A user-selectable, character-based AI model, from flair, is used for scoring the perplexity for the different options for how the text could have been before the formatting broke it. The options themselves are generated by dehyphen, using simple rules. Then dehyphen picks the best-scoring option.

The perplexity of the original, unbroken text is obviously unknown, and varies between different texts, but since the measurement is local, this shouldn't matter.

Since perplexity, roughly speaking, measures how surprising the text is, the algorithm picks the least surprising option. Most often, the least surprising option will be the correctly spelled option - but as with any spell-checker, rare words can trip it up.

In practice, I think this is probably fine. Short of running an LLM-based analysis, this (or something similar in spirit) is the best an algorithm can do.

The flair AI model can run on the GPU. Here's how to select the device. We can have this controlled by the GPU option of Extras, or add a separate option just for this model. I'm thinking a separate option could be useful, since this might be performance-critical, and doesn't take much VRAM, so laptop users will likely want to run the text sanitizer on the GPU.

I tried different flair models (multi, en, news), and didn't get any of them to recognize paragraph breaks correctly. They all want to turn my test paragraph sequence (semi-randomly picked from Brown et al., 2020) into one giant paragraph. dehyphen uses the last and first lines (across the paragraph break) for scoring - one possible idea would be to try a few more lines.

Also, reading the source code, I noticed that dehyphen doesn't even attempt to clean up spurious whitespace inserted into words. Usually, this is an issue only for headings in PDFs, but we'll need to clean up headings to be able to detect the References section for that other sanitization feature. I think I can generate the options via regex, and then score the perplexities using flair to pick the most likely one, similarly to how dehyphen already works.

So it seems that the text cleanup step might require a bit more work than expected. Stay tuned...

@github-actions github-actions bot added the 🚫 Merge Conflicts [PR] Submitted code needs rebasing label Apr 25, 2024
@github-actions github-actions bot removed the 🚫 Merge Conflicts [PR] Submitted code needs rebasing label May 3, 2024
@Technologicat
Copy link
Contributor Author

@Cohee1207: I had planned to do this in Extras, once I find the time to work on ST again.

In the meantime, I've minimally updated the PR to resolve the merge conflicts.

But I'd like to ask, now that Extras is discontinued, is there a preferred new approach? We will need the flair model or something similar to compute the perplexity scores.

Another thing is also, I suppose that while the existing Extras server still works in the near term, it won't remain usable indefinitely.

I think we need a long-term solution for a fast local RAG embeddings provider, as well as a local websearch result parser. (At least I don't want to subscribe to SerpAPI just for that.)

It's a pity losing Talkinghead, but maybe, while a fun experiment, it's not that critical anyway. With a recent Ooba, now a 7B model with 8192 context fully fits into 8 GB VRAM, changing just the prompt preprocessing settings to n_batch=128. Getting 30 tokens/sec in ST. Now that's what I call fast enough.

(Also LLaMA-3 8B fits, at the same context size. Though IMHO, Dolphin-Mistral is still the bee's knees in the 7B size class, especially the recent v2.8.)

@Cohee1207
Copy link
Member

The preferred way of adding functionality is server plugins.

fast local RAG embeddings provider

Any reason why transformer.js embeddings do not work for you?

local websearch result parser

https://github.com/SillyTavern/SillyTavern-WebSearch-Selenium

@Technologicat
Copy link
Contributor Author

The preferred way of adding functionality is server plugins.

Ok. Thanks for the quick response!

fast local RAG embeddings provider

Any reason why transformer.js embeddings do not work for you?

Speed.

At least when I last tried it, ingesting a single 20-page PDF took upwards of a minute of wall time. With the Extras server, ingesting the same document completes in a few seconds. PyTorch is fast, even on CPU.

I have a pile of nearly 4k scientific papers I'm aiming to experiment on with RAG, so...

...yeah, I don't know if ST is built for that use case, but it's so close to an all-in-one LLM frontend that I'd like to use it for this, too.

local websearch result parser

https://github.com/SillyTavern/SillyTavern-WebSearch-Selenium

Thank you, I didn't realize it was done already. :)

@Cohee1207
Copy link
Member

At least when I last tried it, ingesting a single 20-page PDF took upwards of a minute of wall time. With the Extras server, ingesting the same document completes in a few seconds. PyTorch is fast, even on CPU.

You can look into skipping transformers.js entirely and running ONNX models directly with CUDA acceleration somehow.

@Technologicat
Copy link
Contributor Author

Calling into CUDA from JS? I suppose I can look into it. I'll keep you posted.

@Technologicat
Copy link
Contributor Author

Let's first investigate the near future of transformers.js.

Ideally, I'd like a multicore CPU solution, laptop VRAM sizes being what they are - so that only the LLM itself takes up GPU memory.

Some preliminary investigation: transformers.js runs on the ONNX runtime. Both onnx-runtime-web and onnx-runtime-node should already be running on WASM. The precompiled WASM binaries have promising-sounding -threaded and -simd-threaded variants, but I didn't yet find how to enable them. Maybe something to look into next. If that's good enough, then no point in going further.

For possibilities running on GPU, it looks like transformers.js is adding WebGPU support in v3. That could be another long-term solution here.

It seems it's already possible to get Transformers models running on WebGPU with some glue code, as in this blog post by Wei Lu (2024). I'm not sure if it's worth the time investment, though, or if it's better to just wait until we can upgrade to transformers.js v3.

The ONNX runtime itself already supports WebGPU on the web (browser) side, at least on Chrome-based browsers, on both Linux x64 and Windows x64.

On the node.js (server) side, the support is still incomplete. On Linux x64, CUDA 11.8 is supported. No idea whether more recent versions are. On Windows x64, only DirectML is supported. As expected, that's Windows-specific, so not available for other OSs.

On MacOS x64, ONNX currently has no GPU option, neither on the web nor on the node.js side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants