Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Code Interpreter for higher quality response and output, while summarizing the files and documents #856

Open
haseeb-heaven opened this issue Mar 4, 2024 · 9 comments
Labels
enhancement New feature or request feature request

Comments

@haseeb-heaven
Copy link

How are you running AnythingLLM?

AnythingLLM desktop app

What happened?

I was trying to chat with my documents. It was a basic document with the employee data with name and IDs and it wasn't able to generate the high-quality response. Most of the time it was giving me random data, and the output was not matching with the already tools that are available� for this.

I have tested.
1.Claude-2.1.
2.Gemini Pro.
3.Local Models.

And I'm not trying to promote my product, but I compared with already available tools that are code interpreters that can generate the code, and it can analyze and all the files in our local system,

Link Code-Interpreter this is my tool i tested with same models like Claude 2.1 and I got better results with this and that was more accurate, and other tools that are already available that are called code interpreters.

Are there known steps to reproduce?

You can try with the very basic file and you can ask about the data and it will try to generate like table but the data will not be accurate all the time even though if you use the same models and different software available by code interpreters.

@haseeb-heaven haseeb-heaven added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Mar 4, 2024
@haseeb-heaven haseeb-heaven changed the title [BUG]: The accuracy is totally very low and output is not a high quality output. [BUG]: low quality response and output, while summarizing the files and documents Mar 4, 2024
@tylerfeldstein
Copy link

Im having the same issue. Ive offloaded the embedding to ollama using nomic-embed-text to see if that was the issue since I saw the loacal embedding has this.embeddingMaxChunkLength = 1_00; but am having the same results. I have also pushed the LLM to LM studio to see the logs in the API call and noticed the context is VERY limited. Limited enough to make the LLM hallucinate quite often.

@haseeb-heaven
Copy link
Author

Yes we need to improve this quality and less hallucinate.

@tylerfeldstein
Copy link

Try to jam the BERT tokenizer over the existing one and see what you get.
In /collector/utils/tokenizer/index.js add:

// Importing the BertTokenizer class from bert-tokenizer module.
const { BertTokenizer } = require("bert-tokenizer");

// Instantiate the BERT tokenizer.
const tokenizer = new BertTokenizer();

// A function that tokenizes a string using the BERT's text encoding.
// If no string is provided, it defaults to an empty string.
function tokenizeString(input = "") {
  try {
    // Tokenize the input string and return the tokens.
    return tokenizer.tokenize(input);
  } catch (e) {
    // If an error occurs, log a message to the console and return an empty array.
    console.error("Could not tokenize string!");
    return [];
  }
}

// Export the tokenizeString function.
module.exports = {
  tokenizeString,
};

You will want to yarn add bert-tokenizer from the /collector/

In some pdfs, im able to get better context. still testing on my side

@tylerfeldstein
Copy link

tylerfeldstein commented Mar 5, 2024

I should add:

I'm using

this.model = "NomicAi/nomic-embed-text-v1_5";
    this.cacheDir = path.resolve(
      process.env.STORAGE_DIR
        ? path.resolve(process.env.STORAGE_DIR, `models`)
        : path.resolve(__dirname, `../../../storage/models`)
    );
    //this.modelPath = path.resolve(this.cacheDir, "Xenova", "all-MiniLM-L6-v2"); //DEFAULT
    //this.modelPath = path.resolve(this.cacheDir, "BAAI", "bge-small-en-v1_5"); //Test 1
    this.modelPath = path.resolve(
      this.cacheDir,
      "NomicAi",
      "nomic-embed-text-v1_5"
    );

as the embedder in server/utils/EmbeddingEngines/native/index.js I will be jumping over to BERT embedding in a second to get it all to match (You will have to copy the files into the STORAGE_DIR models folder if you are going to use this) , then Qdrant as the DB, and Mixtral 8x7b as the LLM on LM studio

@timothycarambat timothycarambat added enhancement New feature or request feature request and removed possible bug Bug was reported but is not confirmed or is unable to be replicated. labels Mar 6, 2024
@timothycarambat timothycarambat changed the title [BUG]: low quality response and output, while summarizing the files and documents [FEAT] Code Interperter for higher quality response and output, while summarizing the files and documents Mar 6, 2024
@timothycarambat timothycarambat changed the title [FEAT] Code Interperter for higher quality response and output, while summarizing the files and documents [FEAT] Code Interpreter for higher quality response and output, while summarizing the files and documents Mar 6, 2024
@tylerfeldstein
Copy link

tylerfeldstein commented Mar 6, 2024

Still chugging along on this.

Things I've learned

  • bge-small-en-v1_5 embedder did decent. Still wanting better responses
  • nomic-embed-text-v1_5 perfromed better but the container gets shut down when embedding a large doc. Could probably slow it down based on doc size so this doesnt happen?
  • Still can talk gibberish and give back less than desirable responses. Would like to get document TITLE sent in the context for referencing later.

Best Results So Far

  • Bert Tokenizer from above
  • Ollama running nomic for embedding at 2048 chunk size (can probably be done on localAi too)
  • LM studio running Muxtral 8x7b

@Tiberius1313
Copy link

Step by step instructions would be much appreciated 🙏
Issues:

  • When I use Mixtral 8x7b I get during embedding: Error: "1 document failed to add. Could not embed document chunks! This document will not be recorded."
    And in LM Studio i get: [ERROR] Unexpected endpoint or method. (POST /v1/embeddings). Returning 200 anyway
  • I can not find a path that fits: "/collector/utils/tokenizer/"

@tylerfeldstein
Copy link

tylerfeldstein commented Mar 13, 2024

Update. V2. Results have been sub par by using different embedding models. The setup I am currently running is

  • BERT Tokenizer from above
  • Embedding: nomic-embed-text
  • Vector storage: Qdrant
  • LLM: TheBloke / dolphin-2.7-mixtral-8x7b.Q6_K.gguf

I have been messing with the chunking and been getting better success though. I am going to try and mess with different chunking and text splitting methods and overlaps. I read that a parent child method works pretty good and will give that a try at some point.

I believe chunking occurs at the ./server/utils/vectorDBProvidors/<your db>/index.js in this section below.

       const textSplitter = new RecursiveCharacterTextSplitter({
        chunkSize:
          getEmbeddingEngineSelection()?.embeddingMaxChunkLength || 1_000,
        chunkOverlap: 20,
      });
      const textChunks = await textSplitter.splitText(pageContent);

Even by just changing the overlap manually to 100 or so I feel like I get better results. Also tied this with a 400 and 40 just to see what it would do and it was performing alright. This wouldnt be great for lots of hits though because it would clutter the context in the LLM and theres a hard coded max that will be hit.

@timothycarambat
Copy link
Member

@tylerfeldstein fyi, Related issue! #490

We can prioritize this so you can mess with it more easily. Are you using Docker, Desktop, or local dev?

@tylerfeldstein
Copy link

ACK. I'll jump over to that one.
I've switched to a local dev environment so I can test it in real time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

4 participants