-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clarify splitting in documentation #42
Comments
FastEmbed will not do the splitting for you. Our default embedding model expects 512 tokens — these tokens are different from the OpenAI tokens! |
Of course, I have a custom splitter (in my case https://github.com/longevity-genie/indexpaper/blob/main/indexpaper/splitting.py#L119 ) that computes the tokens with the selected HuggingFace model and splits accordingly. The problem is that with this approach I have to run the embedding model once again - for splitting and I do not win much time. If fastembed would have a tokenaware splitter built-in it will save a lot of computation. |
@antonkulaga Yes I thought the same too, is it possible using fastembed to split the texts ? |
I think the close is premature. You have to measure number of tokens to split the text and for this you need to call embedding one more. As fastembed does not have proper splitting I will have to use way slower langchain implementation that decreases the benefits from fastembed |
@antonkulaga ok, thanks for the answer |
Work in Progress here, we're adding a recursive splitter (based on Langchain, but no dependency) based on tokens: #80 Would appreciate folks sharing any feedback! |
@NirantK cool |
I am using embeddings to embed scientific papers. Usually, I use langchain splitters to split the paper into multiple chunks. However, it is not clear to me if fastembed will do splitting for me or I have to split everything (for which I will have to run embedding tokenizer to evaluate tokens per each paragraph).
The text was updated successfully, but these errors were encountered: