Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding semantic HTML chunking using Unstructured.io #415

Draft
wants to merge 21 commits into
base: development
Choose a base branch
from

Conversation

tarockey
Copy link
Collaborator

@tarockey tarockey commented Mar 22, 2024

Still in progress, but adding semantic HTML chunking.

The strategy should apply to the rest of the document chunkers.

Overall method:

  1. use Unstructured to chunk by title
  2. use an embedding model to embed each split chunk
    NOTE: currently using spacy by default. open task to add configurable embedding model.
  3. compare split chunks to find and combine semantically similar chunks

Changes:

  • Added SEMANTIC as a chunking strategy to configuration
  • Added SEMANTIC_SIMILARITY_THRESHOLD to configuration (this defines how semantically similar two chunks must be, before they are combined)
  • Updated load_documents to accept a config as a parameter, to allow these params to be passed, without overloading load_documents
  • updated html_loader to include semantic chunking logic.

@tarockey tarockey self-assigned this Mar 25, 2024
@tarockey tarockey added the enhancement New feature or request label Mar 25, 2024
@tarockey tarockey marked this pull request as ready for review April 2, 2024 16:24
@tarockey tarockey changed the title DRAFT: Adding semantic HTML chunking using Unstructured.io Adding semantic HTML chunking using Unstructured.io Apr 2, 2024
@tarockey tarockey marked this pull request as draft April 2, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant