Skip to content

Curate scraped HTML for large language models. Build more robust generative AI applications. Convert HTML to Markdown using Regex, BeautifulSoup4, and filter useless content with Jina Embeddings.

License

Daethyra/context-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Convert and Format HTML to Markdown

Table of Contents:

Description

Extracts HTML content from a JSON file to produce a Markdown file. Leverages similarity threshold to remove redundant content.

Problem to Solve

Building retrieval augmented generation AI applications can be a lengthy process. While there are web crawlers to collect content, the post processing of this content is equally important for accurate and helpful generation.

This library was built specifically to augment the context-curator project by further automating the document creation process.

Quick Start | Getting Started

  1. Installation
  • To have access to the package in your local environment (your working directory), clone the repository using git: git clone https://github.com/daethyra/context-converter.git

  • To install via pip, run: pip install context-converter

Optional: Run jina_embeddings.py to preemptively download the embeddings model.

  1. Navigate into the context-converter folder: cd context-converter

  2. Place a JSON file of HTML content into the same folder.

  3. Run python3 main.py

Your output file will be created in the same folder.

Configuration

You can tweak the similarity threshold and more to help yourself curate what you want.

i. In main.py, you can set the following parameters to optimize your results:

  • main.py
    • chunk_size: The size of the chunk to be processed. The default value is 256.
    • You can find speed tests here.

ii. In converter.py, you can set the following parameters to optimize your results:

  • converter.py
    • similarity.item(): The similarity threshold. The default value is 0.868899. Only similarity values above the threshold are removed, meaning a higher threshold removes less content. A lower threshold removes more content.
    • batch_size: Proccess embeddings for the given lines using batch processing. The default value is 16, which has proved to be faster than higher values, up to 256. Speed test results.

License

MIT

About

Curate scraped HTML for large language models. Build more robust generative AI applications. Convert HTML to Markdown using Regex, BeautifulSoup4, and filter useless content with Jina Embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages