Convert and Format HTML to Markdown

Table of Contents:

Description
Problem to Solve
Quick Start | Getting Started
Configuration

Description

Extracts HTML content from a JSON file to produce a Markdown file. Leverages similarity threshold to remove redundant content.

Problem to Solve

Building retrieval augmented generation AI applications can be a lengthy process. While there are web crawlers to collect content, the post processing of this content is equally important for accurate and helpful generation.

This library was built specifically to augment the context-curator project by further automating the document creation process.

Quick Start | Getting Started

Installation

To have access to the package in your local environment (your working directory), clone the repository using git: git clone https://github.com/daethyra/context-converter.git
To install via pip, run: pip install context-converter

Optional: Run jina_embeddings.py to preemptively download the embeddings model.

Navigate into the context-converter folder: cd context-converter
Place a JSON file of HTML content into the same folder.
Run python3 main.py

Your output file will be created in the same folder.

Configuration

You can tweak the similarity threshold and more to help yourself curate what you want.

i. In main.py, you can set the following parameters to optimize your results:

main.py
- chunk_size: The size of the chunk to be processed. The default value is 256.
- You can find speed tests here.

ii. In converter.py, you can set the following parameters to optimize your results:

converter.py
- similarity.item(): The similarity threshold. The default value is 0.868899. Only similarity values above the threshold are removed, meaning a higher threshold removes less content. A lower threshold removes more content.
- batch_size: Proccess embeddings for the given lines using batch processing. The default value is 16, which has proved to be faster than higher values, up to 256. Speed test results.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
src/context_converter		src/context_converter
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

src/context_converter

src/context_converter

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pdm.lock

pdm.lock

pyproject.toml

pyproject.toml

Repository files navigation

Convert and Format HTML to Markdown

Description

Problem to Solve

Quick Start | Getting Started

Configuration

License

About

Releases 2

Packages

Languages

License

Daethyra/context-converter

Folders and files

Latest commit

History

Repository files navigation

Convert and Format HTML to Markdown

Description

Problem to Solve

Quick Start | Getting Started

Configuration

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages