Nextjs Doc Scraper

Scrap datas from nextjs doc:

This will scrap the data from nextjs doc with Playwright. Data transformation and cleaning + adding wrappers to make sens of the data for ia with Cheerio. Finally save it in separate files in data/nextjs folder.

npm run scrap

Link to Playwright

Link to npm Cheerio

Scrap stats:

If you want stats on scrapping datas you can run this command

  npm run scrapstat

Create dataBase for store embedding data:

On Neon.tech create a database (Neon because is compatible with vector data) and create a collection for store the data.
add the connection string in DATABASE_URL in .env. Be sure to complete userName and replace ******* by password
Create Tables with the command SQL in database.sql

DROP SCHEMA public CASCADE;

CREATE SCHEMA public;

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS documents (text text, n_tokens integer, file_path text, embeddings vector(1536));

CREATE INDEX ON documents USING ivfflat (embeddings vector_cosine_ops);

CREATE TABLE IF NOT EXISTS openai_ft_data (
  id SERIAL PRIMARY KEY,
  query TEXT NOT NULL,
  answer TEXT NOT NULL,
  suggested_answer TEXT,
  user_feedback BOOLEAN
);

CREATE TABLE IF NOT EXISTS usage (
  id SERIAL PRIMARY KEY,
  ip_address TEXT NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

Link to Neon

OpenAi Key:

Add OpenAi key in .env for use the Api for embedding the data.

Link openAi

Embedding datas:

 npm run embedding

this command will do this actions:

Create array of objects with texts and fileName and save it to a json file (texts.json)
tokenize all texts with tiktoken to know token Number and save it to a json file (textsTokens.json)
Split the texts in max 1500 tokens. If split, split according to the subtitles (Tag h2) and save it to a json file (textsTokensSplited.json)
embedding all split texts with text-embedding-3-small from openai and save it to a json file (textsTokensSplitedEmbedding.json)
save the embedding data to the database

tiktoken library is used to transform text into tokens. We will use this for calculate how many tokens we need to split the text in order to be able to embed it with openAi.

⏳ Link to npm tiktoken / Lien vers le github de tiktoken

You can uncomment displayTokenLengthStats function if you want to check the token sending statistics before saveToDatabase. In this case, don't forget to comment out saveToDatabase function.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.env.exemple		.env.exemple
.gitignore		.gitignore
database.sql		database.sql
embedding.ts		embedding.ts
nextjs.dev.ts		nextjs.dev.ts
package-lock.json		package-lock.json
package.json		package.json
readme.css		readme.css
readme.md		readme.md
scrapstat.ts		scrapstat.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.env.exemple

.env.exemple

.gitignore

.gitignore

database.sql

database.sql

embedding.ts

embedding.ts

nextjs.dev.ts

nextjs.dev.ts

package-lock.json

package-lock.json

package.json

package.json

readme.css

readme.css

readme.md

readme.md

scrapstat.ts

scrapstat.ts

tsconfig.json

tsconfig.json

Repository files navigation

Nextjs Doc Scraper

Scrap datas from nextjs doc:

Scrap stats:

Create dataBase for store embedding data:

OpenAi Key:

Embedding datas:

About

Releases

Packages

Languages

HenriTeinturier/ScrapEmbeddingNextjsDoc

Folders and files

Latest commit

History

Repository files navigation

Nextjs Doc Scraper

Scrap datas from nextjs doc:

Scrap stats:

Create dataBase for store embedding data:

OpenAi Key:

Embedding datas:

About

Topics

Resources

Stars

Watchers

Forks

Languages