🔍 Ambar: Document Search Engine - EC2 Version

A fork of the fork PascalHonegger/ambar with modifications made to run on ec2

See the EC2 Directory for instructions

Note. The original .env file has been moved to .env.local and .env has been added to the .gitignore file. This is due to the fact the the .env file is used for production now and may have sensitive values configured. To run locally use the command docker-compose --env-file .env.local up --build

Ambar is an open-source document search engine with automated crawling, OCR, tagging and instant full-text search.

Ambar defines a new way to implement full-text document search into your workflow.

Easily deploy Ambar with a single docker-compose file
Perform Google-like search through your documents and contents of your images
Tag your documents
Use a simple REST API to integrate Ambar into your workflow

Features

Search

Tutorial: Mastering Ambar Search Queries

Fuzzy Search (John~3)
Phrase Search ("John Smith")
Search By Author (author:John)
Search By File Path (filename:*.txt)
Search By Date (when: yesterday, today, lastweek, etc)
Search By Size (size>1M)
Search By Tags (tags:ocr)
Search As You Type
Supported language analyzers: English ambar_en, Russian ambar_ru, German ambar_de, Italian ambar_it, Polish ambar_pl, Chinese ambar_cn, CJK ambar_cjk

Crawling

Ambar only supports local fs crawling, if you need to crawl an SMB share of an FTP location - just mount it using standard linux tools. Crawling is automatic, no schedule is needed due to crawlers monitor file system events and automatically process new, changed and removed files.

Content Extraction

Ambar supports large files (>30MB)

Supported file types:

ZIP archives
Mail archives (PST)
MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
OCR over images
Email messages with attachments
Adobe PDF (with OCR)
OCR languages: Eng, Deu, Fra, Por
OpenOffice documents
RTF, Plaintext
HTML / XHTML
Multithread processing

Build & Run

Notice: Ambar requires Docker to run

If you want to see how Ambar works w/o installing it, try our live demo. No signup required.

All the images required to run Ambar can be built locally. In general, each image can be built by navigating into the directory of the component in question, performing the compilation steps required and building the image like that:

# From project root
docker compose up --build

Architecture

Hint: Run plantuml to generate the updated PNG (or an online tool like PlantText).

FAQ

Is it open-source?

Yes, it's fully open-source.

Is it free?

Yes, it is forever free and open-source.

Does it perform OCR?

Yes, it performs OCR on images (jpg, tiff, bmp, etc) and PDF's. OCR is perfomed by well-known open-source library Tesseract. We tuned it to achieve best perfomance and quality on scanned documents. You can easily find all files on which OCR was perfomed with tags:ocr query

Which languages are supported for OCR?

Supported languages: Eng, Rus, Ita, Deu, Fra, Spa, Pl, Nld. See this commit for an example how to add new languages.

Does it support tagging?

Yes!

What about searching in PDF?

Yes, it can search through any PDF, even badly encoded or with scans inside. We did our best to make search over any kind of pdf document smooth.

What is the maximum file size it can handle?

It's limited by amount of RAM on your machine, typically it's 500MB. It's an awesome result, as typical document managment systems offer 30MB maximum file size to be processed.

Privacy Policy

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.github		.github
.vscode		.vscode
Documentation		Documentation
ElasticSearch		ElasticSearch
FrontEnd		FrontEnd
LocalCrawler		LocalCrawler
MongoDB		MongoDB
Pipeline		Pipeline
Rabbit		Rabbit
Redis		Redis
ServiceApi		ServiceApi
WebApi		WebApi
ec2		ec2
.env.local		.env.local
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
License-MIT-blue.svg		License-MIT-blue.svg
License.txt		License.txt
README.md		README.md
docker-compose.yml		docker-compose.yml
privacy-policy.md		privacy-policy.md
search.gif		search.gif

License

rbhenao/ambar_ec2

Folders and files

Latest commit

History

Repository files navigation

🔍 Ambar: Document Search Engine - EC2 Version

See the EC2 Directory for instructions

Features

Search

Crawling

Content Extraction

Build & Run

Architecture

FAQ

Is it open-source?

Is it free?

Does it perform OCR?

Which languages are supported for OCR?

Does it support tagging?

What about searching in PDF?

What is the maximum file size it can handle?

Privacy Policy

License

About

Resources

License

Stars

Watchers

Forks

Languages