title

authors

Search engine components and architecture

Markus Mandalka

Search engine components and architecture

Open source search engine architecture (components and modules) and processing (data integration, data analysis and data enrichment)

Architecture overview

Overview of services and main components

The relations in this chart show dependencies and connections between services and main components witch show different directions than the data flow (see another flowchart of document processing and data flow).

flowchart TB


subgraph CONTAINER_UI [User interface]

  direction TB

  subgraph COMPONENT_WEBSERVER [Apache webserver]

    direction TB

  
    subgraph COMPONENT_DJANGO[Python Django]

      COMPONENT_APPS[Web apps]

      COMPONENT_DJANGO_DB[(Django DB)]
      COMPONENT_APPS ----> COMPONENT_DJANGO_DB
    end

    subgraph COMPONENT_PHP[PHP]
      COMPONENT_SEARCH_UI[Solr-PHP-UI]
    end
  
  end
end


subgraph CONTAINER_SOLR [Apache Solr]
  direction TB

  COMPONENT_SOLR[Solr Server]
  click COMPONENT_SOLR "../../solr"
  COMPONENT_SOLR --> COMPONENT_SOLR_DOCUMENT_INDEX
  COMPONENT_SOLR --> COMPONENT_SOLR_ENTITIES_INDEX

  COMPONENT_SOLR_DOCUMENT_INDEX[(Document index)]
  COMPONENT_SOLR_ENTITIES_INDEX[(Entities index)]

end

subgraph CONTAINER_ETL [Open Semantic ETL]

  direction TB

  COMPONENT_OPENSEMANTICETL_FILECRAWLER[File crawler]
  COMPONENT_OPENSEMANTICETL_FILECRAWLER --> COMPONENT_CELERY

  COMPONENT_OPENSEMANTICETL_WORKER[Open Semantic ETL worker]

  COMPONENT_OPENSEMANTICETL_PLUGINS[ETL plugins]
  COMPONENT_OPENSEMANTICETL_WORKER --> COMPONENT_OPENSEMANTICETL_PLUGINS
  

  COMPONENT_CELERY[Celery task manager]
  click COMPONENT_CELERY "../admin/queue/"

  COMPONENT_OPENSEMANTICETL_WORKER --> COMPONENT_CELERY

end


subgraph CONTAINER_QUEUE [RabbitMQ]

  direction TB

  COMPONENT_RABBITMQ[RabbitMQ]
  click COMPONENT_RABBITMQ "../admin/queue/"
  COMPONENT_RABBITMQ --> COMPONENT_RABBITMQ_DATA

  COMPONENT_RABBITMQ_DATA[(Task queue)]
  click COMPONENT_RABBITMQ "../admin/queue/"

end


subgraph CONTAINER_TIKA [Apache Tika]

  direction TB

  COMPONENT_TIKA_SERVER[Tika Server]
  click COMPONENT_TIKA_SERVER "https://github.com/opensemanticsearch/tika-server.deb"
  
  COMPONENT_TIKA_SERVER --> COMPONENT_OCR_CACHE
  
  COMPONENT_OCR_CACHE[Tesseract OCR Cache]
  COMPONENT_OCR_CACHE --> COMPONENT_OCR
  COMPONENT_OCR_CACHE_DATA[(OCR cache)]
  COMPONENT_OCR_CACHE ----> COMPONENT_OCR_CACHE_DATA
  
  COMPONENT_OCR[Tesseract]
  click COMPONENT_OCR "https://github.com/opensemanticsearch/tesseract-ocr-cache"


end

subgraph CONTAINER_NEO4J [Neo4j]
  direction TB

  COMPONENT_NEO4J[Neo4J]
  click COMPONENT_NEO4J "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/export_neo4j.py"
  COMPONENT_NEO4J --> COMPONENT_NEO4J_DATA
  COMPONENT_NEO4J_DATA[(Graph Database)]
  
end

subgraph CONTAINER_NER [SpaCy NLP]

  direction TB
  
  COMPONENT_NER[spacy-services]

  COMPONENT_NER_MODELS[(ML models)]
  COMPONENT_NER --> COMPONENT_NER_MODELS

end

  COMPONENT_EL[Open Semantic Entity Search API]
  click COMPONENT_EL "https://github.com/opensemanticsearch/open-semantic-entity-search-api"

COMPONENT_OPENSEMANTICETL_PLUGINS -->|Get tags and annotations| COMPONENT_APPS
COMPONENT_OPENSEMANTICETL_PLUGINS -->|Entity extraction by thesaurus and ontologies| COMPONENT_EL
COMPONENT_OPENSEMANTICETL_PLUGINS ---->|Metadata and text extraction| COMPONENT_TIKA_SERVER
COMPONENT_OPENSEMANTICETL_PLUGINS ---->|Named entity recognition| COMPONENT_NER
COMPONENT_OPENSEMANTICETL_PLUGINS ------>|Index data| COMPONENT_SOLR
COMPONENT_CELERY ------>|Read and write task queue| COMPONENT_RABBITMQ
COMPONENT_OPENSEMANTICETL_PLUGINS ------>|Index data| COMPONENT_NEO4J



COMPONENT_EL -->|Extract entities in entities index from full text| COMPONENT_SOLR
COMPONENT_SEARCH_UI -->|Search queries| COMPONENT_SOLR
COMPONENT_APPS -->|Read search queries| COMPONENT_SOLR
COMPONENT_APPS -->|Write entities managed by thesaurus or ontologies| COMPONENT_SOLR

Flowchart of document processing and data flow

flowchart TD

FILEMONITORING[Filesystem monitoring]
click FILEMONITORING "../../trigger/filemonitoring/"

FILEMONITORING-->|Immediatelly add task if changed or new file| CELERY

SCHEDULER[Cron scheduler]
click SCHEDULER "https://github.com/opensemanticsearch/open-semantic-search-apps/blob/master/etc/cron.d/open-semantic-search"

SCHEDULER -->|Regularly start crawler| FILECRAWLER

FILECRAWLER[File directory crawler]
click FILECRAWLER "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/etl_filedirectory.py"

FILECRAWLER -->|Add task for each new or changed file in crawled directory| CELERY

CELERY[Celery task manager]
click CELERY "../admin/queue/"

CELERY -->|Parallel processing of files by multiple ETL workers| ETL_WORKER
CELERY --> RABBITMQ

RABBITMQ[(RabbitMQ task queue)]
click RABBITMQ "../admin/queue/"

RABBITMQ --> CELERY

ETL_WORKER[Open Semantic ETL worker]
click ETL_WORKER "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/tasks.py"

ETL_WORKER -->|Running configured plugins one by one| TIKA

subgraph TIKA [Apache Tika for text extraction and metadata extraction]
  direction LR

  TIKA_PLUGIN[ETL plugin calling Tika]
  click TIKA_PLUGIN "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/enhance_extract_text_tika_server.py"
  
  TIKA_PLUGIN -->|Document file| TIKA_SERVER

  TIKA_SERVER[Apache Tika Server]
  click TIKA_SERVER "https://github.com/opensemanticsearch/tika-server.deb"
  
  TIKA_SERVER -->|Image files or images in PDF|OCR
  TIKA_SERVER -->|Extracted text| TIKA_PLUGIN
  
  OCR[Tesseract OCR]
  click OCR "https://github.com/opensemanticsearch/tesseract-ocr-cache"

  OCR-->|Recognized plain text| TIKA_SERVER

end

TIKA -->|Extracted text and metadata| EntitySearchAPI

subgraph EntitySearchAPI [Named Entity Extraction by lists of names, thesaurus and ontologies]
  direction LR
  
  EL_PLUGIN[ETL plugin for entity extraction]
  click EL_PLUGIN "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/enhance_entity_linking.py"
  
  EL_PLUGIN -->|Plain text| EL
  
  EL[Open Semantic Entity Search API]
  click EL "https://github.com/opensemanticsearch/open-semantic-entity-search-api"

  EL -->|Extracted entities| EL_PLUGIN

  THESAURUS[(Thesaurus)]
  click THESAURUS "https://github.com/opensemanticsearch/open-semantic-search-apps/blob/master/src/thesaurus/models.py"
  
  THESAURUS -->|SKOS| EL

  ONTOLOGIES[(Ontologies)]
  click ONTOLOGIES "https://github.com/opensemanticsearch/open-semantic-search-apps/blob/master/src/ontologies/models.py"
  
  ONTOLOGIES -->|RDF| EL
end

EntitySearchAPI -->|Added extracted named entities by lists of names, thesaurus and ontologies| NER

NER[ETL plugin for spaCy Named Entity Recognition by Machine Learning]
click NER "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/enhance_ner_spacy.py"

NER -->|Added recognized named entities| ANNOTATIONS

subgraph ANNOTATIONS [Get tags and annotations for this documents made by humans]
  direction RL

  ANNOTATIONS_DB[(DB with tags and annotations)]
  click ANNOTATIONS_DB "https://github.com/opensemanticsearch/open-semantic-search-apps/blob/master/src/annotate/models.py"
  
  ANNOTATIONS_DB --> ANNOTATIONS_PLUGIN

  ANNOTATIONS_PLUGIN[ETL enrichment plugin getting tags and annotations]
  click ANNOTATIONS_PLUGIN "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/enhance_annotations.py"
end

ANNOTATIONS -->|Added tags and annotations| ANALYSIS_PLUGIN

ANALYSIS_PLUGIN[ETL data analysis plugin like extraction amounts of money]
ANALYSIS_PLUGIN -->|Added extracted amounts of money| OTHER_PLUGINS

OTHER_PLUGINS[Other configured ETL Plugins]
OTHER_PLUGINS -->|Plain text and strucured data| EXPORTER

EXPORTER[Exporter plugins]
EXPORTER -->|Index data for full text search and faceting| SOLR
EXPORTER -->|Index data for full text search and faceting| ELASTICSEARCH
EXPORTER -->|Index linked data for graph search| NEO4J

SOLR[(Apache Solr document index)]
click SOLR "../../solr"

SOLR -->|Search results| UI

UI[Web user interface for search]
UI -->|Solr search query| SOLR

ELASTICSEARCH[(Alternate Elastic Search)]
click ELASTICSEARCH "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/export_elasticsearch.py"

NEO4J[(Neo4J Graph Database)]
click NEO4J "https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/export_neo4j.py"

Components and Modules

User Interface: Client and user interface
- Search query forms: Search query form for full text search
- Explorer and navigator: Search with full text search and navigate (exploratory search) the index or search results with interactive filters (facets)
  - Viewers: Parts of the UI to show different views (i.e. analytics like wordlcouds or trend charts) and previews for special formats (i.e. photos, documents, email ...)
  - Code: /solr-php-ui/templates/
- Annotators: Web Apps for tagging documents or CMS with forms and fields to manage meta data like tags or annotations
- Search Apps: Applications and user interfaces for search like search with lists tool or named entities manager
Index and search server (Solr or Elastic Search): Search server managing the index (indexer) and running search queries (query handler)
- Datamodel/Schema: src/solr.deb/var/solr/data/opensemanticsearch/conf/managed-schema
- Storage: /var/solr/data
- Log: /var/solr/logs/
Open Semantic ETL: Framework for data integration, data analysis, data enrichment and ETL (Extract, transform, load) pipelines or chains
- Connectors, importers, ingestors or crawlers: Import data from a data source (i.e. file system, file directory, file share, website or newsfeed)
- Parsers: Apache Tika to extract text and metadata from different file formats and document formats
- Entity extraction and entity linking: Open Semantic Entity Search API
- Data enrichment plugins and enhancer: Enhancing content with additional data like meta data (i.e. tagging or annotations) or analytics (i.e. OCR)
- ETL Exporter or Loader for Solr or Elastic Search: Indexing the data to search index
Trigger: Your CMS or your file system (file system monitoring) will notify the web service (API) when there is new data or when content changed, so you dont have to burn resources for recrawl often to be able to find new or changed content very soon
Web services (REST-API): Available via standard network protocol HTTP and waiting until you (i.e. using the web admin interface) or another service (i.e. using the REST-API) demands actions like crawling a directory or a webpage and starting this actions
Queue manager (Celery on RabbitMQ): Managing task queue and starting of text extraction, analysis, data enrichment and indexing jobs by the right balance of parallel workers
Scheduler: Managing starting of scheduled indexing jobs. This can be crontab for Cron starting the command line tools. Config: /etc/cron.d/open-semantic-search

Document processing, extract, transform, load (ETL) and enhancing by data enrichment and data analysis

How (new) data is handled by this components and ETL (extract, transform, load), document processing, data analysis and data enrichment:

A user manually or a Cron daemon automatically from time to time starts a command
The command line tools or the web API getting this command starts a ETL (extract, transform, load), data analysis and data enrichment chain to import, analyze and index data
A input plugin or connector (i.e. the connector for the file system or the connector for a website) reads from its datasource
The connectors, an Apache Tika parser, or a file format based data converter or extractor extracts data from the given document or file format
The ETL framework calls all configured enhancer plugins for data enrichment to get additional analysis for the data or annotations to this data from a CMS.
The output storage plugin or indexer index the text and metadata to the Solr index or to the Elastic Search index, so all other tools can search this data
The user uses a user interface like the search user interface, the search apps or some other tools to search based on the search API of this index

Services and Microservices

Linux services:

tika

Text extraction and OCR

tika-fake-ocr

Text extraction without OCR

solr

Search index

spacy-services

spaCy NLP

opensemanticetl

ETL workers

rabbitmq-server

Task queue

flower

Task queue monitoring user interface

apache2

Search UI
Search apps (f.e. thesaurus app or config UI)
Entity Search API

User Interface and search applications

Solr-PHP-UI

User Interface (supports responsive design for mobiles and tablets) for search, facetted search, preview, different views and visualizations.

Based on Solr client solr-php-client (pure vanilla php) and standard User Interfaces (HTML5 and CSS with Zurb Foundation) and visualization libraries (D3js) so you can install and run it on standard PHP webspace without effort and without often not available special PHP-modules)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Search engine components and architecture

Architecture overview

Overview of services and main components

Flowchart of document processing and data flow

Components and Modules

Document processing, extract, transform, load (ETL) and enhancing by data enrichment and data analysis

Services and Microservices

User Interface and search applications

Index server

Solr search server

Annotation

Open Semantic Tagger

Connectors

Scheduler

Queue manager

Data enrichment (Enhancer)

Web Services

Web admin interface

Trigger

Trigger Drupal

Generic triggers