Name		Name	Last commit message	Last commit date
parent directory ..
.github/images		.github/images
data		data
flows		flows
tests		tests
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
app.py		app.py
get_data.sh		get_data.sh
manifest.yml		manifest.yml
requirements.txt		requirements.txt

README.md

Querying While Indexing in the Wikipedia Search Example

About this example:
Learnings	How to configure Jina for querying while indexing
Used for indexing	Text data
Used for querying	Text data
Dataset used	Wikipedia dataset from kaggle
Model used	flair-text

This is an example of using Jina to support both querying and indexing simultaneously in our Wikipedia sentence search example.

Prerequisites

Run and understand our Wikipedia sentence search example

What is querying while indexing?

Querying while indexing means you are able to still query your data while new data is simultaneously being inserted (or updated, or deleted). Jina achieves this with its dump-reload feature.

Configuration changes

This feature requires you to split the Flow, one for Indexing (and Updates, Deletes) and one for Querying, and have them running at the same time. Also, you will need to replace the indexers in Flows. The Index Flow (also referred to as the Storage Flow) will require a Storage Indexer, while the Query Flow requires a Compound Searcher.

In our case we use :

LMDBStorage, which uses a disk-based key-value storage LMDB as a storage engine.
FaissLMDBSearcher, which uses the faiss algorithm to provide faster query results and LMDB to retrieve the metadata.

🐍 Build the app with Python

These instructions explain how to run the example yourself and deploy it with Python.

🗝️ Requirements

Have a working Python 3.7 or 3.8 environment.
We recommend creating a new Python virtual environment to have a clean installation of Jina and prevent dependency conflicts.
Install Docker Engine.
Have at least 5 GB of free space on your hard drive.

Running the example

👾 Step 1. Clone the repo and install Jina

Begin by cloning the repo so you can get the required files and datasets. (If you already have the examples repository on your machine make sure to fetch the most recent version)

git clone https://github.com/jina-ai/examples
cd examples/wikipedia-sentences-query-while-indexing

Let's install jina and the other required libraries. For further information on installing jina check out our documentation.

pip install -r requirements.txt

In order to run the example you will need to do the following:

📥 Step 2. Download your data to search (Optional)

The repo includes a small subset of the Wikipedia dataset, for quick testing. You can just use that.

If you want to use the entire dataset, run bash get_data.sh and then modify the DATA_FILE constant (in app.py) to point to that file.

🏃 Step 3. Running the Flows

In this example, we use JinaD to serve the two Flows (Index and Query) and listen to incoming requests.

Start JinaD server using the below command.

docker run --add-host host.docker.internal:host-gateway \
        -v /var/run/docker.sock:/var/run/docker.sock \
        -v /tmp/jinad:/tmp/jinad \
        -p 8000:8000 \
        --name jinad \
        -d jinaai/jina:2.1.0-daemon

Run python app.py -t flows

This will create the two Flows, and then repeatedly do the following (which can also be done in any other REST client), every 10 seconds:
1. Index 5 Documents.
2. Send a DUMP request to the Storage (Index) Flow to dump its data to a specific location.
3. Send a ROLLING_UPDATE request to the Query Flow to take down its Indexers and start them again, with the new data located at the respective path.
Warning: If you want to use the entire wikipedia dataset, run bash get_data.sh and then modify the DATA_FILE constant to point to that file.

🔎 Step 4: Query your data

Finally, in a second terminal, run python app.py -t client

This will prompt you for a query, send the query to the Query Flow, and then show you the results.

Since the Flows uses http protocol, you can query the REST API with whatever Client provided within jina or use cURL, Postman or custom Swagger UI provided with jina etc.

Cleanup

JinaD creates several containers during this process. In order to remove all the containers do the following after you are done using the example:

docker stop $(docker ps -a -q) and docker rm $(docker ps -a -q)

Flow diagrams

Below you can see a graphical representation of the Flow pipeline:

Storage Flow

Query Flow

Notice the following:

the encoder has the same configuration
the Query Flow uses replicas. One replica continues to serve requests while the other is being reloaded.
the Indexer in the Query Flow is actually made up of two Indexers: one for vectors, one for Document metadata. On the Storage Flow, this data is stored into one Storage Indexer.

🔮 Overview of the files

File or folder	Contents
📂 `data/`	Folder where the data files are stored
📂 `flows/`	Folder to store Flow configuration
--- 📃 `storage.yml`	YAML file to configure Storage (Index) Flow
--- 📃 `query.yml`	YAML file to configure Querying Flow
🐍 `app.py`	Code file for the example

⏭️ Next steps

Did you like this example and are you interested in building your own? For a detailed tutorial on how to build your Jina app check out How to Build Your First Jina App guide in our documentation.

If you have any issues following this guide, you can always get support from our Slack community .

👩‍👩‍👧‍👦 Community

Slack channel - a communication platform for developers to discuss Jina.
LinkedIn - get to know Jina AI as a company and find job opportunities.
- follow us and interact with us using hashtag #JinaSearch.
Company - know more about our company. We are fully committed to open-source!

🦄 License

Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Files

wikipedia-sentences-query-while-indexing

Directory actions

More options