Skip to content

Latest commit

 

History

History
373 lines (301 loc) · 13.4 KB

docs.livemd

File metadata and controls

373 lines (301 loc) · 13.4 KB

Elasticlunr

Description

Elasticlunr is a small, full-text search library for use in the Elixir environment. It indexes JSON documents and provides a friendly search interface to retrieve documents.

The library is built for web applications that do not require the deployment complexities of popular search engines while taking advantage of the Beam capabilities.

Imagine how much is gained when the search functionality of your application resides in the same environment (Beam VM) as your business logic; search resolves faster, the number of services (Elasticsearch, Solr, and so on) to monitor reduces.

Getting Started

Mix.install([
  {:kino, "~> 0.4"},
  {:elasticlunr, "~> 0.6"}
])

What's an Index?

An index is a collection of structured data that is referred to when looking for results that are relevant to a specific query.

In RDBMS, a table can be likened to an index, meaning that you can store, update, delete and search documents in an index. But the difference here is that an index has a pipeline that every JSON document passes through before it becomes searchable.

alias Elasticlunr.{Index, Pipeline}

# the library comes with a default set of pipeline functions
pipeline = Pipeline.new(Pipeline.default_runners())

index = Index.new(pipeline: pipeline)

The above code block creates a new index with a pipeline of default functions that work with the English language.

The new index does not define the expected structure of the JSON documents to be indexed. To fix this, let's assume we are building an index of blog posts, and each post consists of the author, content, category, and title attributes.

index =
  index
  |> Index.add_field("title")
  |> Index.add_field("author")
  |> Index.add_field("content")
  |> Index.add_field("category")

Indexing Documents

Following our example or use-case above, to make the blog posts searchable we need to add them to the index so that they can be analyzed and transformed appropriately.

documents = [
  %{
    "id" => 1,
    "author" => "Mark Ericksen",
    "title" => "Saving and Restoring LiveView State using the Browser",
    "category" => "elixir liveview browser",
    "content" =>
      "There are multiple ways to save and restore state for your LiveView processes. You can use an external cache like Redis, your database, or even the browser itself. Sometimes there are situations where you either can’t or don’t want to store the state on the server. In situations like that, you do have the option of storing the state in the user’s browser. This post explains how you use the browser to store state and how your LiveView process can get it back later. We’ll go through the code so you can add something similar to your own project. We cover what data to store, how to do it securely, and restoring the state on demand."
  },
  %{
    "id" => 2,
    "author" => "Mika Kalathil",
    "title" => "Creating Reusable Ecto Code",
    "category" => "elixir ecto sql",
    "content" =>
      "Creating a highly reusable Ecto API is one of the ways we can create long-term sustainable code for ourselves, while growing it with our application to allow for infinite combination possibilites and high code reusability. If we write our Ecto code correctly, we can not only have a very well defined split between query definition and combination/execution using our context but also have the ability to re-use the queries we design individually, together with others to create larger complex queries."
  },
  %{
    "id" => 3,
    "author" => "Mark Ericksen",
    "title" => "ThinkingElixir 079: Collaborative Music in LiveView with Nathan Willson",
    "category" => "elixir podcast liveview",
    "content" =>
      "In episode 79 of Thinking Elixir, we talk with Nathan Willson about GEMS, his collaborative music generator written in LiveView. He explains how it’s built, the JS sound library integrations, what could be done by Phoenix and what is done in the browser. Nathan shares how he deployed it globally to 10 regions using Fly.io. We go over some of the challenges he overcame creating an audio focused web application. It’s a fun open-source project that pushes the boundaries of what we think LiveView apps can do!"
  },
  %{
    "id" => 4,
    "title" => "ThinkingElixir 078: Logflare with Chase Granberry",
    "author" => "Mark Ericksen",
    "category" => "elixir podcast logging logflare",
    "content" =>
      "In episode 78 of Thinking Elixir, we talk with Chase Granberry about Logflare. We learn why Chase started the company, what Logflare does, how it’s built on Elixir, about their custom Elixir logger, where the data is stored, how it’s queried, and more! We talk about dealing with the constant stream of log data, how Logflare is collecting and displaying metrics, and talk more about Supabase acquiring the company!"
  }
]

index = Index.add_documents(index, documents)

Search Index

The search results is a list of maps and each map contains specific keys, matched, positions, ref, and score. See the definitions below:

  • matched: this field tells the number of attributes where the given query matches
  • score: the value shows how well the document ranks compared to other documents
  • ref: this is the document id
  • positions: this is a map that shows the positions of the matching words in the document
search_query = Kino.Input.text("Search", default: "elixir")
search_query = Kino.Input.read(search_query)
results = Index.search(index, search_query)

NB: Don't forget to fiddle with the search input.

Nested Document Attributes

As seen in the earlier example all documents indexed were without nested attributes. But Imagine a situation where your data source returns documents with nested attributes, and you want to search by these attributes - it's possible with Elasticlunr by specifying the top-level attribute.

Let's say our data source returns a list of users with their address which is an object and you want to index this information so that you can query them.

# the library comes with a default set of pipeline functions
pipeline = Pipeline.new(Pipeline.default_runners())

users_index =
  Index.new(pipeline: pipeline)
  |> Index.add_field("name")
  |> Index.add_field("address")
  |> Index.add_field("education")

Automatically, Elasticlunr will flatten the nested attributes to the level that when using the advanced query DSL you can use dot notation to filter the search results. Now, let's add a few user objects to the index:

documents = [
  %{
    "id" => 1,
    "name" => "rose mary",
    "education" => "BSc.",
    "address" => %{
      "line1" => "Brooklyn Street",
      "line2" => "4181",
      "city" => "Portland",
      "state" => "Oregon",
      "country" => "USA"
    }
  },
  %{
    "id" => 2,
    "name" => "jason richard",
    "education" => "Msc.",
    "address" => %{
      "line1" => "Crown Street",
      "line2" => "2057",
      "city" => "St Malo",
      "state" => "Quebec",
      "country" => "CA"
    }
  },
  %{
    "id" => 3,
    "name" => "peters book",
    "education" => "BSc.",
    "address" => %{
      "line1" => "Murry Street",
      "line2" => "2285",
      "city" => "Norfolk",
      "state" => "Virginia",
      "country" => "USA"
    }
  },
  %{
    "id" => 4,
    "name" => "jason mount",
    "education" => "Highschool",
    "address" => %{
      "line1" => "Aspen Court",
      "line2" => "2057",
      "city" => "Boston",
      "state" => "Massachusetts",
      "country" => "USA"
    }
  }
]

users_index = Index.add_documents(users_index, documents)
search_query = Kino.Input.text("Search users", default: "jason murry")
search_query = Kino.Input.read(search_query)
Index.search(users_index, search_query)

Index Manager

The manager includes different CRUD functions to help you manage your index after mutating the state. First of all, let's get indexes to manage by the manager:

alias Elasticlunr.IndexManager

IndexManager.loaded_indices()

As seen above the list is empty. Now let's add an index:

IndexManager.save(users_index)

IndexManager.loaded_indices()
|> Enum.any?(&(&1 == users_index.name))
|> IO.inspect(label: :users_index_exists)

IndexManager.loaded_indices()

The manager now has the users_index in memory for access.

Query DSL

Like every other search engine, you can make more advanced search queries depending on your requirements, and I'm pleased to tell you that Elasticlunr has not left out such capabilities. So, in the proceeding parts of this docs, I will be highlighting the available query types provided by the library and how you can use them.

It's important to note that Elasticlunr tries to replicate popular Query DSL (Domain Specific Language) with the same behavior as Elasticsearch, which means the learning curve reduces if you have experience using the search engine. For Elasticlunr, there are the bool, match, match_all, not, and terms query types you can use to retrieve insights about an index.

Bool

The bool query is used with a combination of queries to retrieve documents matching the boolean combinations of clauses. Consider these clauses to be everything that comes after the SELECT statement in relational databases.

The bool query is built using one or more clauses to achieve desired results, and each clause has its type, see below:

Clause Description
must The clause must appear in the matching documents, and this affects the document's score.
must_not The clause must not appear in the matching document. Scoring is ignored because the clause is executed in the filter context.
filter Like must, the clause must appear in the matching documents but scoring is ignored for the query.
should The clause should appear in the matching document.

It's important to note that only scores from the must and should clauses contribute to the final score of the matching document.

Index.search(index, %{
  "query" => %{
    "bool" => %{
      "must" => %{
        "terms" => %{"content" => "use"}
      },
      "should" => %{
        "terms" => %{"category" => "elixir"}
      },
      "filter" => %{
        "match" => %{
          "id" => 3
        }
      },
      "must_not" => %{
        "match" => %{
          "author" => "mika"
        }
      },
      "minimum_should_match" => 1
    }
  }
})

You can use the minimum_should_match parameter to specify the number or percentage of should clauses returned documents must match. If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.

Match

The match query is the standard query used for full-text search, including support for fuzzy matching. The provided text is analyzed before matching it against documents.

Index.search(index, %{
  "query" => %{
    "match" => %{
      "content" => %{
        "query" => "liveview browser"
      }
    }
  }
})

A match query accepts one or more top-level fields you wish to search, in the example above, it's the content field. Note that when you have more than one top-level fields, the match query is rewritten to a bool query internally by the library. Now, let's see what parameters are accepted by the match query below:

Parameter Description
query String you wish to find in the provided field.
expand Increase token recall, see token expansion.
fuzziness Maximum edit distance allowed for matching.
boost Floating point number used to decrease or increase the relevance scores of a query. Defaults to 1.0.
operator The boolean operator used to interpret the query value. Available values for the operator option are or and and. Defaults to or.
minimum_should_match Minimum number of clauses that a document must match for it to be returned.

Match All

The most simple query, which matches all documents, gives them a score of 1.0 each.

Parameter Description
boost Floating point number used to decrease or increase the relevance scores of a query. Defaults to 1.0.
Index.search(index, %{
  "query" => %{
    "match_all" => %{}
  }
})

Not

The not query inverts the result of the nested query giving the matched documents a score of 1.0 each.

Index.search(index, %{
  "query" => %{
    "not" => %{
      "match" => %{
        "content" => %{
          "query" => "elixir"
        }
      }
    }
  }
})

Terms

The query return documents that contain the exact terms in a given field. The terms query should be used to find documents based on a precise value such as a price, a product ID, or a username.

Index.search(index, %{
  "query" => %{
    "terms" => %{
      "content" => %{
        "value" => "think"
      }
    }
  }
})

A terms query accepts one or more top-level fields you wish to search, in the example above, it's the content field. Note that when you have more than one top-level fields, the terms query is rewritten to a bool query internally by the library. Now, let's see what parameters are accepted by the terms query below:

Parameter Description
value A term you wish to find in the provided field. The term must match exactly the field value to return a document.
boost Floating point number used to decrease or increase the relevance scores of a query. Defaults to 1.0.