Indexing logic improvements #6

karol-t-wilk · 2024-04-09T11:11:13Z

Sorry for the late and big PR, I will make sure the next ones are more manageable 😅

There are several improvements to the architecture of the app:

Multiple embedding models can be used at the same time
The app is now aware of packages and their releases
The app has support for future fragmentation of documentation
Adding a package to the app and embedding its docs are now separate actions, implemented as mix tasks
The demo page now lets you choose the embedding model used to perform the search
The Hex client uses only repo.hex.pm and does not touch the REST API

josevalim · 2024-04-09T12:29:49Z

lib/search/packages.ex

+  alias Search.{HexClient, ExDocParser}
+
+  @doc """
+  If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds


Suggested change

If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds

If given a package name, adds the latest version of the package to the app. If given a `%HexClient.Release{}` adds

wojtekmach · 2024-04-09T11:18:09Z

lib/mix/tasks/search/add.ex

@@ -0,0 +1,37 @@
+defmodule Mix.Tasks.Search.Add do


nitpick: let’s call this file search.add.ex as its more common.

wojtekmach · 2024-04-09T11:19:05Z

lib/mix/tasks/search/embed.ex

+  @moduledoc """
+  Usage: mix #{Mix.Task.task_name(__MODULE__)} <MODULE>
+
+  Embeds the unembedded docs using the given model given as `Search.Embeddings.<MODULE>`


Suggested change

Embeds the unembedded docs using the given model given as `Search.Embeddings.<MODULE>`

Embeds the unembedded docs using the model given as `Search.Embeddings.<MODULE>`

wojtekmach · 2024-04-09T12:09:34Z

lib/mix/tasks/search/embed.ex

+    module =
+      Enum.find(Search.Application.embedding_models(), fn mod ->
+        "Elixir.Search.Embeddings.#{module_str}" == "#{mod}"
+      end)
+
+    if module do
+      {:ok, _} = module.embed(&callback/1)
+      Mix.shell().info("Done.")
+    else
+      Mix.shell().error("Could not find embedding module #{module_str}.")
+    end


Suggested change

module =

Enum.find(Search.Application.embedding_models(), fn mod ->

"Elixir.Search.Embeddings.#{module_str}" == "#{mod}"

end)

if module do

{:ok, _} = module.embed(&callback/1)

Mix.shell().info("Done.")

else

Mix.shell().error("Could not find embedding module #{module_str}.")

end

module = Module.concat(Search.Embeddings, mod)

module.embed(&callback/1)

Mix.shell().info("Done.")

wojtekmach · 2024-04-09T12:22:48Z

lib/search/embeddings/embedding.ex

+      Embeds any doc fragments which do not have an embedding yet. Recieves an optional callback,
+      which is called to notify about the embedding progress with the tuple {total, done} as its argument.
+      """
+      def embed(progress_callback \\ &Function.identity/1) do


this function is pretty big and we want to avoid generating such in macros for two reasons:

code inside quote is less straightforward to follow then code outside, you need to pay special attention

it increases compilation time

instead we generate small functions that delegate to regular functions on the module, for example:

quote do def embed(progress_callback \\ & &1) do Search.Embeddings.Embedding.__embed__(__MODULE__, ...) end end

All this being said, at the moment we only have two modules that use this use. I'd wait. I think it's fine to duplicate Ecto schema definition, changeset function, and a couple functions that just delegate to a shared module. WDYT?

Thanks for the tip regarding macros - I did read the best metaprogramming practices, but to use them is practice is another thing.

As for whether the macro is even needed here, I opted to create one because AFAIK we plan on benchmarking many models and further on experimenting with finetuning etc. I thought providing a framework for that early on would be a good idea, but after reading your comments I'm not so sure anymore. Could you give some more of your thoughts on this?

The modules differ only in configuration, so what we could do is have a single implementation that accepts config. We could define a behaviour to make the contract more explicit.

So the idea is that in the config we would have roughly this:

config :search, embedding_providers: [ albert_small: {Search.Embeddings.BumblebeeProvider, model: {:hf, "sentence-transformers/paraphrase-albert-small-v2"}, embedding_size: 768, serving_opts: [ compile: [batch_size: 16, sequence_length: 100], defn_options: [compiler: EXLA] ]}, paraphrase_l3: {Search.Embeddings.BumblebeeProvider, model: {:hf, "sentence-transformers/paraphrase-MiniLM-L3-v2"}, embedding_size: 384, serving_opts: [ compile: [batch_size: 16, sequence_length: 512], defn_options: [compiler: EXLA] ]} ]

And then

defmodule Search.Embeddings.BumblebeeProvider do @behaviour Search.Embeddings.Provider @impl true def embed(..., config) @impl true def child_spec(opts, config) end

And finally the API would be like Search.Embeddings.embed(:albert_small, ...), it would fetch the config and call the module.

The main question is how does this relate to the database. Could be a single table with model key, or perhaps we could use a single schema and but use different table name on the fly, based on the provider key.
// EDIT: actually we need multiple tables since the embedding column has different size. I believe we can have a single schema and specify the table name explicitly in queries like from e in {"TABLE_NAME", Embedding}), ....

The benefit of such explicit contract is that plugging something else than Bumblebee would be easy, and this is a good flexibility to have for ML models.

This is just a rough sketch, happy to hear what you guys think!

Yup, sounds good to me @jonatanklosko!

wojtekmach · 2024-04-09T12:28:24Z

lib/search/packages.ex

+        else
+          {:error, err} ->
+            Repo.rollback(err)
+        end


could you copy Repo.transaction_with/2 from https://github.com/livebook-teams/teams/blob/46646c3c5a180ef550ab418c1c13e4fcc58902e2/lib/teams/repo.ex#L11?

wojtekmach · 2024-04-09T12:56:12Z

lib/search/hex_client.ex

  @repo_url "https://repo.hex.pm"

  alias Search.HexClient

  def get_releases(package_name) when is_binary(package_name) do
-    case get("#{@api_url}/packages/#{package_name}") do
-      {:ok, %{status: 200, body: %{"releases" => releases}}} ->
+    case get("packages/#{package_name}", plugins: [ReqHex]) do


could you move this to the underlying get/1 function? Also please use ReqHex.attach/1, the :plugins option is strictly speaking undocumented at the moment because I still haven't made up my mind about it

Suggested change

case get("packages/#{package_name}", plugins: [ReqHex]) do

case get("packages/#{package_name}") do

Regarding this, ReqHex fails in zlib when getting a tarball of the docs from the repo too IIRC. I do not know if this is a bug or just a case of out-of-contract use. I'll provide repro steps for this and the other ReqHex issue as soon as I can.

wojtekmach · 2024-04-09T12:57:33Z

lib/search/hex_client.ex

+      |> Req.new()
+      |> Req.Request.prepend_response_steps(
+        handle_errors: fn {req, res} ->
+          # Looks like ReqHex fails in zlib on non-200 codes, so these are handled here


if it's easy to reproduce could you open up an issue on req_hex repo? Or just DM me if you prefer.

Will open issue if the problem is in fact reproducible

wojtekmach · 2024-04-09T13:02:17Z

lib/search/hex_client.ex

+      |> Req.Request.run_request()
+
+    case res do
+      %RuntimeError{message: message} -> {:error, message}
+      other -> {:ok, other}
+    end


if there's a network error we'd return {:ok, %Mint.TransportError{}} (Req.TransportError on upcoming Req v0.5) which I think is gonna be unexpected. My suggestion is to instead of handling the error message here, make the request using Req.request, and handle the error at the call site, call Exception.message(e).

Suggested change

|> Req.Request.run_request()

case res do

%RuntimeError{message: message} -> {:error, message}

other -> {:ok, other}

end

|> Req.request()

WDYT?

wojtekmach · 2024-04-09T13:03:32Z

lib/search/packages.ex

+  @doc """
+  Returns the list of packages.
+
+  ## Examples
+
+      iex> list_packages()
+      [%Package{}, ...]
+
+  """


Let's keep the docs super focused, these examples are not useful. I wouldn't even document this function but if we document one, might as well document all public ones.

Suggested change

@doc """

Returns the list of packages.

## Examples

iex> list_packages()

[%Package{}, ...]

"""

@doc """

Returns the list of packages.

"""

wojtekmach · 2024-04-09T13:04:29Z

lib/search/packages.ex

+  alias Search.{HexClient, ExDocParser}
+
+  @doc """
+  If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds


Just one short sentence in the first paragraph please!

wojtekmach · 2024-04-09T13:08:59Z

lib/search/packages.ex

+  def add_package(package_name) when is_binary(package_name) do
+    case HexClient.get_releases(package_name) do
+      {:ok, releases} ->
+        latest = Enum.max_by(releases, & &1.version, Version)


the "latest version" is a bit more subtle, it is this if the only versions are pre-releases, but otherwise it is the latest non-pre-release version.

We have a Postgrex v1.0.0-rc.1 released, checks notes, 6 years ago, and that is not considered "latest version".

See https://github.com/hexpm/hexdocs/blob/20a09a705c312aaa2c22a894eaf018468eca50bb/lib/hexdocs/utils.ex#L33:L54

jonatanklosko · 2024-04-10T06:06:44Z

lib/search_web/controllers/page_controller.ex

+      Enum.find(Search.Application.embedding_models(), fn mod ->
+        "Elixir.Search.Embeddings.#{embedding_model}" == "#{mod}"
+      end)


We should decouple the model name from the implementation (module and hierarchy). If we go with the config approach, then we can use the keys, otherwise each module should expose a name or something.

jonatanklosko · 2024-04-10T06:22:48Z

lib/search/embeddings/embedding.ex

+              case embeddings do
+                %{embedding: embedding} ->
+                  embedding
+
+                _ when is_list(embeddings) ->
+                  Stream.map(embeddings, & &1.embedding)
+              end


To avoid two code paths, for the singular case you could wrap each text in a list. That said, I would just require the batch_size to be specified in the config (with an aggressive Keyword.fetch!/2).

jonatanklosko · 2024-04-10T06:25:30Z

lib/search/embeddings/paraphrase_albert_small.ex

+    serving_opts: [
+      compile: [batch_size: 16, sequence_length: 100],
+      defn_options: [compiler: EXLA]
+    ]


Suggested change

]

],

load_model_opts: [backend: EXLA.Backend]

(now the default backend is {EXLA.Backend, client: :host}, but the parameters we want to load onto the preferred client, that is, GPU if available)

karol-t-wilk added 18 commits March 18, 2024 14:24

Create packages context

77c07ae

Add DocItem schema

611b51e

Add DocFragment schema

eaae03c

Add default host to config

d1abf65

Add Embeddings context

833cf56

Add paraphrase-L3 as an embedding model

37a3fb8

Add more options for the embedding macro

3fbcdce

Add tests, clean up old unused modules

c6a6de9

Set up tests to handle multiple embedding models

0bcdff0

Add another embedding model to the app

31e4f10

Add the embed function

5f4b3ff

Rework HexClient to only use hex repo

6b442cb

Add the function to add packages

f24496c

Add tests and bugfixes for the items and fragments helper functions

499cda4

Add mix task for adding documentation

6d3df9c

Add mix task for embeddings

9cbc01f

Update the frontend to work with new changes

ac96408

Formatting

5236045

josevalim reviewed Apr 9, 2024

View reviewed changes

wojtekmach reviewed Apr 9, 2024

View reviewed changes

jonatanklosko reviewed Apr 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing logic improvements #6

Indexing logic improvements #6

karol-t-wilk commented Apr 9, 2024

josevalim Apr 9, 2024

wojtekmach Apr 9, 2024

wojtekmach Apr 9, 2024

wojtekmach Apr 9, 2024

wojtekmach Apr 9, 2024

karol-t-wilk Apr 9, 2024

jonatanklosko Apr 9, 2024 •

edited

wojtekmach Apr 10, 2024

wojtekmach Apr 9, 2024

wojtekmach Apr 9, 2024

karol-t-wilk Apr 9, 2024

wojtekmach Apr 9, 2024

karol-t-wilk Apr 9, 2024

wojtekmach Apr 9, 2024

wojtekmach Apr 9, 2024

wojtekmach Apr 9, 2024

wojtekmach Apr 9, 2024

jonatanklosko Apr 10, 2024

jonatanklosko Apr 10, 2024

jonatanklosko Apr 10, 2024 •

edited

	If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds
	If given a package name, adds the latest version of the package to the app. If given a `%HexClient.Release{}` adds

	Embeds the unembedded docs using the given model given as `Search.Embeddings.<MODULE>`
	Embeds the unembedded docs using the model given as `Search.Embeddings.<MODULE>`

	case get("packages/#{package_name}", plugins: [ReqHex]) do
	case get("packages/#{package_name}") do

Indexing logic improvements #6

Are you sure you want to change the base?

Indexing logic improvements #6

Conversation

karol-t-wilk commented Apr 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatanklosko Apr 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatanklosko Apr 10, 2024 • edited

Choose a reason for hiding this comment

jonatanklosko Apr 9, 2024 •

edited

jonatanklosko Apr 10, 2024 •

edited