Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing logic improvements #6

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

karol-t-wilk
Copy link
Collaborator

Sorry for the late and big PR, I will make sure the next ones are more manageable 😅

There are several improvements to the architecture of the app:

  • Multiple embedding models can be used at the same time
  • The app is now aware of packages and their releases
  • The app has support for future fragmentation of documentation
  • Adding a package to the app and embedding its docs are now separate actions, implemented as mix tasks
  • The demo page now lets you choose the embedding model used to perform the search
  • The Hex client uses only repo.hex.pm and does not touch the REST API

alias Search.{HexClient, ExDocParser}

@doc """
If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds
If given a package name, adds the latest version of the package to the app. If given a `%HexClient.Release{}` adds

@@ -0,0 +1,37 @@
defmodule Mix.Tasks.Search.Add do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: let’s call this file search.add.ex as its more common.

@moduledoc """
Usage: mix #{Mix.Task.task_name(__MODULE__)} <MODULE>

Embeds the unembedded docs using the given model given as `Search.Embeddings.<MODULE>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Embeds the unembedded docs using the given model given as `Search.Embeddings.<MODULE>`
Embeds the unembedded docs using the model given as `Search.Embeddings.<MODULE>`

Comment on lines +19 to +29
module =
Enum.find(Search.Application.embedding_models(), fn mod ->
"Elixir.Search.Embeddings.#{module_str}" == "#{mod}"
end)

if module do
{:ok, _} = module.embed(&callback/1)
Mix.shell().info("Done.")
else
Mix.shell().error("Could not find embedding module #{module_str}.")
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
module =
Enum.find(Search.Application.embedding_models(), fn mod ->
"Elixir.Search.Embeddings.#{module_str}" == "#{mod}"
end)
if module do
{:ok, _} = module.embed(&callback/1)
Mix.shell().info("Done.")
else
Mix.shell().error("Could not find embedding module #{module_str}.")
end
module = Module.concat(Search.Embeddings, mod)
module.embed(&callback/1)
Mix.shell().info("Done.")

Embeds any doc fragments which do not have an embedding yet. Recieves an optional callback,
which is called to notify about the embedding progress with the tuple {total, done} as its argument.
"""
def embed(progress_callback \\ &Function.identity/1) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is pretty big and we want to avoid generating such in macros for two reasons:

  1. code inside quote is less straightforward to follow then code outside, you need to pay special attention
  2. it increases compilation time

instead we generate small functions that delegate to regular functions on the module, for example:

quote do
  def embed(progress_callback \\ & &1) do
    Search.Embeddings.Embedding.__embed__(__MODULE__, ...)
  end
end

All this being said, at the moment we only have two modules that use this use. I'd wait. I think it's fine to duplicate Ecto schema definition, changeset function, and a couple functions that just delegate to a shared module. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip regarding macros - I did read the best metaprogramming practices, but to use them is practice is another thing.

As for whether the macro is even needed here, I opted to create one because AFAIK we plan on benchmarking many models and further on experimenting with finetuning etc. I thought providing a framework for that early on would be a good idea, but after reading your comments I'm not so sure anymore. Could you give some more of your thoughts on this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modules differ only in configuration, so what we could do is have a single implementation that accepts config. We could define a behaviour to make the contract more explicit.

So the idea is that in the config we would have roughly this:

config :search,
  embedding_providers: [
    albert_small:
      {Search.Embeddings.BumblebeeProvider,
        model: {:hf, "sentence-transformers/paraphrase-albert-small-v2"},
        embedding_size: 768,
        serving_opts: [
          compile: [batch_size: 16, sequence_length: 100],
          defn_options: [compiler: EXLA]
        ]},
    paraphrase_l3:
      {Search.Embeddings.BumblebeeProvider,
        model: {:hf, "sentence-transformers/paraphrase-MiniLM-L3-v2"},
        embedding_size: 384,
        serving_opts: [
          compile: [batch_size: 16, sequence_length: 512],
          defn_options: [compiler: EXLA]
        ]}
  ]

And then

defmodule Search.Embeddings.BumblebeeProvider do
  @behaviour Search.Embeddings.Provider

  @impl true
  def embed(..., config)

  @impl true
  def child_spec(opts, config)
end

And finally the API would be like Search.Embeddings.embed(:albert_small, ...), it would fetch the config and call the module.

The main question is how does this relate to the database. Could be a single table with model key, or perhaps we could use a single schema and but use different table name on the fly, based on the provider key.
// EDIT: actually we need multiple tables since the embedding column has different size. I believe we can have a single schema and specify the table name explicitly in queries like from e in {"TABLE_NAME", Embedding}), ....

The benefit of such explicit contract is that plugging something else than Bumblebee would be easy, and this is a good flexibility to have for ML models.

This is just a rough sketch, happy to hear what you guys think!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sounds good to me @jonatanklosko!

Comment on lines +51 to +54
else
{:error, err} ->
Repo.rollback(err)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@repo_url "https://repo.hex.pm"

alias Search.HexClient

def get_releases(package_name) when is_binary(package_name) do
case get("#{@api_url}/packages/#{package_name}") do
{:ok, %{status: 200, body: %{"releases" => releases}}} ->
case get("packages/#{package_name}", plugins: [ReqHex]) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this to the underlying get/1 function? Also please use ReqHex.attach/1, the :plugins option is strictly speaking undocumented at the moment because I still haven't made up my mind about it

Suggested change
case get("packages/#{package_name}", plugins: [ReqHex]) do
case get("packages/#{package_name}") do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this, ReqHex fails in zlib when getting a tarball of the docs from the repo too IIRC. I do not know if this is a bug or just a case of out-of-contract use. I'll provide repro steps for this and the other ReqHex issue as soon as I can.

|> Req.new()
|> Req.Request.prepend_response_steps(
handle_errors: fn {req, res} ->
# Looks like ReqHex fails in zlib on non-200 codes, so these are handled here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's easy to reproduce could you open up an issue on req_hex repo? Or just DM me if you prefer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open issue if the problem is in fact reproducible

Comment on lines +52 to +57
|> Req.Request.run_request()

case res do
%RuntimeError{message: message} -> {:error, message}
other -> {:ok, other}
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a network error we'd return {:ok, %Mint.TransportError{}} (Req.TransportError on upcoming Req v0.5) which I think is gonna be unexpected. My suggestion is to instead of handling the error message here, make the request using Req.request, and handle the error at the call site, call Exception.message(e).

Suggested change
|> Req.Request.run_request()
case res do
%RuntimeError{message: message} -> {:error, message}
other -> {:ok, other}
end
|> Req.request()

WDYT?

Comment on lines +96 to +104
@doc """
Returns the list of packages.

## Examples

iex> list_packages()
[%Package{}, ...]

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the docs super focused, these examples are not useful. I wouldn't even document this function but if we document one, might as well document all public ones.

Suggested change
@doc """
Returns the list of packages.
## Examples
iex> list_packages()
[%Package{}, ...]
"""
@doc """
Returns the list of packages.
"""

alias Search.{HexClient, ExDocParser}

@doc """
If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one short sentence in the first paragraph please!

def add_package(package_name) when is_binary(package_name) do
case HexClient.get_releases(package_name) do
{:ok, releases} ->
latest = Enum.max_by(releases, & &1.version, Version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "latest version" is a bit more subtle, it is this if the only versions are pre-releases, but otherwise it is the latest non-pre-release version.

We have a Postgrex v1.0.0-rc.1 released, checks notes, 6 years ago, and that is not considered "latest version".

See https://github.com/hexpm/hexdocs/blob/20a09a705c312aaa2c22a894eaf018468eca50bb/lib/hexdocs/utils.ex#L33:L54

Comment on lines +24 to +26
Enum.find(Search.Application.embedding_models(), fn mod ->
"Elixir.Search.Embeddings.#{embedding_model}" == "#{mod}"
end)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should decouple the model name from the implementation (module and hierarchy). If we go with the config approach, then we can use the keys, otherwise each module should expose a name or something.

Comment on lines +103 to +109
case embeddings do
%{embedding: embedding} ->
embedding

_ when is_list(embeddings) ->
Stream.map(embeddings, & &1.embedding)
end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid two code paths, for the singular case you could wrap each text in a list. That said, I would just require the batch_size to be specified in the config (with an aggressive Keyword.fetch!/2).

serving_opts: [
compile: [batch_size: 16, sequence_length: 100],
defn_options: [compiler: EXLA]
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
]
],
load_model_opts: [backend: EXLA.Backend]

(now the default backend is {EXLA.Backend, client: :host}, but the parameters we want to load onto the preferred client, that is, GPU if available)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants