-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing logic improvements #6
base: main
Are you sure you want to change the base?
Conversation
alias Search.{HexClient, ExDocParser} | ||
|
||
@doc """ | ||
If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds | |
If given a package name, adds the latest version of the package to the app. If given a `%HexClient.Release{}` adds |
@@ -0,0 +1,37 @@ | |||
defmodule Mix.Tasks.Search.Add do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: let’s call this file search.add.ex as its more common.
@moduledoc """ | ||
Usage: mix #{Mix.Task.task_name(__MODULE__)} <MODULE> | ||
|
||
Embeds the unembedded docs using the given model given as `Search.Embeddings.<MODULE>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embeds the unembedded docs using the given model given as `Search.Embeddings.<MODULE>` | |
Embeds the unembedded docs using the model given as `Search.Embeddings.<MODULE>` |
module = | ||
Enum.find(Search.Application.embedding_models(), fn mod -> | ||
"Elixir.Search.Embeddings.#{module_str}" == "#{mod}" | ||
end) | ||
|
||
if module do | ||
{:ok, _} = module.embed(&callback/1) | ||
Mix.shell().info("Done.") | ||
else | ||
Mix.shell().error("Could not find embedding module #{module_str}.") | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module = | |
Enum.find(Search.Application.embedding_models(), fn mod -> | |
"Elixir.Search.Embeddings.#{module_str}" == "#{mod}" | |
end) | |
if module do | |
{:ok, _} = module.embed(&callback/1) | |
Mix.shell().info("Done.") | |
else | |
Mix.shell().error("Could not find embedding module #{module_str}.") | |
end | |
module = Module.concat(Search.Embeddings, mod) | |
module.embed(&callback/1) | |
Mix.shell().info("Done.") |
Embeds any doc fragments which do not have an embedding yet. Recieves an optional callback, | ||
which is called to notify about the embedding progress with the tuple {total, done} as its argument. | ||
""" | ||
def embed(progress_callback \\ &Function.identity/1) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is pretty big and we want to avoid generating such in macros for two reasons:
- code inside quote is less straightforward to follow then code outside, you need to pay special attention
- it increases compilation time
instead we generate small functions that delegate to regular functions on the module, for example:
quote do
def embed(progress_callback \\ & &1) do
Search.Embeddings.Embedding.__embed__(__MODULE__, ...)
end
end
All this being said, at the moment we only have two modules that use this use. I'd wait. I think it's fine to duplicate Ecto schema definition, changeset function, and a couple functions that just delegate to a shared module. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip regarding macros - I did read the best metaprogramming practices, but to use them is practice is another thing.
As for whether the macro is even needed here, I opted to create one because AFAIK we plan on benchmarking many models and further on experimenting with finetuning etc. I thought providing a framework for that early on would be a good idea, but after reading your comments I'm not so sure anymore. Could you give some more of your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modules differ only in configuration, so what we could do is have a single implementation that accepts config. We could define a behaviour to make the contract more explicit.
So the idea is that in the config we would have roughly this:
config :search,
embedding_providers: [
albert_small:
{Search.Embeddings.BumblebeeProvider,
model: {:hf, "sentence-transformers/paraphrase-albert-small-v2"},
embedding_size: 768,
serving_opts: [
compile: [batch_size: 16, sequence_length: 100],
defn_options: [compiler: EXLA]
]},
paraphrase_l3:
{Search.Embeddings.BumblebeeProvider,
model: {:hf, "sentence-transformers/paraphrase-MiniLM-L3-v2"},
embedding_size: 384,
serving_opts: [
compile: [batch_size: 16, sequence_length: 512],
defn_options: [compiler: EXLA]
]}
]
And then
defmodule Search.Embeddings.BumblebeeProvider do
@behaviour Search.Embeddings.Provider
@impl true
def embed(..., config)
@impl true
def child_spec(opts, config)
end
And finally the API would be like Search.Embeddings.embed(:albert_small, ...)
, it would fetch the config and call the module.
The main question is how does this relate to the database. Could be a single table with model key, or perhaps we could use a single schema and but use different table name on the fly, based on the provider key.
// EDIT: actually we need multiple tables since the embedding column has different size. I believe we can have a single schema and specify the table name explicitly in queries like from e in {"TABLE_NAME", Embedding}), ...
.
The benefit of such explicit contract is that plugging something else than Bumblebee would be easy, and this is a good flexibility to have for ML models.
This is just a rough sketch, happy to hear what you guys think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, sounds good to me @jonatanklosko!
else | ||
{:error, err} -> | ||
Repo.rollback(err) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you copy Repo.transaction_with/2
from https://github.com/livebook-teams/teams/blob/46646c3c5a180ef550ab418c1c13e4fcc58902e2/lib/teams/repo.ex#L11?
@repo_url "https://repo.hex.pm" | ||
|
||
alias Search.HexClient | ||
|
||
def get_releases(package_name) when is_binary(package_name) do | ||
case get("#{@api_url}/packages/#{package_name}") do | ||
{:ok, %{status: 200, body: %{"releases" => releases}}} -> | ||
case get("packages/#{package_name}", plugins: [ReqHex]) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move this to the underlying get/1 function? Also please use ReqHex.attach/1
, the :plugins
option is strictly speaking undocumented at the moment because I still haven't made up my mind about it
case get("packages/#{package_name}", plugins: [ReqHex]) do | |
case get("packages/#{package_name}") do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this, ReqHex fails in zlib when getting a tarball of the docs from the repo too IIRC. I do not know if this is a bug or just a case of out-of-contract use. I'll provide repro steps for this and the other ReqHex issue as soon as I can.
|> Req.new() | ||
|> Req.Request.prepend_response_steps( | ||
handle_errors: fn {req, res} -> | ||
# Looks like ReqHex fails in zlib on non-200 codes, so these are handled here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's easy to reproduce could you open up an issue on req_hex repo? Or just DM me if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open issue if the problem is in fact reproducible
|> Req.Request.run_request() | ||
|
||
case res do | ||
%RuntimeError{message: message} -> {:error, message} | ||
other -> {:ok, other} | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's a network error we'd return {:ok, %Mint.TransportError{}}
(Req.TransportError on upcoming Req v0.5) which I think is gonna be unexpected. My suggestion is to instead of handling the error message here, make the request using Req.request, and handle the error at the call site, call Exception.message(e).
|> Req.Request.run_request() | |
case res do | |
%RuntimeError{message: message} -> {:error, message} | |
other -> {:ok, other} | |
end | |
|> Req.request() |
WDYT?
@doc """ | ||
Returns the list of packages. | ||
|
||
## Examples | ||
|
||
iex> list_packages() | ||
[%Package{}, ...] | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the docs super focused, these examples are not useful. I wouldn't even document this function but if we document one, might as well document all public ones.
@doc """ | |
Returns the list of packages. | |
## Examples | |
iex> list_packages() | |
[%Package{}, ...] | |
""" | |
@doc """ | |
Returns the list of packages. | |
""" |
alias Search.{HexClient, ExDocParser} | ||
|
||
@doc """ | ||
If given a package name, adds the latest version of the package to the app. If given a `#{HexClient.Release}` adds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one short sentence in the first paragraph please!
def add_package(package_name) when is_binary(package_name) do | ||
case HexClient.get_releases(package_name) do | ||
{:ok, releases} -> | ||
latest = Enum.max_by(releases, & &1.version, Version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "latest version" is a bit more subtle, it is this if the only versions are pre-releases, but otherwise it is the latest non-pre-release version.
We have a Postgrex v1.0.0-rc.1 released, checks notes, 6 years ago, and that is not considered "latest version".
Enum.find(Search.Application.embedding_models(), fn mod -> | ||
"Elixir.Search.Embeddings.#{embedding_model}" == "#{mod}" | ||
end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should decouple the model name from the implementation (module and hierarchy). If we go with the config approach, then we can use the keys, otherwise each module should expose a name or something.
case embeddings do | ||
%{embedding: embedding} -> | ||
embedding | ||
|
||
_ when is_list(embeddings) -> | ||
Stream.map(embeddings, & &1.embedding) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid two code paths, for the singular case you could wrap each text in a list. That said, I would just require the batch_size
to be specified in the config (with an aggressive Keyword.fetch!/2
).
serving_opts: [ | ||
compile: [batch_size: 16, sequence_length: 100], | ||
defn_options: [compiler: EXLA] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
] | |
], | |
load_model_opts: [backend: EXLA.Backend] |
(now the default backend is {EXLA.Backend, client: :host}
, but the parameters we want to load onto the preferred client, that is, GPU if available)
Sorry for the late and big PR, I will make sure the next ones are more manageable 😅
There are several improvements to the architecture of the app: