New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(python): Embeddings with lots of zeros cause numeric instability when indexing. #1222
Comments
I figured I might try using PCA to limit the zeros. from sklearn.decomposition import PCA
pipe = make_union(
SelectCols(["mmr"]),
make_pipeline(
SelectCols(["year", "condition", "odometer"]),
StandardScaler()
),
make_pipeline(
SelectCols(["make", "model", "body", "transmission", "color"]),
OneHotEncoder(sparse_output=False),
PCA(n_components=100)
)
)
X_demo = pipe.fit_transform(df) But this does not resolve the issue. Makes sense too, I think you'd really have to reduce it immensely in order for the "zero-effect" to disappear. |
Just for the heck of it I figured that I might try PCA with 10 components. This should really loose a lot of information but ... I hit another issue while building an index this time.
This is interesting, because the smallest absolute number in my data ( |
hi @koaning , i tried to reproduce your first panic problem by creating index with vectors with lots of zeros. |
Here's the relevant code from the notebook. import pandas as pd
df = pd.read_csv("car_prices.csv").dropna()
df.head(3) from sklearn.pipeline import make_union, make_pipeline
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from skrub import SelectCols
pipe = make_union(
SelectCols(["mmr"]),
make_pipeline(
SelectCols(["year", "condition", "odometer"]),
StandardScaler()
),
make_pipeline(
SelectCols(["make", "model", "body", "transmission", "color"]),
OneHotEncoder(sparse_output=False)
)
)
X_demo = pipe.fit_transform(df) from lancedb.pydantic import Vector, LanceModel
import lancedb
db = lancedb.connect("./.lancedb")
class CarVector(LanceModel):
vector: Vector(X_demo.shape[1])
id: int
sellingprice: int
batch = [{"vector": v, "id": idx, "sellingprice": p}
for idx, (p, v) in enumerate(zip(df['sellingprice'], X_demo))]
tbl = db.create_table(
"orig-model",
schema=CarVector,
on_bad_vectors='drop', # This is an interesting one by the way!
data=batch
) Here's where the warnings come in. %%time
tbl.create_index() %%time
tbl.search(X_demo[0]).limit(20).to_pandas() I ran this on both my M1 macbook air as well as my M1 mac mini and saw the same panic in both cases. Could be that this is Mac specific, but not 100% sure. |
hi @koaning
Details: I will add some checks to report meaningful errors for these cases |
The reasoning sure seems sound, however, when I change the hyperparams on my machine I still get the same error. This surprised me, but I think that I'm also not able to set the number of clusters that it'll fit? I can imagine that with less clusters we might also get out of the weeds here. Numerically I can imagine that there are way to many clusters for this dataset and that it's pretty easy to end up with clusters that can't reach any points. This is merely brain-farting though ... may need to think it over. Curious to hear other thoughts though. |
When I set num_partitions and num_sub_vectors to 1 I don't see any errors anymore though, so that may also just be the remedy for this dataset. |
for the warning logs: for the panic: |
I may be mistaken, but I think that I'm not able to set the tbl.create_index(
metric='L2',
num_partitions=256,
num_sub_vectors=96,
vector_column_name='vector',
replace: 'bool' = True,
accelerator: 'Optional[str]' = None,
index_cache_size: 'Optional[int]' = None,
) |
oh you are right... the lancedb doesn't expose this param. setting |
Gotya, being able to set the number of clusters somehow does feel like a valid feature. I can see how I may want to tune that param. But as far as this issue goes I guess better error messages would be fine. I also understand that my use-case is a bit out of the ordinary, there are also things I could do to make these embeddings "better" with regards to the retreival engine. |
) see details: lancedb/lancedb#1222 --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
LanceDB version
v0.6.8
What happened?
I have a tabular dataset and I wanted to see if I could use LanceDB to create a KNearestNeighborsRegressor that uses Lance as a backend instead of numpy. The initial results were solid, out of the box it is faster than the scikit-learn implementation. But this was without an index, so I wondered if the index might improve things.
I'm about to show a bunch of warnings, but for context it might help how this "embedding" is created. I am using this Kaggle dataset and here's a screenshot of the table.
To encode this, I'm using this scikit-learn pipeline.
I think it's important to observe that the
OneHotEncoder
here will introduce a vector with lots (and lots!) of zeros. I think this is messing up the indexing because when I index I see red lines appear in the notebook with warnings that look like this:I'm only showing a subset of the logs here, but after about a minute and a half it reports that an index has indeed been calculated. However, if I now search ...
... I get these warnings. Eventually it dies with this message.
I'm guessing this is all related to the many zeros in the vector. I can totally see how that messes up Kmeans. In my case I think I can either choose to not use the index or go for some embedding tricks to reduce the zeros. But I figured reporting this one, since it may be a relevant edge-case for other folks as well.
Are there known steps to reproduce?
I can send the full notebook, but I hope the above description is sufficient.
The text was updated successfully, but these errors were encountered: