Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(python): Embeddings with lots of zeros cause numeric instability when indexing. #1222

Open
koaning opened this issue Apr 15, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@koaning
Copy link
Contributor

koaning commented Apr 15, 2024

LanceDB version

v0.6.8

What happened?

I have a tabular dataset and I wanted to see if I could use LanceDB to create a KNearestNeighborsRegressor that uses Lance as a backend instead of numpy. The initial results were solid, out of the box it is faster than the scikit-learn implementation. But this was without an index, so I wondered if the index might improve things.

I'm about to show a bunch of warnings, but for context it might help how this "embedding" is created. I am using this Kaggle dataset and here's a screenshot of the table.

CleanShot 2024-04-15 at 19 02 13@2x

To encode this, I'm using this scikit-learn pipeline.

from sklearn.pipeline import make_union, make_pipeline
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from skrub import SelectCols

pipe = make_union(
    SelectCols(["mmr"]),
    make_pipeline(
        SelectCols(["year", "condition", "odometer"]),
        StandardScaler()
    ),
    make_pipeline(
        SelectCols(["make", "model", "body", "transmission", "color"]),
        OneHotEncoder(sparse_output=False)
    )
)

X_demo = pipe.fit_transform(df)

I think it's important to observe that the OneHotEncoder here will introduce a vector with lots (and lots!) of zeros. I think this is messing up the indexing because when I index I see red lines appear in the notebook with warnings that look like this:

[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 74 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 95 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 70 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 99 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 78 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 109 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 66 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 63 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[2024-04-15T16:55:53Z WARN  lance_linalg::kmeans] KMeans: more than 10% of clusters are empty: 63 of 256.

I'm only showing a subset of the logs here, but after about a minute and a half it reports that an index has indeed been calculated. However, if I now search ...

thread 'lance-cpu' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs:72:14](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs#line=71):
range end index 221184 out of range for slice of length 220884
thread 'lance-cpu' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs:72:14](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs#line=71):
range end index 221184 out of range for slice of length 220884
thread 'lance-cpu' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs:72:14](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs#line=71):
range end index 221184 out of range for slice of length 220884
thread 'lance-cpu' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs:72:14](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs#line=71):
range end index 221184 out of range for slice of length 220884

... I get these warnings. Eventually it dies with this message.

thread 'lance_background_thread' panicked at [/Users/runner/work/lance/lance/rust/lance/src/utils/tokio.rs:34:24](http://localhost:8888/Users/runner/work/lance/lance/rust/lance/src/utils/tokio.rs#line=33):
called `Result::unwrap()` on an `Err` value: RecvError(())
thread 'lance-cpu' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs:72:14](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/pq/utils.rs#line=71):
range end index 221184 out of range for slice of length 220884
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File <timed eval>:1

File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/query.py:262](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/query.py#line=261), in LanceQueryBuilder.to_pandas(self, flatten)
    247 def to_pandas(self, flatten: Optional[Union[int, bool]] = None) -> "pd.DataFrame":
    248     """
    249     Execute the query and return the results as a pandas DataFrame.
    250     In addition to the selected columns, LanceDB also returns a vector
   (...)
    260         If unspecified, do not flatten the nested columns.
    261     """
--> 262     tbl = self.to_arrow()
    263     if flatten is True:
    264         while True:

File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/query.py:527](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/query.py#line=526), in LanceVectorQueryBuilder.to_arrow(self)
    518 def to_arrow(self) -> pa.Table:
    519     """
    520     Execute the query and return the results as an
    521     [Apache Arrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).
   (...)
    525     vector and the returned vectors.
    526     """
--> 527     return self.to_batches().read_all()

File ~/Development/knn-lance/venv/lib/python3.11/site-packages/pyarrow/ipc.pxi:757, in pyarrow.lib.RecordBatchReader.read_all()

File [~/Development/knn-lance/venv/lib/python3.11/site-packages/pyarrow/error.pxi:91](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/pyarrow/error.pxi#line=90), in pyarrow.lib.check_status()

OSError: Io error: Execution error: External error: Execution error: ExecNode(Take): thread panicked: task 532288 panicked

I'm guessing this is all related to the many zeros in the vector. I can totally see how that messes up Kmeans. In my case I think I can either choose to not use the index or go for some embedding tricks to reduce the zeros. But I figured reporting this one, since it may be a relevant edge-case for other folks as well.

Are there known steps to reproduce?

I can send the full notebook, but I hope the above description is sufficient.

@koaning koaning added the bug Something isn't working label Apr 15, 2024
@koaning
Copy link
Contributor Author

koaning commented Apr 15, 2024

I figured I might try using PCA to limit the zeros.

from sklearn.decomposition import PCA

pipe = make_union(
    SelectCols(["mmr"]),
    make_pipeline(
        SelectCols(["year", "condition", "odometer"]),
        StandardScaler()
    ),
    make_pipeline(
        SelectCols(["make", "model", "body", "transmission", "color"]),
        OneHotEncoder(sparse_output=False),
        PCA(n_components=100)
    )
)

X_demo = pipe.fit_transform(df)

But this does not resolve the issue. Makes sense too, I think you'd really have to reduce it immensely in order for the "zero-effect" to disappear.

@koaning
Copy link
Contributor Author

koaning commented Apr 15, 2024

Just for the heck of it I figured that I might try PCA with 10 components. This should really loose a lot of information but ... I hit another issue while building an index this time.

thread '<unnamed>' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs:41:20](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs#line=40):
attempt to divide by zero
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File <timed eval>:1

File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/table.py:1145](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/table.py#line=1144), in LanceTable.create_index(self, metric, num_partitions, num_sub_vectors, vector_column_name, replace, accelerator, index_cache_size)
   1134 def create_index(
   1135     self,
   1136     metric="L2",
   (...)
   1142     index_cache_size: Optional[int] = None,
   1143 ):
   1144     """Create an index on the table."""
-> 1145     self._dataset_mut.create_index(
   1146         column=vector_column_name,
   1147         index_type="IVF_PQ",
   1148         metric=metric,
   1149         num_partitions=num_partitions,
   1150         num_sub_vectors=num_sub_vectors,
   1151         replace=replace,
   1152         accelerator=accelerator,
   1153         index_cache_size=index_cache_size,
   1154     )

File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lance/dataset.py:1492](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lance/dataset.py#line=1491), in LanceDataset.create_index(self, column, index_type, name, metric, replace, num_partitions, ivf_centroids, pq_codebook, num_sub_vectors, accelerator, index_cache_size, shuffle_partition_batches, shuffle_partition_concurrency, **kwargs)
   1489 if shuffle_partition_concurrency is not None:
   1490     kwargs["shuffle_partition_concurrency"] = shuffle_partition_concurrency
-> 1492 self._ds.create_index(column, index_type, name, replace, kwargs)
   1493 return LanceDataset(self.uri, index_cache_size=index_cache_size)

PanicException: attempt to divide by zero

This is interesting, because the smallest absolute number in my data (np.min(np.abs(emb))) is 5.159067705838442e-06. It's a small number, sure, but before I had actual zeros. So this issue may be coming from within Lance that's not related to my data.

@BubbleCal
Copy link
Contributor

hi @koaning , i tried to reproduce your first panic problem by creating index with vectors with lots of zeros.
i got the warning logs as the same as yours, but i didn't get panic when search.
could you share more info about your index params, or your notebook?

@koaning
Copy link
Contributor Author

koaning commented Apr 19, 2024

Here's the relevant code from the notebook.

import pandas as pd
df = pd.read_csv("car_prices.csv").dropna()
df.head(3)
from sklearn.pipeline import make_union, make_pipeline
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from skrub import SelectCols

pipe = make_union(
    SelectCols(["mmr"]),
    make_pipeline(
        SelectCols(["year", "condition", "odometer"]),
        StandardScaler()
    ),
    make_pipeline(
        SelectCols(["make", "model", "body", "transmission", "color"]),
        OneHotEncoder(sparse_output=False)
    )
)

X_demo = pipe.fit_transform(df)
from lancedb.pydantic import Vector, LanceModel
import lancedb 

db = lancedb.connect("./.lancedb")

class CarVector(LanceModel):
    vector: Vector(X_demo.shape[1])
    id: int
    sellingprice: int

batch = [{"vector": v, "id": idx, "sellingprice": p} 
         for idx, (p, v) in enumerate(zip(df['sellingprice'], X_demo))]

tbl = db.create_table(
    "orig-model", 
    schema=CarVector, 
    on_bad_vectors='drop', # This is an interesting one by the way! 
    data=batch
)

Here's where the warnings come in.

%%time 

tbl.create_index()
%%time

tbl.search(X_demo[0]).limit(20).to_pandas()

I ran this on both my M1 macbook air as well as my M1 mac mini and saw the same panic in both cases. Could be that this is Mac specific, but not 100% sure.

@BubbleCal
Copy link
Contributor

BubbleCal commented Apr 19, 2024

hi @koaning
TLDR: you can resolve the issue by creating index with params below:

  • num_partitions should be the num_rows / 1,000,000 or sqrt(num_rows), but at least 1. the default value is 256, which is too large for your dataset.
  • num_sub_vectors should divide the dimension of vector, according to your code, it can be 1 or 5 (your vectors are with 5 dimensions, OneHotEncoder produces dimension for each field iiuc)

Details:
the IVF_PQ index divides the dataset into num_partitions partitions, each partition should contain enough rows so it will be meaningful.
the PQ transforms each vector to a uint8 array with length num_sub_vectors, it splits the vectors into chunks with equal size dimension / num_sub_vectors,
if the num_sub_vectors doesn't divide vector dimension, it will read wrong number of data, that's the reason of the first panic;
if the num_sub_vectors is greater than the vector dimension, it will get 0 length of uint8 array, that's the reason of the second panic

I will add some checks to report meaningful errors for these cases

@koaning
Copy link
Contributor Author

koaning commented Apr 19, 2024

The reasoning sure seems sound, however, when I change the hyperparams on my machine I still get the same error. This surprised me, but I think that I'm also not able to set the number of clusters that it'll fit?

CleanShot 2024-04-19 at 17 10 14

I can imagine that with less clusters we might also get out of the weeds here. Numerically I can imagine that there are way to many clusters for this dataset and that it's pretty easy to end up with clusters that can't reach any points. This is merely brain-farting though ... may need to think it over. Curious to hear other thoughts though.

@koaning
Copy link
Contributor Author

koaning commented Apr 19, 2024

When I set num_partitions and num_sub_vectors to 1 I don't see any errors anymore though, so that may also just be the remedy for this dataset.

@BubbleCal
Copy link
Contributor

BubbleCal commented Apr 19, 2024

for the warning logs:
the PQ training also divides data into partitions, the number of partitions(centroids) is pow(2, num_bits), by default, the num_bits=8 so it's 256 centroids. try to create the index with additional param: num_bits=1

for the panic:
could you check the dimension of your vectors, to make sure it's actually 5? I noticed it still reads the wrong number of data, setting num_sub_vector to 1 should also work

@koaning
Copy link
Contributor Author

koaning commented Apr 19, 2024

I may be mistaken, but I think that I'm not able to set the num_bits. This is the signature of tbl.create_index.

tbl.create_index(
    metric='L2',
    num_partitions=256,
    num_sub_vectors=96,
    vector_column_name='vector',
    replace: 'bool' = True,
    accelerator: 'Optional[str]' = None,
    index_cache_size: 'Optional[int]' = None,
)

@BubbleCal
Copy link
Contributor

oh you are right... the lancedb doesn't expose this param.

setting num_sub_vector=1 should work if you have enough rows.

@koaning
Copy link
Contributor Author

koaning commented Apr 19, 2024

Gotya, being able to set the number of clusters somehow does feel like a valid feature. I can see how I may want to tune that param. But as far as this issue goes I guess better error messages would be fine. I also understand that my use-case is a bit out of the ordinary, there are also things I could do to make these embeddings "better" with regards to the retreival engine.

BubbleCal added a commit to lancedb/lance that referenced this issue Apr 24, 2024
)

see details: lancedb/lancedb#1222

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants