Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] sample_by on view does not work #2023

Open
daniel-falk opened this issue Nov 23, 2022 · 3 comments
Open

[BUG] sample_by on view does not work #2023

daniel-falk opened this issue Nov 23, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@daniel-falk
Copy link
Contributor

馃悰馃悰 Bug Report

Hi, when using the sample by query on the result of another query, it fails with an exception.

ds = deeplake.load("hub://activeloop/mnist-train")
ds2 = ds.query("select * limit 1000")
ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()

This fails with exception:

IndexError                                Traceback (most recent call last)
<ipython-input-147-431c4577b4bb> in <cell line: 1>()
----> 1 ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()

~/src/Hub/deeplake/core/dataset/dataset.py in query(self, query_string)
   1709         from deeplake.enterprise import query
   1710 
-> 1711         return query(self, query_string)
   1712 
   1713     def sample_by(

~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/humbug/report.py in wrapped_callable(*args, **kwargs)
    443             self.feature_report(callable.__name__, parameters)
    444 
--> 445             return callable(*args, **kwargs)
    446 
    447         return wrapped_callable

~/src/Hub/deeplake/enterprise/libdeeplake_query.py in query(dataset, query_string)
     39     dsv = ds.query(query_string)
     40     indexes = dsv.indexes
---> 41     return dataset[indexes]
     42 
     43 

~/src/Hub/deeplake/core/dataset/dataset.py in __getitem__(self, item, is_iteration)
    456                 ret = self.__class__(
    457                     storage=self.storage,
--> 458                     index=self.index[item],
    459                     group_index=self.group_index,
    460                     read_only=self._read_only,

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    374             return new_index
    375         elif isinstance(item, list):
--> 376             return self[(tuple(item),)]  # type: ignore
    377         elif isinstance(item, Index):
    378             return self[tuple(v.value for v in item.values)]  # type: ignore

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    371             for idx, sub_item in enumerate(item):
    372                 ax = new_index.find_axis(offset=idx)
--> 373                 new_index = new_index.compose_at(sub_item, ax)
    374             return new_index
    375         elif isinstance(item, list):

~/src/Hub/deeplake/core/index/index.py in compose_at(self, item, i)
    330             return Index(self.values + [IndexEntry(item)])
    331         else:
--> 332             new_values = self.values[:i] + [self.values[i][item]] + self.values[i + 1 :]
    333             return Index(new_values)
    334 

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

~/src/Hub/deeplake/core/index/index.py in <genexpr>(.0)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

IndexError: tuple index out of range

Using versions:

deeplake==3.1.0
libdeeplake==0.0.29

It would also be nice if I could automatically sample to get a uniformed distribution instead of using weights, because now I need to do the query in two steps:

  • Filter on any metadata that I am insterested in
  • Calculate the class imballance
  • Sample by the inverse of the class imballance
@daniel-falk daniel-falk added the bug Something isn't working label Nov 23, 2022
@AbhinavTuli
Copy link
Contributor

Hey, @daniel-falk. Can you try this on main? This PR should have fixed the issue #2018

@daniel-falk
Copy link
Contributor Author

Does not seem to work for me on master either:

In [6]: deeplake.__version__
Out[6]: '3.1.1'

In [8]: deeplake.__file__
Out[8]: '/home/daniel/src/Hub/deeplake/__init__.py'

In [10]: !cd /home/daniel/src/Hub/deeplake/ && git log -n1
commit c2c64607a42e3135923bb529e729118c4d4cdf2a (HEAD -> main, origin/main, origin/HEAD)
Author: Abhinav Tuli <42538472+AbhinavTuli@users.noreply.github.com>
Date:   Wed Nov 23 18:45:55 2022 +0530

    Handle repeated samples in shuffle (#2018)
    
    * fix
    
    * Fix.
    
    * fix for fix
    
    Co-authored-by: Sasun Hambardzumyan <xustup@gmail.com>                                   
                                                                      
In [11]: ds = deeplake.load("hub://activeloop/mnist-train")
hub://activeloop/mnist-train loaded successfully.                                                                                            
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/mnist-train
                                                                      
In [12]: ds2 = ds.query("select * limit 1000")

In [13]: ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-13-431c4577b4bb> in <cell line: 1>()
----> 1 ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()

~/src/Hub/deeplake/core/dataset/dataset.py in query(self, query_string)
   1702         from deeplake.enterprise import query
   1703 
-> 1704         return query(self, query_string)
   1705 
   1706     def sample_by(

~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/humbug/report.py in wrapped_callable(*args, **kwargs)
    443             self.feature_report(callable.__name__, parameters)
    444 
--> 445             return callable(*args, **kwargs)
    446 
    447         return wrapped_callable

~/src/Hub/deeplake/enterprise/libdeeplake_query.py in query(dataset, query_string)
     39     dsv = ds.query(query_string)
     40     indexes = dsv.indexes
---> 41     return dataset[indexes]
     42 
     43 

~/src/Hub/deeplake/core/dataset/dataset.py in __getitem__(self, item, is_iteration)
    456                 ret = self.__class__(
    457                     storage=self.storage,
--> 458                     index=self.index[item],
    459                     group_index=self.group_index,
    460                     read_only=self._read_only,

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    374             return new_index
    375         elif isinstance(item, list):
--> 376             return self[(tuple(item),)]  # type: ignore
    377         elif isinstance(item, Index):
    378             return self[tuple(v.value for v in item.values)]  # type: ignore

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    371             for idx, sub_item in enumerate(item):
    372                 ax = new_index.find_axis(offset=idx)
--> 373                 new_index = new_index.compose_at(sub_item, ax)
    374             return new_index
    375         elif isinstance(item, list):

~/src/Hub/deeplake/core/index/index.py in compose_at(self, item, i)
    330             return Index(self.values + [IndexEntry(item)])
    331         else:
--> 332             new_values = self.values[:i] + [self.values[i][item]] + self.values[i + 1 :]
    333             return Index(new_values)
    334 

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

~/src/Hub/deeplake/core/index/index.py in <genexpr>(.0)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

IndexError: tuple index out of range

@AbhinavTuli
Copy link
Contributor

Thanks for pointing it out Daniel. This seems like a different problem than the one addressed in the PR mentioned above. @khustup is working on fixing this issue and we should have a fix soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants