Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading zarr file #48

Open
mumichae opened this issue Aug 15, 2023 · 1 comment · Fixed by #49
Open

Error reading zarr file #48

mumichae opened this issue Aug 15, 2023 · 1 comment · Fixed by #49
Labels
bug Something isn't working
Milestone

Comments

@mumichae
Copy link

mumichae commented Aug 15, 2023

Describe the bug
Getting the following error when trying to read a zarr file

/hpc/projects/hca_integration/workspace/hca_pipelines/workflow/metrics/.snakemake/conda/152006085908115996d5d904bffd05fa_/lib/python3.9/site-packages/mudata/_core/mudata.py:491: UserWarning: Cannot join columns with the same name because var_names are intersecting.                                             
  warnings.warn(
/hpc/projects/hca_integration/workspace/hca_pipelines/workflow/metrics/.snakemake/conda/152006085908115996d5d904bffd05fa_/lib/python3.9/site-packages/mudata/_core/mudata.py:491: UserWarning: Cannot join columns with the same name because var_names are intersecting.                                             
  warnings.warn(
Traceback (most recent call last):                                                                                                                                                                                                                                                                                    
  File "/hpc/projects/hca_integration/workspace/hca_pipelines/workflow/metrics/.snakemake/conda/152006085908115996d5d904bffd05fa_/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                                                                                                    
  File "/hpc/projects/hca_integration/workspace/hca_pipelines/workflow/metrics/.snakemake/conda/152006085908115996d5d904bffd05fa_/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc                                                                                                     
    return self._engine.get_loc(casted_key)                                                                                                                                                                                                                                                                           
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
    return self._engine.get_loc(casted_key)                                                                                                                                                                                                                                                                           
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc                                                                                                                                                                                                                                  
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc                                                                                                                                                                                                                                  
  File "pandas/_libs/index.pyx", line 203, in pandas._libs.index.IndexEngine._get_loc_duplicates                                                                                                                                                                                                                      
  File "pandas/_libs/index.pyx", line 203, in pandas._libs.index.IndexEngine._get_loc_duplicates
  File "pandas/_libs/index.pyx", line 211, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer                                                                                                                                                                                                                  
  File "pandas/_libs/index.pyx", line 211, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index._unpack_bool_indexer                                                                                                                                                                                                                                 
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index._unpack_bool_indexer
KeyError: 'ABHD17A' 

The key 'ABHD17A' exists in the dataset and seems to be duplicated according to the part of the code that is run (although I couldn't confirm this in the anndata objects that I used to create the mudata zarr file.

To Reproduce
example.h5mu.zarr.zip

unzip example.h5mu.zarr.zip
import mudata as mu
mu.read_zarr('example.h5mu.zarr')

Expected behaviour
No error when reading the file

System

  • OS: Rocky Linux release 8.7 (Green Obsidian) x86_64
  • Python version 3.9
  • mudata 0.2.3
  • muon 0.1.5
  • anndata 0.9.2
  • scanpy 1.9.3
  • pandas 1.5.3
  • zarr 2.16.0

Additional context
After some exploration, I found that making copies of the assigned values solved the problem for me

prev_values = prev_index.values
now_values = now_index.values

I made those changes in PR #49

@mumichae mumichae added the bug Something isn't working label Aug 15, 2023
This was referenced Aug 15, 2023
@gtca
Copy link
Collaborator

gtca commented Sep 10, 2023

Thanks a lot, @mumichae!

The main issue here is actually ignoring the mod-order attribute in read_zarr.
The proper fix will be coming with v0.3.


More clarifications:

although I couldn't confirm this in the anndata objects

ABHD17A gene name is in all three modalities and hence is repeated three times in the global annotation.

Considering many gene name duplications across modalities, the order of global .var annotation (if there were any) would have been mixed up as there would be no way to know which modality the duplicated features really came from.
With correct mod-order the concatenation of var_names of individual modalities in the correct order matches the global var_names.

By the way, feature name duplications is something that MuData's design advises against. Is it multimodal data at all? Could the axes interface be useful here?

gtca added a commit that referenced this issue Sep 10, 2023
@gtca gtca added this to the v0.3.0 milestone Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants