Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csr_matrix with the .get method will return "Last value of index pointer should be less than the size of index and data arrays" #94

Open
linmuchuiyang opened this issue Jun 16, 2022 · 3 comments

Comments

@linmuchuiyang
Copy link

import scanpy as sc
import anndata
import time
import os,wget
import cudf
import cupy as cp
from cuml.decomposition import PCA
from cuml.manifold import TSNE
from cuml.cluster import KMeans
from cuml.preprocessing import StandardScaler
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
import rmm

rmm.reinitialize(
    managed_memory=True, # Allows oversubscription
    pool_allocator=False, # default is False
    devices=0, # GPU device IDs to register. By default registers only GPU 0.
)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

MT_GENE_PREFIX = "MT-" # Prefix for mitochondria genes to regress out
markers = ["ACE2", "TMPRSS2", "EPCAM"] # Marker genes for visualization
# filtering cells
min_genes_per_cell = 1 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 100000 # Filter out cells with more genes than this expressed
pt_max = 1
# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 2000 # Number of highly variable genes to retain
# PCA
n_components = 50 # Number of principal components to compute
# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE
# KNN
n_neighbors = 15 # Number of nearest neighbors for KNN graph
knn_n_pcs = 50 # Number of principal components to use for finding nearest neighbors
# UMAP
umap_min_dist = 0.3
umap_spread = 1.0
# Gene ranking
ranking_n_top_genes = 50

adata = sc.read('/rapids_clara/c952.diff_PRO.h5ad')

genes = cudf.Series(adata.var_names)
sparse_gpu_array=cp.sparse.csr_matrix(adata.raw.X)
sparse_gpu_array[1350000:1360000].get()

When I use .get() for the csr_matrix(sparse_gpu_array), an error shown as follows:

ValueError                                Traceback (most recent call last)
Input In [94], in <cell line: 1>()
----> 1 sparse_gpu_array[1350000:1360000].get()

File /opt/conda/envs/rapids/lib/python3.9/site-packages/cupyx/scipy/sparse/csr.py:73, in csr_matrix.get(self, stream)
     71 indices = self.indices.get(stream)
     72 indptr = self.indptr.get(stream)
---> 73 return scipy.sparse.csr_matrix(
     74     (data, indices, indptr), shape=self._shape)

File /opt/conda/envs/rapids/lib/python3.9/site-packages/scipy/sparse/_compressed.py:106, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
    103 if dtype is not None:
    104     self.data = self.data.astype(dtype, copy=False)
--> 106 self.check_format(full_check=False)

File /opt/conda/envs/rapids/lib/python3.9/site-packages/scipy/sparse/_compressed.py:178, in _cs_matrix.check_format(self, full_check)
    176     raise ValueError("indices and data should have the same size")
    177 if (self.indptr[-1] > len(self.indices)):
--> 178     raise ValueError("Last value of index pointer should be less than "
    179                      "the size of index and data arrays")
    181 self.prune()
    183 if full_check:
    184     # check format validity (more expensive)

ValueError: Last value of index pointer should be less than the size of index and data arrays

However, if I set the interval with [1000:1360000], it will not feedback with any error. The original h5ad file is about 20GiB. And the shape for sparse_gpu_array is (1462703, 27610). Why it will show error for this special interval?

@cjnolet
Copy link
Member

cjnolet commented Jul 13, 2022

@linmuchuiyang thanks for opening an issue on this. Is there any chance that dataset is publicly available? If not, have you been able to reproduce this behavior on a dataset that is publicly available, or with generated data?

This behavior does seem pretty weird, but it's not the first time I've seen strange behavior like this. Does it also fail if you select a few entries closer to the beginning of the matrix? Something like sparse_gpu_array[:10]?

However, if I set the interval with [1000:1360000], it will not feedback with any error.

When it doesn’t feedback with any error, are you saying it succeeds or that it seems to fail silently without any error?

In the meantime, I’ll try loading up a couple of the datasets we use in the examples and see if I can reproduce the behavior.

@linmuchuiyang
Copy link
Author

@cjnolet , I've already passed the original datafile to NVIDIA SA. He said he had sent you the email with the data. Is there any feedback, currently.

@zacharylau10
Copy link

I encounter this error too, when matrix more than one million cells and 20k genes. I think this error is produce by vstack when chunks merge and there are produced negative indptr in CSR/CSC matrix, maybe the matrix is over limitation of CSR matrix or CSC matrix (Should < 2^32 elements in matrix?). But I don't have any idea how to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants