csr_matrix with the .get method will return "Last value of index pointer should be less than the size of index and data arrays" #94

linmuchuiyang · 2022-06-16T16:04:09Z

import scanpy as sc
import anndata
import time
import os,wget
import cudf
import cupy as cp
from cuml.decomposition import PCA
from cuml.manifold import TSNE
from cuml.cluster import KMeans
from cuml.preprocessing import StandardScaler
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
import rmm

rmm.reinitialize(
    managed_memory=True, # Allows oversubscription
    pool_allocator=False, # default is False
    devices=0, # GPU device IDs to register. By default registers only GPU 0.
)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

MT_GENE_PREFIX = "MT-" # Prefix for mitochondria genes to regress out
markers = ["ACE2", "TMPRSS2", "EPCAM"] # Marker genes for visualization
# filtering cells
min_genes_per_cell = 1 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 100000 # Filter out cells with more genes than this expressed
pt_max = 1
# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 2000 # Number of highly variable genes to retain
# PCA
n_components = 50 # Number of principal components to compute
# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE
# KNN
n_neighbors = 15 # Number of nearest neighbors for KNN graph
knn_n_pcs = 50 # Number of principal components to use for finding nearest neighbors
# UMAP
umap_min_dist = 0.3
umap_spread = 1.0
# Gene ranking
ranking_n_top_genes = 50

adata = sc.read('/rapids_clara/c952.diff_PRO.h5ad')

genes = cudf.Series(adata.var_names)
sparse_gpu_array=cp.sparse.csr_matrix(adata.raw.X)
sparse_gpu_array[1350000:1360000].get()

When I use .get() for the csr_matrix(sparse_gpu_array), an error shown as follows:

ValueError                                Traceback (most recent call last)
Input In [94], in <cell line: 1>()
----> 1 sparse_gpu_array[1350000:1360000].get()

File /opt/conda/envs/rapids/lib/python3.9/site-packages/cupyx/scipy/sparse/csr.py:73, in csr_matrix.get(self, stream)
     71 indices = self.indices.get(stream)
     72 indptr = self.indptr.get(stream)
---> 73 return scipy.sparse.csr_matrix(
     74     (data, indices, indptr), shape=self._shape)

File /opt/conda/envs/rapids/lib/python3.9/site-packages/scipy/sparse/_compressed.py:106, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
    103 if dtype is not None:
    104     self.data = self.data.astype(dtype, copy=False)
--> 106 self.check_format(full_check=False)

File /opt/conda/envs/rapids/lib/python3.9/site-packages/scipy/sparse/_compressed.py:178, in _cs_matrix.check_format(self, full_check)
    176     raise ValueError("indices and data should have the same size")
    177 if (self.indptr[-1] > len(self.indices)):
--> 178     raise ValueError("Last value of index pointer should be less than "
    179                      "the size of index and data arrays")
    181 self.prune()
    183 if full_check:
    184     # check format validity (more expensive)

ValueError: Last value of index pointer should be less than the size of index and data arrays

However, if I set the interval with [1000:1360000], it will not feedback with any error. The original h5ad file is about 20GiB. And the shape for sparse_gpu_array is (1462703, 27610). Why it will show error for this special interval?

The text was updated successfully, but these errors were encountered:

cjnolet · 2022-07-13T16:26:20Z

@linmuchuiyang thanks for opening an issue on this. Is there any chance that dataset is publicly available? If not, have you been able to reproduce this behavior on a dataset that is publicly available, or with generated data?

This behavior does seem pretty weird, but it's not the first time I've seen strange behavior like this. Does it also fail if you select a few entries closer to the beginning of the matrix? Something like sparse_gpu_array[:10]?

However, if I set the interval with [1000:1360000], it will not feedback with any error.

When it doesn’t feedback with any error, are you saying it succeeds or that it seems to fail silently without any error?

In the meantime, I’ll try loading up a couple of the datasets we use in the examples and see if I can reproduce the behavior.

linmuchuiyang · 2022-07-21T01:23:09Z

@cjnolet , I've already passed the original datafile to NVIDIA SA. He said he had sent you the email with the data. Is there any feedback, currently.

zacharylau10 · 2023-01-04T06:37:28Z

I encounter this error too, when matrix more than one million cells and 20k genes. I think this error is produce by vstack when chunks merge and there are produced negative indptr in CSR/CSC matrix, maybe the matrix is over limitation of CSR matrix or CSC matrix (Should < 2^32 elements in matrix?). But I don't have any idea how to fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csr_matrix with the .get method will return "Last value of index pointer should be less than the size of index and data arrays" #94

csr_matrix with the .get method will return "Last value of index pointer should be less than the size of index and data arrays" #94

linmuchuiyang commented Jun 16, 2022

cjnolet commented Jul 13, 2022 •

edited

linmuchuiyang commented Jul 21, 2022

zacharylau10 commented Jan 4, 2023

csr_matrix with the .get method will return "Last value of index pointer should be less than the size of index and data arrays" #94

csr_matrix with the .get method will return "Last value of index pointer should be less than the size of index and data arrays" #94

Comments

linmuchuiyang commented Jun 16, 2022

cjnolet commented Jul 13, 2022 • edited

linmuchuiyang commented Jul 21, 2022

zacharylau10 commented Jan 4, 2023

cjnolet commented Jul 13, 2022 •

edited