Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Seurat object to h5ad file #59

Open
uqzqiao opened this issue May 8, 2023 · 10 comments
Open

Convert Seurat object to h5ad file #59

uqzqiao opened this issue May 8, 2023 · 10 comments

Comments

@uqzqiao
Copy link

uqzqiao commented May 8, 2023

Hi!

Our scRNA-seq data was processed using the Seurat package. In order to run scDRS with our scRNA-seq data, I first converted the Seurat object saved as an RDS file to an h5ad file using the following script,


seurat_counts <- seurat_obj[["RNA"]]@counts
  seurat_metadata <- seurat_obj@meta.data
  rownames(seurat_metadata) <- colnames(seurat_counts)

  adata <- anndata$AnnData(
    X = as.matrix(t(as.matrix(seurat_counts))),
    obs = as.data.frame(seurat_metadata)
  )
 return(adata)
}

adata <- seurat_to_anndata(df)
write_h5ad(adata, "raw.h5ad")

However, when I tried to use this h5ad file for the compute-score step, I encountered the following error,

Traceback (most recent call last):
  File "$HOME/miniconda3/envs/scRNA-seq/bin/scdrs", line 740, in <module>
    fire.Fire()
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/bin/scdrs", line 211, in compute_score
    scdrs.preprocess(
  File "~/softwares/scDRS/scdrs/pp.py", line 200, in preprocess
    v_resid = reg_out(np.ones(n_cell), df_cov.values)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/softwares/scDRS/scdrs/pp.py", line 453, in reg_out
    mat_coef = np.linalg.solve(mat_xtx, mat_xty)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in solve
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/numpy/linalg/linalg.py", line 386, in solve
    r = gufunc(a, b, signature=signature, extobj=extobj)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._UFuncInputCastingError: Cannot cast ufunc 'solve' input 0 from dtype('O') to dtype('float64') with casting rule 'same_kind'

I'm more experienced with R than Python, so I was hoping you could help me suggest a better way to convert the Seurat object to an h5ad file. Thanks so much!

@uqzqiao
Copy link
Author

uqzqiao commented May 9, 2023

Update:

I've attempted to transform the Seurat object to an h5ad file using another R package, but unfortunately encountered the same error. This leads me to believe that the problem may be due to the existence of missing values (NA) in the expression data. However, after checking the data, I found that there are no NAs. Can you please provide some suggestions on other potential issues that could be causing this problem? Thank you very much!

# Load packages
library(Seurat)
library(SeuratData)
library(SeuratDisk)

# df is the seurat object
test=UpdateSeuratObject(df)
SaveH5Seurat(test, filename="raw.h5Seurat")
Convert("raw.h5Seurat", dest="h5ad")

@martinjzhang
Copy link
Owner

Hi @uqzqiao

Thank you for reporting the issue. It seems your .h5ad file is fine and the error is with the preprocessing step within scdrs. The problem might be that your --cov file has a different data type, hence raising the Cannot cast ufunc 'solve' error.

I couldn't locate the exact issue. Being more careful with input data types on your side or forcing the data to have the same type on the software side may solve the problem.

If you can provide us with a small set of input files that can reproduce this error, I can look into the issue and get back to you in a few days.

Best,
Martin

@uqzqiao
Copy link
Author

uqzqiao commented May 10, 2023

Hi @martinjzhang,

Thank you so much for your response! I really appreciate your help! I noticed that the error I encountered was due to NAs in the covariate file, which was an oversight in my preparation process. I am sorry for any inconvenience this may have caused.

I do have a couple more questions regarding the software (I apologize for posting my additional questions in this thread, as they may be better suited for a separate discussion),

  • Firstly, I came across a closed issue where you mentioned that you removed the MHC region (CHR6 25-34MB) and restricted the summary statistics to the HAPMAP3 SNPs for the manuscript. I am wondering about the impact of this step on the results, since the disease I am studying may have disease-associated genes in the HLA region. Additionally, I would like to know if restricting to HapMap3 SNPs is for computational benefits or if it affects the accuracy of generating disease-specific gene sets.

  • Secondly, our preliminary analysis suggests that adjusting for pool can better account for batch effect. Thus, I plan to include pool as a categorical variable in the covariate file. However, we have over 70 pools from 2 batches. Would you recommend adjusting for pool or batch in this case?

Thank you again for your time and help!

Best regards,
Zhen

@uqzqiao
Copy link
Author

uqzqiao commented May 10, 2023

Hi Martin,

I apologize for asking so many questions, but I'm looking for some advice on how to proceed if no cells are significant at certain FDR thresholds (0.1 or 0.2). I'm aware that FDR correction may not effectively retain true positives if there are a large number of tests needed to be corrected for (e.g., I got 1.26m cells). In this case, would it be appropriate to compute a score for each cell type to identify any cells associated with disease?
Additionally, if no cells are significant at certain FDR thresholds, would it still be worthwhile to proceed with downstream analysis? Thank you again for your help!

Best wishes,
Zhen

@martinjzhang
Copy link
Owner

Hi @uqzqiao

Great that you can successfully run the software. Please see my suggestions below.

Firstly, I came across a closed issue where you mentioned that you removed the MHC region (CHR6 25-34MB) and restricted the summary statistics to the HAPMAP3 SNPs for the manuscript. I am wondering about the impact of this step on the results, since the disease I am studying may have disease-associated genes in the HLA region. Additionally, I would like to know if restricting to HapMap3 SNPs is for computational benefits or if it affects the accuracy of generating disease-specific gene sets.

The MHC region has a complex LD structure. That's why we excluded this region. scDRS is compatible with any disease gene set. So if you can curate a set of disease genes using other software, that would work too.

If you focus on signals of common SNPs (e.g., >5%), HapMap3 SNPs should capture most of the GWAS information. I am not sure about rare SNPs. It may be helpful to include other SNPs if you care about rare variant signals.

Secondly, our preliminary analysis suggests that adjusting for pool can better account for batch effect. Thus, I plan to include pool as a categorical variable in the covariate file. However, we have over 70 pools from 2 batches. Would you recommend adjusting for pool or batch in this case?

You can encode these covariates as dummy variables.

I apologize for asking so many questions, but I'm looking for some advice on how to proceed if no cells are significant at certain FDR thresholds (0.1 or 0.2). I'm aware that FDR correction may not effectively retain true positives if there are a large number of tests needed to be corrected for (e.g., I got 1.26m cells). In this case, would it be appropriate to compute a score for each cell type to identify any cells associated with disease? Additionally, if no cells are significant at certain FDR thresholds, would it still be worthwhile to proceed with downstream analysis? Thank you again for your help!

Even if no cells are significant individually, they may yield significant results in the group-level analysis. I suggest creating a UMAP visualization with color representing the normalized scDRS disease score. If you can notice any pattern, such as certain groups of cells having much greater scDRS disease scores, then it indicates that scDRS is likely capturing some disease signals.

Please feel free to let us know if you have more questions.

Best,
Martin

@uqzqiao
Copy link
Author

uqzqiao commented May 12, 2023

Hi Martin,

Thanks so much for your reply! It has been incredibly helpful!! I've got some interesting results and I'm currently working on understanding and interpreting them. I have a couple more questions:

Could you please explain the methods used to correct for multiple testing (regarding the individual cell disease relevance scores)? If it is BH, would it be appropriate for me to recalculate the number of significant cells with alternative methods, using the scDRS p-values (pval) from the score file?

In the downstream analyses (--corr-analysis) at the individual cell level, would it still be beneficial to perform this analysis if the number of significant cells is limited? Additionally, if I only have individual cell-level variables for a subset of all cells (with the rest being NA), will it still be possible to calculate the correlation parameter (na.rm=T)?

I am particularly interested in the prioritized disease-relevant genes obtained from the --gene-analysis. However, I only have data for one specific trait/disease, and it is challenging to curate a reference gene set. Do you have any suggestions on how to interpret the prioritized genes in this context?

Thank you very much for your help!

Best regards,
Zhen

@martinjzhang
Copy link
Owner

Could you please explain the methods used to correct for multiple testing (regarding the individual cell disease relevance scores)? If it is BH, would it be appropriate for me to recalculate the number of significant cells with alternative methods, using the scDRS p-values (pval) from the score file?

Yes, we used BH. The pval in the score files are raw p-values. You can apply any multiple testing methods on them.

In the downstream analyses (--corr-analysis) at the individual cell level, would it still be beneficial to perform this analysis if the number of significant cells is limited?

Yes. Similar to the group-level analysis, the correlation analysis may be more powerful than the individual cell-level association analysis by pooling information across cells.

Additionally, if I only have individual cell-level variables for a subset of all cells (with the rest being NA), will it still be possible to calculate the correlation parameter (na.rm=T)?

scDRS may give you NA results because it computes the correlation using all overlapping cells between adata and the score file. One way to get around this is to create smaller score files restricted to only cells with the cell-level variable values.

I am particularly interested in the prioritized disease-relevant genes obtained from the --gene-analysis. However, I only have data for one specific trait/disease, and it is challenging to curate a reference gene set. Do you have any suggestions on how to interpret the prioritized genes in this context?

Have you looked at the drug targets in https://www.opentargets.org/

@jjzixue
Copy link

jjzixue commented Aug 29, 2023

Hi,@martinjzhang
I encountered difficulties while converting Seurat to the h5ad file format. Due to my limited familiarity with Python, I couldn't resolve the issue on my own. I would greatly appreciate your assistance.

example data from zeisel_2015/expr.h5ad for the analysis.
After splitting and reorganizing the expr.h5ad file into merged_data.h5ad using R and Python code, the subsequent scRDS analysis with the split file is not yielding the expected results.

##################
#R code:
##################
a1 <- ReadH5AD("expr.h5ad")

umap_coords <- Embeddings(a1, "umap")
write.csv(umap_coords, "umap_coords.csv")

seurat_object <- a1
# Convert Seurat object to SingleCellExperiment object
sce <- as.SingleCellExperiment(seurat_object)

# Export UMAP and PCA coordinates
umap_coords <- Embeddings(seurat_object, "umap")
pca_coords <- Embeddings(seurat_object, "pca")
write.csv(umap_coords, "umap_coords.csv")
write.csv(pca_coords, "pca_coords.csv")

# Export metadata
metadata <- seurat_object@meta.data
write.csv(metadata, "metadata.csv")

# Export assay data (replace "assay_name" with the name of the assay you want to export)
assay_name <- "RNA"
assay_data <- seurat_object@assays$RNA@counts  # Replace "RNA" with the correct assay name

write.csv(assay_data, paste0(assay_name, "_data.csv”))

##################
#python code:
##################
import pandas as pd
import anndata
import os
os.chdir('~/scDRS')
os.getcwd() 
# Load UMAP and PCA coordinates from CSV files
umap_coords = pd.read_csv("umap_coords.csv", index_col=0)
pca_coords = pd.read_csv("pca_coords.csv", index_col=0)
# Load metadata from CSV file
metadata = pd.read_csv("metadata.csv", index_col=0)

# Load assays data from CSV file (replace "assay_data.csv" with the actual file name)
assay_data = pd.read_csv("RNA_data.csv", index_col=0)

# Create an AnnData object
adata = anndata.AnnData(X=assay_data.values.T, obs=metadata)
adata.obsm["X_umap"] = umap_coords.values
adata.obsm["X_pca"] = pca_coords.values

# Save the AnnData object as h5ad file
output_h5ad_path = "merged_data.h5ad"
adata.write(output_h5ad_path)

print("AnnData object saved as h5ad format.”)

##################
#merged_data.h5ad
##################
scdrs compute-score \
  --h5ad-file ~/scDRS/data/merged_data.h5ad \
  --h5ad-species mouse \
  --gs-file ~/scDRS/data/processed_geneset.gs \
  --gs-species mouse \
  --cov-file ~/scDRS/data/cov.tsv \
  --flag-filter-data True \
  --flag-raw-count True \
  --flag-return-ctrl-raw-score False \
  --flag-return-ctrl-norm-score True \
  --out-folder ~/scDRS/data/
******************************************************************************
* Single-cell disease relevance score (scDRS)
* Version 1.0.3
* Martin Jinye Zhang and Kangcheng Hou
* HSPH / Broad Institute / UCLA
* MIT License
******************************************************************************
Call: scdrs compute-score \
--h5ad-file ~/scDRS/data/merged_data.h5ad \
--h5ad-species mouse \
--cov-file ~/scDRS/data/cov.tsv \
--gs-file ~/scDRS/data/processed_geneset.gs \
--gs-species mouse \
--ctrl-match-opt mean_var \
--weight-opt vs \
--adj-prop None \
--flag-filter-data True \
--flag-raw-count True \
--n-ctrl 1000 \
--min-genes 250 \
--min-cells 50 \
--flag-return-ctrl-raw-score False \
--flag-return-ctrl-norm-score True \
--out-folder ~/scDRS/data/

Loading data:
--h5ad-file loaded: n_cell=3005, n_gene=13572 (sys_time=1.7s)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 genes: ['0', '1', '2', '3', '4']     ########################## different to expr.h5ad
--cov-file loaded: covariates=['const', 'n_genes'] (sys_time=1.7s)
n_cell=3005 (3005 in .h5ad)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 values for 'const': [1, 1, 1, 1, 1]
First 5 values for 'n_genes': [4848, 4685, 6028, 5824, 4701]
--gs-file loaded: n_trait=3 (sys_time=1.7s)
Print info for first 3 traits:
First 3 elements for 'SCZ': [], []    ########################## different to expr.h5ad
First 3 elements for 'Dorsal': [], [] ########################## different to expr.h5ad
First 3 elements for 'Height': [], [] ########################## different to expr.h5ad

Preprocessing:

Computing scDRS score:
trait=SCZ: skipped due to small size (n_gene=0, sys_time=3.8s)
trait=Dorsal: skipped due to small size (n_gene=0, sys_time=3.8s)
trait=Height: skipped due to small size (n_gene=0, sys_time=3.8s)


However, during the scRDS analysis, only the expr.h5ad file is able to complete successfully.

##################
#expr.h5ad
##################
scdrs compute-score \
  --h5ad-file ~/scDRS/data/expr.h5ad \       
  --h5ad-species mouse \
  --gs-file ~/scDRS/data/processed_geneset.gs \
  --gs-species mouse \
  --cov-file ~/scDRS/data/cov.tsv \
  --flag-filter-data True \
  --flag-raw-count True \
  --flag-return-ctrl-raw-score False \
  --flag-return-ctrl-norm-score True \
  --out-folder ~/scDRS/data/

******************************************************************************
* Single-cell disease relevance score (scDRS)
* Version 1.0.3
* Martin Jinye Zhang and Kangcheng Hou
* HSPH / Broad Institute / UCLA
* MIT License
******************************************************************************
Call: scdrs compute-score \
--h5ad-file ~/scDRS/data/expr.h5ad \
--h5ad-species mouse \
--cov-file ~/scDRS/data/cov.tsv \
--gs-file ~/scDRS/data/processed_geneset.gs \
--gs-species mouse \
--ctrl-match-opt mean_var \
--weight-opt vs \
--adj-prop None \
--flag-filter-data True \
--flag-raw-count True \
--n-ctrl 1000 \
--min-genes 250 \
--min-cells 50 \
--flag-return-ctrl-raw-score False \
--flag-return-ctrl-norm-score True \
--out-folder ~/scDRS/data/

Loading data:
--h5ad-file loaded: n_cell=3005, n_gene=13572 (sys_time=1.9s)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 genes: ['Tspan12', 'Tshz1', 'Fnbp1l', 'Adamts15', 'Cldn12']    ########################## run  normal
--cov-file loaded: covariates=['const', 'n_genes'] (sys_time=1.9s)
n_cell=3005 (3005 in .h5ad)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 values for 'const': [1, 1, 1, 1, 1]
First 5 values for 'n_genes': [4848, 4685, 6028, 5824, 4701]
--gs-file loaded: n_trait=3 (sys_time=1.9s)
Print info for first 3 traits:
########################## run normal
First 3 elements for 'SCZ': ['Dpyd', 'Cacna1i', 'Rbfox1'], [7.6519, 7.4766, 7.3247]  
First 3 elements for 'Dorsal': ['Ckb', 'Fndc5', 'Lin7b'], [1.0, 1.0, 1.0]
First 3 elements for 'Height': ['Wwox', 'Gmds', 'Lpp'], [10.0, 10.0, 9.9916]

Preprocessing:

Computing scDRS score:
Computing control scores: 100%|█████████████████████████████████████| 1000/1000 [00:47<00:00, 20.95it/s]
Trait=SCZ, n_gene=756: 178/3005 FDR<0.1 cells, 320/3005 FDR<0.2 cells (sys_time=83.9s)
Computing control scores: 100%|█████████████████████████████████████| 1000/1000 [00:17<00:00, 58.77it/s]
Trait=Dorsal, n_gene=155: 549/3005 FDR<0.1 cells, 641/3005 FDR<0.2 cells (sys_time=123.1s)
Computing control scores: 100%|█████████████████████████████████████| 1000/1000 [00:37<00:00, 26.75it/s]
Trait=Height, n_gene=671: 0/3005 FDR<0.1 cells, 6/3005 FDR<0.2 cells (sys_time=184.7s)

@martinjzhang
Copy link
Owner

Hi @jjzixue ,

In your new data file, gene names become numbers. As a result, scDRS can not match genes in the .gs file with genes in the .h5ad file.

First 5 genes: ['0', '1', '2', '3', '4']     ########################## different to expr.h5ad

Let me know if you need further help.

@jjzixue
Copy link

jjzixue commented Aug 30, 2023

Hi @martinjzhang ,
Thank you very much for your prompt response. After reading the files using R, I discovered that the gene names are missing. I'm unsure which part of the code during merging might be causing this issue. Could you help me review and identify any potential bugs in the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants