Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing vizgen data #7080

Open
nzhang89 opened this issue Mar 29, 2023 · 62 comments
Open

Processing vizgen data #7080

nzhang89 opened this issue Mar 29, 2023 · 62 comments

Comments

@nzhang89
Copy link

Thank you for your continuous and dedicated efforts of incorporating more and more useful features to Seurat. We recently received our first batch of vizgen data, and I would like to process and visualize them using Seurat.

But the folder structure and files we received are a little different from the public vizgen data. We do not have a cell_boundaries folder with a bunch of h5 files. Instead, we have a single cell_boundaries.parquet file. In this situation, can I still use the cell boundary/segmentation information?

Thank you for your help.

@AustinHartman
Copy link
Contributor

AustinHartman commented Mar 29, 2023

@nzhang89 thanks for posting an issue about this - I heard Vizgen was planning to modify the output segmentation format, but I have not yet seen what these files look like. Any chance you could share some data with me in this format? My email is ahartman@nygenome.org. I can then go ahead and update the LoadVizgen function to support cell boundaries in parquet format.

In the meantime, the steps for creating a Segmentation object, which stores the boundary information in a Seurat object, are relatively straightforward. You'll want to construct a data.frame with x, y, and cell columns. Each row represents a vertice in a segmentation for a given cell, so you'll have many rows per cell. You can then run segs <- CreateSegmentation(df) to generate a Segmentation object. Please let me know if you have any questions!

@andrewjkwok
Copy link

Thanks for addressing this - having exactly the same issue here.

Looking at Vizgen's latest user guide:
https://vizgen.com/resources/merscope-instrument/?_gl=1labipg_upMQ.._gaMTA1NzM5MzYyNi4xNjgwMjQ5NjM2_ga_4DCLSHRRJC*MTY4MDI0OTYzNi4xLjAuMTY4MDI0OTYzNi4wLjAuMA..

it seems that with MERSCOPE instrument software v232 or later, they've moved to parquet files for the cell boundaries instead of the hdf5 format.

@AustinHartman
Copy link
Contributor

Hey @andrewjkwok, thank you for your patience and pointing me to the user guide. I've reached out to Vizgen for additional information, and I'll update you here once I have a solution

@alikhuseynov
Copy link

Vizgen now has Cellpose segmentation output stored as .parquet. I had to modify ReadVizgen() and LoadVizgen().
See this script
and example to make Seurat obj of Vizgen data
@AustinHartman probably I should submit PR at some point, unless you guys implementing similar stuff already?

Additionally, there were some issue using subset() #6683 and #6767
in order to solve that as intermediate solution, I implemented this script
Hope this helps.
best

@AustinHartman
Copy link
Contributor

@alikhuseynov thank you so much! A PR to handle parquet files would be much appreciated

@alikhuseynov
Copy link

@AustinHartman sure, I like Seurat package and it's always nice to contribute. I will try to work on that.

@alikhuseynov
Copy link

PR submitted #7190

@andrewjkwok
Copy link

Thnkas @alikhuseynov for the update. I installed the branch that you updated, but unfortunately encounter an error sayign:

Error: file must be a "InputStream"

Do you have any idea what the issue might be? When I look at the traceback, this is what I get:

8: stop(assertError(attr(res, "msg")))
7: assert_that(inherits(object, class), msg = msg)
6: assert_is(file, "InputStream")
5: make_readable_file(file, mmap)
4: arrow::ParquetFileReader$create(dsn, ...)
3: sfarrow::st_read_parquet(file2read)
2: ReadVizgen(data.dir = data.dir, ...)
1: LoadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/spatial_transcriptome/Data/Test_MERFISH_run_16032023/202303171059_20230317_VMSC04301_firstcopy/region_0/")

@alikhuseynov
Copy link

How does the content look like of these directories?

  • your ./region_0/ directory
  • directory containing Cellpose output, ie .parquet files

Could you post your LoadVizgen() run/script?
thanks

@andrewjkwok
Copy link

My ./region_0/ directory:

202303171059_20230317_VMSC04301_firstcopy_region_0.vzg  cell_by_gene.csv   cell_metadata_truncated.csv  images
cell_boundaries.parquet                            cell_metadata.csv  detected_transcripts.csv     summary_20230317.png

I'm not sure about the directory containing the Cellpose output - the .parquet file seems to be in the ./region_0/ directory?

My script currently involves nothing other than loading the libraries and running the command!

options(Seurat.object.assay.version = 'v5')

library(Seurat)
library(future)
library(ggplot2)
library(magrittr)
plan("multisession", workers = 10)

vizgen.obj <- LoadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/spatial_transcriptome/Data/Test_MERFISH_run_16032023/202303171059_20230317_VMSC04301_firstcopy/region_0/")

@alikhuseynov
Copy link

in our case the Vizgen directory content looks like this:

region_0/
├── xxx_region_0.vzg
├── Cellpose_polyT_CB3
├── detected_transcripts.csv
├── images
└── summary.png

region_0/Cellpose_polyT_CB3/
├── xxx_region_0_cellpose.vzg
├── cellpose_cell_by_gene.csv
├── cellpose_cell_metadata.csv
├── cellpose_micron_space.parquet
└── cellpose_mosaic_space.parquet

I would assume that your cell_boundaries.parquet is the same as cellpose_micron_space.parquet ?

Since I don't have access to the test data, could you run the following and see if the content of your cell_boundaries.parquet is more or less similar?

file2read <- "./region_0/cell_boundaries.parquet"
test_parq <- sfarrow::st_read_parquet(file2read)
# in my case the output looks like that:
test_parq %>% str
Classes ‘sf’ and 'data.frame':	559790 obs. of  10 variables:
 $ ID                : int  0 312 104 156 260 208 52 105 209 53 ...
 $ EntityID          :integer64 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100001 112833000000100002 ... 
 $ ZIndex            : int  3 6 4 1 0 5 2 4 5 2 ...
 $ Type              : chr  "cell" "cell" "cell" "cell" ...
 $ ZLevel            : num  6 10.5 7.5 3 1.5 9 4.5 7.5 9 4.5 ...
 $ Name              : 'vctrs_unspecified' logi  NA NA NA NA NA NA ...
 $ ParentID          : 'vctrs_unspecified' logi  NA NA NA NA NA NA ...
 $ ParentType        : 'vctrs_unspecified' logi  NA NA NA NA NA NA ...
 $ X__index_level_0__: int  0 348 116 174 290 232 58 117 233 59 ...
 $ Geometry          :sfc_MULTIPOLYGON of length 559790; first list element: List of 1
  ..$ :List of 1
  .. ..$ : num [1:12, 1:2] 6805 6807 6808 6809 6809 ...
  ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
 - attr(*, "sf_column")= chr "Geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
  ..- attr(*, "names")= chr [1:9] "ID" "EntityID" "ZIndex" "Type" ...
# then filter for mid Z plane
test_parq %>% filter(ZIndex == 3) %>% pull(Geometry) %>% str
sfc_MULTIPOLYGON of length 79970; first list element: List of 1
 $ :List of 1
  ..$ : num [1:12, 1:2] 6805 6807 6808 6809 6809 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

@andrewjkwok
Copy link

Huh this is strange. Sorry for the poor formatting earlier. This is the directory structure:

region_0/
├── 202303171059_KoLab20230317_VMSC04301_region_0.vzg
├── cell_boundaries.parquet
├── cell_by_gene.csv
├── cell_metadata.csv
├── cell_metadata_truncated.csv
├── detected_transcripts.csv
├── images
│   ├── manifest.json
│   ├── micron_to_mosaic_pixel_transform.csv
│   ├── mosaic_DAPI_z0.tif
│   ├── mosaic_DAPI_z1.tif
│   ├── mosaic_DAPI_z2.tif
│   ├── mosaic_DAPI_z3.tif
│   ├── mosaic_DAPI_z4.tif
│   ├── mosaic_DAPI_z5.tif
│   ├── mosaic_DAPI_z6.tif
│   ├── mosaic_PolyT_z0.tif
│   ├── mosaic_PolyT_z1.tif
│   ├── mosaic_PolyT_z2.tif
│   ├── mosaic_PolyT_z3.tif
│   ├── mosaic_PolyT_z4.tif
│   ├── mosaic_PolyT_z5.tif
│   └── mosaic_PolyT_z6.tif
└── summary_20230317.png

1 directory, 23 files

So it seems our directory structures are a bit different, as the parquet file is directly in my ./region_0/ directory?

The parquet file seems to be more or less similar?:

> test_parq %>% str
Classes ‘sf’ and 'data.frame':	1018750 obs. of  10 variables:
 $ ID                : int  0 1 2 3 4 5 6 7 8 9 ...
 $ EntityID          :integer64 1771792400012100003 1771792400012100004 1771792400012100006 1771792400012100008 1771792400012100009 1771792400012100015 1771792400012100016 1771792400012100018 ... 
 $ ZIndex            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Type              : chr  "cell" "cell" "cell" "cell" ...
 $ ZLevel            : num  1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 ...
 $ Name              : 'vctrs_unspecified' logi  NA NA NA NA NA NA ...
 $ ParentID          : 'vctrs_unspecified' logi  NA NA NA NA NA NA ...
 $ ParentType        : 'vctrs_unspecified' logi  NA NA NA NA NA NA ...
 $ X__index_level_0__: int  0 1 2 3 4 5 6 7 8 9 ...
 $ Geometry          :sfc_MULTIPOLYGON of length 1018750; first list element: List of 1
  ..$ :List of 1
  .. ..$ : num [1:98, 1:2] 1501 1501 1502 1506 1507 ...
  ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
 - attr(*, "sf_column")= chr "Geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
  ..- attr(*, "names")= chr [1:9] "ID" "EntityID" "ZIndex" "Type" ...
> test_parq %>% filter(ZIndex == 3) %>% pull(Geometry) %>% str
sfc_MULTIPOLYGON of length 152345; first list element: List of 2
 $ :List of 1
  ..$ : num [1:5, 1:2] 1539 1539 1539 1540 1539 ...
 $ :List of 1
  ..$ : num [1:89, 1:2] 1501 1502 1503 1506 1509 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

@alikhuseynov
Copy link

No worries, sound good. Vizgen did updates on Merscope output as far as we know.
Also, note that Vizgen is using only one z-plane, the middle plane of the tissue for Cellpose segmentation. So, the cell boundaries in all $ZIndex are the same.

I will add support for reading any single .parquet file in the given directory. For now, if you want, just rename the cell_boundaries.parquet to cellpose_micron_space.parquet.
Then run this chunk to load data using BiocParallel instead of future

### make sure all libs are installed a priori!
## load libs ----
    suppressPackageStartupMessages({
    library(ggplot2)
    library(Seurat)
    library(dplyr)
    library(magrittr)
    library(BiocParallel)    
})

# set params
mol.type <- "microns"
coord.space <- "micron"
z.stack <- 3L # z stack to use
dir_use <- "./region_0/"

# load object
obj.vz <-
  LoadVizgen(data.dir = dir_use,
             fov = "test.sample", 
             assay = "Vizgen",
             use.cellpose.out = TRUE, # if TRUE and ./Cellpose dir exists
             metadata = c("volume", "fov"), # add cell volume info
             mol.type = mol.type, # molecule coords in µm space
             coord.space = coord.space, 
             type = c("segmentations", "centroids", "boxes"), # type of cell spatial coord matrices
             z = z.stack,
             add.zIndex = TRUE, # add z slice section to the object
             update.object = TRUE,
             use.BiocParallel = TRUE, # using `BiocParallel` instead of `future`
             workers.total = 10, # cores for `BiocParallel` processing
             DTthreads.pct = NULL # percentage of total threads to use for `data.table::fread`
  )

@alikhuseynov
Copy link

just add the support #7190, seems to work.
just re-install: remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen')

@andrewjkwok
Copy link

Thanks for such quick responses.

I've reinstalled as you've instructed. The issue coming up now is:


Error in is.empty(.) : could not find function "is.empty"
> traceback()
3: files2scan %>% grep(coord.space, ., value = TRUE) %>% {
       if (is.empty(.)) {
           files2scan %>% grep(".parquet$", ., value = TRUE)
       }
       else {
           (.)
       }
   }
2: ReadVizgen(data.dir = data.dir, ...)
1: LoadVizgen(data.dir = dir_use, fov = "test.sample", assay = "Vizgen", 
       use.cellpose.out = TRUE, metadata = c("volume", "fov"), mol.type = mol.type, 
       coord.space = coord.space, type = c("segmentations", "centroids", 
           "boxes"), z = z.stack, add.zIndex = TRUE, update.object = TRUE, 
       use.BiocParallel = TRUE, workers.total = 10, DTthreads.pct = NULL)
The current directory structure is:

region_0/
├── 202303171059_KoLab20230317_VMSC04301_region_0.vzg
├── cell_by_gene.csv
├── cell_metadata.csv
├── cell_metadata_truncated.csv
├── cellpose_micron_space.parquet
├── detected_transcripts.csv
├── images
│   ├── manifest.json
│   ├── micron_to_mosaic_pixel_transform.csv
│   ├── mosaic_DAPI_z0.tif
│   ├── mosaic_DAPI_z1.tif
│   ├── mosaic_DAPI_z2.tif
│   ├── mosaic_DAPI_z3.tif
│   ├── mosaic_DAPI_z4.tif
│   ├── mosaic_DAPI_z5.tif
│   ├── mosaic_DAPI_z6.tif
│   ├── mosaic_PolyT_z0.tif
│   ├── mosaic_PolyT_z1.tif
│   ├── mosaic_PolyT_z2.tif
│   ├── mosaic_PolyT_z3.tif
│   ├── mosaic_PolyT_z4.tif
│   ├── mosaic_PolyT_z5.tif
│   └── mosaic_PolyT_z6.tif
└── summary_20230317.png

1 directory, 23 files

i.e. I've renamed my parquet file to what you've suggested?

@alikhuseynov
Copy link

Ok, given the last fix, you don't need to rename the .parquet file, it can be cell_boundaries.parquet or cellpose_micron_space.parquet
is.empty() is a useful function from spatstat.geom package, in my case (spatstat.geom_3.0-3) is loaded but not attached (see session info below) once I load Seurat package. Try to install spatstat.geom first, then clean your session, load libraries and run the script. Then it should work.

below is my session info, just in case.

sessionInfo()
​
R version 4.2.1 (2022-06-23)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /media/../yyy/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SeuratObject_4.1.3 Seurat_4.3.0.9002 

loaded via a namespace (and not attached):
  [1] nlme_3.1-159           spatstat.sparse_3.0-0  matrixStats_0.62.0    
  [4] RcppAnnoy_0.0.19       RColorBrewer_1.1-3     httr_1.4.4            
  [7] repr_1.1.4             sctransform_0.3.5      tools_4.2.1           
 [10] utf8_1.2.2             R6_2.5.1               irlba_2.3.5           
 [13] KernSmooth_2.23-20     uwot_0.1.14            lazyeval_0.2.2        
 [16] colorspace_2.0-3       sp_1.5-0               gridExtra_2.3         
 [19] tidyselect_1.2.0       compiler_4.2.1         progressr_0.11.0      
 [22] cli_3.4.1              spatstat.explore_3.0-5 plotly_4.10.0         
 [25] scales_1.2.1           spatstat.data_3.0-0    lmtest_0.9-40         
 [28] ggridges_0.5.4         pbapply_1.7-0          goftest_1.2-3         
 [31] stringr_1.4.1          pbdZMQ_0.3-7           digest_0.6.29         
 [34] spatstat.utils_3.0-1   base64enc_0.1-3        pkgconfig_2.0.3       
 [37] htmltools_0.5.3        parallelly_1.32.1      fastmap_1.1.0         
 [40] htmlwidgets_1.5.4      rlang_1.1.1            shiny_1.7.2           
 [43] generics_0.1.3         zoo_1.8-10             jsonlite_1.8.0        
 [46] spatstat.random_3.0-1  ica_1.0-3              dplyr_1.1.2           
 [49] magrittr_2.0.3         patchwork_1.1.2        Matrix_1.5-1          
 [52] Rcpp_1.0.9             IRkernel_1.3           munsell_0.5.0         
 [55] fansi_1.0.3            abind_1.4-5            reticulate_1.26       
 [58] lifecycle_1.0.3        stringi_1.7.8          MASS_7.3-57           
 [61] Rtsne_0.16             plyr_1.8.7             grid_4.2.1            
 [64] parallel_4.2.1         listenv_0.8.0          promises_1.2.0.1      
 [67] ggrepel_0.9.2          crayon_1.5.2           deldir_1.0-6          
 [70] miniUI_0.1.1.1         lattice_0.20-45        IRdisplay_1.1         
 [73] cowplot_1.1.1          splines_4.2.1          tensor_1.5            
 [76] pillar_1.9.0           igraph_1.3.5           uuid_1.1-0            
 [79] spatstat.geom_3.0-3    future.apply_1.9.0     reshape2_1.4.4        
 [82] codetools_0.2-18       leiden_0.4.2           glue_1.6.2            
 [85] evaluate_0.16          data.table_1.14.6      png_0.1-7             
 [88] vctrs_0.6.2            httpuv_1.6.5           polyclip_1.10-0       
 [91] gtable_0.3.1           RANN_2.6.1             purrr_0.3.4           
 [94] tidyr_1.2.0            scattermore_0.8        future_1.28.0         
 [97] ggplot2_3.4.0          mime_0.12              xtable_1.8-4          
[100] later_1.3.0            survival_3.4-0         viridisLite_0.4.1     
[103] tibble_3.2.1           cluster_2.1.4          globals_0.16.1        
[106] fitdistrplus_1.1-8     ellipsis_0.3.2         ROCR_1.0-11 

@andrewjkwok
Copy link

Very sorry but it still throws an error, this time like this...

Error in segs[[i]][[1]] : subscript out of bounds
> traceback()
5: segs[[i]][[1]] %>% length
4: FUN(X[[i]], ...)
3: lapply(segs %>% seq, function(i) segs[[i]][[1]] %>% length)
2: ReadVizgen(data.dir = data.dir, ...)
1: LoadVizgen(data.dir = dir_use, fov = "test.sample", assay = "Vizgen", 
       use.cellpose.out = TRUE, metadata = c("volume", "fov"), mol.type = mol.type, 
       coord.space = coord.space, type = c("segmentations", "centroids", 
           "boxes"), z = z.stack, add.zIndex = TRUE, update.object = TRUE, 
       use.BiocParallel = TRUE, workers.total = 10, DTthreads.pct = NULL)

Very happy to share data if you would prefer to best troubleshoot?

@alikhuseynov
Copy link

yes, it would then be easier to debug.
if you can upload & share on gdrive (alikhuseyno/at/gmail.com) or any link where I can download these files:

detected_transcripts.csv
cell_by_gene.csv
cell_metadata.csv
cell_boundaries.parquet

there will be more updates on that PR #7190
Thanks

@alikhuseynov
Copy link

I think the problem might come from your .parquet file, that is, it gives 2 lists in the Geometry variable, where it should be 1 list.

> test_parq %>% filter(ZIndex == 3) %>% pull(Geometry) %>% str
sfc_MULTIPOLYGON of length 152345; first list element: List of 2
 $ :List of 1
  ..$ : num [1:5, 1:2] 1539 1539 1539 1540 1539 ...
 $ :List of 1
  ..$ : num [1:89, 1:2] 1501 1502 1503 1506 1509 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

@andrewjkwok
Copy link

Shared - please let me know if there are issues accessing the data.

@alikhuseynov
Copy link

got it!
As far as I see, there are number of segmentation artifacts and empty polygon information. This is a very useful case test, since I never saw such an output from Vizgen.
I have to implement few sanity/cleaning steps before extracting segmented cell boundaries, then it should work.
will let you know ;)
thanks

@cnk113
Copy link

cnk113 commented May 29, 2023

I'm also getting the same issues as above
Error in segs[[i]][[1]] : subscript out of bounds

@alikhuseynov
Copy link

alikhuseynov commented May 30, 2023

yes, I'm working on the fix, which will be available asap.
The problem is that .parquet file in this case has "artifacts", a number of empty and nested lists for single cell segmentation/polygon boundaries.

ideally one should have single polygon for a single segmented cell ID:

List of 1
 $ :List of 1
  ..$ : num [1:26, 1:2] 2751 2753 2754 2757 2759 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

However, some have these, ie multiple polygons for single cell.

List of 4
 $ :List of 1
  ..$ : num [1:7, 1:2] 1513 1513 1512 1511 1512 ...
 $ :List of 1
  ..$ : num [1:5, 1:2] 1525 1525 1526 1526 1525 ...
 $ :List of 2
  ..$ : num [1:82, 1:2] 1495 1496 1497 1497 1499 ...
  ..$ : num [1:17, 1:2] 1507 1510 1512 1514 1516 ...
 $ :List of 1
  ..$ : num [1:6, 1:2] 1498 1497 1497 1498 1498 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

#..or this
List of 2
 $ :List of 1
  ..$ : num [1:5, 1:2] 1539 1539 1539 1540 1539 ...
 $ :List of 1
  ..$ : num [1:89, 1:2] 1501 1502 1503 1506 1509 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

It's hard to tell which one of those multiple polygons would be the actual segmented cell. Right now, I'm keeping the single polygon with maximum points for a single cell. Suggestions on which polygons to choose are welcome.
Thanks

@cnk113
Copy link

cnk113 commented May 30, 2023

Yah I think max points would be fine, would love to try the fix!

Thanks,
Chang

@alikhuseynov
Copy link

The fix is ready, try out. re-install: remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen')
I will be adding support according to suggestions in PR #7190

These packages pkgs <- c("data.table", "arrow", "sfarrow", "tidyverse", "furrr", "BiocParallel") should be installed, else the function will ask you to install them.

The test Vizgen data is from @andrewjkwok, this part will generate the object:

## make sure all libs are installed a priori!
# load libs ----
suppressPackageStartupMessages({
  library(ggplot2)
  library(Seurat)
  library(dplyr)
  library(magrittr)
  library(BiocParallel)
  library(progressr)
  library(spatstat)  
})

# helper function to return Seurat object metadata
callmeta <- function (object = NULL) { 
  return(object@meta.data) 
  }

# set params
mol.type <- "microns"
coord.space <- "micron"
z.stack <- 3L # z stack to use
dir_use <- "./merfish_seurat_debug/" # or something like "./region_0/"

##
start.time <- Sys.time()
obj <- 
  LoadVizgen(data.dir = dir_use,
             fov = "merfish.test", 
             assay = "Vizgen",
             metadata = c("volume", "fov"), # add cell volume info
             mol.type = mol.type, # molecule coords in µm space
             coord.space = coord.space, 
             type = c("segmentations", "centroids"), # type of cell spatial coord matrices
             z = z.stack,
             add.zIndex = TRUE, # add z slice section to a cell
             update.object = TRUE,
             use.BiocParallel = TRUE,
             workers.MulticoreParam = 14, # for `BiocParallel` processing
             use.furrr = TRUE, # when working with list use furrr or purrr
             )
end.time <- Sys.time()
message("Time taken to load object = ", 
        round(end.time - start.time, digits = 2), " ", 
        attr(c(end.time - start.time), "units"))

Result, it took ~ 5 mins with workers.MulticoreParam = 14 (jupyter session RAM - 180GB)

obj
obj %>% callmeta %>% str
An object of class Seurat 
140 features across 232695 samples within 1 assay 
Active assay: Vizgen (140 features, 0 variable features)
 1 spatial field of view present: merfish.test
'data.frame':	232695 obs. of  6 variables:
 $ orig.ident     : Factor w/ 1 level "SeuratProject": 1 1 1 1 1 1 1 1 1 1 ...
 $ nCount_Vizgen  : num  11 69 73 9 14 1 23 3 2 26 ...
 $ nFeature_Vizgen: int  7 17 25 6 13 1 12 3 2 6 ...
 $ z              : int  3 3 3 3 3 3 3 3 3 3 ...
 $ volume         : num  9961 10654 3499 4381 2164 ...
 $ fov            : int  217 217 217 217 217 217 217 217 217 217 ...

test_merfish

@cnk113
Copy link

cnk113 commented May 31, 2023

Hmm I'm getting >>> using segmentations Error in molecules %iff% list(molecules = CreateMolecules(coords = molecules, : object 'mol.type' not found even though I set it to 'microns'

@alikhuseynov
Copy link

strange, could you please list the data.dir (eg "./region_0/"), to see what files are in there?

@cnk113
Copy link

cnk113 commented May 31, 2023

202304040410_gw23-121422-thalamusap0903-verymedial_VMSC01801_region_0.vzg cell_boundaries.parquet cell_by_gene.csv cell_metadata.csv detected_transcripts.csv images summary.png

@alikhuseynov
Copy link

ok, great!

@andrewjkwok
Copy link

@alikhuseynov Thank you so much for the support. I can confirm that it's working now for me with the parquet file.

However, I am now running into a problem trying to load data from the old format in - I wanted to test my data against some of Vizgen's released data (https://info.vizgen.com/mouse-brain-map), and with the current LoadVizgen function, I get the following error:

Error in sfarrow::st_read_parquet(file2read) : 
  object 'file2read' not found
> traceback()
4: make_readable_file(file, mmap)
3: arrow::ParquetFileReader$create(dsn, ...)
2: sfarrow::st_read_parquet(file2read)
1: ReadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/NTS_projection_transcriptome_2023/Data/BrainReceptorShowcase/Slice1/Replicate1/")

Is it possible for the read function to support backward compatibility?

@alikhuseynov
Copy link

@alikhuseynov Thank you so much for the support. I can confirm that it's working now for me with the parquet file.

However, I am now running into a problem trying to load data from the old format in - I wanted to test my data against some of Vizgen's released data (https://info.vizgen.com/mouse-brain-map), and with the current LoadVizgen function, I get the following error:

Error in sfarrow::st_read_parquet(file2read) : 
  object 'file2read' not found
> traceback()
4: make_readable_file(file, mmap)
3: arrow::ParquetFileReader$create(dsn, ...)
2: sfarrow::st_read_parquet(file2read)
1: ReadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/NTS_projection_transcriptome_2023/Data/BrainReceptorShowcase/Slice1/Replicate1/")

Is it possible for the read function to support backward compatibility?

Thanks, glad it works. Your shared data was very useful for tests.
I haven't yet tested the older Vizgen data (ie .hdf5 cell boundaries). The new fix should be ready next week, which among some optimization will also handle data if no .parquet file exists.
For now, set use.cellpose.out = FALSE in LoadVizgen() and try to load.

@andrewjkwok
Copy link

Great, thanks. I tried setting use.cellpose.out = FALSE as suggested - it only works if I also set use.BiocParallel = FALSE, which unfortunately means it takes quite a while to load the dataset in, but it does work!

@dfhannum
Copy link

dfhannum commented Jun 5, 2023

@alikhuseynov Thank you so much for the support. I can confirm that it's working now for me with the parquet file.

However, I am now running into a problem trying to load data from the old format in - I wanted to test my data against some of Vizgen's released data (https://info.vizgen.com/mouse-brain-map), and with the current LoadVizgen function, I get the following error:

Error in sfarrow::st_read_parquet(file2read) : 
  object 'file2read' not found
> traceback()
4: make_readable_file(file, mmap)
3: arrow::ParquetFileReader$create(dsn, ...)
2: sfarrow::st_read_parquet(file2read)
1: ReadVizgen(data.dir = "/mnt/z/Users/Andrew_K/Projects/NTS_projection_transcriptome_2023/Data/BrainReceptorShowcase/Slice1/Replicate1/")

Is it possible for the read function to support backward compatibility?

I also got this error when there were multiple parquet files in the folder.

@dfhannum
Copy link

dfhannum commented Jun 5, 2023

I was able to get it to work with hdf5 files by just using use_cellpose_output = FALSE. I did get this warning: Warning: MulticoreParam() not supported on Windows, use SnowParam(), loading each file, but everything loaded fine in the end.

Hmm so it actually works fine now, IDK why it was failing earlier. Thanks for the quick support!

I had this same issue, and eventually realized I had mol.type = 'microns' as an environment variable when testing later on, which fixed the LoadVizgen() function.

@alikhuseynov
Copy link

I'm going to push for a fix later today, that will solve above mentioned issues.
use.cellpose.out arg will be removed completely. If no .parquet file(s) are present, then .hdf5files are loaded from ./cell boundaries directory (if present) using use.BiocParallel = TRUE by default.

MulticoreParam() not supported on Windows, use SnowParam() are you running things on Windows laptop?

For multiple .parquet files, you mean something like this?

cellpose_micron_space.parquet
cellpose_mosaic_space.parquet

if yes, that will be solved too.

..will do few tests on my side, before committing
Thanks

@dfhannum
Copy link

dfhannum commented Jun 5, 2023

Yes, I was running it on a Windows laptop.

For the multiple .parquet files, I had a subset of the data in a folder of the directory for testing. This lead to multiple file2read and an error when the file wasn't located in the directory (but a folder within the directory). I only wanted to read the parquet file in the current directory.

@alikhuseynov
Copy link

I see, yeah.. that was done for the situation when one has .parquet file(s) in a sub directory, eg Cellpose_polyT_CB3:
in ./region_0

├── xxx_region_0.vzg
├── Cellpose_polyT_CB3
├── detected_transcripts.csv
├── images
└── summary.png

I will try to add support to like: .."look for parquet file only in the current dir, if not found, look in the sub dir"

@alikhuseynov
Copy link

I just did the last fix, please try it out. It did work on my side for old data (ie .hdf5 files) with optimized BiocParallel, and also on newer Vizgen outputs. It should also support parallelization on windows machine now.
If any bugs or something I missed, let me know ;)

Test run would be this

## test =================================================
## make sure all libs are installed a priori!
# load libs ----
suppressPackageStartupMessages({
  library(ggplot2)
  library(Seurat)
  library(dplyr)
  library(magrittr)
  library(BiocParallel)
  library(progressr)
  library(spatstat)    
})

# helper function to return Seurat object metadata
callmeta <- function (object = NULL) { 
  return(object@meta.data) 
  }

# set directory to read from
dir_use <- "./merfish_seurat_debug/" # or something like "./region_0/"

##
start.time <- Sys.time()
obj <-
  LoadVizgen(data.dir = dir_use,  
             fov = "merfish.test", 
             assay = "Vizgen",
             metadata = c("volume", "fov"), # add cell volume info
             type = c("segmentations", "centroids"), # type of cell spatial coord matrices
             z = 3L,
             add.zIndex = TRUE, # add z slice section to a cell
             update.object = TRUE,
             use.BiocParallel = TRUE,
             workers.MulticoreParam = 14, # for `BiocParallel` processing
             verbose = TRUE
             )
end.time <- Sys.time()
message("Time taken to load object = ", 
        round(end.time - start.time, digits = 2), " ", 
        attr(c(end.time - start.time), "units"))
obj
obj %>% callmeta %>% str
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /xxx/envs/spatial_develop/lib/libopenblasp-r0.3.21.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] spatstat_2.3-4        spatstat.linnet_2.3-2 spatstat.core_2.4-4  
[4] rpart_4.1.16          nlme_3.1-159          spatstat.random_3.0-1
[7] spatstat.geom_3.0-3   spatstat.data_3.0-0   progressr_0.11.0     
[10] BiocParallel_1.30.4   magrittr_2.0.3        dplyr_1.1.2          
[13] SeuratObject_4.1.3    Seurat_4.3.0.9002     ggplot2_3.4.0        

loaded via a namespace (and not attached):
[1] readxl_1.4.1           uuid_1.1-0             backports_1.4.1       
[4] plyr_1.8.7             igraph_1.3.5           repr_1.1.4            
[7] lazyeval_0.2.2         sp_1.5-0               splines_4.2.1         
[10] listenv_0.8.0          scattermore_0.8        digest_0.6.29         
[13] htmltools_0.5.3        fansi_1.0.3            tensor_1.5            
[16] googlesheets4_1.0.1    cluster_2.1.4          ROCR_1.0-11           
[19] tzdb_0.3.0             readr_2.1.3            globals_0.16.1        
[22] modelr_0.1.9           matrixStats_0.62.0     spatstat.sparse_3.0-0 
[25] colorspace_2.0-3       rvest_1.0.3            ggrepel_0.9.2         
[28] haven_2.5.1            crayon_1.5.2           jsonlite_1.8.0        
[31] survival_3.4-0         zoo_1.8-10             glue_1.6.2            
[34] polyclip_1.10-0        gtable_0.3.1           gargle_1.2.0          
[37] leiden_0.4.2           future.apply_1.9.0     abind_1.4-5           
[40] scales_1.2.1           DBI_1.1.2              miniUI_0.1.1.1        
[43] Rcpp_1.0.9             viridisLite_0.4.1      xtable_1.8-4          
[46] reticulate_1.26        bit_4.0.4              htmlwidgets_1.5.4     
[49] httr_1.4.4             RColorBrewer_1.1-3     ellipsis_0.3.2        
[52] ica_1.0-3              pkgconfig_2.0.3        uwot_0.1.14           
[55] dbplyr_2.2.1           deldir_1.0-6           utf8_1.2.2            
[58] tidyselect_1.2.0       rlang_1.1.1            reshape2_1.4.4        
[61] later_1.3.0            munsell_0.5.0          cellranger_1.1.0      
[64] tools_4.2.1            cli_3.4.1              generics_0.1.3        
[67] broom_1.0.1            ggridges_0.5.4         evaluate_0.16         
[70] stringr_1.4.1          fastmap_1.1.0          goftest_1.2-3         
[73] bit64_4.0.5            fs_1.5.2               fitdistrplus_1.1-8    
[76] purrr_1.0.1            RANN_2.6.1             pbapply_1.7-0         
[79] future_1.28.0          mime_0.12              xml2_1.3.3            
[82] arrow_9.0.0            hdf5r_1.3.5            compiler_4.2.1        
[85] plotly_4.10.0          png_0.1-7              spatstat.utils_3.0-1  
[88] reprex_2.0.2           tibble_3.2.1           stringi_1.7.8         
[91] forcats_0.5.2          lattice_0.20-45        IRdisplay_1.1         
[94] Matrix_1.5-1           vctrs_0.6.2            furrr_0.3.1           
[97] pillar_1.9.0           lifecycle_1.0.3        lmtest_0.9-40         
[100] RcppAnnoy_0.0.19       data.table_1.14.6      cowplot_1.1.1         
[103] irlba_2.3.5            httpuv_1.6.5           patchwork_1.1.2       
[106] R6_2.5.1               promises_1.2.0.1       KernSmooth_2.23-20    
[109] gridExtra_2.3          parallelly_1.32.1      codetools_0.2-18      
[112] MASS_7.3-57            assertthat_0.2.1       withr_2.5.0           
[115] sctransform_0.3.5      hms_1.1.2              mgcv_1.8-40           
[118] parallel_4.2.1         sfarrow_0.4.1          grid_4.2.1            
[121] tidyverse_1.3.2        IRkernel_1.3           tidyr_1.2.0           
[124] googledrive_2.0.0      Cairo_1.6-0            Rtsne_0.16            
[127] pbdZMQ_0.3-7           spatstat.explore_3.0-5 lubridate_1.8.0       
[130] shiny_1.7.2            base64enc_0.1-3 

@andrewjkwok
Copy link

Thanks for the update.

I installed the latest version with remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen').

This somehow now throws up an error:

Cell segmentations are found in `.parquet` file
>>> using micron space coordinates
Error in files[["spatial"]] : subscript out of bounds
> traceback()
2: ReadVizgen(data.dir = data.dir, mol.type = mol.type, filter = filter, 
       z = z, verbose = verbose, ...)
1: LoadVizgen(data.dir = dir_use, fov = "merfish.test", assay = "Vizgen", 
       metadata = c("volume", "fov"), mol.type = mol.type, type = c("segmentations", 
           "centroids"), z = z.stack, add.zIndex = TRUE, update.object = TRUE, 
       use.BiocParallel = TRUE, workers.MulticoreParam = 14, use.furrr = TRUE, 
       verbose = TRUE)

Incidentally I keep also getting an error saying that there's no default argument for verbose, but that's minor. Thanks again in advance for the support of this!

@alikhuseynov
Copy link

Thanks for the update.

I installed the latest version with remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen').

This somehow now throws up an error:

Cell segmentations are found in `.parquet` file
>>> using micron space coordinates
Error in files[["spatial"]] : subscript out of bounds
> traceback()
2: ReadVizgen(data.dir = data.dir, mol.type = mol.type, filter = filter, 
       z = z, verbose = verbose, ...)
1: LoadVizgen(data.dir = dir_use, fov = "merfish.test", assay = "Vizgen", 
       metadata = c("volume", "fov"), mol.type = mol.type, type = c("segmentations", 
           "centroids"), z = z.stack, add.zIndex = TRUE, update.object = TRUE, 
       use.BiocParallel = TRUE, workers.MulticoreParam = 14, use.furrr = TRUE, 
       verbose = TRUE)

Incidentally I keep also getting an error saying that there's no default argument for verbose, but that's minor. Thanks again in advance for the support of this!

You are welcome, I want to make sure it's robust and always works.

Does your dir_use has these files?
If it's another dataset, those files should be present there.

merfish_seurat_debug/
├── cell_boundaries.parquet
├── cell_by_gene.csv
├── cell_metadata.csv
└── detected_transcripts.csv

files[["spatial"]] is the basically the cell_metadata.csv. Error in files[["spatial"]] : subscript out of bounds would mean that it could not find cell_metadata.csv file or multiple cell_metadata.csv files exist inside dir_use.
Also you don't have to provide mol.type (it's set to microns by default) and z as environmental variables or inside the LoadVizgen(), unless you want to use other than z = 3L slice.
It works on my side for a number of runs/datasets and there no warning for verbose, so not sure yet why it happens.
Try to run as in my comment above and let me know.
Thanks

@alikhuseynov
Copy link

..for sanity check, could you run the following and see if all 3 files are present?

sanity check
data.dir <- dir_use # your directory with Vizgen data
# run this
files <- c(transcripts = NULL %||% 
               list.files(data.dir, 
                          pattern = "cell_by_gene",
                          full.names = TRUE) %>%
               { if (is.empty(.)) { 
                 list.files(data.dir,
                            pattern = "cell_by_gene",
                            full.names = TRUE,
                            recursive = TRUE)
               } else { (.) }} %>% 
               { if (length(.) > 1) { 
                 message("There are > 1 `cell_by_gene` files", 
                         "\n", "make sure only 1 file is read")
                 stop()
               } else if (!length(.)) {
                 message("No `cell_by_gene` file is available")
                 stop()
               } else { (.) }},
             
             spatial = NULL %||% 
               list.files(data.dir, 
                          pattern = "cell_metadata",
                          full.names = TRUE) %>%
               { if (is.empty(.)) { 
                 list.files(data.dir,
                            pattern = "cell_metadata",
                            full.names = TRUE,
                            recursive = TRUE)
               } else { (.) }} %>% 
               { if (length(.) > 1) { 
                 message("There are > 1 `cell_metadata` files", 
                         "\n", "make sure only 1 file is read")
                 stop()    
               } else if (!length(.)) {
                 message("No `cell_metadata` file is available")
                 stop()
               } else { (.) }},
             
             molecules = NULL %||% 
               list.files(data.dir, 
                          pattern = "detected_transcripts",
                          full.names = TRUE) %>%
               { if (is.empty(.)) { 
                 list.files(data.dir,
                            pattern = "detected_transcripts",
                            full.names = TRUE,
                            recursive = TRUE)
               } else { (.) }} %>%
               { if(length(.) > 1) { 
                 message("There are > 1 `detected_transcripts` files", 
                         "\n", "make sure only 1 file is read")
                 stop()    
               } else if (!length(.)) {
                 message("No `detected_transcripts` file is available")
                 stop()
               } else { (.) }}
  )

output should look something like this:

files
transcripts: ‘/xxx/vizgen_test_repo/merfish_seurat_debug//cell_by_gene.csv’
spatial: ‘/xxx/vizgen_test_repo/merfish_seurat_debug//cell_metadata.csv’
molecules: ‘/xxx/vizgen_test_repo/merfish_seurat_debug//detected_transcripts.csv’

@andrewjkwok
Copy link

Got it - turns out exactly as you predicted there was an extra cell metadata file (had tried to edit a version) - works now after deleting that file!

@munizmom
Copy link

munizmom commented Jun 21, 2023

Hi Alik, thanks a lot for this implementation! Currently the LoadVizgen function is only able to load the segmentation coming from one z-layer. VIZGEN opened a post processing package in GitHub where we can redo the segmentation and include multiple z-layers so the parquet files now contain the segmentation of multiple layers and is able to identify new cells in different layers. To combine all the info in one unique Seurat object seems like. preliminary way is running LoadVizgen() per layer and then merge the Seurat objects.

seurat.objF<- merge(seurat_z1.obj, y = c(seurat_z2.obj, seurat_z3.obj,seurat_z4.obj, seurat_z5.obj,seurat_z6.obj), add.cell.ids = c("1z", "2z", "3z","4z", "5z", "6z"), project = "test")

However some cells are also observed in different z-layers and merging the objects in the end we will need to merge the same cells by ID so they are not counted as a different cell.... Could be more simple to adapt the LoadVizgen() function to be able to use more than one z layer info?

@alikhuseynov
Copy link

Hi Alik, thanks a lot for this implementation! Currently the LoadVizgen function is only able to load the segmentation coming from one z-layer. VIZGEN opened a post processing package in GitHub where we can redo the segmentation and include multiple z-layers so the parquet files now contain the segmentation of multiple layers and is able to identify new cells in different layers. To combine all the info in one unique Seurat object seems like. preliminary way is running LoadVizgen() per layer and then merge the Seurat objects.

seurat.objF<- merge(seurat_z1.obj, y = c(seurat_z2.obj, seurat_z3.obj,seurat_z4.obj, seurat_z5.obj,seurat_z6.obj), add.cell.ids = c("1z", "2z", "3z","4z", "5z", "6z"), project = "test")

However some cells are also observed in different z-layers and merging the objects in the end we will need to merge the same cells by ID so they are not counted as a different cell.... Could be more simple to adapt the LoadVizgen() function to be able to use more than one z layer info?

Hey Mar, thanks. Support for multiple segmentations is on "todo" list. I need to take a look at Vizgen repo.

  • Does the post-processing also generate new cell_by_gene.csv and cell_metadata.csv per z-plane?

  • Are all of the unique cell IDs from the count matrix (cell_by_gene.csv) also found in each of the z-plane segmentations?

One could generate single dataframe of all (or selected) z-planes with unique cell IDs (also found in the count matrix) and use it as final "segmentation" FOV.
If you can share that test data for 7 z-planes, I could take a look.

Alternative post-processing could be Maximum Intensity Projection (MIP) on all z-planes, then segmentation on that and use the output for single z-plane

@AustinHartman, do you have any suggestion?

Thanks

@dfhannum
Copy link

For some current (and potentially future) versions of the Vizgen segmentation, the segmentations for a cell won't be the same on all planes. From my knowledge, the newly generated cell_by_gene.csv and cell_metadata.csv take this segmentation into account and create a singular versions of these files.

Though based on @munizmom 's comment it seems like there is multiple output files.

My two cents on adding in multiple layers:

While I think it would be interesting to view multiple cell segmentations, I'm not sure what the benefit would be doing this in Seurat. The segmentations take up a ton of space in the Seurat object and dealing with the complete set of segmentations would be a slog. Being able to view the segmentations in Seurat is very useful for image generation and some small exploratory analysis; but for me Seurat isn't meant to be an exploratory image analysis tool.

@munizmom
Copy link

Hi Alik, thanks for the super prompt answer . To answer your questions:

  • “Does the post-processing also generate new cell_by_gene.csv and cell_metadata.csv per z-plane?” MM: it generate still a unique cell_by_gene.csv and cell_metadata.csv. cell_metadata.csv still contains only info of x and y centre coordinates (not z) and where the information of FOV is always empty.
  • “Are all of the unique cell IDs from the count matrix (cell_by_gene.csv) also found in each of the z-plane segmentations?” MM: the cell IDs are completely numeric and they have unique cell IDs using python libraries compatible with int64 numbers. Importing those IDs in R lose the int64 feature and then the IDs are not unique but you can still in a post processing step change both cell_by_gene.csv and cell_metadata.csv to contain the real cellID plus a row number for example to make them really unique and still compatible to bring back the information to the Vizgen visualiser as you still maintain the original cellID.

“One could generate single dataframe of all (or selected) z-planes with unique cell IDs (also found in the count matrix) and use it as final "segmentation" FOV.” MM: I am going to test this and come back to you. Thanks!

I will send you an email Alik with access to the multi-layer segmentation and all the output needed for loading the experiment. Give me just a day to put it together :). Thanks for the feedback and help of the community! if we manage to maintain more z-layer info it will at least make a better niches analyses and 3D analyses

@alikhuseynov
Copy link

For some current (and potentially future) versions of the Vizgen segmentation, the segmentations for a cell won't be the same on all planes. From my knowledge, the newly generated cell_by_gene.csv and cell_metadata.csv take this segmentation into account and create a singular versions of these files.

Though based on @munizmom 's comment it seems like there is multiple output files.

My two cents on adding in multiple layers:

While I think it would be interesting to view multiple cell segmentations, I'm not sure what the benefit would be doing this in Seurat. The segmentations take up a ton of space in the Seurat object and dealing with the complete set of segmentations would be a slog. Being able to view the segmentations in Seurat is very useful for image generation and some small exploratory analysis; but for me Seurat isn't meant to be an exploratory image analysis tool.

Thanks! sounds good, I see the point and agree that it will be huge/heavy object with all the z-planes segmentations.
I won't include this multi-plane support in the current PR #7190, but will make another branch and test/see how the things look like.

@alikhuseynov
Copy link

Hi Alik, thanks for the super prompt answer . To answer your questions:

  • “Does the post-processing also generate new cell_by_gene.csv and cell_metadata.csv per z-plane?” MM: it generate still a unique cell_by_gene.csv and cell_metadata.csv. cell_metadata.csv still contains only info of x and y centre coordinates (not z) and where the information of FOV is always empty.
  • “Are all of the unique cell IDs from the count matrix (cell_by_gene.csv) also found in each of the z-plane segmentations?” MM: the cell IDs are completely numeric and they have unique cell IDs using python libraries compatible with int64 numbers. Importing those IDs in R lose the int64 feature and then the IDs are not unique but you can still in a post processing step change both cell_by_gene.csv and cell_metadata.csv to contain the real cellID plus a row number for example to make them really unique and still compatible to bring back the information to the Vizgen visualiser as you still maintain the original cellID.

“One could generate single dataframe of all (or selected) z-planes with unique cell IDs (also found in the count matrix) and use it as final "segmentation" FOV.” MM: I am going to test this and come back to you. Thanks!

I will send you an email Alik with access to the multi-layer segmentation and all the output needed for loading the experiment. Give me just a day to put it together :). Thanks for the feedback and help of the community! if we manage to maintain more z-layer info it will at least make a better niches analyses and 3D analyses

I'm happy that the current implementation works and is useful for users ;)
Thanks for your answers, let me know for test data once ready.

@heilandd
Copy link

heilandd commented Aug 2, 2023

Hi, I tested the function on our latest Vizgene data and the following error appeared:

Using parallelization with: `future`
NOTE: set workers for parallelization, eg: `future::plan('multisession', workers = 10)`
Reading data from:
~/Desktop/MERFISH/NP_Beta10/region_0/Cellpose_DAPI_CB2
Cell segmentations are found in `.parquet` file
>>> using micron space coordinates
>>> filtering `cell_metadata` - keep cells with `transcript_count` > 0
>>> filtering `cell_by_gene` - keep cells with counts > 0
Sanity checks on cell segmentation polygons:
>>> found 16 cells with > 1 (nested) polygon lists:
>>> flattening polygon lists

Found 18 cells with > 1 polygon artifacts:
>>> removing artifacts
>>> keeping cell boundary with maximum coords

Additionally found 1 cells with > 1 polygons (identical length):
>>> only the 1st polygon boundary will be kept

Error in segs[[segs.art.index[i]]] <- segs[[segs.art.index[i]]][1] : 
  attempt to select more than one element in integerOneIndex

Can you help to avoid this problem ?

@alikhuseynov
Copy link

Hi, I tested the function on our latest Vizgene data and the following error appeared:

Using parallelization with: `future`
NOTE: set workers for parallelization, eg: `future::plan('multisession', workers = 10)`
Reading data from:
~/Desktop/MERFISH/NP_Beta10/region_0/Cellpose_DAPI_CB2
Cell segmentations are found in `.parquet` file
>>> using micron space coordinates
>>> filtering `cell_metadata` - keep cells with `transcript_count` > 0
>>> filtering `cell_by_gene` - keep cells with counts > 0
Sanity checks on cell segmentation polygons:
>>> found 16 cells with > 1 (nested) polygon lists:
>>> flattening polygon lists

Found 18 cells with > 1 polygon artifacts:
>>> removing artifacts
>>> keeping cell boundary with maximum coords

Additionally found 1 cells with > 1 polygons (identical length):
>>> only the 1st polygon boundary will be kept

Error in segs[[segs.art.index[i]]] <- segs[[segs.art.index[i]]][1] : 
  attempt to select more than one element in integerOneIndex

Can you help to avoid this problem ?

Hi, that has to do with segmentation output, I did implement few sanity checks to deal with nested polygon lists.
If you can share your .parquet file, I could then try to make a fix.

Alternatively, if you don't need segmentations for your preliminary analysis, try to load without segmentations first, eg:
type = c("centroids") or type = c("boxes", "centroids").

Also, try to use parallelization with BiocParallel which is usually a bit more efficient, eg: use.BiocParallel = TRUE and workers.MulticoreParam = 10

Hope this helps.

@alikhuseynov
Copy link

Hi, I tested the function on our latest Vizgene data and the following error appeared:

Using parallelization with: `future`
NOTE: set workers for parallelization, eg: `future::plan('multisession', workers = 10)`
Reading data from:
~/Desktop/MERFISH/NP_Beta10/region_0/Cellpose_DAPI_CB2
Cell segmentations are found in `.parquet` file
>>> using micron space coordinates
>>> filtering `cell_metadata` - keep cells with `transcript_count` > 0
>>> filtering `cell_by_gene` - keep cells with counts > 0
Sanity checks on cell segmentation polygons:
>>> found 16 cells with > 1 (nested) polygon lists:
>>> flattening polygon lists

Found 18 cells with > 1 polygon artifacts:
>>> removing artifacts
>>> keeping cell boundary with maximum coords

Additionally found 1 cells with > 1 polygons (identical length):
>>> only the 1st polygon boundary will be kept

Error in segs[[segs.art.index[i]]] <- segs[[segs.art.index[i]]][1] : 
  attempt to select more than one element in integerOneIndex

Can you help to avoid this problem ?

can you please share your .parquet file for reproducing the error/debugging?
thanks

@alikhuseynov
Copy link

@heilandd ..my last commit in:

just re-install remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'feat/vizgen')

@LilianGomes
Copy link

LilianGomes commented May 13, 2024

Hi All this is very helpful, thank you very much for the updates!

I have been trying to run a Vizgen run with .parquet cell boundaries. I used Alik updates on LoadVizgen. However, I get the following warnings:

Warning messages:
1: Not validating FOV objects 
2: Not validating Centroids objects
9: Not validating Seurat objects 

I went ahead with data processing. The UMAP looks good but I get the following when I try to plot the ImageDimPlot:

No compatible spatial coordinates present

Does anyone knows what is the issue?

@alikhuseynov
Copy link

Hi All this is very helpful, thank you very much for the updates!

I have been trying to run a Vizgen run with .parquet cell boundaries. I used Alik updates on LoadVizgen. However, I get the following warnings:

Warning messages:
1: Not validating FOV objects 
2: Not validating Centroids objects
9: Not validating Seurat objects 

I went ahead with data processing. The UMAP looks good but I get the following when I try to plot the ImageDimPlot:

No compatible spatial coordinates present

Does anyone knows what is the issue?

It's difficult to understand how your object looks like. Try to check which spatial FOVs are present

obj[[Images(obji)[1]]] %>% names() # for 1st section
[1] "centroids"    "segmentation" "molecules"

when does that warning comes?
try remotes::install_github(repo = 'alikhuseynov/seurat', ref = 'vizgen_seurat5'), making sure that SeuratObject is v5 as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants