fix multiple cell ids match with the same dataset id #140

myushen · 2024-05-07T00:57:04Z

…th the same dataset id

multimeric · 2024-05-08T01:12:17Z

R/import_metadata_and_counts.R


  # Generate cpm from counts
-  cli_alert_info("Generating cpm from {.path {metadata_tbl$file_id_db}}. ")
+  unique_file_ids <- unique(metadata_tbl$file_id_db)
+  cli_alert_info("Generating cpm from {.path {unique_file_ids}}. ")
  get_counts_per_million(input_sce_obj = sce_obj, output_dir = counts_path$cpm_path, hd5_file_dir = counts_path$original_path)


At this point, counts_path$cpm_path seems like it's a vector of filepaths, one for each file_id_db in the metadata. However, get_counts_per_million expects a single file path. Does this work in practice?

I think this API requires users to import a SingleCellExperiment object one at a time, and each object should have the same file_id_db derived from dataset_id.

Ah yes, I see that file_id_db = .data$dataset_id |> openssl::md5() |> as.character(), so it there will only be one file_id_db. I think to make this clearer, you could just create file_id_db as a separate variable that you add to the metadata data frame, and also use to generate the paths. That way you won't have to filter out the redundant copies at all.

Thanks for that, @multimeric; I had a discussion with @myushen and concluded that an additional argument is not needed at this stage, just good messaging and documentation, in case the user input and multiple data sets in the same single experiment.

I wasn't suggesting adding a new argument. I'm just saying that we have a single file ID, which gets duplicated as a result of putting it into the data frame, and then we filter it back down again to a single value using distinct. This can be avoided by using a separate variable:

file_id_db <- .data$dataset_id |> openssl::md5() |> as.character() ... metadata_tbl <- metadata_tbl |> mutate(file_id_db = file_id_db) ... original_path <- file.path(original_dir, basename(file_id_db)) cpm_path <- file.path(cache_dir, "cpm", basename(file_id_db))

but the redundant file_if_db column will be carried in the metadata, that is what we use to do future queries. So, in the metadata, the redundancy will remain.

Or am I missing something?

Yes, I realise it will be like that in the metadata. My suggestion is just to make the code simpler and clearer.

I am not sure if the additional argument would be in the front end (cost of each additional argument is $1M) or in the back end (cost $1)

sure in the backend @myushen follow @multimeric lead.

I think @myushen intended that as modifying the front end (code interface to the user)

I will add a check point at the front end and rename the function for clarification

R/import_metadata_and_counts.R

code optimisation Co-authored-by: Michael Milton <ttmigueltt@gmail.com>

… for clarification

fix the issue when a sce metadata contains multiple cell ids match wi…

b44411e

…th the same dataset id

stemangiola requested a review from multimeric May 7, 2024 23:02

multimeric reviewed May 8, 2024

View reviewed changes

myushen and others added 4 commits May 8, 2024 11:32

Update R/import_metadata_and_counts.R

aaaa6a9

code optimisation Co-authored-by: Michael Milton <ttmigueltt@gmail.com>

simplify code. add checkpoint to interact with users. rename function…

ecc68b4

… for clarification

Merge branch 'stemangiola:master' into import-api-debug

a20443e

sample data

2d3e2f9

myushen requested a review from multimeric May 21, 2024 04:57

multimeric approved these changes May 22, 2024

View reviewed changes

stemangiola merged commit 76e5b2d into stemangiola:master May 22, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix multiple cell ids match with the same dataset id #140

fix multiple cell ids match with the same dataset id #140

myushen commented May 7, 2024

multimeric May 8, 2024

myushen May 8, 2024

multimeric May 8, 2024

stemangiola May 9, 2024

multimeric May 9, 2024 •

edited

stemangiola May 9, 2024

multimeric May 9, 2024

stemangiola May 9, 2024 •

edited

myushen May 9, 2024

fix multiple cell ids match with the same dataset id #140

fix multiple cell ids match with the same dataset id #140

Conversation

myushen commented May 7, 2024

multimeric May 8, 2024

Choose a reason for hiding this comment

myushen May 8, 2024

Choose a reason for hiding this comment

multimeric May 8, 2024

Choose a reason for hiding this comment

stemangiola May 9, 2024

Choose a reason for hiding this comment

multimeric May 9, 2024 • edited

Choose a reason for hiding this comment

stemangiola May 9, 2024

Choose a reason for hiding this comment

multimeric May 9, 2024

Choose a reason for hiding this comment

stemangiola May 9, 2024 • edited

Choose a reason for hiding this comment

myushen May 9, 2024

Choose a reason for hiding this comment

multimeric May 9, 2024 •

edited

stemangiola May 9, 2024 •

edited