Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifier replacement method for cytoframe #344

Closed
DillonHammill opened this issue Aug 30, 2020 · 20 comments
Closed

Identifier replacement method for cytoframe #344

DillonHammill opened this issue Aug 30, 2020 · 20 comments

Comments

@DillonHammill
Copy link
Contributor

@mikejiang, is there an identifier replacement method for cytoframes? Unfortunately flowCore::identifier<- doesn't work for cytoframes:

library(CytoExploreRData)
library(flowWorkspace)

fs <- Activation
cs <- flowSet_to_cytoset(fs)
identifier(cs[[1]])
identifier(cs[[1]]) <- "TEST"
identifier(cs[[1]])

Sorry for bugging you with all of these issues, I am going through all the code in CytoExploreR to ensure that it works with all classes of cytometry objects - so I am finding a few things that are not working as I had expected based on previous flowCore functionality.

Thanks for your help!
Dillon

@DillonHammill
Copy link
Contributor Author

DillonHammill commented Aug 30, 2020

Oh, I also noticed that updating the sampleNames of a cytoset doesn't alter the underlying cytoframe identifiers either:

sampleNames(cs)[1] <- "TEST"
identifier(cs[[1]]) # should match new sampleNames

This is related to #335.

@mikejiang
Copy link
Member

again, avoid the legacy identifier, Can you tell me a compelling reason that you must use it?

@DillonHammill
Copy link
Contributor Author

DillonHammill commented Aug 30, 2020

Just about all the code in CytoExploreR performs operations at the cytoframe level. So it is important (for me anyways) that I can extract the sample name information from each cytoframe and this should match that of the cytoset. Perhaps, there could be a simpler API where identifier is not used but we could call sampleNames() on a cytoframe. I think it makes sense to keep the replacement restricted to cytosets but the changes should trickle down to the cytoframes in some way.

@mikejiang
Copy link
Member

Again, sampleNames belongs to cytoset, cytoframe doesn't have/store this information.

@DillonHammill
Copy link
Contributor Author

I think I am starting to understand this. I will post some examples tomorrow for you to look at.

@DillonHammill
Copy link
Contributor Author

@mikejiang, where does the sampleNames() information come from when the samples are loaded? Presumably from the FIL or FILENAME keyword? I am wondering if I should instead extract the names from this slot instead of using sampleNames(), because users may update the keyword and it won't be updated in the sampleNames().

I am now switching to using a new cyto_apply() function to loop over cytosets etc. but this means that operations are performed solely at the cytoframe level - so if the result requires annotation with the sample name it should come from the cytoframe. Sure, I could replace this after the fact with the sampleNames() but this gets messy.

I often convert cytosets to lists of cytoframes and I have considered storing the sampleNames in the names of the list to carry this information with each cytoframe, but there are so many operations performed on these lists that it is painful to keep checking that the names are retained.

I guess I am just after a way of extracting the most up-to-date name of the sample from a cytoframe.

@mikejiang
Copy link
Member

mikejiang commented Sep 1, 2020

Because this sampleNames serves as sample GUIDs for cytoframes within a specific cytoset context and is only relevant to a specific cytoset. So once you go down to the cytoframe level, it is no longer relevant.

When you load_cytoset_from_fcs(), it uses filename as guid(or sampleNames), for cytoset() constructor, it takes the names of the input list (of cytoframes) as sampleNames.

For looping through cs, the cytoframe level operation should not be dependent on the higher-level info (i.e. sampleName) that is only stored at cs. Again, duplicating this information at each individual cytoframe is a REALLY bad idea/design.

@mikejiang
Copy link
Member

My major concern is the data integrity issue as you keep the same info copies at two different places. Once cytoframe is extracted from cytoset, we no longer can keep it in sync (say if you change sampleName at cytoframe)

@DillonHammill
Copy link
Contributor Author

DillonHammill commented Sep 1, 2020

Is there a technical reason for storing this information at the cytoset level or is it merely used for getting the sample names? It seems to me that this information is only important when the file are read in. If this is the case and users can alter the keywords, why not have an API that extracts/replaces the keyword directly from the cytoframes (i.e. users will not use sampleNames())? I know you don't like identifier() but a function that simply returns the sample names could like something like this:

sample_names <- function(x) {
  UseMethod("sample_names")
}
sample_names.flowFrame <- function(x) {
  # some way to check and extract relevant keyword - identifier is currently the best bet
  identifier(x)
}
sample_names.flowSet <- function(x) {
  fsApply(x, "sample_names")
}
sample_names.GatingHierarchy <- function(x) {
  sample_names(gs_cyto_data(x))
}
sample_names.GatingSet <- function(x) {
  sample_names(gs_cyto_data(x))
}

At least this way all objects are giving the same information to the user and then you don't have to worry about duplicating this information? Similarly, there could be a replacement method that replaces the relevant keyword instead of altering sampleNames()?

Just an idea...

I had a look through my code and it would require substantial changes to fix this and I probably don't have the time to tackle this now. So I think for now I might add a function similar to the one above and add a replacement method as well - this should force users to change the keyword and not sampleNames(). That way whenever I need the sample names I can extract them directly from the cytoframes and this information will always be up to date.

@mikejiang
Copy link
Member

sampleNames of cs is used as the key of the hash map for fast indexing, so it can't be stored at cytoframe level.

@DillonHammill
Copy link
Contributor Author

I see, so you would want the sampleNames() to be up to date then. So subsetting a cytoset by name uses the sampleNames() and not keywords?

@jacobpwagner
Copy link
Member

Late to the discussion, but I think I mostly agree with Mike on this one. sampleNames is meant to just be a set of identifiers for the constituent cytoframes within the context of the cytoset. I don't think those identifiers should be stored within each cytoframe for the same reason the last 4 lines of this snippet look pretty absurd.

> list_of_numbers <- list(1,2,3,4)
> names(list_of_numbers) <- c("a","b","c","d")
> 
> # The names are only relevant in the context of the collection
> attr(list_of_numbers, "names")
[1] "a" "b" "c" "d"
> attr(list_of_numbers[[1]], "name")
NULL
> 
> # You can do this, but if you rely on these you
> # need to make sure they are constantly in sync
> # with the collection names
> attr(list_of_numbers[[1]], "name") <- "a"
> attr(list_of_numbers[[2]], "name") <- "b"
> attr(list_of_numbers[[3]], "name") <- "c"
> attr(list_of_numbers[[4]], "name") <- "d"

Nothing is stopping you from accessing those names from the collection in parallel for each item from within a lapply (or within cyto_apply). If you are iterating over the collection, you must have access to its names.

> data("GvHD")
> cs <- flowSet_to_cytoset(GvHD[1:4])
> info <- lapply(seq_along(cs), function(idx){
+     paste0("The cytoframe named ", sampleNames(cs)[[idx]], " has this many cells: ", nrow(exprs(cs[[idx]])))
+ })
> info
[[1]]
[1] "The cytoframe named s5a01 has this many cells: 3420"

[[2]]
[1] "The cytoframe named s5a02 has this many cells: 3405"

[[3]]
[1] "The cytoframe named s5a03 has this many cells: 3435"

[[4]]
[1] "The cytoframe named s5a04 has this many cells: 8550"

Of course, if you just want identifying information about each cytoframe, as discussed earlier you could explicitly use something like keyword(cf, "$FIL").

Your sample_names accessors are fine, but the challenge here comes in maintaining synchronization under the mutators. Even if you only store sample_name at the cytoframe level, you run in to issues. As a quick example, suppose you have a cytoset with 3 sample cytoframes named "a", "b", and "c". If you call a sample_names<- mutator on cytoframe "b" to change its sample_name to "a", now the sample_name identifiers are not unique and effectively broken for most indexing purposes for the cytoset. I suppose you could check for uniqueness in the parent cytoset before replacement, but cytoframes need not even exist in a cytoset, so this is counter to the modular design of the class.

Of course, you could still store some kind of identifier at the cytoframe level (by an attribute or a keyword) for the sake of annotating things, but if you want to use it for indexing in the aggregate, you will run in to the uniqueness issue above and need to add careful checks during mutation. Furthermore, I'd advise against calling it sample_names just to avoid user confusion with sampleNames...

@DillonHammill
Copy link
Contributor Author

DillonHammill commented Sep 2, 2020

Thanks @jacobpwagner, if I do implement this is I make sure to check for uniqueness and it will be called cyto_names() to be consistent with the rest of CytoExploreR and to avoid confusion with sampleNames(). The problem I have is a lot of the code I have written works with cytoframe lists which can have very complex structure and I don't have access to the cytoset to pull out the sampleNames() - I can try to maintain these in the names of the lists when they are first generated but it is really messy to try and retain this information after all the operations performed on these lists.

@jacobpwagner
Copy link
Member

jacobpwagner commented Sep 2, 2020

I see, so you would want the sampleNames() to be up to date then. So subsetting a cytoset by name uses the sampleNames() and not keywords?

Precisely, albeit at the cytolib (C++) level. At that level a cytoset is actually represented as a GatingSet (just without gates). But anyway, as Mike mentioned, the sampleNames are used as a key for the hash map that speeds up lookup.

I can try to maintain these in the names of the lists when they are first generated but it is really messy to try and retain this information after all the operations performed on these lists.

I feel your pain and totally understand, particularly if you are dealing with arbitrarily-nested lists of cytoframes. But yeah, these are the sort of mutation/synchronization issues that have to be managed if annotating the same information in multiple places. That's more or less the same fundamental argument for not adding the sampleName annotation to the cytoframe class, except instead of nested R-level lists, the sampleNames of a cytoset touch a lot of interrelated moving parts at the C++ level that need to be carefully managed/synchronized.

@DillonHammill
Copy link
Contributor Author

DillonHammill commented Sep 9, 2020

@jacobpwagner & @mikejiang, I have checked through the CytoExploreR code and a patch job is definitely not going to work. I think I am going to have to take the time to refactor the code to prevent cytoframe/cytoset synchronization issues. Do you mind if I run my plan past you before I dive in? I want to avoid making drastic changes that I may regret in a couple of weeks time.

The current design involves writing methods for cytoframe objects and then looping through these methods for cytosets etc. As I described earlier, this means that any information in the output comes directly from the cytoframe and not the cytoset.

After our discussions, I am thinking that I should instead treat cytoframe class simply as a container for the data and not extract any other information from them. In fact, I would like to force users to only interact with cytosets when dealing with the low-level data - as not all changes to cytoframe will synchronize with cytoset. To accomplish this I am planning on removing many of the cytoframe methods that I have written and instead only write methods down to the cytoset level. This will prevent users from making changes to individual cytoframes and if they need to work with individual samples they can just use cs[1]. I will also move away from using cytoframe lists in favour of cytoset lists to retain sampleNames information without having to store this in the names of the list - i.e. list(cs[[1]], cs[[2]]) will be replaced by list(cs[1], cs[2]). For consistency, this also means that any coercion methods that I use will return a merged cytoset instead of cytoframe - i.e. cytoset(list("merge" = as(cs[1:4], "cytoframe"))) - once #348 is addressed. Doing things this way, I will also be able to retain support for GatingHierarchy and GatingSet classes, as I can simply extract the relevant cytoset from these objects. In essence, where I used to work with cs[[x]] I will now work with cs[x] instead.

This is going to be a lot of work, so I just wanted to confirm whether this approach is more consistent with the way that you intend for users to interact with these classes? Happy to take any suggestions.

@mikejiang
Copy link
Member

as not all changes to cytoframe will synchronize with cytoset

What do you mean by that?

@DillonHammill
Copy link
Contributor Author

DillonHammill commented Sep 9, 2020

I was just referring to name synchronisation/duplication problems that we discussed previously. I think this issue can be bypassed if I simply restrict operations to cytoset instead of cytoframe. Doing things this way, I can ditch identifier() altogether and use sampleNames() instead. It also means that I won't have to rely on any keywords matching the sampleNames() so that if users do fiddle with the identifier related keywords manually it won't affect other operations within CytoExploreR - as I will always use sampleNames().

@mikejiang
Copy link
Member

Then your statement was not true. cytoframe is a reference content of cytoset, all changes to cytoframe should be in place and thus synchronized to cytoset.

I just want to remind you once again that the sampleNames that you were concerned about is just not part of cytoframe at all. Instead it is exclusively belonging to cytoset, If your goal is to retain and mutate the sampleName, yes [ subsetting is the way to go instead of [[ extraction operation.

@DillonHammill
Copy link
Contributor Author

DillonHammill commented Sep 9, 2020

Yes of course. I just feel that if I switch over to using cytosets more often I won't have to worry about retaining/passing on these names. Here is a simple example that hopefully illustrates the changes I am planning on making a bit clearer. Here I am using cyto_plot() to plot the data and include the name of the sample in the title but it could be replaced with any function that annotates the result with the sample name. This is just for illustrative purposes.

# current code uses cytoframe lists
cf_list <- lapply(sampleNames(cs), function(x) {
  cs[[x]]
})
names(cf_list) <- sampleNames(cs)  # sampleNames not stored in cytoframe - carry in list names

# current usage is to extract identifier when required
res <- lapply(cf_list, function(x) {
  cyto_plot(x,
            channels = c("FSC-A", "SSC-A"),
            title = identifier(x))  # may be out of sync with sampleNames(cs)
})

# option 1 - carry the sampleNames in the names of list where they can be accessed
res <- lapply(seq_along(cf_list), function(x) {
  cyto_plot(cf_list[[x]],
            channels = c("FSC-A", "SSC-A"),
            title = names(cf_list)[x])  # matches sampleNames(cs)
})

# option 2 - switch to using cytoset lists
cs_list <- lapply(sampleNames(cs), function(x){
  cs[x]
})
names(cs_list) <- sampleNames(cs) # not required - information stored in each cytoset

# now the name can be extracted from sampleNames
res <- lapply(cs_list, function(x) {
  cyto_plot(x,
            channels = c("FSC-A", "SSC-A"),
            title = sampleNames(x))  # matches sampleNames(cs)
})

I have some complex list structures in CytoExploreR so it will be difficult to try and carry the names for the cytoframe lists, it would be much simpler if I use cytoset list instead as this information can be extracted directly from each object.

@DillonHammill
Copy link
Contributor Author

Thanks so much for your help @mikejiang and @jacobpwagner. I will start working on this today, it will take some time but I think it will be worth it in the long run. Once complete I will no longer require the use of identifier() so I am happy to close this issue now. I think we can close #335 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants