Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Color compensation #238

Open
castillohair opened this issue Mar 7, 2017 · 3 comments
Open

Color compensation #238

castillohair opened this issue Mar 7, 2017 · 3 comments
Assignees

Comments

@castillohair
Copy link
Collaborator

Color compensation is necessary for applications in which multiple fluorophores are used in the same cell (e.g. sfGFP and mCherry). Currently, FlowCal does not perform color compensation automatically. This issue will fix that, and the solution will eventually become part of a relatively major revision. This requires, as far as I can tell right now, the resolution of three aspects: developing the mathematical foundations, programming the API methods, and expanding the Excel UI so that non-programmers can use compensation easily.

The math that I would be using can be found here. It was derived independently, but it is equivalent to http://www.drmr.com/compensation/, plus autofluorescence. It is also equivalent to what it is performed in the TASBE method, as far as I can tell. Roughly, the method requires the calculation of one matrix of spillover coefficients, and a vector of autofluorescence values. These can be calculated using samples with no fluorescence and with strong expression of a single fluorophore.

With regards to the API, I propose the creation of a new module, called compensation. This module will contain the functions necessary to calculate the compensation parameters and apply compensation to new samples. Compensation, from an API perspective, should be performed in a similar way to the way MEF is performed:

  1. Compensation of an FCSData object will be performed via a transformation function: s_compensated = compensate_fxn(s_uncompensated). I think it would be cool if the channels names are changed to reflect the fact that the event list now contains fluorophore values instead of channel signals. (e.g. instead of FL1, the channel would be called sfGFP).
  2. The transformation function will be generated by a function in the compensation module get_compensation_fxn(), which will receive as parameters all the appropriate controls and the channels to compensate.
  3. The compensation module will contain any other auxiliary functions that are necessary for compensation.

The Excel UI aspect needs more thinking. So far, I think that a new sheet can be included, in which the user can point to all the necessary controls, the names of the channels to compensate, and the names of the fluorophores used. In the Samples sheet, the user can still get statistics on uncompensated values by using the name of the channel (e.g. FL1 Median) or on compensated values, by using a fluorophore's name (e.g. sfGFP Median). A priority in this case should be to preserve FlowCal's current behavior.

@JS3xton
Copy link
Contributor

JS3xton commented May 22, 2017

I think it would be cool if the channels names are changed to reflect the fact that the event list now contains fluorophore values instead of channel signals. (e.g. instead of FL1, the channel would be called sfGFP).

I was initially on board with this idea, but upon reflection and after discussions with @thoreusc, I think we have to be careful about autofluorescence and its implications on the reported statistics.

Specifically, to expand on your math (s = a0 + A*f which implies f = A^-1 * (s - a0)):

  • I think we typically model f, a0, and s with uncertainty (whereas A, the spillover matrix, is modeled deterministically).
    • Modeling a0 deterministically would greatly simplify our model, but I think doing so is unrealistic at low sfGFP expression levels where the distribution of the measured signal closely resembles the distribution of white cells.
  • On a per event basis (e.g. for an event list or for the contents of an FCSData object):
    • We do not have a direct measure of a0. I think it's that simple; we cannot precisely report f without a measure of both s and a0.
    • Given that we cannot precisely report f for every event, there are still some reasonable calculations that one could perform (e.g. f = A^-1 * (s_FL1 - a0_mean)), but I think it would be incorrect to imply that f_sfGFP has been precisely calculated. I think something like "Compensated FL1" is more appropriate.
  • In aggregate, i.e. when discussing distributions:
    • We can model a0 by measuring many white cell events, postulating an appropriate distribution, and inferring the model parameters of that distribution from the white cell data. Similarly, we can model s via a similar procedure. However, calculating (s - a0) (i.e. deconvolving or unmixing the s and a0 distributions) is an underconstrained problem, so we cannot precisely determine the distribution of (s - a0) (or f = A^-1 * (s - a0), which relies on (s - a0)).
    • Independent of the forms of the distributions for s and a0, the Expected Value (a.k.a. the first moment, a.k.a. the Arithmetic Mean) of their difference is simply the difference of their Expected Values, E[(s-a0)] = E[s] - E[a0] = s_mean - a0_mean. Note: this does not require that s and a0 be independent. I believe it therefore follows that you can report the Expected Value (a.k.a Arithmetic Mean) of the "sfGFP" distribution via f_sfGFP_mean = A^-1 * (s_FL1_mean - a0_FL1_mean). Or, stated another way, you can claim that the "Compensated FL1" Expected Value (a.k.a Arithmetic Mean) equals the "sfGFP" Expected Value (a.k.a Arithmetic Mean). I do not think this equality holds for the parameterizing statistics of other distributions, though (e.g. the Geometric Mean; Geomean(s-a0) != Geomean(s) - Geomean(a0)), nor do I think it necessarily holds for other characteristics of a distribution, e.g. median or mode (I don't have proofs either way on anything other than the arithmetic mean, though).
    • To calculate the variance (or the standard deviation) of the "sfGFP" distribution, you need to have information about how s and a0 covary (specifically, Variance(X+Y) = Variance(X) + Variance(Y) + 2*Covariance(X,Y), where Covariance(X,Y) captures that covariance information), which we typically don't have in our flow cytometry measurements. So you could perhaps still report the sample variance or the sample standard deviation of the "Compensated FL1" distribution, but I think it's incorrect to claim that they are equal to the variance or standard deviation of the "sfGFP" distribution.
    • Given FlowCal.plot.hist1d(s_compensated, channel='sfGFP'), I think it would be incorrect to render a histogram and label it as the "sfGFP" distribution, because we don't have enough information to determine the true sfGFP distribution.

@castillohair
Copy link
Collaborator Author

castillohair commented Jun 10, 2017

Note: What I'm proposing is an implementation of the existing widely-used compensation method, which is based on "average" statistics. If we want to talk about a new technique that deconvolves the true fluorophore distribution and takes into account uncertainties, we should do that in another issue. But we know that @thoreusc has spent a lot of time on this and not arrived at something satisfactory, so it would likely be a complicated rabbit hole. My intention with this is to implement what I think is the biggest omission in FlowCal right now, and I think delaying this until we develop a superior technique is not a good approach.

So you could perhaps still report the sample variance or the sample standard deviation of the "Compensated FL1" distribution, but I think it's incorrect to claim that they are equal to the variance or standard deviation of the "sfGFP" distribution.

This is a good point. Higher moments are not corrected by this compensation method, and having the channel name changed to the fluorophore name is misleadingly implying that it is. I'm gonna go with "Compensated [channel_name]" then.

@JS3xton
Copy link
Contributor

JS3xton commented Jun 12, 2017

I'm on board with adding an existing, widely-used compensation method, and I agree that we shouldn't wait (or maybe even hope for) a more sophisticated technique that more appropriately deconvolves out the true fluorophore distribution.

I think we need to be precise with that implementation, though, and I think referring to the compensated data as "Compensated [channel name]", both to describe the distribution and its summary statistics, absolves you of any misdirection. I think it needs to be easy to figure out what the compensation formula was from the documentation, though (while widely used, I still consider it to be arbitrary). It might also be nice to allow the user to specify their own compensation function via the Python API, but I may be getting ahead of myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants