Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update on non-UMI data #94

Open
JuliaP138 opened this issue Feb 5, 2021 · 7 comments
Open

Update on non-UMI data #94

JuliaP138 opened this issue Feb 5, 2021 · 7 comments

Comments

@JuliaP138
Copy link

Hi there,

I know this has been asked before, but I anyway wanted to know if there is any update on the sctransform for non-UMI data?

@RobertWhitener
Copy link

I'm also interested in any updates. My group is currently researching on the choice of which algorithm to use for my non-UMI dataset. We are leaning towards lognormalized because we have non-UMI data, but SCTransform does look very attractive, as our dataset is highly heterogenous (we are doing an organ ATLAS project).

Curious on whether applying SCTransform to non-UMI data might have downstream effects on differential expression analysis, etc. as well.

Thanks!

@ChristophH
Copy link
Collaborator

Hi @JuliaP138 and @RobertWhitener

I know from anecdotal experience that sctransform works well with non-UMI data. However, I have not done any formal testing, i.e. comparisons to other normalization methods using a diverse set of data (including non-heterogeneous control data).
I'd love to explore this more. If you point me to studies that are relevant to your research, or even better, provide me with the actual expression matrices, I would be happy to run some tests and share my findings here.

When you are saying non-UMI what technologies are we talking about? What upstream pipeline generates the expression matrix? Is the data still integer count data, or continuous?

@RobertWhitener
Copy link

RobertWhitener commented Feb 10, 2021

Hi Christoph,

Thanks for the offer! In our case, we are using Smart-Seq2 based full length read counts. Our upstream pipeline is very similar to the Tabula Muris dataset, which may be a good example dataset that is available for download already.

https://www.nature.com/articles/s41586-018-0590-4

https://github.com/czbiohub/tabula-muris

The main difference is we are doing only the plate-based Illumina sequencing, not the 10X sequencing as well.

For initial filtering when we make the Seurat Object, we used min.cells = 10, min.features = 5.

Best,
Robert

@ChristophH
Copy link
Collaborator

I have had a look at the Smart-Seq2 data and whether sctransform would be appropriate for normalization. The short answer is that it is not. When looking at it in detail, it becomes clear that the count distribution for Smart-Seq2 data looks quite different from UMI-based data. Even low counts are amplified such that there is a clear distinction between drop-outs (gene not detected) and low expression. In contrast, for UMI data this is a continuous gradient. The regression with a Negative Binomial model works, but it does not fit the data very well.
In the case of Smart-Seq2 data, I don't think the sctransform workflow provides any benefits compared to the ‘standard’ log-normalization workflow currently used by Seurat.

@satijalab
Copy link
Collaborator

Thanks Christoph for taking a look at this. I thought I'd mention some of my thoughts as well as we get this question quite often.

There are definitely differences in the distributions of UMI and non-UMI data, most importantly, the 'gap' between 0 expression and low expression as Christoph mentioned. The standard NB distribution does not model this 'gap', and therefore represents an imperfect fit to non-UMI data.

Having said that, we still observe good performance when applying the sctransform workflow to non-UMI data. As an example, we typically use sctransform normalization to map SS2 datasets to 10x references using Azimuth (see for example: https://app.azimuth.hubmapconsortium.org/app/human-motorcortex), and this works quite well. You can see that the method performs well in Christoph's notebook as well on the CZI data. One of the reasons for this is that we learn parameters directly from the data, even if the distribution itself is imperfect.

I agree that sctransform makes assumptions about the statistical nature of the data that aren't the best fit for non-UMI data, and you could therefore make the argument that you should use log-normalization for non-UMI datasets. However, log-normalization incorporates suboptimal assumptions/steps as well (in particular, assuming that all cells have the same total number of molecules, log-transformation and pseudocount addition, etc., giving each gene equal weight in downstream analyses, etc.).

One option would be to modify the sctransform noise model to represent a mixture, and perhaps better account for the 'gap' in non-UMI data (we have not done this). Another option would be to apply some recently developed methods (https://www.nature.com/articles/nmeth.4150; https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02078-0), to convert read counts to 'quasi-UMIs', after which one could apply sctransform (or GLM-PCA).

The quasi-UMI transformation can be a bit slow, but if you'd like to use sctransform or a similar approach on SS2 data and are worried about the theoretical considerations, then I would recommend this solution.

We have tried this on a number of datasets. In general, we get quite similar results whether we run sctransform on read-level counts or quasi-UMI counts, though you can certainly try this yourself as well, and it may not be true in all cases.

Our conclusion from all of this is that while sctransform is best-suited to UMI datasets, it can be applied successfully in diverse contexts.

@RobertWhitener
Copy link

Hi Christoph and Colleagues,

Thank you very much for looking into this. As with the Tabula muris dataset, our sequences comprise a very diverse set of cells, including epithelial, endothelial, neurons, immune, etc.

The discussion above is giving us a lot to think about. Seems that unless we commit to techniques like the "quasi-UMIs", we should stick with the standard Seurat log-normalization method.

Best
Robert

@BiotechPedro
Copy link

Dear all,

this is a super helpful and useful issue! I learnt a lot while reading it 😀

It has been a year since the last comment and new versions (v2) of SCTransform has been released. What would you answer to the questions above now? Also, maybe there are new methods out there to include in the discussion of how to 'correctly' model non-UMI data as Smart-seq2 ones. Would you mind to point out some of them?

Thank you a lot!

Pedro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants