Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for gene-level estimation and DE, along with documentation #4

Open
magnusrattray opened this issue Sep 30, 2014 · 6 comments
Open

Comments

@magnusrattray
Copy link
Member

No description provided.

@messersc
Copy link

I would like that, too.

Until then, could somebody sketch what an analysis at the gene level could look like?
Any help would be much appreciated.

@magnusrattray
Copy link
Member Author

If you are just estimating gene expression (stage 1) then you can sum up the transcript expression levels in FPKM or TPM units (not counts units - since length normalisation is important). In that case you can also just use our fast VB algorithm rather than MCMC is timing is an issue. If you are doing differential expression (stage 2) you have to do this sum for each MCMC sample in the original BitSeq.

@messersc
Copy link

messersc commented Feb 2, 2015

To recap and as a reference for everybody else (correct me if I'm wrong), I will put this here.

For the moment, I am interested in computing logFC to compare to edgeR/limma etc and also qPCR. Differential gene expression might be on the list later, though. Further, I already have done the analysis for transcripts in R, writing (most of?) the files to my working directory (so no temp files).
Further, I have compiled BitSeq for use with the command line.

Stage1: Just estimating gene expression, I can work with the results files from getExpression(). I already have transcript estimates from an earlier run (MCMC), but if I was to start from scratch the new VB method would be fine for computing means and logFC.

I have a per-replicate .mean file from my analysis, from reading the documentation I assume this is generated by getExpression() and is equal to the $exp bit of the resulting list.
From here, I un-log my rpkm means, sum up the replicates for my 2 conditions and compute the logFC. Right?

Stage2: "Summing each MCMC sample in the original Bitseq" is what getGeneExpression() does, I assume? That gives me a new table with estimates on gene level (~30000 genes x 1000 samples). How do I proceed from there?

@mqbssppe
Copy link
Member

mqbssppe commented Feb 2, 2015

Hi Clemens,

just to make sure: have you already used a *.tr file with proper genes-transcripts grouping and then called the getGeneExpression function or not?

Regarding you question 'How do I proceed from there?' I understand that you just want to compute the logFC on the gene-level and not to perform DE analysis. In this case, you could run the 'getVariance' function over the replicates of a given condition (each file should have ~30000 genes x 1000 samples as you described). This will produce a file <*.Lmean> containing the averaged expression (in the first column) across all replicates of the specific condition. Then you will have to do this for the replicates of the second condition. Then you can compute the logFC.

@messersc
Copy link

messersc commented Feb 2, 2015

Hello Panagiotis,

yes, I have a trInfo DataFrame/File with proper groupings (queried biomaRt for missing gene names) and I also have succesfully called getGeneExpression(), which gives me the mentioned files (without any file extension, though. ~30000 genes x 1000 samples down from ~80000 transcripts).
I will try 'getVariance', thanks for the advice.

Further questions that came up:

  • What is the difference between *.Lmean and *.mean files?
    (I have a .Lmean file from getDE(), but this is for all replicates without grouping.)
  • Where do the *.mean files come from? See my post above.
  • Is there a "mapping" between the commands for R and the shell version? It would be handy to know which shell command outputs which files, especially file extensions. Looking into my working directory, it's a jungle right now and is is hard to know which files to use, when I take off on the shell where I left with R.

Thanks for your help!

@petog
Copy link
Contributor

petog commented Feb 3, 2015

Hi,
the L in *.Lmean should stand for logged. So it should be the mean of log expression. *.mean files will most likely be just the mean expressions for each replicate.

Yes, there is a mapping. I think you should be able to run all R methods with pretend=TRUE option, in such case, the only output will be the actual shell commands that can be used to get the same results. If you run the methods with same arguments as before, you should see exactly which command produced which files (and also what were the command line arguments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants