Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spare not needed groupBy when calling toFragments() on AlignmentDataset #2281

Open
benraha opened this issue Nov 7, 2020 · 4 comments
Open

Comments

@benraha
Copy link
Contributor

benraha commented Nov 7, 2020

Hi!

I'm running a process that is pre-processing a bunch of reads before aligning them using Bowtie. Most of them are unpaired, so when I run toFragments(), I need to groupBy() them for no actual reason. Is there a way to spare this groupBy?

Looking at the code, I think we can add a variable to signify when we know for sure when we have unpaired files. When we are unsure, we'll do the groupBy anyway (maybe let the user tell us by adding a parameter to loadAlignments).

I'd love to implement it.

WDYT?
Ben

@benraha benraha changed the title Save not needed groupBy when calling toFragments() on AlignmentDataset Spare not needed groupBy when calling toFragments() on AlignmentDataset Nov 7, 2020
@benraha
Copy link
Contributor Author

benraha commented Nov 10, 2020

@heuermh Would love your thoughts on that before I implement it.

@heuermh
Copy link
Member

heuermh commented Nov 10, 2020

If all you want to do is a straight conversion 1:1 of Alignment to Fragment, there are the transmute/transmuteDataFrame/transmuteDataset APIs, e.g.

https://javadoc.io/static/org.bdgenomics.adam/adam-core-spark3_2.12/0.32.0/org/bdgenomics/adam/rdd/read/AlignmentDataset.html#transmute[X,Y%3C:Product,Z%3C:org.bdgenomics.adam.rdd.GenomicDataset[X,Y,Z]](tFn:org.apache.spark.api.java.function.Function[org.apache.spark.api.java.JavaRDD[T],org.apache.spark.api.java.JavaRDD[X]],convFn:org.apache.spark.api.java.function.Function2[V,org.apache.spark.rdd.RDD[X],Z]):Z

An example of this can be found in the unit tests
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L126
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentDatasetSuite.scala#L1543

I think a new method toUnpairedFragments() that leaves out the groupBy might be ok.

Then for calling bowtie, in Cannoli we have bowtie2, a function FragmentDatasetAlignmentDataset, and singleEndBowtie2, a function AlignmentDatasetAlignmentDataset. If starting from mixed set of reads, you could filter out unpaired reads and run them separately through singleEndBowtie2 as to not incur the cost of toFragments and then union the results together.

There isn't currently a singleEndBowtie in Cannoli but I doubt it would be difficult to add one.

@benraha
Copy link
Contributor Author

benraha commented Nov 10, 2020

These are good, but I want to use the knowledge ADAM already has on the data instead of relying on the user to know it, or maybe there's some problem regarding this that I don't know of?

Something like that (taken from loadAlignments):

BAM -> unpaired
InterleavedFastQ -> paired
FASTQ -> paired / unpaired like ADAM works today
FASTA -> unpaired?
PARQUET -> can be paired

@heuermh
Copy link
Member

heuermh commented Nov 10, 2020

Those assumptions can fall apart though, from experience BAM/CRAM/SAM files can contain paired reads, unpaired reads, aligned reads, and unaligned reads. It is common to use unaligned BAM (uBAM) in workflows instead of FASTQ because it compresses better.

We would of course encourage the use of Parquet because it compresses better, doesn't have problems with split guessing, can take advantage of push down predicates and column projection, and can be read/write concurrently in distributed fashion across a cluster. 😉

That said, please feel free to suggest changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants