Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Fastq reads #2385

Open
SidWeng opened this issue Mar 24, 2023 · 1 comment
Open

Missing Fastq reads #2385

SidWeng opened this issue Mar 24, 2023 · 1 comment

Comments

@SidWeng
Copy link

SidWeng commented Mar 24, 2023

adam-core version: 0.33.0
Spark version: 3.3.0
Scala version: 2.12

I read FASTQ BGZ file with following code :

spark.sparkContext.newAPIHadoopFile(url, classOf[SingleFastqInputFormat], classOf[Void], classOf[Text], conf)

It works fine if the file is about 70 GB.
However when file size is about 170 GB, some reads are missing (the missing reads are well-formed).
And the missing reads can be found if read the file line by line

spark.sparkContext.newAPIHadoopFile(url, classOf[TextInputFormat], classOf[Void], classOf[Text], conf)

Is there any limitation about SingleFastqInputFormat or any advice can help me debug this issue ?

@heuermh
Copy link
Member

heuermh commented Apr 3, 2023

Hello @SidWeng!

I have seen issues occasionally with gzipped/bgzf FASTQ input before, although typically with paired reads, where ADAM complains about not having the same numbers of each. If you know of any publicly available datasets that demonstrate this issue, I can dig into it deeper.

As a workaround, you may be able to convert to unaligned BAM format first and then read into ADAM.

Another workaround would be to convert FASTQ into CSV or tab-delimited format and then use Spark SQL to read the text file and convert into ADAM format, something like

import org.bdgenomics.adam.ds.ADAMContext._

val sql = """
SELECT
  _c0 AS name,
  CAST(NULL AS STRING) AS description,
  'DNA' AS alphabet,
  upper(_c1) AS sequence,
  length(_c1) AS length,
  _c2 AS qualityScores,
  CAST(NULL AS STRING) AS sampleId,
  CAST(NULL AS MAP<STRING,STRING>) AS attributes
FROM
  reads
"""

val df = spark.read.option("delimiter", "\t").csv(inputPath)
df.createOrReplaceTempView("reads")
val readsDf = spark.sql(sql)
val reads = sc.loadReads(readsDf)
reads.saveAsParquet(outputPath)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants