Skip to content

Input File Format

Sam Minot edited this page Feb 5, 2020 · 2 revisions

You can find details on the content of the input files for Geneshot on the Running Geneshot page, but it's also worth noting something about the format of those files. While the formatting requirements are minimal, they are worth noting.

Manifest - CSV

The manifest file (which describes which FASTQ files and metadata are assigned to which biological specimen) needs to be formatted as a CSV, which has just a couple of surprising gottchas. A Comma-Separated Values file is a text file where every record is on a line, and all of the fields of that record are separated by commas. This format is output easily by Excel, and it's also easy to make by directly editing a text file. However, there is one big issue that frequently arises -- carriage returns.

If you've never heard of the carriage return, then it makes for minimally entertaining reading (why have two different characters to indicate a newline?) but there's really only one thing that you need to worry about:

When you make a CSV using Excel, it will sometimes (depending on your operating system) use a carriage return instead of (or in addition to) a newline character. This is extremely easy to fix with the dos2unix utility (found on all Mac or Linux systems).

tl;dr: If you have an error that looks like it's being caused by your manifest, run the dos2unix utility on your manifest CSV and try it again.

Other tips and tricks, mostly for Excel:

  • Don't include any blank lines above the start of your manifest
  • Don't include any blank lines in the middle of your manifest
  • Don't use any characters aside from A-Z, a-z, 0-9, _, or : (only in URLs)

WGS Reads - FASTQ

While FASTQ is an extremely widely used format, it is surprisingly devoid of standards. There are a couple of things that Geneshot expects, which might be helpful to consult if you run into any odd errors (like Geneshot doesn't think you have any data in your input files).

For reference, a FASTQ file is made up of records (or reads), and each record consists of a header line, a nucleotide sequence line, a spacer (+), and a quality sequence line.

  • You need two FASTQ files, one for the forward (R1) and one for the reverse read (R2)
  • The headers of the R1 and R2 files must match such that everything before the first whitespace is identical for the paired reads
  • Nucleotide sequences must match quality sequences in length
  • Every record in R1 must have a matching record in R2

To explain the header issue a bit more, the following format is NOT accepted:

R1.fastq:

@HEADER_READ1/1
ATCGATCGATCGATCGATCG
+
IIIIIGIAGIIIHHIIIIII

R2.fastq:

@HEADER_READ1/2
GATCGATCGATCGATCGATC
+
IIIIIGIAGIIIHHIIIIII

While the following format IS accepted:

R1.fastq:

@HEADER_READ1 1
ATCGATCGATCGATCGATCG
+
IIIIIGIAGIIIHHIIIIII

R2.fastq:

@HEADER_READ1 2 it really doesn't matter what I put after the whitespace
GATCGATCGATCGATCGATC
+
IIIIIGIAGIIIHHIIIIII