genome-sequence-io

Read and write from various bioinformatics sequence formats, currently BED, GFF3 (and GTF, and GVF), FASTA, UCSC chain (genome alignment), and pre-MAKEPED (pedigree). VCF readers and writers are currently at https://github.com/PharmGKB/vcf-parser instead.

This project has moderately high test coverage and is quite usable. However, it's incomplete, so subsequent versions may break backwards-compatibility.

Build instructions

The project is not currently on Maven Central. To JAR all subprojects, run gradle jarAll. To build a single subproject, run gradle :xxx:jar, where xxx is the name of the subproject (for example, gradle :gff:jar).

You can also run tests with gradle :xxx:test and compile (without JARing) using gradle :xxx:build. Note that running gradle :xxx:gff will only run tests for gff, core.

Examples

// Store GFF3 (or GVF, or GTF) features into a list
List<Gff3Feature> features = new Gff3Parser().collectAll(inputFile);
features.get(0).getType(); // the parser unescaped this string

// Now write the lines:
new Gff3Writer().writeToFile(outputFile); 
// The writer percent-encodes GFF3 fields as necessary

// From a BED file, get distinct chromosome names that start with "chr", in parallel
Files.lines(file).map(new BedParser())
     .parallel()
     .map(BedFeature::getChromosome).distinct()
     .filter(chr -> chr.startsWith("chr"))
// You can also use new BedParser().parseAll(file)

// From a pre-MAKEPED file, who are Harry Johnson's children?
Pedigree pedigree = new PedigreeParser.Builder().build().apply(Files.lines(file));
NavigableSet<Individual> children = pedigree.getFamily("Johnsons")
                                            .find("Harry Johnson")
                                            .getChildren();

// Traverse through a family pedigree in topological order
Pedigree pedigree = new PedigreeParser.Builder().build().apply(Files.lines(file));
Stream<Individual> = pedigree.getFamily("Johnsons")
                             .topologicalOrderStream();

// "Lift over" coordinates using a UCSC chain file
// Filter out those that couldn't be lifted over
GenomeChain chain = new GenomeChainParser().apply(Files.lines(hg19ToGrch38ChainFile));
List<Locus> liftedOver = lociList.parallelStream()
                                 .map(chain)
                                 .filter(Optional::isPresent)
                                 .collect(Collectors.toList());
// You can also use new GenomeChainParser().parse(hg19ToGrch38ChainFile)

// Read FASTA bases with a buffered random-access reader
RandomAccessFastaStream stream = new RandomAccessFastaStream.Builder(file)
                                 .setnCharsInBuffer(4096)
                                 .build();
char base = stream.read("gene_1", 58523);

// Suppose you have a 2GB FASTA file and a method smithWaterman that returns AlignmentResults
// Align each sequence and get the top 10 results, in parallel
try (FastaSequenceReader reader = new FastaSequenceReader.Builder(file).allowComments().build()) {
    List<AlignmentResult> topScores = reader.read()
        .parallel()
        .peek(sequence -> logger.info("Read {}", sequence.getHeader())
        .map(sequence -> smithWaterman(sequence.getSequence(), reference))
        .sorted() // assuming AlignmentResult implements Comparable
        .limit(10);
}

Guiding principles

Where possible, a parser is a Function<String, R> or Function<Stream<String>, R>, and writer is a Function<R, String> or Function<R, Stream<String>>. Java 8 Streams are therefore expected to be used.
Null values are generally banned from public methods in favor of Optional. See http://www.oracle.com/technetwork/articles/java/java8-optional-2175753.html for more information.
Most operations are thread-safe. Thread safety is annotated using javax.annotation.concurrent.
Top-level data classes are immutable, as annotated by or javax.annotation.concurrent.Immutable.
The builder pattern is used for non-trivial classes. Each builder has a copy constructor.
Links to specifications are provided. Any interpretation used for an ambiguous specification is documented.
Parsing and writing is moderately strict. Severe violations throw a BadDataFormatException, and milder violations are logged as warnings using SLF4J. Not every aspect of a specification is validated.
For specification-mandated escape sequences, encoding and decoding is automatic.
Coordinates are always 0-based, even for 1-based formats. This is to ensure consistency as well as arithmetic simplicity.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.idea		.idea
bed		bed
chain		chain
core		core
fasta		fasta
gff		gff
pedigree		pedigree
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
genome-sequence-io.iml		genome-sequence-io.iml
settings.gradle		settings.gradle

License

PharmGKB/genome-sequence-io

Folders and files

Latest commit

History

Repository files navigation

genome-sequence-io

Build instructions

Examples

Guiding principles

About

Resources

License

Stars

Watchers

Forks

Languages