bioio

Efficient, high-quality streaming parsers and writers for 16 text-based formats used in bioinformatics.

The goal is to have the best possible parsers for the most hated and problematic formats.

Supported formats:

VCF (4.2)
VFF
GenBank
BED
GFF2, GFF3, GTF, and GVF
FASTA
FASTA alignment
FASTQ
UCSC liftOver format
pre-MAKEPED LINKAGE
BGEE expression format
Turtle and RDF
Delimited text (e.g., CSV)

Features & choices:

Reads and writes Java Streams, keeping only essential metadata in memory.
Parses every part of a format, leaving nothing as text unnecessarily.
Has a consistent API. Coordinates are always 0-indexed and text is always escaped as per the specification.
Immutable, thread-safe, null-pointer-safe (Optional<>), and arbitrary-precision.
All methods are in interfaces, or in records, enums, or final classes

Example:

This example reads, filters, and writes a VCF file.

import org.pharmgkb.parsers.vcf.*;
import org.pharmgkb.parsers.vcf.model.*;

Stream<VcfPosition> mitochondrialCalls = new VcfDataParser().parseFile(path)
	.filter(p -> p.chromosome().isMitochondial())

new VcfDataWriter().writeToFile(mitochondrialCalls, filteredPath);

Build/install

Compatible with Java 21 LTS and higher. You can get the artifacts from Maven Central.

Maven

<dependency>
    <groupId>org.pharmgkb</groupId>
    <artifactId>bioio</artifactId>
    <version>0.3.0</version>
</dependency>

Gradle

implementation 'org.pharmgkb:bioio:0.3.0'

SBT

"org.pharmgkb" % "bioio" % "0.3.0"

Pre-built JAR

Releases contain both fat JARs (containing dependencies) and thin JARs (without dependencies), independently for each subproject (e.g. bioio-vcf for VCF, or bioio-gff for GFF/GTV/GVF).

You can build artifacts from a source checkout using Gradle:

To JAR all subprojects, run gradle jarAll
To build a single subproject (e.g. VCF), run gradle :vcf:jar

Examples

This long list of examples showcases many of the parsers. For added flavor, they also use various methods for IO (parseAll, etc.) and various Stream functions (parallel(), collect, flatMap, etc.)

// Store GFF3 (or GVF, or GTF) features into a list
List<Gff3Feature> features = new GffParser.Builder().build().collectAll(inputFile);
features.get(0).type(); // the parser unescaped this string

// Now write the lines:
new Gff3Writer.Builder().build().writeToFile(outputFile);
// The writer percent-encodes GFF3 fields as necessary

// From a BED file, get distinct chromosome names that start with "chr", in parallel
Files.lines(file)
  .map(new BedParser())
	.parallel()
	.map(BedFeature::chromosome())
  .distinct()
	.filter(chr -> chr.startsWith("chr"));
// You can also use new BedParser().parseAll(file)

// From a pre-MAKEPED file, who are Harry Johnson's children?
Pedigree pedigree = new PedigreeParser.Builder().build().apply(Files.lines(file));
NavigableSet<Individual> children = pedigree.getFamily("Johnsons")
	.find("Harry Johnson")
	.children();

// Traverse through a family pedigree in topological order
Pedigree pedigree = new PedigreeParser.Builder().build().apply(Files.lines(file));
Stream<Individual> = pedigree.family("Johnsons")
	.topologicalOrder();

// "Lift over" coordinates using a UCSC chain file
// Filter out those that couldn't be lifted over
GenomeChain chain = new GenomeChainParser().apply(Files.lines(hg19ToGrch38ChainFile));
List<Locus> liftedOver = lociList.parallelStream()
	.map(chain)
	.filter(Optional::isPresent)
	.toList();
// You can also use new GenomeChainParser().parse(hg19ToGrch38ChainFile)

// Print formal species names from a GenBank file
Path input = Paths.get("plasmid.genbank");
new GenbankParser().parseAll(input)
	.filter(record -> record instanceof SourceAnnotation)
	.map(record -> record.formalName())
	.forEach(System.out::println);

// Parse a GenBank file
// Get the set of "color" properties of features on the complement starting before the sequence
Set<String> properties = new GenbankParser().parseAll(input)
	.filter(record -> record instanceof FeaturesAnnotation)
	.flatMap(record -> record.features())
	.filter(feature -> record.range.isComplement());
	.filter(feature -> record.range.start() < 0);
	.flatMap(feature -> feature.properties().entrySet().stream())
	.filter(prop -> prop.getKey().equals("color"))
	.map(prop -> prop.getValue())
	.toSet();

// Read FASTA bases with a buffered random-access reader
RandomAccessFastaStream stream = new RandomAccessFastaStream.Builder(file)
	.setnCharsInBuffer(4096)
	.build();
char base = stream.read("gene_1", 58523);

// Suppose you have a 2GB FASTA file
// and a method smithWaterman that returns AlignmentResults
// Align each sequence and get the top 10 results, in parallel
MultilineFastaSequenceParser parser = new MultilineFastaSequenceParser.Builder().build();
List<AlignmentResult> topScores = parser.parseAll(Files.lines(fastaFile))
	.parallel()
	.peek(sequence -> logger.info("Aligning {}", sequence.header())
	.map(sequence -> smithWaterman(sequence.sequence(), reference))
	.sorted() // assuming AlignmentResult implements Comparable
	.limit(10);
}

// Stream Triples in Turtle format from a URL
/*
@prefix myPrefix: <https://abc#owner> .
<https://abc#cat> "belongsTo" @myPrefix ;
	"hasSynonym" <https://abc#feline> .
 */
Stream<String> input = null;
try (
  BufferedReader reader = new BufferedReader(
    new InputStreamReader((HttpURLConnection) myUrl.openConnection()).getInputStream())
  )
) {
	input = reader.lines();
}
// usePrefixes=true will replace prefixes
TripleParser parser = new TripleParser(true);
Stream<Triple> stream = input.map(new TripleParser());
// contains:  List[ https://abc#cat belongsTo https://abc#owner , \
// https://abc#cat hasSynonym https://abc#feline ]
List<Prefix> prefixes = parser.prefixes();

// Parse VCF, validate it,
// and write a new VCF file containing only positions whose QUAL field
// is at least 10, each with its FILTER field cleared
// short-circuits during read:
VcfMetadataCollection metadata = new VcfMetadataParser().parse(input);
Stream<VcfPosition> data = new VcfDataParser().parseAll(input)
	.filter(p ->
    p.quality().stream().anyMatch(q -> q.greaterThanOrEqual("10"))
  ).map(p -> new VcfPosition.Builder(p).clearFilters().build())
  // verify consistent with metadata:
	.peek(new VcfValidator.Builder(metadata).warnOnly().build());
new VcfMetadataWriter().writeToFile(metadata.lines(), output);
new VcfDataWriter().appendToFile(data, output);

// From a VCF file, associate every GT with its number of occurrences, in parallel
Map<String, Long> genotypeCounts = new VcfDataParser().parseAll(input)
	.parallel()
	.flatMap(p -> p.samples().stream())
	.filter(s -> s.containsKey(ReservedFormatProperty.Genotype))
	.map(s -> s.get(ReservedFormatProperty.Genotype).get())
	.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

Stream<GeneralizedBigDecimal> MatrixParserI.tabs().parseAll(file).map(GeneralizedBigDecimal::new);

Principles

Where possible, a parser is a Function<String, R> or Function<Stream<String>, R>, and writer is a Function<R, String> or Function<R, Stream<String>>. Java 8+ Streams are expected to be used.
Null values are banned from public methods in favor of Optional. See https://www.oracle.com/technetwork/articles/java/java8-optional-2175753.html for more information.
Most operations are thread-safe. Thread safety is annotated using javax.annotation.concurrent.
Top-level data classes are immutable, as annotated by javax.annotation.concurrent.Immutable.
The builder pattern is used for non-trivial classes. Each builder has a copy constructor.
Links to specifications are provided. Any choice made in an ambiguous specification is documented.
Parsing and writing is moderately strict. Severe violations throw a BadDataFormatException, and milder violations are logged as SLF4J warnings. Not every aspect of a specification is validated.
For specification-mandated escape sequences, encoding and decoding is automatic.
Coordinates are always 0-based, even for 1-based formats. This is to ensure consistency and arithmetic simplicity.

Pitfalls

Never reuse a parser for a new stream. Some parsers need to track some metadata on the stream. For example, the multiline FASTQ parser needs to know the length of the last sequence. (Otherwise, it’s impossible to know where a score ends and a new header begins!)

License, authors, & contributing

Licensed under the Mozilla Public License, version 2.0.

Please refer to the contributing guide.

Credits:

Douglas Myers-Turnbull (design and parsers)
Mark Woon (bug fixes and code review)
the Stanford University School of Medicine
the Pharmacogenomics Knowledge Base at Stanford
the University of California, San Francisco (UCSF)

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github		.github
bed		bed
bgee		bgee
chain		chain
core		core
fasta		fasta
genbank		genbank
gff		gff
pedigree		pedigree
text		text
turtle		turtle
vcf		vcf
vff		vff
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
build.gradle		build.gradle
settings.gradle		settings.gradle

License

dmyersturnbull/bioio

Folders and files

Latest commit

History

Repository files navigation

bioio

Example:

Build/install

Maven

Gradle

SBT

Pre-built JAR

Examples

Principles

Pitfalls

License, authors, & contributing

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Languages