hts-rdf

Author: Pierre Lindenbaum PhD.

Here are a few notes about Managing sequencing data with RDF. I want to keep track of the samples, BAMs, references, diseases etc.. used in my lab.

This document is auto-generated using a Makefile. Do not edit it.
I don't want to use a SQL database.
I don't want to join too many tab delimited files.
I want to use a controlled vocabulary to define things like diseases, organims, etc...
This document is NOT a tutorial for RDF or SPARQL.
I use the RDF+XML notation because I 'm used to work with XML.
I created a namespace for my lab: https://umr1087.univ-nantes.fr/rdf/ and a XML entity for this namespace: &u1087;.
I tried to reuse existing ontologies (e.g. foaf:Person for samples) as much as I can, but sometimes I created my own classes and properties.
I'm not an expert of SPARQL or RDF
Required tools are (jena)[https://jena.apache.org/download/], bcftools (for VCFs), samtools (for BAMs), awk.

Building the RDF GRAPH

Species

I manually wrote data/species.rdf defining the species used in my lab. We will use rdf:subClassOf to find organisms that are a sub-species of a taxon in the NCBI taxonomy tree.

(...)

<u:Taxon rdf:about="http://purl.uniprot.org/taxonomy/9606">
	<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/9605"/>
	<dc:title>Homo Sapiens</dc:title>
</u:Taxon>


<u:Taxon rdf:about="http://purl.uniprot.org/taxonomy/9605">
	<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/9604"/>
	<dc:title>Homo</dc:title>
</u:Taxon>

<u:Taxon rdf:about="http://purl.uniprot.org/taxonomy/9604">
	<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/all"/>
	<dc:title>Hominidae</dc:title>
</u:Taxon>


<u:Taxon rdf:about="http://purl.uniprot.org/taxonomy/63221">
	<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/9606"/>
        <dc:title>Homo sapiens neanderthalensis</dc:title>
</u:Taxon>


<u:Taxon rdf:about="http://purl.uniprot.org/taxonomy/10912">
	<rdfs:subClassOf rdf:resource="http://purl.uniprot.org/taxonomy/all"/>
	<dc:title>Rotavirus</dc:title>
</u:Taxon>

<u:Taxon rdf:about="http://purl.uniprot.org/taxonomy/all">
	<dc:title>ROOT</dc:title>
</u:Taxon>


</rdf:RDF>

Diseases / Phenotypes

I manually wrote data/diseases.rdf defining the diseases used in my lab. We will use rdf:subClassOf to find diseases that are a sub-disease in a disease ontology tree.

(...)
    <owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_0080599">
        <rdfs:label>Coronavirus infectious disease</rdfs:label>
    </owl:Class>

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_0080600">
        <rdfs:label>COVID-19</rdfs:label>
	<rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/DOID_0080599"/>
    </owl:Class>

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_0081013">
	<rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/DOID_0080600"/>
        <rdfs:label>Severe COVID-19</rdfs:label>
    </owl:Class>

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_3491">
        <rdfs:label>Turner Syndrome</rdfs:label>
    </owl:Class>

</rdf:RDF>

References / FASTA / Genomes

I manually wrote data/references.tsv a tab delimited text file defining each FASTA reference genome available on my cluster. The taxon id will be used to retrive the species associated to a FASTA file.

#path                 genomeId   ucsc  taxid
data/hg19.fasta       grch37     hg19  9606
data/hg38.fasta       grch38     hg38  9606
data/rotavirus_rf.fa  rotavirus        10912

The table is transformed into RDF using awk:

tail -n+2 data/references.tsv |\
	awk -F '\t' '{printf("<u:Reference rdf:about=\"&u1087;references/%s\">\n\t<u:genomeId>%s</u:genomeId>\n\t<u:filename>%s</u:filename>\n",$2,$2,$1);if($4!="") printf("\t<u:taxon rdf:resource=\"http://purl.uniprot.org/taxonomy/%s\"/>\n",$4); printf("</u:Reference>\n");}'

output:

(...)
<u:Reference rdf:about="&u1087;references/grch37">
	<u:genomeId>grch37</u:genomeId>
	<u:filename>data/hg19.fasta</u:filename>
	<u:taxon rdf:resource="http://purl.uniprot.org/taxonomy/9606"/>
</u:Reference>
<u:Reference rdf:about="&u1087;references/grch38">
	<u:genomeId>grch38</u:genomeId>
	<u:filename>data/hg38.fasta</u:filename>
	<u:taxon rdf:resource="http://purl.uniprot.org/taxonomy/9606"/>
</u:Reference>
<u:Reference rdf:about="&u1087;references/rotavirus">
	<u:genomeId>rotavirus</u:genomeId>
	<u:filename>data/rotavirus_rf.fa</u:filename>
	<u:taxon rdf:resource="http://purl.uniprot.org/taxonomy/10912"/>
</u:Reference>
</rdf:RDF>

Samples

I manually wrote data/samples.rdf defining the samples sequenced in my lab. This is where we can define the gender, associate a sample to a diseases and where we can define the familial relations. The Class foaf:Group is used to create a group of samples.

(...)
  <foaf:Person rdf:about="&u1087;samples/S1">
    <foaf:name>S1</foaf:name>
    <foaf:gender>female</foaf:gender>
    <rel:childOf rdf:resource="&u1087;samples/S2"/>
    <rel:childOf rdf:resource="&u1087;samples/S3"/>
    <u:has-disease rdf:resource="http://purl.obolibrary.org/obo/DOID_0080600"/>
    <u:has-disease rdf:resource="http://purl.obolibrary.org/obo/DOID_3491"/>
  </foaf:Person>


  <foaf:Person rdf:about="&u1087;samples/S2">
    <foaf:name>S2</foaf:name>
    <foaf:gender>female</foaf:gender>
  </foaf:Person>


  <foaf:Person rdf:about="&u1087;samples/S3">
    <foaf:name>S3</foaf:name>
    <foaf:gender>male</foaf:gender>
    <u:has-disease rdf:resource="http://purl.obolibrary.org/obo/DOID_0081013"/>
  </foaf:Person>

  <foaf:Group rdf:about="&u1087;groups/family1">
	<foaf:name>Fam01</foaf:name>
	<foaf:member rdf:resource="&u1087;samples/S1"/>
	<foaf:member rdf:resource="&u1087;samples/S2"/>
	<foaf:member rdf:resource="&u1087;samples/S3"/>
	<foaf:member rdf:resource="&u1087;samples/S5"/>
  </foaf:Group>

</rdf:RDF>

VCF BCF

VCF and genomes

for each VCF files, we need to associate a VCF and the reference genome: Chromosome and length are extracted from the references, we calculate the md5 checksum and we sort on md5.

tail -n+2 data/references.tsv | sort -T TMP -t $'\t' -k1,1 > TMP/sorted.refs.txt
cut -f 1 TMP/sorted.refs.txt | while read FA; do echo -ne "${FA}\t" && cut -f 1,2 "${FA}.fai" | md5sum | cut -d ' ' -f 1; done > TMP/references.md5.tmp.a
join -t $'\t' -1 1 -2 1  TMP/sorted.refs.txt TMP/references.md5.tmp.a | sort -t $'\t' -k5,5 > TMP/references.md5
rm -f  TMP/references.md5.tmp.a

For each VCF, the header is extracted, we extract the chromosome and length of the contig lines, we calculate the md5 checksum and we sort on md5.

find data -type f \( -name "*.vcf.gz" -o -name "*.bcf" -o -name "*.vcf" \) | sort > TMP/vcfs.txt
(cat TMP/vcfs.txt| while read V ; \
		do echo -en "${V}\t" && \
		bcftools view --header-only "${V}"  | awk  -F '[=,<>]' '/^##contig/ {printf("%s\t%s\n",$4,$6);}'   | md5sum | cut -d ' ' -f1 ; done) | sort -t $'\t' -k2,2 > TMP/vcfs.md5.txt

we join both files on md5 and we convert to RDF using awk:

cat data/header.rdf.part > TMP/vcf2ref.rdf
join -t $'\t' -1 2 -2 5 TMP/vcfs.md5.txt TMP/references.md5 |\
	awk -F '\t' '{printf("<u:Vcf rdf:about=\"file://%s\"><u:filename>%s</u:filename><u:reference rdf:resource=\"&u1087;references/%s\"/></u:Vcf>",$2,$2,$4); }' >> TMP/vcf2ref.rdf
cat data/footer.rdf.part >> TMP/vcf2ref.rdf

VCF and samples

to link the VCF files and the sample, we use bcftools query -l to extract the samples and we convert to RDF using awk:

find data -type f \( -name "*.vcf.gz" -o -name "*.bcf" -o -name "*.vcf" \) | sort > TMP/vcfs.txt
cat data/header.rdf.part > TMP/vcf2samples.rdf
cat TMP/vcfs.txt | while read F; do bcftools query -l "${F}" | awk -vVCF="$F" 'BEGIN {printf("<u:Vcf rdf:about=\"file://%s\"><u:filename>%s</u:filename>",VCF,VCF); } {printf("<u:sample rdf:resource=\"&u1087;samples/%s\"/>",$1);} END {printf("</u:Vcf>");}' >> TMP/vcf2samples.rdf ; done
cat data/footer.rdf.part >> TMP/vcf2samples.rdf

BAM files

BAM file contains the sample names in their read-groups; We use samtools samples to extract the samples, the reference and the path of each BAM file. data/samtools.samples.to.rdf.awk is used to convert the output of samtools samples to RDF.

find ${PWD}/data -type f -name "*.bam" |\
	samtools samples -F TMP/references.txt |\
	sort -T TMP -t $'\t' -k3,3 |\
	join -t $'\t' -1 3 -2 1 - TMP/sorted.refs.txt > TMP/bams.txt

cat data/header.rdf.part > TMP/bams.rdf

awk -F '\t' -f data/samtools.samples.to.rdf.awk TMP/bams.txt >> TMP/bams.rdf

cat data/footer.rdf.part >> TMP/bams.rdf

the output:

(...)
  <foaf:Person rdf:about="&u1087;samples/S5">
    <foaf:name>S5</foaf:name>
  </foaf:Person>
  <u:Bam rdf:about="file:///home/lindenb/src/hts-rdf/data/S5.grch38.bam">
    <u:filename>/home/lindenb/src/hts-rdf/data/S5.grch38.bam</u:filename>
    <u:sample rdf:resource="&u1087;samples/S5"/>
    <u:reference rdf:resource="&u1087;references/grch38"/>
  </u:Bam>
(...)

Combining all the RDF chunks

jena/rio is used to merge RDF files into knowledge.rdf

riot --formatted=RDFXML TMP/references.rdf data/species.rdf TMP/bams.rdf data/diseases.rdf data/samples.rdf TMP/vcf2ref.rdf TMP/vcf2samples.rdf > knowledge.rdf

Querying the GRAPH

jena/arq is used to run the SPARQL queries.

arq --data=knowledge.rdf --query=querysparql

Example

show me the species that are a sub-taxon of "Homo"

query data/query.species.01.sparql :

(...)

SELECT DISTINCT ?taxonName
WHERE {
	?taxon dc:title ?taxonName .
	?taxon a u:Taxon .
	?taxon rdfs:subClassOf* ?root .
	?root a u:Taxon .
	?root dc:title "Homo" .
}

execute:

arq --data=knowledge.rdf --query=data/query.species.01.sparql > TMP/species.01.out

output TMP/species.01.out:

taxonName
"Homo sapiens neanderthalensis"
"Homo Sapiens"
"Homo"

Example

show the diseases that are a sub disease of COVID-19.

query data/query.diseases.01.sparql :

(...)

SELECT DISTINCT ?diseaseName
WHERE {
	?disease rdfs:label ?diseaseName .
	?disease a owl:Class .
	?disease rdfs:subClassOf* ?root .
	?root a owl:Class .
	?root rdfs:label "COVID-19" .
}

execute:

arq --data=knowledge.rdf --query=data/query.diseases.01.sparql > TMP/diseases.01.out

output TMP/diseases.01.out:

diseaseName
"COVID-19"
"Severe COVID-19"

Example

find the samples , their children, parents , diseases

query data/query.samples.01.sparql :

(...)

SELECT DISTINCT
		(SAMPLE(?sampleName) as ?colName)
		(SAMPLE(?gender) as ?colGender )
		(SAMPLE(?fatherName) as ?colFather )
		(SAMPLE(?motherName)  as ?colMother)
		(GROUP_CONCAT(DISTINCT ?childName; SEPARATOR=";") as ?colChildren)
		(GROUP_CONCAT(DISTINCT ?diseaseName; SEPARATOR=";") as ?colDiseases)
		
WHERE {
	?sample a foaf:Person .
	?sample foaf:name ?sampleName .
	OPTIONAL {?sample foaf:gender ?gender .}
	OPTIONAL {
		?sample u:has-disease ?disease .
		?disease a owl:Class .
		?disease rdfs:label ?diseaseName .
		} .
	OPTIONAL {
		?father a foaf:Person .
		?sample rel:childOf ?father .
		?father foaf:gender "male" .
		?father foaf:name ?fatherName .
		} .
	OPTIONAL {
		?mother a foaf:Person .
		?sample rel:childOf ?mother .
		?mother foaf:gender "female" .
		?mother foaf:name ?motherName .
		} .
	OPTIONAL {
		?child a foaf:Person .
		?child rel:childOf ?sample .
		?child foaf:name ?childName .
		} .
}
GROUP BY  ?sample

execute:

arq --data=knowledge.rdf --query=data/query.samples.01.sparql > TMP/samples.01.out

output TMP/samples.01.out:

colName	colGender	colFather	colMother	colChildren	colDiseases
"S1"	"female"	"S3"	"S2"		"Turner Syndrome;COVID-19"
"S2"	"female"			"S1"
"S3"	"male"			"S1"	"Severe COVID-19"
"S4"
"S5"

Example

List all the VCF files and their samples, at least containing the sample "S1"

query data/query.vcfs.01.sparql :

(...)

SELECT DISTINCT ?vcfPath ?fasta ?taxonName ?sampleName
WHERE {
  ?vcf a u:Vcf .
  ?vcf u:filename ?vcfPath .

   ?vcf u:sample ?sample1 .
   ?sample1 a foaf:Person .
   ?sample1 foaf:name "S1" .

   ?vcf u:sample ?sample2 .
   ?sample2 a foaf:Person .
   ?sample2 foaf:name ?sampleName .

  OPTIONAL {
	?vcf u:reference ?ref .
	?ref a u:Reference .
	?ref u:filename ?fasta

	OPTIONAL {
		?ref u:taxon ?taxon .
		?taxon a u:Taxon .
		?taxon dc:title ?taxonName .
		}
	}

}

execute:

arq --data=knowledge.rdf --query=data/query.vcfs.01.sparql > TMP/vcfs.01.out

output TMP/vcfs.01.out:

vcfPath	fasta	taxonName	sampleName
"data/variants2.vcf"	"data/hg19.fasta"	"Homo Sapiens"	"S1"
"data/variants2.vcf"	"data/hg19.fasta"	"Homo Sapiens"	"S2"
"data/variants2.vcf"	"data/hg19.fasta"	"Homo Sapiens"	"S3"
"data/variants1.vcf"	"data/hg38.fasta"	"Homo Sapiens"	"S1"
"data/variants1.vcf"	"data/hg38.fasta"	"Homo Sapiens"	"S5"
"data/variants1.vcf"	"data/hg38.fasta"	"Homo Sapiens"	"S2"
"data/variants1.vcf"	"data/hg38.fasta"	"Homo Sapiens"	"S3"

Example

find the bam , their reference, samples , etc..

query data/query.bams.01.sparql :

(...)

SELECT DISTINCT ?bamPath
		(SAMPLE(?fasta) as ?colFasta)
		(SAMPLE(?taxonName) as ?colTaxon)
		(SAMPLE(?sampleName) as ?colSampleName )
		(GROUP_CONCAT(DISTINCT ?groupName; SEPARATOR=";") as ?colGroups )
		(GROUP_CONCAT(DISTINCT ?gender; SEPARATOR=";") as ?colGender )
		(GROUP_CONCAT(DISTINCT ?diseaseName; SEPARATOR=";") as ?colDiseases)
		(SAMPLE(?fatherName) as ?colFather )
		(SAMPLE(?motherName)  as ?colMother)
		(GROUP_CONCAT(DISTINCT ?childName; SEPARATOR="; ") as ?colChildren)
WHERE {
  ?bam a u:Bam .
  ?bam u:filename ?bamPath .

  OPTIONAL {
	?bam u:reference ?ref .
	?ref a u:Reference .
	?ref u:filename ?fasta

	OPTIONAL {
		?ref u:taxon ?taxon .
		?taxon a u:Taxon .
		?taxon dc:title ?taxonName .
		}
	}

  OPTIONAL {
	?bam u:sample ?sample .
	?sample a foaf:Person .
	OPTIONAL {?sample foaf:name ?sampleName .}
	OPTIONAL {?sample foaf:gender ?gender .}
	OPTIONAL {
		?group foaf:member ?sample .
		?group a foaf:Group .
		?group foaf:name ?groupName .
		} .
	OPTIONAL {
		?sample u:has-disease ?disease .
		?disease a owl:Class .
		?disease rdfs:label ?diseaseName .
		} .
	OPTIONAL {
		?father a foaf:Person .
		?sample rel:childOf ?father .
		?father foaf:gender "male" .
		?father foaf:name ?fatherName .
		} .
	OPTIONAL {
		?mother a foaf:Person .
		?sample rel:childOf ?mother .
		?mother foaf:gender "female" .
		?mother foaf:name ?motherName .
		} .
	OPTIONAL {
		?child a foaf:Person .
		?child rel:childOf ?sample .
		?child foaf:name ?childName .
		} .
	}.
}
GROUP BY  ?bamPath

execute:

arq --data=knowledge.rdf --query=data/query.bams.01.sparql > TMP/bams.01.out

output TMP/bams.01.out:

bamPath	colFasta	colTaxon	colSampleName	colGroups	colGender	colDiseases	colFather	colMother	colChildren
"/home/lindenb/src/hts-rdf/data/S1.grch38.bam"	"data/hg38.fasta"	"Homo Sapiens"	"S1"	"Fam01"	"female"	"Turner Syndrome;COVID-19"	"S3"	"S2"
"/home/lindenb/src/hts-rdf/data/S2.grch37.bam"	"data/hg19.fasta"	"Homo Sapiens"	"S2"	"Fam01"	"female"				"S1"
"/home/lindenb/src/hts-rdf/data/S4.RF.bam"	"data/rotavirus_rf.fa"	"Rotavirus"	"S4"
"/home/lindenb/src/hts-rdf/data/S5.grch38.bam"	"data/hg38.fasta"	"Homo Sapiens"	"S5"	"Fam01"
"/home/lindenb/src/hts-rdf/data/S3.grch38.bam"	"data/hg38.fasta"	"Homo Sapiens"	"S3"	"Fam01"	"male"	"Severe COVID-19"			"S1"
"/home/lindenb/src/hts-rdf/data/S1.grch37.bam"	"data/hg19.fasta"	"Homo Sapiens"	"S1"	"Fam01"	"female"	"Turner Syndrome;COVID-19"	"S3"	"S2"

The Graph

and here is the RDF graph as a SVG document:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
knowledge.rdf		knowledge.rdf
knowledge.svg		knowledge.svg

License

lindenb/hts-rdf

Folders and files

Latest commit

History

Repository files navigation

hts-rdf

Building the RDF GRAPH

Species

Diseases / Phenotypes

References / FASTA / Genomes

Samples

VCF BCF

VCF and genomes

VCF and samples

BAM files

Combining all the RDF chunks

Querying the GRAPH

Example

Example

Example

Example

Example

The Graph

About

Topics

Resources

License

Stars

Watchers

Forks

Languages