|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Bcftools |
| 4 | +parent: 2. Program guides |
| 5 | +--- |
| 6 | + |
| 7 | +# Bcftools |
| 8 | + |
| 9 | +Bcftools are a set of [utilities for variant calling and manipulating VCFs and BCFs](https://samtools.github.io/bcftools/bcftools.html). |
| 10 | + |
| 11 | +## Generating genotype likelihoods for alignment files using `bcftools mpileup` |
| 12 | + |
| 13 | +`bcftools mpileup` can be used to generate VCF or BCF files containing genotype likelihoods for one or multiple alignment (BAM or CRAM) files as follows: |
| 14 | + |
| 15 | +```bash |
| 16 | +$ bcftools mpileup --max-depth 10000 --threads n -f reference.fasta -o genotype_likelihoods.bcf reference_sequence_alignmnet.bam |
| 17 | +``` |
| 18 | + |
| 19 | +In this command... |
| 20 | + |
| 21 | +1. **`--max-depth`** or **`-d`** sets the reads per input file for each position in the alignment. In this case, it is set to 10000 |
| 22 | +2. **`--threads`** sets the number (*n*) of processors/threads to use. |
| 23 | +3. **`--fasta-ref`** or **`-f`** is used to select the [faidx-indexed FASTA](samtools.md#indexing-a-fasta-file-using-samtools-faidx) nucleotide reference file (*reference.fasta*) used for the alignment. |
| 24 | +4. **`--output `** or **`-o`** is used to name the ouput file (*genotype_likelihoods.bcf*). |
| 25 | +5. The final argument given is the input BAM alignment file (*reference_sequence_alignment.bam*). Multiple input files can be given here. |
| 26 | + |
| 27 | +## Variant calling using `bcftools call` |
| 28 | + |
| 29 | +`bcftools call` can be used to call SNP/indel variants from a BCF file as follows: |
| 30 | + |
| 31 | +```bash |
| 32 | +$ bcftools call -O b --threads n -vc --ploidy 1 -p 0.05 -o variants_unfiltered.bcf genotype_likelihoods.bcf |
| 33 | +``` |
| 34 | + |
| 35 | +In this command... |
| 36 | + |
| 37 | +1. **`--output-type`** or **`-O`** is used to select the output format. In this case, *b* for BCF. |
| 38 | +2. **`--threads`** sets the number (*n*) of processors/threads to use. |
| 39 | +3. **`-vc`** specifies that we want the output to contain variants only, using the original [SAMtools](samtools.md) consensus caller. |
| 40 | +4. **`--ploidy`** specifies the ploidy of the assembly. |
| 41 | +5. **`--pval-threshold`** or **`-p`** is used to the set the p-value threshold for variant sites (*0.05*). |
| 42 | +6. **`--output `** or **`-o`** is used to name the ouput file (*variants_unfiltered.bcf*). |
| 43 | +7. The final argument is the input BCF file (*genotype_likelihoods.bcf*). |
| 44 | + |
| 45 | +## Filtering variants using `bcftools filter` |
| 46 | + |
| 47 | +`bcftools filter` can be used to filter variants from a BCF file as follows... |
| 48 | + |
| 49 | +```bash |
| 50 | +$ bcftools filter --threads n -i '%QUAL>=20' -O v -o variants_filtered.vcf variants_unfiltered.bcf |
| 51 | +``` |
| 52 | + |
| 53 | +In this command... |
| 54 | + |
| 55 | +1. **`--threads`** sets the number (*n*) of processors/threads to use. |
| 56 | +2. **`--include`** or **`-i`** is used to define the expression used to filter sites. In this case, *`%QUAL>=20`* results in sites with a quality score greater than or equal to 20. |
| 57 | +3. **`--output-type`** or **`-O`** is used to select the output format. In this case, *v* for VCF. |
| 58 | +4. **`--output `** or **`-o`** is used to name the ouput file (*variants_filtered.vcf*). |
| 59 | +5. The final argument is the input BCF file (*genotype_likelihoods.bcf*). |
| 60 | + |
| 61 | +## See also |
| 62 | + |
| 63 | +- [File formats used in bioinformatics](file_formats.md) |
| 64 | +- [SNP calling script](snp_calling.md) |
| 65 | + |
| 66 | +## Futher reading |
| 67 | + |
| 68 | +- [bcftools documentation](https://samtools.github.io/bcftools/bcftools.html) |
0 commit comments