Skip to content

Retro-ingineering of PacBio's modelling of IPD (RSII, Sequel I, Sequel II v2)

License

Notifications You must be signed in to change notification settings

EMeyerLab/ipdtools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ipdtools

Retro-ingineering of PacBio's modelling of IPD (RS II, Sequel I, Sequel II v2). This won't work for Sequel II with v1.0 chemistry.

Development state

The project is not anymore under active development, but I will do my best to help you if you post an issue.

Background

The PacBio's SMRT sequencing allows the measurement of the DNA polymerase's kinetics. It has been shown that slowing of the polymerase can be linked to:

  • DNA Modifications
  • Secondary structures

This kinetic can then be compared with the kinetic of a control sample from a Whole Genome Amplification (WGA) to detect possible modifications of nucleotides within DNA:

  • N6-methyl-adenine (n6mA)
  • 5-methyl-cytosine (5mC)
  • 4-methyl-cytosine (4mC)

As it is known that the polymerase's speed at a nucleotide N is influenced by the surrounding N-5/N+4 nucleotides, it is most of the times possible to predict the polymerase's speed if the given nucleotide is methylated or not.

For people who don't use WGA, PacBio provides a modelling of the kinetics of the polymerase for unmethylated DNA and methylated DNA, that uses Machine-Learning of the surrounding context of nucleotides to predict the Inter-Pulse Duration (IPD), which is directly the speed of the polymerase.

Unfortunately, this ML modeling is not directly accessible to scientists

From the source code of kineticsTools and the models published by PacBio at kineticsTools, it was possible to make a user-friendly command-line-interface (CLI) and python API to easily predict any IPD, in any context for unmethylated DNA.

NB: Methylated DNA is harder to predict because it often involves a mixture of many possible combinations of signals. The original "kineticsTools" does it, but the retroingeneering of it is not achieved yet

Installation

The program has only be tested in Ubuntu LTS 18 and 20 with pip and conda

With pip from PyPI

user@computer:$ pip install ipdtools

NB: Editable pip install is not guaranteed to work properly.

With conda

user@computer:$ conda install -c gdelevoye ipdtools # Linux only

Usage

(base) user@computer:~/ipdtools/resources$ ipdtools -f test.fasta -o test.csv -p -n 6 --model P6-C4 --verbosity INFO
2020-12-22 15:14:29,067 [INFO] Using 6 CPUs
2020-12-22 15:14:29,079 [INFO] Number of nucleotides to handle : 463
Nucleotides: 100%|███████████████████████████████| 463/463 [00:10<00:00, 39.59it/s]
2020-12-22 15:14:39,145 [INFO] Result saved at /export/home1/users/gmc/delevoye/Github/ipdtools/ipdtools/resources/test.csv
2020-12-22 15:14:39,148 [INFO] DONE

Command-line interface (CLI)

Generate a .csv of IPDs from a fasta file:

user@computer$:ipdtools -f myfile.fasta -o output.csv -m [MODELNAME]

By default, [MODELNAME] is SP2-C2, which is supposed to work on any Sequel I or Sequel IIv2 Data. You might want to change it if you work with older RSII Data.

See "Which model should I use ?" for more info

Detailed Usage

user@computer$:ipdtools --help
usage: ipdtools [-h] [--model {SP2-C2,C2,P4-C2,P5-C3,P6-C4,XL-C2,XL-XL}]
                --fastafile FASTAFILE --output_csv OUTPUT_CSV
                [--verbosity {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                [--progress_bar] [--nproc NPROC] [--indexing {0,1}]

optional arguments:
  -h, --help            show this help message and exit
  --model {SP2-C2,C2,P4-C2,P5-C3,P6-C4,XL-C2,XL-XL}, -m {SP2-C2,C2,P4-C2,P5-C3,P6-C4,XL-C2,XL-XL}
                        Choose the model for IPD prediction. See the README of
                        package for more info. DEFAULT: SP2-C2
  --fastafile FASTAFILE, -f FASTAFILE
                        Path to a fasta file.
  --output_csv OUTPUT_CSV, -o OUTPUT_CSV
                        Output CSV file of predicted IPDs.
  --verbosity {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -v {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Choose your verbosity. Default: INFO
  --progress_bar, -p    Displays a progress bar
  --nproc NPROC, -n NPROC
                        Max number of processors for parallelism. DEFAULT: 1,
                        positive integers only. Rq: The programm will actually
                        use n+1 >= 2 CPU no matter what.
  --indexing {0,1}, -i {0,1}
                        Is the indexing corresponding to the .fasta reference
                        1-based (default PacBio) or 0-based

Output .csv header

The output will be a .csv file like this one:

Fasta_ID Position Strand Nucleotide Prediction
[CONTIG] 0 0 A xxxx
[CONTIG] 0 1 T xxxx
[CONTIG] 1 0 T xxxx
[CONTIG] 1 1 A xxxx
[CONTIG] 2 0 C xxxx
[CONTIG] 2 1 G xxxx
[CONTIG] 3 0 G xxxx
[CONTIG] 3 1 C xxxx
[CONTIG] 4 0 A xxxx
[CONTIG] 4 1 T xxxx

Of course, depending on using a 0-based or 1-based indexing, the columns will vary.

The csv is sorted (in ascending order) by:

  • Fasta_ID
  • Position
  • Strand

Python API

ipdtools can be used in python to predict from a fasta file

it can also be used to predict directly from a python string with Str2IPD:

>>> import ipdtools
>>> predictor = ipdtools.ipdModel.Str2IPD("ATGCTAGCTTTTTGNCTGATTAGCTGA",model="SP2-C2",indexing=0) # Default model is SP2-C2
>>> predictor.predict(position=0) # The default is strand0
1.0641327
>>> predictor.predict(position=0,strand=1)
0.42285475

WARNING : This feature is experimental. PacBio's model can produce outputs even for very incoherent parameters (like, "Strand 4, Position -1") without raising any error or warning. Although this behaviour is really not reasuring at first, I made sure with manual checkings that the predictions made from valid parameters stay valid (they apparently do). So, when using correct parameters, there is no problem. The problem is only there IF you're using wrong parameters (like, seeking the ipds at position 350 if your sequence is 349 long). I tried to limit all those potential problems by putting additionnal assertions on the sequence, the strand and the position we ask (depending of course of the indexing system). All of these security assumptions are implemented. I just havn't really had the time to test it yet, so be carefull with that.

Which model should I use ?

Available models are:

  • C2
  • P4-C2
  • P5-C3
  • P6-C4
  • SP2-C2
  • XL-C2
  • XL-XL

No model is available for Sequel II with chemistry V1.0

SP2-C2 was the latest model in Jan 2020], and was recommended for Sequel I and Sequel II v2 data

See here for more info.

The model <-> sequencing kit conversions are unfortunately not well documented. PacBio tools parse the headers automatically to determine it

Conventions used

Important notes:

  • 0-based or 1-based indexing depends on the user's choice (PacBio uses 1-based so, our default is 0-based unless specified explicitely)
  • Position N strand 0 refers to the Nth position in the fasta
  • Strand 0 is the fasta reference sequence. Strand 1 is the complementary
  • Position is alswyas expressed from the strand0/fasta reference point of view. We don't care about 5'-3'.

E.g, if we use 0-based indexing (not the default one) then,

Given this fasta sequence :

"ATGCTT"

The DNA molecule could be pictured this way :

position                               012345
strand0                             5' ATGCTT 3'
                                       ||||||
strand1                             3' TACGAA 5'
position                               012345

Which implies, for instance that:

  • Prediction of position=2 and strand = 1 will be the prediction at position 2 on the complementary strand: "C"
  • Prediction of position=2 and strand = 0 will be the prediction at position 2 on read strand : "G"

Here is a table for the output that will be produced from this exemple just to be sure:

For strand 0

position strand nucleotide
0 0 A
1 0 T
2 0 G
3 0 C
4 0 T
5 0 T

And for strand 1

position strand nucleotide
0 1 T
1 1 A
2 1 C
3 1 G
4 1 A
5 1 A

Credits

The Original code is copyrighted from Pacific Bioscience California (See "LICENSE") as it is published on Github: kineticsTools at the date of Commits on Apr 25, 2018 (https://github.com/PacificBiosciences/kineticsTools/tree/27a18788b4bae5956710e3bda804dbd5635db2a2).

Major modifications, retro-ingineering, isolation from the PacBio's API, repackaging, python3 compatibility and new documentation are from Guillaume DELEVOYE.

See "License" for more infos