Skip to content

paul90317/Semi-Global-Sequence-Alignment-with-Cuda

Repository files navigation

Semi Global Sequence Alignment with Cuda

intro
ppt
youtube

What I have done

As the topic, I alse implement the local and global sequence alignment.
However, I don't do this by one program but two.

  1. The program semi_interval, will calculate best score with semi x and y first, then generate semi inteval of x and y.
  2. The program alignment, will align the sequence x,y with the interval, although this is global align, but we use semi interval to do this, so result will as same as semi global sequence alignment.

Config

You can edit config in myconfig.h, this config is a part of program, so it will be optimized ("optimized" is my word, not "compiler optimizing") with preprocessor , such as branch reducing, class member reducing, etc.
In this file, you can edit cuda thread number per block, start and end of sequence x and y is free or fixed the datatype of score matrix and so on.

Compile and Run

compile

make

make the program.

make clean

clean the program

run

./semi_interval.out <x.txt> <y.txt> <best interval.txt> <score.txt>

Get the best interval, you need to config semi setting and datatype of score matrix in myconfig.h.

  • <x.txt> <y.txt> the files need to contain input sequence.
  • <best interval.txt> best intervals store in this file.
  • <score.txt> score matrix store will be get from this file.

./alignment.out <x.txt> <y.txt> <best interval.txt> <score.txt> <alignment.txt>

Get alignment using the <best interval.txt> generated by semi_interval.out, this is global alignment but using the interval in <best interval.txt>

  • <x.txt> <y.txt> the files need to contain input sequence.
  • <best interval.txt> best interval should be got in this file (only get first line (interval)).
  • <score.txt> score matrix store will be get from this file.
  • <alignment.txt> alignment will be stored in this file.

./cpu.out <x.txt> <y.txt> <score.txt>`  

Just for testing speed, only calculate and print out global sequence alignment score.

  • <x.txt> <y.txt> the files need to contain input sequence.
  • <score.txt> score matrix store will be get from this file.

Input and Ouput File Format

Score Matrix <score.txt>

Which stores the score matrix used in program alignment.out and semi_interval.out.

input format

<number base>
<base> ...
<score matrix> ...
.
.
.
<gap>
<extension>

<base> can be any char (including space), but can not be newline

input example

4
ATGC
1 -5 -5 -1 
-5 1 -1 -5 
-5 -1 1 -5 
-1 -5 -5 1 
-2
-1

Sequence <x.txt> <y.txt>

The file only contain newline and <base>, the program will ignore newline when read the sequence file.

input example

AATTCCGAT
AATTCGTT
TGGAAT

Best Interval <best interval.txt>

output format by semi_interval.out

<best score> <x start> <x end> <y start> <y end>
.
.
.

input format by alignment.out

<best score> <x start> <x end> <y start> <y end>

If there are multiple lines, only the first line is consumed by alignment.out


Alignment <alignment.txt>

output example

- C
A A
T T
G -
C C
- C

Python scripts

You can use my python scripts which calculate alignment automatically in a specific file structure. If you have many alignment to do, it's useful.

File structure

├───score.json
├───tasks
│   ├───100K-100K
│   │   └───x.txt
│   │   └───y.txt
│   ├───100K-10K
│   │   └───x.txt
│   │   └───y.txt
│   ├───10K-100K
│   │   └───x.txt
│   │   └───y.txt
│   ├───10K-10K
│   │   └───x.txt
│   │   └───y.txt
│   └───1K-1K
│   │   └───x.txt
│   │   └───y.txt

after command ./gpu_test.sh -a

├───tasks
│   ├───100K-100K
│   │   └───out
│   │       └───best.txt
│   │       └───alm
│   │           └───...
│   ├───100K-10K
│   │   └───out
│   │       └───best.txt
│   │       └───alm
│   │           └───...
│   ├───10K-100K
│   │   └───out
│   │       └───best.txt
│   │       └───alm
│   │           └───...
│   ├───10K-10K
│   │   └───out
│   │       └───best.txt
│   │       └───alm
│   │           └───...
│   └───1K-1K
│   │   └───out
│   │       └───best.txt
│   │       └───alm
│   │           └───...

The folder alm/ contains alignments <alignment.txt> generated from alignment.out, if you don't want to genearate this, remove argv -a

The file best.txt is the file which stores the best intervals generated from semi_interval.out

Commands

make cpu_test

Just use CPU run global alignment score in tasks, there is no semi function, so it just let you can compare the performance of CPU with that of GPU or, the global alignment score should be as same as the program run with CUDA (set start and end of x and y to fixed).

make gpu_test

Calcuate best scores and its intervals by semi_interval.out, then run alignemnt.out generate alignments of the intervals generated by semi_interval.out.

make clean_tasks

clean alignments in tasks


Score Matrix score.json

This is very important, instead of score.txt, python scripts only allow score.json, but I think score.json is easier to edit.

example for DNA

{
    "chars":["A","T","G","C"],
    "matrix":[
        [1,-1,-1,-1],
        [-1,1,-1,-1],
        [-1,-1,1,-1],
        [-1,-1,-1,1]
    ],
    "gap":-2,
    "extension":-1
}

example for a-z, A-Z and space

{
    "chars":[
        {
            "l":"a",
            "r":"z"
        },
        {
            "l":"A",
            "r":"Z"
        },
        " "
    ],
    "matrix":{
        "match":1,
        "miss":-1
    },
    "gap":-2,
    "extension":-1
}

Requirements

  • GCC 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  • CUDA 11.5
  • NVIDIA-SMI 495.29.05
  • Operating System Ubuntu 20.04.3 LTS (Focal Fossa)
  • Make GNU Make 4.2.1
  • Python3 3.8.10

Result

semi_interval.out

run following command to get score.txt

$ ./matrix_transform.sh score.json temp/score.txt 
Transfrom score.json to temp/score.txt.

and run following command.

$ ./semi_interval.out "tasks/1K-1K/x.txt" "tasks/1K-1K/y.txt" "tasks/1K-1K/out/best.txt" temp/score.txt

semi-global-setting: src/headers/myconfig.h
 - x: [fixed, fixed]
 - y: [fixed, fixed]
score matrix: temp/score.txt
sequence X: tasks/1K-1K/x.txt
 - size: 972
sequence Y: tasks/1K-1K/y.txt
 - size: 979

time taken: 0.01s

[OUTPUT]
best intervals: tasks/1K-1K/out/best.txt
best score: -281.20000
inteval: X=[1, 972] Y=[1, 979]
 - score: -281.20000

cpu.out

$ ./cpu.out "tasks/1K-1K/x.txt" "tasks/1K-1K/y.txt" temp/score.txt

score matrix: temp/score.txt
sequence X: tasks/1K-1K/x.txt
 - size: 972
sequence Y: tasks/1K-1K/y.txt
 - size: 979

time taken: 0.02s

[OUTPUT]
best score: -281.2

alignment.out

run the python script gpu_test,

$ ./gpu_test.sh -a

and you will find it run following command.

$ ./alignment.out "tasks/1K-1K/x.txt" "tasks/1K-1K/y.txt" temp/best.txt temp/score.txt "tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt"

score matrix: temp/score.txt
interval: temp/best.txt
 - index: 0
 - score: -281.2
 - sequence X: tasks/1K-1K/x.txt
 -  - interval: [1, 972]
 - sequence Y: tasks/1K-1K/y.txt
 -  - interval: [1, 979]

time taken: 1.22s

[OUTPUT]
best score: -281.2
alignment: tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt
 - score: -281.2

and the result alignment is stored in "tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt", run following command to show it.

$ ./v2h.sh tasks/1K-1K/out/alm/-281.20000-1-972-1-979.txt
ATGCTAAAAACCCTCAATAAACTAGGTACTGATGGAACATATCTCAAAAT
--G---ACATCCAT---T----TTTGTTGTTATCCAACATCTGCCCACCG

AATAATACCTATTTATGAAAAACCCACAGCCAATACTGAATGGTGAAAAA
A-TATT-CCTTTTGAAGACTA-CCC-CATT-AATCTTGA-GAGTGG----

CTGGAAGCATTCCCTTTGAAAACCAGCACAAG--ACAAGGATGCCCTATC
CTGGTA-C--TCCCTCT-AAGAC-ATCGAAAGGGACTAGCTTTCCAAA-C

...

About

用 CUDA 在線性空間複雜度平行化序列比對算法

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published