Skip to content

cobilab/HumanGenome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

How compressible is a human genome sequence?


This repository provides information-reproducibility on how compressible a human genome sequence is (T2T Chm13 version 2.0 [article,sequence]) using different data compressors.

Results:

The 3,117,292,120 human DNA symbols have been compressed (lossless) to


Rank Bytes Bps Time (m) RAM (GB) Program Replication Factor (*)
1 538,155,679 1.381 ? ? JARVIS3 ? 31%
2 539,129,963 1.384 641 13.7 JARVIS3 Run56 31%
3 543,855,534 1.395 381 28.8 JARVIS2 Run52 30%
4 544,059,173 1.396 389 28.8 JARVIS2 Run51 30%
5 544,267,353 1.396 420 27.4 JARVIS2 Run50 30%
6 544,292,577 1.397 399 26.9 JARVIS2 Run49 30%
7 545,960,947 1.401 283 26.9 JARVIS2 Run48 30%
8 549,594,830 1.410 284 11 JARVIS2 Run47 30%
9 550,041,600 1.411 340 18.8 JARVIS2 Run45 29%
10 550,051,840 1.411 309 18.8 JARVIS2 Run43 29%
11 550,379,520 1.412 279 18.7 JARVIS2 Run42 29%
12 554,823,680 1.423 253 18.7 JARVIS2 Run44 29%
13 554,985,480 1.424 219 4.1 JARVIS2 Run46 29%
14 555,412,871 1.425 690 24.8 GeCo3 Run39 29%
15 555,679,745 1.426 616 24.3 GeCo3 Run32 29%
16 555,977,522 1.427 488 22.2 GeCo3 Run31 29%
17 556,415,717 1.428 427 19.7 GeCo3 Run27 29%
18 557,100,364 1.430 428 17.2 GeCo3 Run26 29%
19 557,438,004 1.431 426 15.7 GeCo3 Run24 29%
20 557,995,100 1.432 406 14.6 GeCo3 Run23 28%
21 558,343,430 1.433 396 13.3 GeCo3 Run20 28%
22 559,124,034 1.435 425 11.6 GeCo3 Run11 28%
23 560,694,405 1.439 354 12.8 GeCo3 Run10 28%
24 560,982,904 1.440 416 8.1 GeCo3 Run25 28%
25 561,644,781 1.441 280 11.3 GeCo3 Run9 28%
26 562,253,393 1.443 281 11.3 GeCo3 Run5 28%
27 564,282,192 1.448 222 6.3 GeCo3 Run4 28%
28 564,613,120 1.449 82 4.5 JARVIS2 Run53 28%
29 564,913,725 1.450 262 7.3 GeCo3 Run3 28%
30 566,108,106 1.453 54 8.4 JARVIS2 Run54 27%
31 566,387,531 1.454 215 6.3 GeCo3 Run2 27%
32 575,830,095 1.478 94 2.9 GeCo3 Run41 26%
33 576,296,690 1.479 37 5.9 JARVIS2 Run55 26%
34 577,672,973 1.482 88 1.9 GeCo3 Run40 26%
35 578,588,274 1.485 101 3.3 GeCo3 Run1 26%
36 581,917,199 1.493 97 1.8 GeCo3 Run17 25%
37 583,746,074 1.498 86 3.3 GeCo3 Run21 25%
38 589,813,339 1.514 17,465 0.6 nncp Run14 24%
39 603,726,643 1.549 71 3.3 GeCo3 Run22 23%
40 607,749,667 1.560 22 2.5 MFCompress Run30 22%
41 607,835,665 1.560 48 1.8 GeCo2 Run8 22%
42 609,579,746 1.564 171 13.8 JARVIS Run18 22%
43 612,331,601 1.571 4,588 1.6 paq8l Run12 21%
44 614,339,951 1.577 39 28.5 bsc-m03 Run38 21%
45 614,919,247 1.578 39 20.4 bsc-m03 Run37 21%
46 618,241,906 1.587 39 16.3 bsc-m03 Run36 21%
47 619,369,574 1.590 20 2.0 GeCo2 Run7 21%
48 619,837,647 1.591 12 0.6 MFCompress Run29 21%
49 620,837,061 1.593 39 11.2 bsc-m03 Run35 20%
50 625,647,034 1.606 38 5.6 bsc-m03 Run34 20%
51 625,753,521 1.606 11 0.6 MFCompress Run28 20%
52 628,342,060 1.613 18 0.5 GeCo2 Run6 19%
53 639,222,915 1.640 43 0.8 NAF-22 Run16 18%
54 646,062,792 1.658 84 0.6 lzma -9 Run15 17%
55 661,591,088 1.698 36 0.05 bsc-m03 Run33 15%
56 752,793,986 1.932 5 0.001 bzip2 -9 Run19 3%
Baseline 779,323,017 2.000 - - 2 BPS - 0%

(*) The base line of 2 bits per symbol is used to calculate the (data compression) Factor, which represents the proportion of the sequence that has been fully compressed and is given by 100-((CompressedBytes*8)/(3117292120*2)*100). The Run1.sh and Run4.sh ran in a Laptop computer running Linux with 11th Gen Intel® Core™ i5-1135G7 @ 2.40GHz × 8, 8 GB of RAM, and an SSD disk of 512 GB. The remaining computations ran in a Desktop computer running Linux with Intel® Core™ i7-6700 CPU @ 3.40GHz × 8, 31,2 GiB RAM, and disk of 3 TB. The ranking is given by the lowest number of bytes (Kolmogorov complexity approximation).

Data compression tools


Data Compressor Repository Description
GeCo3 code article
GeCo2 code article
paq8l code article
nncp v3.1 code article
NAF code article
lzma 5.2.5 code article
JARVIS code article
bzip2 1.0.8 code article
MFCompress code article
bsc-m03 v0.2.1 code article
JARVIS2 code article
JARVIS3 code -
Zstandard code base

Reproducibility:

Change directory and give permitions:

cd scripts/
chmod +x Run*.sh

To replicate each run, use the respective replication script.

Releases

No releases published

Packages

No packages published

Languages