Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading binary output from cmp #63

Open
tsp-kucbd opened this issue Jan 1, 2021 · 2 comments
Open

reading binary output from cmp #63

tsp-kucbd opened this issue Jan 1, 2021 · 2 comments

Comments

@tsp-kucbd
Copy link

tsp-kucbd commented Jan 1, 2021

Is there a way of reading the binary distance matrix from "dashing cmp -b" directly into numpy?

@tsp-kucbd tsp-kucbd changed the title output reading binary output from cmp Jan 1, 2021
@dnbaker
Copy link
Owner

dnbaker commented Jan 2, 2021

Sure! It's really easy with Numpy. Here's a script that does what you need:

import numpy as np

def parse_binary(path):
    dat = np.fromfile(path, dtype=np.uint8)
    assert dat[0] == 0, "Expected the first byte to be zero"
    n = int(dat[1:9].view(np.uint64)[0])
    vals = dat[9:].view(np.float32)
    return n, vals

def tri2full(n, vals, is_sim=True):
    mat = np.zeros((n,n), dtype=np.float32)
    mat[np.triu_indices(n, k=1)] = vals
    mat[np.tril_indices(n, k=-1)] = vals
    mat[np.diag_indices(n)] = 1. if is_sim else 0.
    return mat

if __name__ == "__main__":
    import sys
    if sys.argv[1:]:
        n, v = parse_binary(sys.argv[1])
        print(n, v)
        full = tri2full(n, v)
        print(full)

The first byte describes the type of the distance matrix. The next 8 are a 64-bit integer for the number of items in the distance matrix, and then the remaining items are float32. You can then use the np.triu_indices function to export items from packed space into the full matrix. In the above example, I set both halves of the matrix.

I'll add some instructions regarding this into the documentation soon.

Thanks!

Daniel

@tsp-kucbd
Copy link
Author

tsp-kucbd commented Jan 4, 2021

Thank you, that worked!
Although only for smaller runs where I have 10,000 genomes.
The current run where I have 400,000 genomes works nicely with "-T" writing the csv files but crashes directly (after 3 minutes) in the beginning when trying to use the binary output "-b"

./dashing_s512 dist -F list_of_genomes  -k 15 -S16 -p 39 -M -b -o labelsB --use-nthash -O  NGOT.dashdists.bin

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[1]    40318 abort (core dumped)  ./dashing_s512 dist -F list_of_genomes -k 15 -S16 -p 39 -M -b -o labelsB  -O

I will open a new issue with this,
/T

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants