Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

differences between this plugin's gather and sourmash gather #331

Open
ctb opened this issue May 12, 2024 · 1 comment
Open

differences between this plugin's gather and sourmash gather #331

ctb opened this issue May 12, 2024 · 1 comment

Comments

@ctb
Copy link
Collaborator

ctb commented May 12, 2024

Post-#298, we now get full gather results out of fastgather and fastmultigather. But there are some differences between what the plugin outputs and what the OG sourmash gather outputs 馃槺 .

First, note that {'filename', 'md5', 'name'} in OG gather are now {'match_filename', 'match_md5', 'match_name'}.

Also, 'potential_false_negative' is missing from plugin gather.

After that is dealt with, the following columns are the same 馃帀 -

  • f_orig_query
  • median_abund
  • md5
  • gather_result_rank
  • n_unique_weighted_found
  • query_abundance
  • ksize
  • f_match_orig
  • scaled
  • f_unique_to_query
  • query_name
  • query_filename
  • average_abund
  • unique_intersect_bp
  • name
  • f_match
  • intersect_bp
  • total_weighted_hashes

rounding differences?

std_abund, f_unique_weighted appear to be different just because they are floats.

trivial/easy to fix differences

  • moltype is lowercase in plugin gather, so dna instead of DNA
  • query_md5 is truncated to 8 characters in the OG gather.
  • filename means different things in OG gather and the plugin - in the OG gather, it's the filename of the database being searched, in the plugin it's ... the filename of the sig? not sure.

real differences

  • query_n_hashes and query_bp are the original query (so, constant) in OG gather, while in the plugin they are the size of the remaining query at that rank
  • remaining_bp is just different - looks like it's just being calculated very differently.
  • max_containment_ani is just quite different...??
  • average_containment_ani is just different, too
  • query_containment_ani is also different
  • aaand match_containment_ani is also different

twilight zone differences

  • sum_weighted_found values are all the same except for in one specific row. WTF.
@ctb
Copy link
Collaborator Author

ctb commented May 12, 2024

@bluegenes here's the notebook I'm using: https://github.com/ctb/2024-debug-gather-difference/blob/main/compare-picklist.ipynb

it's mildly tricksy, because I had to force sourmash gather to use the identical set of sketches used by fastgather, via a picklist. but it seems to work ok ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant