Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing KNN between two granges objects #77

Open
ShanSabri opened this issue Mar 2, 2020 · 2 comments
Open

Computing KNN between two granges objects #77

ShanSabri opened this issue Mar 2, 2020 · 2 comments

Comments

@ShanSabri
Copy link

ShanSabri commented Mar 2, 2020

Hi Stuart,

Thanks for the great package.

I was wondering if it was possible to find the k-nearest neighbors as opposed to the single nearest. For example, I'm interested in tagging ATAC peaks with the 5 nearest genes. I've opened up an issue on GenomicRanges() regarding its unexported findKNN() function and was wondering if you had any insight.

The functions below seem to work perfectly for k=1 nearest neighbor, but I'd like to extend this to k>1, while also retaining the corresponding distances:

>   IRanges::nearest(peaks, tss, ignore.strand = FALSE, select = "all") # k = 1; nearest peak to loci
Hits object with 295913 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1       15215
       [2]         2       15215
       [3]         3       15215
       [4]         4       15215
       [5]         5       15215
       ...       ...         ...
  [295909]    295640       16535
  [295910]    295641       16535
  [295911]    295642       16535
  [295912]    295643       16535
  [295913]    295644       16535
  -------
  queryLength: 295644 / subjectLength: 18436

>   GenomicRanges::distanceToNearest(peaks, tss, select = "all"))# k = 1; nearest peak to loci
Hits object with 295913 hits and 1 metadata column:
           queryHits subjectHits |  distance
           <integer>   <integer> | <integer>
       [1]         1       15215 |    107265
       [2]         2       15215 |    107065
       [3]         3       15215 |    106865
       [4]         4       15215 |    106665
       [5]         5       15215 |    106465
       ...       ...         ... .       ...
  [295909]    295640       16535 |     42858
  [295910]    295641       16535 |     43058
  [295911]    295642       16535 |     43258
  [295912]    295643       16535 |     43458
  [295913]    295644       16535 |     43658
  -------
  queryLength: 295644 / subjectLength: 18436

Any help would be much appreciated!

EDIT: I should mention that reproducible data and examples are posted on the GenomicRanges() issue I opened.

@ShanSabri
Copy link
Author

I managed to work up a solution that seems to work for my case.

@sa-lee
Copy link
Collaborator

sa-lee commented Mar 4, 2020

Glad you managed to get something to work for your needs. When I have more time, I will try to implement a family of join_nearest_neighbor_*() functions based on your use case. Would be happy to add you as a contributor, if you would like to have a go at implementing a PR. cc @lawremi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants