Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSD() stack overflow with many rows #7

Open
wkc1986 opened this issue Jun 6, 2018 · 13 comments
Open

JSD() stack overflow with many rows #7

wkc1986 opened this issue Jun 6, 2018 · 13 comments

Comments

@wkc1986
Copy link

wkc1986 commented Jun 6, 2018

I'm trying to figure this out myself, but no luck yet.

# test.R

library(philentropy)

m <- matrix(runif(1500 * 10), nrow = 1500)
JSD(m)
$ /usr/local/bin/Rscript test.R
Jensen-Shannon Divergence using unit 'log2'.
Metric: 'jensen-shannon' using unit: 'log2'.
Error: segfault from C stack overflow
Execution halted
@HajkD
Copy link
Member

HajkD commented Jun 7, 2018

Dear @wkc1986,

Thank you very much for finding this issue.

I will have a look, it might be a problem with the internal Rcpp loop e.g. DistMatrixWithUnitDF() which I implemented to
pairwise compare all rows in a matrix. My prediction would be that if you write a loop in R and run pairwise comparisons JSD(x_i,y_j) it will work.

But I will keep you posted.

Many thanks and kind regards,
Hajk

@elbamos
Copy link

elbamos commented Jun 11, 2018

I'm seeing something that I suspect is related. I run JSD in a loop, each time generating a 500x500 matrix of distances. After the loop completes, within about 30 seconds I get the C stack overflow error. I'm seeing this on both OS X and Ubuntu 16.04.

@HajkD
Copy link
Member

HajkD commented Jun 11, 2018

Dear @elbamos

Thank you very very much for making me aware of this.
Yes, I also think that it is related and I think I can fix it :)

I am very sorry for the inconvenience it might have caused.

Kind regards,
Hajk

@elbamos
Copy link

elbamos commented Jun 11, 2018

No problem :)

BTW - you know in C you can refer to a function by a pointer. If you did that with custom_log, it would probably simplify a lot of your code. You'd have a single function that takes the log argument and returns a pointer to the C function, that you'd then pass to the respective distance functions.

@HajkD HajkD added the bug label Jun 11, 2018
@HajkD
Copy link
Member

HajkD commented Jun 13, 2018

Dear all,

as I predicted the issue was in the Rcpp functions DistMatrixWithUnitMAT() and DistMatrixWithoutUnitMAT() in file src/dist_matrix.cpp.

The problem was that Rcpp doesn't allow efficiently passing functions as an argument of another function. You can find a detailed discussion here:

https://stackoverflow.com/questions/27391472/passing-r-function-as-parameter-to-rcpp-function

@elbamos: You are absolutely correct and I am aware that it is possible to define C functions as pointers. But when writing philentropy I wanted to use the Rcpp notation throughout my code. Thus, since most functions are now implemented using the syntactic sugar of Rcpp I would have to re-write all functions using the base C/C++ notation to be able to pass functions as pointers. Since I don't have the time to do so in the near future and since after playing a while with the option to still pass function pointers to Rcpp functions (which simply wasn't possible) I decided to write new optimized R functions named DistMatrix(), DistMatrixNoUnit(), and DistMatrixMinkowski() to iterate through the pairwise comparisons of an input matrix.

@wkc1986 When I now run your example code:

m <- matrix(runif(1500 * 10), nrow = 1500)
dist_mat <- JSD(m)
str(dist_mat)

I don't get a segfault anymore.

# >Metric: 'jensen-shannon' using unit: 'log2'; comparing: 1500 vectors.
num [1:1500, 1:1500] 0 0.58 0.347 0.516 0.327 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:1500] "v1" "v2" "v3" "v4" ...
  ..$ : chr [1:1500] "v1" "v2" "v3" "v4" ...

@elbamos Could you please let me know if your example also works now with the newest developer version of philentropy?

In the future, I will also add the option to run matrix computations in parallel (multicore computations) to further speed up the process.

Thank you both once again very very much for making me aware of this issue and I hope that you won't have problems now.

Kind regards,
Hajk

@wkc1986
Copy link
Author

wkc1986 commented Jun 13, 2018

Confirm dev version working for my (actual) application. Thanks so much as always for your responsiveness @HajkD !

@elbamos
Copy link

elbamos commented Jun 13, 2018

@HajkD I can take a look this weekend. I can also tell you that, to solve my issue, I cut-and-pasted the JS function and DistMatrixWithUnitMAT() into another file and took out the function pointer, and that solved it. (That was my guess about what was causing it.)

Regarding function pointers, I actually wasn't referring to the passing of Rcpp functions, which always seemed like voodoo to me. I was referring to structures like this: https://github.com/HajkD/philentropy/blob/9ca1b4c503ae61ab223d892840a13adeb92cc414/src/distances.h#L1288, which could be made more maintainable and simpler, and probably considerably faster, by passing the relevant custom_log function as a pointer. Similarly, when you have structures like this: https://github.com/HajkD/philentropy/blob/9ca1b4c503ae61ab223d892840a13adeb92cc414/src/distances.h#L514 what you probably want to do is just have the check for NAs inside the if() clause. My guess is that 2000 line file could be 500-800 lines of code, which would be much simpler for you to maintain.

@HajkD
Copy link
Member

HajkD commented Jun 23, 2018

Hi @elbamos,

This is a great suggestion.

I would truly appreciate it if you could look into this and whatever makes it easier for me to maintain the code the happier I am of course :)

Many thanks for helping me on this one!

Cheers,
Hajk

@elbamos
Copy link

elbamos commented Jun 25, 2018

@HajkD

Sorry I haven't had a chance to test the new code yet; I had extracted the relevant code into a separate file and I've been using that.

In terms of using a function pointer:

# Necessary because log is an overloaded function
double mylog(const double& x) {
  return log(x);
} 

# Switched to use armadillo for openmp safety...
double jensen_shannon(const vec& P,  const vec& Q, const Rcpp::String unit){
  
...
  # function pointer def
  double (*get_log)(const double&);
  
  if (unit == "log") {
    get_log = mylog; 
  } else if (unit == "log2") {
    get_log = custom_log2; 
  } else if (unit == "log10") {
    get_log = custom_log10; 
  } else {
    Rcpp::stop("Please choose from units: log, log2, or log10.");
  }
  
 ...
    
  for(int i = 0; i < P_len; i++){
    PQsum =   P[i] + Q[i];
    if (P[i] == 0.0 || PQsum == 0.0) {
      sum1  += 0.0;
    } else {
      sum1  +=  P[i] * get_log((2.0 * P[i]) / PQsum);
    }
    if (Q[i] == 0.0 || PQsum == 0.0) {
      sum2  += 0.0;
    } else {
      sum2  +=  Q[i] * get_log((2.0 * Q[i]) / PQsum);
    }
  }
  return 0.5 * (sum1 + sum2);
}

Not fully tested, but you get the idea.

@elbamos
Copy link

elbamos commented Jun 25, 2018

Also, you can buy a performance improvement in the loops with:

  for (int i = 0; i < ncols; i++){
    const vec& vec_i = dists.col(i); 
     # You don't need to consider j < i; actually, you don't need to consider j == i or i == ncols either
     # but that's for another time.
    for (int j = i; j < ncols; j++){
        dist_value = jensen_shannon(vec_i, dists.col(j), unit);
        dist_matrix(i,j) = dist_matrix(j,i) = dist_value;
    }
  } 

@elbamos
Copy link

elbamos commented Jun 26, 2018

The function pointer idea works even better if you do the if..then in the exported function and pass the function pointer as a parameter to all of your distance functions.

Also you don't need to check that the columns are the same size each time the distance function is called -- they come from a matrix, so they have to be the same size.

(Actually it would be better to map2 them using C++11, but that's another story.)

@HajkD
Copy link
Member

HajkD commented Jun 26, 2018

Dear @elbamos,

These are absolutely amazing suggestions!!!
Thank you so so much for taking the time to help me improve the speed of computations.

I will start working on incorporating your approaches and am more than happy for any further improvements :)

Many many thanks,
Hajk

@elbamos
Copy link

elbamos commented Jun 26, 2018

BTW - it shouldn't be terribly hard to get this working with OpenMP. I put an hour or two into it but kept getting seg faults and don't have the time to track them down. But its "embarassingly parallel", as they say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants