New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark to replace the transform functions #3662
Comments
Here is another one, with a benchmark #include <chrono>
#include <armadillo>
int main()
{
arma::mat output = arma::randu(10, 10);
arma::mat output_2 = output;
output.print("original matrix");
double ratio = 0.2;
auto start_mlpack = std::chrono::high_resolution_clock::now();
output.transform([&] (double val) { return (val > ratio); });
auto stop_mlpack = std::chrono::high_resolution_clock::now();
output.print("dropout matrix mlpack: ");
auto start_omar = std::chrono::high_resolution_clock::now();
output_2 = arma::sign(output_2 - ratio);
output_2.clamp(0, 1);
auto stop_omar = std::chrono::high_resolution_clock::now();
output_2.print("dropout matrix omar: ");
auto duration_mlpack = std::chrono::duration_cast<std::chrono::nanoseconds>(stop_mlpack - start_mlpack);
std::cout << "duration mlpack : " << duration_mlpack.count() << " nano seconds" << std::endl;
auto duration_omar = std::chrono::duration_cast<std::chrono::nanoseconds>(stop_omar - start_omar);
std::cout << "duration omar : " << duration_omar.count() << " nano seconds" << std::endl;
} n = 10 x 10
n = 100 x 100
|
better implementations are welcomed 😄 |
Hi @shrit, #include <chrono>
#include <armadillo>
#include <iostream>
#include <fstream>
arma::mat generateRandomMatrix(double mean, double stddev, int rows, int cols) {
arma::mat output = arma::randu(rows, cols);
output = (output * stddev * std::sqrt(12.0)) + mean - stddev * std::sqrt(3.0);
return output;
}
// Function to calculate the exponential of a matrix using Armadillo
arma::mat calculateExponentialArmadillo(arma::mat& output) {
auto start = std::chrono::high_resolution_clock::now();
output = arma::exp(output * (-1));
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
std::cout << "Duration arma exp: " << duration.count() << " nano seconds" << std::endl;
return output;
}
// Function to calculate Padé approximant for exp(-x) for x positive
double padeApproximant(double x) {
static constexpr double num_coeffs[] = {120, -60, 12};
static constexpr double den_coeffs[] = {120, 60, 12};
double num = num_coeffs[0] + x * (num_coeffs[1] + x * num_coeffs[2]);
double den = den_coeffs[0] + x * (den_coeffs[1] + x * den_coeffs[2]);
return num / den;
}
// Function to calculate the exponential of a matrix using Padé approximant
arma::mat calculateExponentialPade(arma::mat& output) {
auto start = std::chrono::high_resolution_clock::now();
output.transform([](double x) {
return padeApproximant(x);
});
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
std::cout << "Duration Padé exp: " << duration.count() << " nano seconds" << std::endl;
return output;
}
// Function to calculate the exponential of a matrix using a fast approximation
arma::mat calculateExponentialFast(arma::mat& output) {
auto start = std::chrono::high_resolution_clock::now();
output.transform([](double x) {
static constexpr double A0 = 1.0;
static constexpr double A1 = 0.125;
static constexpr double A2 = 0.0078125;
static constexpr double A3 = 0.00032552083;
static constexpr double A4 = 1.0172526e-5;
if (x < 13.0) {
double y = A0 + x * (A1 + x * (A2 + x * (A3 + x * A4)));
y *= y;
y *= y;
y *= y;
y = 1 / y;
return y;
}
return 0.0;
});
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
std::cout << "Duration new exp: " << duration.count() << " nano seconds" << std::endl;
return output;
}
int main() {
double mean = 0.0;
double stddev = 2.0;
std::ofstream file("benchmark_results.csv");
file << "MatrixSize,ArmadilloDuration,FastDuration,PadeDuration\n";
std::vector<int> sizes = {10, 50, 100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000};
for (int size : sizes) {
arma::mat output = generateRandomMatrix(mean, stddev, size, size);
arma::mat output_2 = output;
arma::mat output_3 = output;
std::cout << "\n\nSize: " << size << std::endl;
auto start = std::chrono::high_resolution_clock::now();
output = calculateExponentialArmadillo(output);
auto stop = std::chrono::high_resolution_clock::now();
auto duration_arma = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
start = std::chrono::high_resolution_clock::now();
output_2 = calculateExponentialFast(output_2);
stop = std::chrono::high_resolution_clock::now();
auto duration_fast = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
start = std::chrono::high_resolution_clock::now();
output_3 = calculateExponentialPade(output_3);
stop = std::chrono::high_resolution_clock::now();
auto duration_pade = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
file << size << "," << duration_arma.count() << "," << duration_fast.count() << "," << duration_pade.count() << "\n";
}
file.close();
} I also tried to optimize the dropout algorithm in your second comment with the find_and_fill and conv_to_mat functions. I'd love to get your feedback on these two optimized versions I came up with: #include <cmath>
#include <fstream>
#include <vector>
// Function to time a given operation and return the duration
template<typename Func>
auto time_operation(Func operation) {
auto start = std::chrono::high_resolution_clock::now();
operation();
auto stop = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count();
}
int main()
{
double ratio = 0.2;
std::ofstream file("benchmark_results_dropout.csv");
file << "Matrix Size,mlpack,omar,find_and_fill,conv_to_mat\n";
std::vector<int> sizes = {10, 50, 100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000};
for (int size : sizes) {
arma::mat output = arma::randu(size, size);
arma::mat output_2 = output;
arma::mat output_3 = output;
arma::mat output_4 = output;
auto mlpack_time = time_operation([&]{ output.transform([&] (double val) { return (val > ratio); }); });
auto omar_time = time_operation([&]{ output_2 = arma::sign(output_2 - ratio); output_2.clamp(0, 1); });
auto find_and_fill_time = time_operation([&]{ arma::uvec indices = arma::find(output_3 > ratio); output_3.zeros(); output_3.elem(indices).fill(1); });
auto conv_to_mat_time = time_operation([&]{ output_4 = arma::conv_to<arma::mat>::from(output_4 > ratio); });
file << size << "," << mlpack_time << "," << omar_time << "," << find_and_fill_time << "," << conv_to_mat_time << "\n";
}
file.close();
return 0;
} What do you think about the results? Are they even relevant regarding the minimal time improvements? |
@MarkFischinger the idea is not to find a better implementation, but instead is to get rid of the usage of the transform function from armadillo because we can not no longer use it in mlpack code base nice try though |
We do need a generic implementation that will work with Bandicoot and that is what this issue is about, but since this is such an important piece of code I think having a specialized Armadillo version is worthwhile. We can just use template specializations or SFINAE to do it, so we use the faster approach when the user is using Armadillo and the generic one otherwise. I'm interested in the approximation level of the Pade approximant approach, @MarkFischinger can you comment on that? |
btw, I really like the fact of using the template parameter as a function pointer. // Function to time a given operation and return the duration
template<typename Func>
auto time_operation(Func operation) {
auto start = std::chrono::high_resolution_clock::now();
operation();
auto stop = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count();
} |
@Shirt, @rcurtin, For the benchmarks, I chose the coefficients, numerator polynomial |
Oh, nice, this looks good. In the first graph you posted, is that absolute error or relative error? |
@rcurtin if we agree on using the Padé Approximation, @MarkFischinger would you by then open a PR ? and replace the current solution with Padé ? Also would you do the same for Dropout layer ? by using the following: auto find_and_fill_time = time_operation([&]{ arma::uvec indices = arma::find(output_3 > ratio); output_3.zeros(); output_3.elem(indices).fill(1); }); @rcurtin looking at the graph I think it worth having one instead of two function for the dropout, the loss in speed is not that major. What do you think ? |
Yeah, I think the Padé approximation is an improvement overall. @zoq you did the original fast version, what do you think? For the dropout layer I think we are actually better off changing the strategy overall. We currently generate a randu matrix and then threshold it to generate a mask. I think it might be better if we could either generate the mask on-the-fly, or generate a 0/1 matrix with the right probabilities directly. Anyway, for now I would just go with one solution, since I think more optimization is possible there, even with a general implementation that will work for both Bandicoot and Armadillo. |
Hi @rcurtin, std::abs(actual - fastApprox);
std::abs(actual - padeApprox); @shrit I have some concerns about the expected range of arma::mat output = arma::randn(1000, 1000, arma::distr_param(0, 2));
if (x < 13.0) {...} In this context, the likelihood of generating a number greater than Once I'm clear on the range of X values we're dealing with, I plan to run a few more backtests with more variance before opening that PR :) |
@MarkFischinger would you open a PR and replace the current transform function for the dropout with I never had problem training mlpack on various numbers of datasets, also this is for an activation function which is basically at the end of the neural network, and most if not all of weights distributions are close to zero with small standard deviation. |
I think the |
@shrit After spending several days researching and testing, I decided to switch to using RDTSC and spreading the execution to multiple cores so that I get more Regarding the Padé approximant, I noticed that the non-scaled version performs faster up to about double lambda2(double x) {
if (x >= 13.0) {
return 0.0;
}
static constexpr double A0 = 1.0, A1 = 0.125, A2 = 0.0078125, A3 = 0.00032552083, A4 = 1.0172526e-5;
double y = A0 + x * (A1 + x * (A2 + x * (A3 + x * A4)));
y *= y; y *= y; y *= y;
return 1 / y;
}
I also experimented with another implementation using inline assembly and scaled padé versions, but it didn’t yield faster or more accurate results. For the Find and Fill algorithm, I ran the RDTSC benchmark, which confirmed the previous metrics and provided a clearer visualization: I'll update the Find and Fill method as soon as possible and am looking forward to trying out your idea, @rcurtin |
Hi,
The idea here is that I am trying to replace the transform function used by logsoftmax. Because we can not have transform in bandicoot.
Using armadillo function is not that bad, any other alternative would cost even more, so as a compromise I think the best would be is to replace the current implementation with
exp
from armadillo, I have added some numbers regarding the execution time with different matrices sizes.n = 1000 x 1000
n = 100 x 100
n = 10 x 10
n = 10000 x 10000
Even on huge matrces we see that the difference is not that important, I agree this will accumulate with each forward pass, but logsoftmax is an activation function, so I think it should be okay.
Hopefully, bandicoot speed will compensate for this difference immediately. @rcurtin @zoq
The text was updated successfully, but these errors were encountered: