Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark to replace the transform functions #3662

Open
shrit opened this issue Mar 22, 2024 · 14 comments
Open

Benchmark to replace the transform functions #3662

shrit opened this issue Mar 22, 2024 · 14 comments

Comments

@shrit
Copy link
Member

shrit commented Mar 22, 2024

Hi,

The idea here is that I am trying to replace the transform function used by logsoftmax. Because we can not have transform in bandicoot.

#include <chrono>
#include <armadillo>

int main()
{
  arma::mat output = arma::randn(1000, 1000, arma::distr_param(0, 2));
  arma::mat output_2 = output;

  //output.print("original matrix");

  auto start_arma = std::chrono::high_resolution_clock::now();
  output = arma::exp(output * (-1));
  auto stop_arma = std::chrono::high_resolution_clock::now();
  //output.print("exp arma::");

  auto duration_arma = std::chrono::duration_cast<std::chrono::nanoseconds>(stop_arma - start_arma);
  std::cout << "Duration arma exp: " <<  duration_arma.count() << " nano seconds" << std::endl;

  auto start_marcus = std::chrono::high_resolution_clock::now();
  output_2.transform([](double x)
  {
    //! Fast approximation of exp(-x) for x positive.
    static constexpr double A0 = 1.0;
    static constexpr double A1 = 0.125;
    static constexpr double A2 = 0.0078125;
    static constexpr double A3 = 0.00032552083;
    static constexpr double A4 = 1.0172526e-5;

    if (x < 13.0)
    {
      double y = A0 + x * (A1 + x * (A2 + x * (A3 + x * A4)));
      y *= y;
      y *= y;
      y *= y;
      y = 1 / y;

      return y;
    }

    return 0.0;
  });
  auto stop_marcus = std::chrono::high_resolution_clock::now();

  //output_2.print("exp marcus::");

  auto duration_marcus = std::chrono::duration_cast<std::chrono::nanoseconds>(stop_marcus - start_marcus);
  std::cout <<  "Duration marcus exp: " << duration_marcus.count() << " nano seconds" << std::endl;

  output = arma::log(arma::sum(output));
  //output.print("log softmax arma::");

  output_2 = arma::log(arma::sum(output_2));
  //output_2.print("log softmax marcus::");
}

Using armadillo function is not that bad, any other alternative would cost even more, so as a compromise I think the best would be is to replace the current implementation with exp from armadillo, I have added some numbers regarding the execution time with different matrices sizes.

n = 1000 x 1000

 ./a.out
Duration arma exp: 18352784 nano seconds
Duration marcus exp: 13505136 nano seconds

n = 100 x 100

./a.out
Duration arma exp: 205599 nano seconds
Duration marcus exp: 142744 nano seconds

n = 10 x 10

./a.out
Duration arma exp: 4537 nano seconds
Duration marcus exp: 1833 nano seconds

n = 10000 x 10000

Duration arma exp: 2057701481 nano seconds
Duration marcus exp: 1421418466 nano seconds

Even on huge matrces we see that the difference is not that important, I agree this will accumulate with each forward pass, but logsoftmax is an activation function, so I think it should be okay.
Hopefully, bandicoot speed will compensate for this difference immediately. @rcurtin @zoq

@shrit shrit changed the title Benchmark to replace the approximation for logsoftmax Benchmark to replace the transform functions Mar 22, 2024
@shrit
Copy link
Member Author

shrit commented Mar 22, 2024

Here is another one, with a benchmark

#include <chrono>
#include <armadillo>

int main()
{

  arma::mat output = arma::randu(10, 10);
  arma::mat output_2 = output;

  output.print("original matrix");

  double ratio = 0.2;
  auto start_mlpack = std::chrono::high_resolution_clock::now();
  output.transform([&] (double val) { return (val > ratio); });
  auto stop_mlpack = std::chrono::high_resolution_clock::now();
  output.print("dropout matrix mlpack: ");

  auto start_omar = std::chrono::high_resolution_clock::now();
  output_2 = arma::sign(output_2  - ratio);
  output_2.clamp(0, 1);
  auto stop_omar = std::chrono::high_resolution_clock::now();
  output_2.print("dropout matrix omar: ");


  auto duration_mlpack = std::chrono::duration_cast<std::chrono::nanoseconds>(stop_mlpack - start_mlpack);
  std::cout << "duration mlpack : " <<  duration_mlpack.count() << " nano seconds" << std::endl;

  auto duration_omar = std::chrono::duration_cast<std::chrono::nanoseconds>(stop_omar - start_omar);
  std::cout << "duration omar : " <<  duration_omar.count() << " nano seconds" << std::endl;
}

n = 10 x 10

duration mlpack : 738 nano seconds
duration omar : 3248 nano seconds

n = 100 x 100

duration mlpack : 55383 nano seconds
duration omar : 216340 nano seconds

@shrit
Copy link
Member Author

shrit commented Mar 22, 2024

better implementations are welcomed 😄

@MarkFischinger
Copy link
Contributor

Hi @shrit,
I wanted to share some results from benchmarking I've been doing. I compared the Padé Approximant method (see: https://en.wikipedia.org/wiki/Pad%C3%A9_approximant) with the current Fast approximation approach. The Padé method showed a small but consistent performance improvement.


benchmark_results


#include <chrono>
#include <armadillo>
#include <iostream>
#include <fstream>

arma::mat generateRandomMatrix(double mean, double stddev, int rows, int cols) {
	arma::mat output = arma::randu(rows, cols);
	output = (output * stddev * std::sqrt(12.0)) + mean - stddev * std::sqrt(3.0);
	return output;
}

// Function to calculate the exponential of a matrix using Armadillo
arma::mat calculateExponentialArmadillo(arma::mat& output) {
	auto start = std::chrono::high_resolution_clock::now();
	output = arma::exp(output * (-1));
	auto stop = std::chrono::high_resolution_clock::now();

	auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
	std::cout << "Duration arma exp: " <<  duration.count() << " nano seconds" << std::endl;

	return output;
}

// Function to calculate Padé approximant for exp(-x) for x positive
double padeApproximant(double x) {
	static constexpr double num_coeffs[] = {120, -60, 12};
	static constexpr double den_coeffs[] = {120, 60, 12};

	double num = num_coeffs[0] + x * (num_coeffs[1] + x * num_coeffs[2]);
	double den = den_coeffs[0] + x * (den_coeffs[1] + x * den_coeffs[2]);

	return num / den;
}

// Function to calculate the exponential of a matrix using Padé approximant
arma::mat calculateExponentialPade(arma::mat& output) {
	auto start = std::chrono::high_resolution_clock::now();

	output.transform([](double x) {
    	return padeApproximant(x);
	});

	auto stop = std::chrono::high_resolution_clock::now();

	auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
	std::cout <<  "Duration Padé exp: " << duration.count() << " nano seconds" << std::endl;

	return output;
}


// Function to calculate the exponential of a matrix using a fast approximation
arma::mat calculateExponentialFast(arma::mat& output) {
	auto start = std::chrono::high_resolution_clock::now();

	output.transform([](double x) {
    	static constexpr double A0 = 1.0;
    	static constexpr double A1 = 0.125;
    	static constexpr double A2 = 0.0078125;
    	static constexpr double A3 = 0.00032552083;
    	static constexpr double A4 = 1.0172526e-5;

    	if (x < 13.0) {
        	double y = A0 + x * (A1 + x * (A2 + x * (A3 + x * A4)));
        	y *= y;
        	y *= y;
        	y *= y;
        	y = 1 / y;

        	return y;
    	}

    	return 0.0;
	});

	auto stop = std::chrono::high_resolution_clock::now();

	auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
	std::cout << "Duration new exp: " << duration.count() << " nano seconds" << std::endl;

	return output;
}

int main() {
	double mean = 0.0;
	double stddev = 2.0;

	std::ofstream file("benchmark_results.csv");
	file << "MatrixSize,ArmadilloDuration,FastDuration,PadeDuration\n";

	std::vector<int> sizes = {10, 50, 100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000};  

	for (int size : sizes) {
    	arma::mat output = generateRandomMatrix(mean, stddev, size, size);
    	arma::mat output_2 = output;
    	arma::mat output_3 = output;

    	std::cout << "\n\nSize: " << size << std::endl;
    	auto start = std::chrono::high_resolution_clock::now();
    	output = calculateExponentialArmadillo(output);
    	auto stop = std::chrono::high_resolution_clock::now();
    	auto duration_arma = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);

    	start = std::chrono::high_resolution_clock::now();
    	output_2 = calculateExponentialFast(output_2);
    	stop = std::chrono::high_resolution_clock::now();
    	auto duration_fast = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);

    	start = std::chrono::high_resolution_clock::now();
    	output_3 = calculateExponentialPade(output_3);
    	stop = std::chrono::high_resolution_clock::now();
    	auto duration_pade = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);


    	file << size << "," << duration_arma.count() << "," << duration_fast.count() << "," << duration_pade.count() << "\n";
	}

	file.close();
}

I also tried to optimize the dropout algorithm in your second comment with the find_and_fill and conv_to_mat functions. I'd love to get your feedback on these two optimized versions I came up with:

benchmark_results_dropout


#include <cmath>
#include <fstream>
#include <vector>

// Function to time a given operation and return the duration
template<typename Func>
auto time_operation(Func operation) {
	auto start = std::chrono::high_resolution_clock::now();
	operation();
	auto stop = std::chrono::high_resolution_clock::now();
	return std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count();
}

int main()
{
	double ratio = 0.2;
	std::ofstream file("benchmark_results_dropout.csv");
	file << "Matrix Size,mlpack,omar,find_and_fill,conv_to_mat\n";

	std::vector<int> sizes = {10, 50, 100, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000};

	for (int size : sizes) {

    	arma::mat output = arma::randu(size, size);
    	arma::mat output_2 = output;
    	arma::mat output_3 = output;
    	arma::mat output_4 = output;

    	auto mlpack_time = time_operation([&]{ output.transform([&] (double val) { return (val > ratio); }); });
    	auto omar_time = time_operation([&]{ output_2 = arma::sign(output_2  - ratio); output_2.clamp(0, 1); });
    	auto find_and_fill_time = time_operation([&]{ arma::uvec indices = arma::find(output_3 > ratio); output_3.zeros(); output_3.elem(indices).fill(1); });
    	auto conv_to_mat_time = time_operation([&]{ output_4 = arma::conv_to<arma::mat>::from(output_4 > ratio); });
   	 
    	file << size << "," << mlpack_time << "," << omar_time << "," << find_and_fill_time << "," << conv_to_mat_time << "\n";
	}

	file.close();
	return 0;
}

What do you think about the results? Are they even relevant regarding the minimal time improvements?

@shrit
Copy link
Member Author

shrit commented Apr 5, 2024

@MarkFischinger the idea is not to find a better implementation, but instead is to get rid of the usage of the transform function from armadillo because we can not no longer use it in mlpack code base

nice try though

@rcurtin
Copy link
Member

rcurtin commented Apr 5, 2024

We do need a generic implementation that will work with Bandicoot and that is what this issue is about, but since this is such an important piece of code I think having a specialized Armadillo version is worthwhile. We can just use template specializations or SFINAE to do it, so we use the faster approach when the user is using Armadillo and the generic one otherwise.

I'm interested in the approximation level of the Pade approximant approach, @MarkFischinger can you comment on that?

@shrit
Copy link
Member Author

shrit commented Apr 5, 2024

btw, I really like the fact of using the template parameter as a function pointer.
This is really a nice way of doing it 💯

// Function to time a given operation and return the duration
template<typename Func>
auto time_operation(Func operation) {
	auto start = std::chrono::high_resolution_clock::now();
	operation();
	auto stop = std::chrono::high_resolution_clock::now();
	return std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count();
}

@MarkFischinger
Copy link
Contributor

@Shirt,
Oh, my bad, I don't know how I missed that :/

@rcurtin,
The Padé approximant offers a more accurate representation of $e^{-x}$ for small to moderate values of $x$,compared to straightforward polynomial approximations. This increased accuracy stems from the Padé approximant’s rational form, utilizing both a numerator and a denominator polynomial to closely mimic the function's behavior. The approximation level is denoted as $[m/n]$, where $m$ is the degree of the numerator polynomial and $n$ is the degree of the denominator polynomial. Choosing the approximation level here is a trade-off between performance and accuracy.

errors_and_time_pade

For the benchmarks, I chose the coefficients, numerator polynomial num_coeffs[] = {120, -60, 12} $(m=2)$ and denominator polynomial den_coeffs[] = {120, 60, 12} $(n=2)$, to balance computational efficiency with approximation accuracy. These coefficients are derived from the series expansion of $(e^{-x})$, so that the Padé approximant $[2/2]$ represents the function within the considered range of $(x)$.

@rcurtin
Copy link
Member

rcurtin commented Apr 8, 2024

Oh, nice, this looks good. In the first graph you posted, is that absolute error or relative error?

@shrit
Copy link
Member Author

shrit commented Apr 8, 2024

@rcurtin if we agree on using the Padé Approximation, @MarkFischinger would you by then open a PR ? and replace the current solution with Padé ? Also would you do the same for Dropout layer ? by using the following:

auto find_and_fill_time = time_operation([&]{ arma::uvec indices = arma::find(output_3 > ratio); output_3.zeros(); output_3.elem(indices).fill(1); });

@rcurtin looking at the graph I think it worth having one instead of two function for the dropout, the loss in speed is not that major. What do you think ?

@rcurtin
Copy link
Member

rcurtin commented Apr 9, 2024

Yeah, I think the Padé approximation is an improvement overall. @zoq you did the original fast version, what do you think?

For the dropout layer I think we are actually better off changing the strategy overall. We currently generate a randu matrix and then threshold it to generate a mask. I think it might be better if we could either generate the mask on-the-fly, or generate a 0/1 matrix with the right probabilities directly. Anyway, for now I would just go with one solution, since I think more optimization is possible there, even with a general implementation that will work for both Bandicoot and Armadillo.

@MarkFischinger
Copy link
Contributor

Hi @rcurtin,
the first graph shows the absolute error using:

std::abs(actual - fastApprox);
std::abs(actual - padeApprox);

@shrit I have some concerns about the expected range of $X$ values for the Padé Approximant. It tends to underperform in accuracy for $X$ values above 8 unless we upscale it, impacting execution speed. In my benchmarks, keeping your distribution in mind, this could be a limitation:

arma::mat output = arma::randn(1000, 1000, arma::distr_param(0, 2));

if (x < 13.0) {...}

In this context, the likelihood of generating a number greater than $8$ is about 0.0032%, but I am unsure about other matrices in real-world scenarios. I'm aware that moderate $X$ values are typical, but are there applications with for example, high variance in input data, where this might not hold true? Based on your experience, do you maybe know of any specific situations where the Padé Approximant might not perform well (receives high $X$ values)? Could you clarify the reasoning for limiting $X$ right at $13$ in our existing implementation, while $y$ is aproaching $0$?

Once I'm clear on the range of X values we're dealing with, I plan to run a few more backtests with more variance before opening that PR :)

@shrit
Copy link
Member Author

shrit commented Apr 10, 2024

@MarkFischinger would you open a PR and replace the current transform function for the dropout with find and fill ?

I never had problem training mlpack on various numbers of datasets, also this is for an activation function which is basically at the end of the neural network, and most if not all of weights distributions are close to zero with small standard deviation.
would you also open a PR for padé approximation ? and then see if all the tests we have will pass or not

@rcurtin
Copy link
Member

rcurtin commented Apr 10, 2024

I think the X >= 13 limit is just to give a shortcut for large values. It doesn't have to be 13, but I suspect it won't make a huge difference for neural network training anyway as a large X value implies a very high activation of the neuron, so it is probably true that small differences there don't matter much. You could try on the mnist_cnn example or similar with different limit values to see (you will have to ensure that the random seed is set to the same thing to compare accurately).

@MarkFischinger
Copy link
Contributor

@shrit After spending several days researching and testing, I decided to switch to using RDTSC and spreading the execution to multiple cores so that I get more average_iterations using OpenMP, for more accurate time measurement. I was experiencing inconsistent runtimes in nanoseconds when using chrono::high_resolution_clock, even when averaging multiple iterations, due to CPU variance (learned a lot about the CPU ;)

Regarding the Padé approximant, I noticed that the non-scaled version performs faster up to about $x=~1.3$, maintaining similar accuracy. However, when integrated with the fast approximate and an added if statement, it actually required more cycles than the standalone fast approximant. Interestingly, positioning the if statement at the beginning of only the fast approximate significantly reduced the cycle count by nearly half (eventhough I passed no values above 12.99 in the function! I'm not quite sure why this happens, but maybe it is because of the branch prediction of the CPU?):

double lambda2(double x) {
     if (x >= 13.0) {
          return 0.0;
      }

      static constexpr double A0 = 1.0, A1 = 0.125, A2 = 0.0078125, A3 = 0.00032552083, A4 = 1.0172526e-5;
      double y = A0 + x * (A1 + x * (A2 + x * (A3 + x * A4)));
      y *= y; y *= y; y *= y;
      return 1 / y;
}
Lambda1 Cycles: 32, Average Relative Error: 0.035419 
Lambda2 Cycles: 18, Average Relative Error: 0.035419

I also experimented with another implementation using inline assembly and scaled padé versions, but it didn’t yield faster or more accurate results.

For the Find and Fill algorithm, I ran the RDTSC benchmark, which confirmed the previous metrics and provided a clearer visualization:

93afa1bd-d803-42b7-aba5-667568e0eda8

I'll update the Find and Fill method as soon as possible and am looking forward to trying out your idea, @rcurtin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants