Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear` #277

YuhanLiin · 2022-11-12T19:31:13Z

According to the benchmark results from this comment, linfa-linear is faster with BLAS than without. The OLS algorithm isn't too different with 5 features and is only slightly slower with 10 features, but GLM is significantly slower without BLAS. We should profile and investigate the difference between the BLAS and non-BLAS performance.

The text was updated successfully, but these errors were encountered:

oojo12 · 2022-11-16T06:46:08Z

Flamegraphs from profiling attached. I haven't studied profiling or flamegraphs to build a firm foundation yet so I can't provide any insights. Each profile was run for 1min.
Linfa_linear.zip

oojo12 · 2022-11-16T07:47:28Z

did a quick review. 10 feets GLM 100_000 samples is spending most of its CPU time at this step ndarray::zip::Zip<(P1,P2),D>::for_each this is called twice. Both calls share an ancestor of argmin::core::executor::Executor<O,S>::run

One flow has <argmin::solver::linesearch::morethuente::MoreThuenteLineSearch<P,F> as argmin::core::Solver<O>>::next_iter called before their common <linfa_linear::glm::TweedieProblem<A> as argmin::core::ArgminOp>::gradient operation.
This fork in processing leads to the ndarray::zip::Zip<(P1,P2),D>::for_each being called twice. If it is possible to consolidate the calls so that the gradient operation is called once or invoke some sort of caching (if applicable not sure if it will be if we're using gradient descent) then I think performance would increase.

A screenshot of the graph is attached.

oojo12 · 2022-11-16T07:56:44Z

Also if we use the rayon backend we could look into using par_azip instead of zip. It is the parallel version.

YuhanLiin · 2022-11-17T01:36:49Z

for_each is being called by general_mat_vec_mul_impl, which is the pure Rust matrix multiplication routine from ndarray. AFAIK it's the only difference between BLAS and non-BLAS for GLM, since BLAS uses a different matrix multiplication routine. One way to improve this is to call the multiplication less often. We can also try to enable the matrixmultiply-threading feature on ndarray to improve the matrix multiplication speed. Do we also know if the BLAS version has the same bottlenecks?

oojo12 · 2022-11-17T04:11:16Z

I think calling multiplication less often would likely be the bigger performance boost. I can locally test enabling the matrixmultiply-threading feature.

I didn't profile the BLAS version. Is this desirable or just out of curiosity?

YuhanLiin · 2022-11-17T06:06:19Z

Would be useful to see if multiplication is also the bottleneck for BLAS, since BLAS also calls multiplication in two places.

oojo12 · 2022-11-17T06:54:03Z

Okay I'll 2 more. One with blas and one with that the matrixmultiply feature.

oojo12 · 2022-11-17T16:11:57Z

Blas flamegraph a lil different then the other

YuhanLiin · 2022-11-18T01:25:40Z

The fact that gradient appears twice in the first benchmark is likely due to a sampling inconsistency, not different behaviour. The second flamegraph also has two instances of gradient. The interesting thing is that first few calls to dot perform similarly in both graphs, but the final dot in gradient is where it slows down. Dimensionally, that dot is a matrix-multiplication of y and X, which means it linearly combines each feature column (which have 100k elems). I think the non-BLAS dot doesn't scale well for inputs of that size.

YuhanLiin added enhancement New feature or request help wanted Extra attention is needed labels Nov 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear` #277

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear` #277

YuhanLiin commented Nov 12, 2022

oojo12 commented Nov 16, 2022 •

edited

oojo12 commented Nov 16, 2022 •

edited

oojo12 commented Nov 16, 2022

YuhanLiin commented Nov 17, 2022 •

edited

oojo12 commented Nov 17, 2022

YuhanLiin commented Nov 17, 2022

oojo12 commented Nov 17, 2022

oojo12 commented Nov 17, 2022

YuhanLiin commented Nov 18, 2022

Investigate discrepancy between non-BLAS and BLAS versions of linfa-linear #277

Investigate discrepancy between non-BLAS and BLAS versions of linfa-linear #277

Comments

YuhanLiin commented Nov 12, 2022

oojo12 commented Nov 16, 2022 • edited

oojo12 commented Nov 16, 2022 • edited

oojo12 commented Nov 16, 2022

YuhanLiin commented Nov 17, 2022 • edited

oojo12 commented Nov 17, 2022

YuhanLiin commented Nov 17, 2022

oojo12 commented Nov 17, 2022

oojo12 commented Nov 17, 2022

YuhanLiin commented Nov 18, 2022

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear` #277

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear` #277

oojo12 commented Nov 16, 2022 •

edited

oojo12 commented Nov 16, 2022 •

edited

YuhanLiin commented Nov 17, 2022 •

edited