Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of cyl_bessel_i() on a low-powered arm64 device #92

Open
mskvortsov opened this issue May 2, 2024 · 4 comments
Open

Performance of cyl_bessel_i() on a low-powered arm64 device #92

mskvortsov opened this issue May 2, 2024 · 4 comments

Comments

@mskvortsov
Copy link
Contributor

mskvortsov commented May 2, 2024

While running the receiver on a low-powered device like Raspberry Pi, I'm seeing a high CPU load. A signal gets sampled at a 5 Msps rate, SF 11, BW 250.

A quick profiling of a run-to-completion flow from a File Source w/o throttling block shows the boost::math::cyl_bessel_i() function takes a substantial time. As it turns out, a default Boost math policy promotes doubles to long doubles the device is struggling to compute with.

The promotion can be disabled as described in https://www.boost.org/doc/libs/1_85_0/libs/math/doc/html/math_toolkit/tradoffs.html:

diff --git a/lib/fft_demod_impl.cc b/lib/fft_demod_impl.cc
index 784403a..f622ada 100644
--- a/lib/fft_demod_impl.cc
+++ b/lib/fft_demod_impl.cc
@@ -14,2 +14,5 @@ extern "C" {

+using namespace boost::math::policies;
+auto no_double_promotion_policy = make_policy(promote_double<false>());
+
 namespace gr {
@@ -197,3 +200,4 @@ namespace gr {
                 if (bessel_arg < 713)  // 713 ~ log(std::numeric_limits<LLR>::max())
-                    LLs[n] = boost::math::cyl_bessel_i(0, bessel_arg);  // compute Bessel safely
+                    // TODO? std::cyl_bessel_i() exists since C++17
+                    LLs[n] = boost::math::cyl_bessel_i(0, bessel_arg, no_double_promotion_policy);  // compute Bessel safely
                 else {

The fix gives a whopping ~3x speed up on RPi4 without decoding degradation on my signal. However, I don't know whether this long double precision is strictly required and can be downgraded just like that.

@miweber67
Copy link

The fix gives a whopping ~3x speed up on RPi4 without decoding degradation on my signal. However, I don't know whether this long double precision is strictly required and can be downgraded just like that.

You could create a set of test input files of varying 'quality' by adding varying amounts of Gaussian white noise and center frequency shift to see if the precision is an issue for those variables.

@mskvortsov
Copy link
Contributor Author

I didn't see any difference in response in terms of the number of packets decoded with valid CRC's. I used Channel Model block and varied noise_voltage and frequency_offset parameters independently in small steps until the number of valid crc's declined to zero. On the other hand, there are too many other LoRa block configurations to make a definite conclusion from this limited experiment.

However, a more obvious point is that my 5 Msps sampling rate is somewhat high, and unfortunately, it's the lowest usable rate of my receiver. cyl_bessel_i() is executed in the order of O(samp_rate * 2^sf) times, so reducing the input sampling rate would probably be a simpler approach for my particular problem.

@miweber67
Copy link

I didn't see any difference in response in terms of the number of packets decoded with valid CRC's. I used Channel Model block and varied noise_voltage and frequency_offset parameters independently in small steps until the number of valid crc's declined to zero. On the other hand, there are too many other LoRa block configurations to make a definite conclusion from this limited experiment.

Nice... a single data point to be sure, but, it's a pleasant single data point. :-)

However, a more obvious point is that my 5 Msps sampling rate is somewhat high, and unfortunately, it's the lowest usable rate of my receiver. cyl_bessel_i() is executed in the order of O(samp_rate * 2^sf) times, so reducing the input sampling rate would probably be a simpler approach for my particular problem.

So your frame_sync of_factor is ... 20? In issue 91 it was suggested that 4 should be adequate. If you filter and decimate by 5, do you still get good results?

@mskvortsov
Copy link
Contributor Author

It looks like Low Pass Filter and Rational Resampler are quite CPU intensive. A receiver flow with additional filtering or resampling blocks makes 4x more load and occupies one Cortex-A72 core entirely. I'm going to try just a cheapo 1Msps radio the next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants