Performance enhancement Variogram in `_calc_groups()` method (scikit-gstat #145) #148

MartinSchobben · 2023-05-23T07:13:25Z

I made a start here. Some questions:

The _group names should be 1 up to len(bin_edges) - 1, and where group len(bin_edges) should be -1, as this last group will be leftout of the variogram? If that is correct, I need to figure out how this can be done efficiently.

I ran pytest but I got a lot of failures which say: "ValueError: ydata must not be empty!". What does this mean?

mmaelicke · 2023-05-23T08:34:25Z

Hi there, thanks a lot for getting started!

The error message is caused by scipy.optimize.curve_fit, which is used under the hood to fit the variogram function. The group -1 is assigned at the start to all point pairs (L.1827), which acts as a NaN here, as only groups >= 0 are considered later on, when yielding the point pairs associated to a specific group. Your adaption to the code does, up to now, not fill the matrix of assigned group indices (self._groups), thus all are -1 and an empty array is passed to the fitting function.

Once, that part is finished, the error messages should be resolved.

I hope this already helps, otherwise I can free some time tomorrow to give a more complete answer and/or commit to this branch, if needed.

Best,
Mirko

mmaelicke · 2023-05-23T08:37:06Z

and btw. the _groups start at 0 not 1, just for info.

MartinSchobben · 2023-05-23T09:14:28Z

I think I can start with that. It will probably take some time to progress with this, as I will work on this in the spare minutes between other projects.

mmaelicke · 2023-05-23T09:31:41Z

I think I can start with that. It will probably take some time to progress with this, as I will work on this in the spare minutes between other projects.

Yeah, same for me. Just drop me a notice when you want me to contribute somehow

MartinSchobben · 2023-06-29T15:13:08Z

Took me some time, but I am wondering if the small changes in the next commit would do the trick. As the bin_edges accumulate differences up to, but not including, max_lag, one can also modify the loop of lag_classes to not range up to the maximum bin_edge. This then instead of using -1 as a padding to fill values beyond the max_lag range. Or do I miss something here? I did not run the tests yet.

mmaelicke · 2023-06-30T09:05:26Z

I guess that is right. I have to defend my PhD thesis in two weeks, therefore I can't test or contribute right now. But there should be tests, that will fail if the changes yield different results. From mid July on, I can start testing and contributing myself.
Sorry for the inconvenience.

MartinSchobben · 2023-06-30T09:18:44Z

That's alright. I can see that that has priority. Good luck defending!

mmaelicke · 2023-07-06T09:50:12Z

Hey, I have a short update. While working on another project I had to do something really similar (with pytorch) and quickly checked the respective Numpy functions.

Consider to replace the Vario.__calc_groups functions with a call to np.digitize and np.where with something very similar to this:

# vario is the skg.Variogram class

digi = np.digitize(vario.distance, vario.bins, right=True)
grp = np.where(digi == vario.n_lags, -1, digi)

This will yield exactly the same internal Variogram._groups array:

assert (grp == vario._groups).all()

This should get rid of the for loop. In terms of speed, I ran this short test:

def digi(vario):
    grp = np.digitize(vario.distance, vario.bins, right=True)
    grp2 = np.where(grp == vario.n_lags, -1, grp)

for n in (30, 50, 100, 300, 500):
    coords, vals = skg.data.pancake(N=n, seed=42).get('sample')
    vario = skg.Variogram(coords, vals, maxlag=0.6, n_lags=25)

    print(F"N = {n}")
    print(f"{round(timeit.timeit(lambda: vario._calc_groups(force=True), number=100) * 10, 2)}ms")
    print(f"{round(timeit.timeit(lambda: digi(vario), number=100) * 10, 2)}ms\n")

N = 30
Old loop:    0.13ms
np.digitize: 0.04ms

N = 50
Old loop:    0.21ms
np.digitize: 0.07ms

N = 100
Old loop:    0.34ms
np.digitize: 0.2ms

N = 300
Old loop:    2.17ms
np.digitize: 1.43ms

N = 500
Old loop:    5.25ms
np.digitize: 4.04ms

If you want, and you agree, you are welcome to to push these changes. Then I can put my code review here and the contribution would be associated with you after the merge (as it was your idea).

Best
Mirko

MartinSchobben · 2023-07-06T14:33:53Z

Yes, that sounds good. This seems to be the easier solution. I already noticed that not labeling with -1 gave some other issues down the line. (Can't remember what exactly). I wonder, however, why the performance increase is only marginal. When I ran benchmarks it was orders of magnitude faster.

…gstat mmaelicke#145)

MartinSchobben · 2023-07-06T14:49:29Z

This is the pytest output

log.txt

mmaelicke · 2023-07-07T08:39:43Z

This is the pytest output

log.txt

Ok, the fail is not associated to the changes made here. There is an optional dependency on gstools, which is only needed for the gstools interface. Usually, the tests should be skipped, if gstools is not present. No idea why this does not happen here. Have to look into that.
The tests here on GitHub should work, as they install optional dependencies.

About the performance drop compared to your tests: First, the np.where can be the bottleneck here, I have not tested that. Second, when comparing against the instantiation of a skg.Variogram interface, aka. the old behavior, that also involves stuff like calculating the semi-variance and fitting a model. Thus, the test needs to test the old _calc_groups(force=True) only. Not sure how you did that.

If you are happy with the changes, I can go ahead and merge?

MartinSchobben · 2023-07-07T09:52:58Z

Running this:

import numpy as np
import timeit

max_d = 1e6
max_lag = int(1e5)
# distance
d = np.arange(0, max_d)
# get the bin edges and distances
bin_edges = np.linspace(0, max_d, max_lag)
 
n = 5

# for loop
def for_loop_solution():
    # -1 is the group fir distances outside maxlag
    groups = np.ones(len(d), dtype=int) * -1

    for i, bounds in enumerate(zip([0] + list(bin_edges), bin_edges)):
        groups[np.where((d >= bounds[0]) & (d < bounds[1]))] = i

result = timeit.timeit(stmt='for_loop_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

# numpy
def numpy_solution():
    groups = np.digitize(d, bin_edges)
    groups = np.where(groups == max_lag, -1, groups)

result = timeit.timeit(stmt='numpy_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

I get the following:

Execution time is 111.75757568539993 seconds
Execution time is 0.017489605200171354 seconds

MartinSchobben force-pushed the dev branch from abbe8f4 to 3355366 Compare July 6, 2023 14:44

Performance enhancement Variogram in _calc_groups() method (scikit-…

5ee9e83

…gstat mmaelicke#145)

MartinSchobben force-pushed the dev branch from 3355366 to 5ee9e83 Compare July 6, 2023 14:47

mmaelicke approved these changes Jul 7, 2023

View reviewed changes

rhugonnet mentioned this pull request Aug 22, 2023

Skip GSTools test in test_variogram if not installed #162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance enhancement Variogram in `_calc_groups()` method (scikit-gstat #145) #148

Performance enhancement Variogram in `_calc_groups()` method (scikit-gstat #145) #148

MartinSchobben commented May 23, 2023

mmaelicke commented May 23, 2023 •

edited

mmaelicke commented May 23, 2023

MartinSchobben commented May 23, 2023

mmaelicke commented May 23, 2023

MartinSchobben commented Jun 29, 2023

mmaelicke commented Jun 30, 2023

MartinSchobben commented Jun 30, 2023

mmaelicke commented Jul 6, 2023

MartinSchobben commented Jul 6, 2023

MartinSchobben commented Jul 6, 2023

mmaelicke commented Jul 7, 2023

MartinSchobben commented Jul 7, 2023

Performance enhancement Variogram in _calc_groups() method (scikit-gstat #145) #148

Are you sure you want to change the base?

Performance enhancement Variogram in _calc_groups() method (scikit-gstat #145) #148

Conversation

MartinSchobben commented May 23, 2023

mmaelicke commented May 23, 2023 • edited

mmaelicke commented May 23, 2023

MartinSchobben commented May 23, 2023

mmaelicke commented May 23, 2023

MartinSchobben commented Jun 29, 2023

mmaelicke commented Jun 30, 2023

MartinSchobben commented Jun 30, 2023

mmaelicke commented Jul 6, 2023

MartinSchobben commented Jul 6, 2023

MartinSchobben commented Jul 6, 2023

mmaelicke commented Jul 7, 2023

MartinSchobben commented Jul 7, 2023

Performance enhancement Variogram in `_calc_groups()` method (scikit-gstat #145) #148

Performance enhancement Variogram in `_calc_groups()` method (scikit-gstat #145) #148

mmaelicke commented May 23, 2023 •

edited