Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancement Variogram in _calc_groups() method (scikit-gstat #145) #148

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MartinSchobben
Copy link

I made a start here. Some questions:

The _group names should be 1 up to len(bin_edges) - 1, and where group len(bin_edges) should be -1, as this last group will be leftout of the variogram? If that is correct, I need to figure out how this can be done efficiently.

I ran pytest but I got a lot of failures which say: "ValueError: ydata must not be empty!". What does this mean?

@mmaelicke
Copy link
Owner

mmaelicke commented May 23, 2023

Hi there, thanks a lot for getting started!

The error message is caused by scipy.optimize.curve_fit, which is used under the hood to fit the variogram function. The group -1 is assigned at the start to all point pairs (L.1827), which acts as a NaN here, as only groups >= 0 are considered later on, when yielding the point pairs associated to a specific group. Your adaption to the code does, up to now, not fill the matrix of assigned group indices (self._groups), thus all are -1 and an empty array is passed to the fitting function.

Once, that part is finished, the error messages should be resolved.

I hope this already helps, otherwise I can free some time tomorrow to give a more complete answer and/or commit to this branch, if needed.

Best,
Mirko

@mmaelicke
Copy link
Owner

and btw. the _groups start at 0 not 1, just for info.

@MartinSchobben
Copy link
Author

I think I can start with that. It will probably take some time to progress with this, as I will work on this in the spare minutes between other projects.

@mmaelicke
Copy link
Owner

I think I can start with that. It will probably take some time to progress with this, as I will work on this in the spare minutes between other projects.

Yeah, same for me. Just drop me a notice when you want me to contribute somehow

@MartinSchobben
Copy link
Author

Took me some time, but I am wondering if the small changes in the next commit would do the trick. As the bin_edges accumulate differences up to, but not including, max_lag, one can also modify the loop of lag_classes to not range up to the maximum bin_edge. This then instead of using -1 as a padding to fill values beyond the max_lag range. Or do I miss something here? I did not run the tests yet.

@mmaelicke
Copy link
Owner

I guess that is right. I have to defend my PhD thesis in two weeks, therefore I can't test or contribute right now. But there should be tests, that will fail if the changes yield different results. From mid July on, I can start testing and contributing myself.
Sorry for the inconvenience.

@MartinSchobben
Copy link
Author

That's alright. I can see that that has priority. Good luck defending!

@mmaelicke
Copy link
Owner

Hey, I have a short update. While working on another project I had to do something really similar (with pytorch) and quickly checked the respective Numpy functions.

Consider to replace the Vario.__calc_groups functions with a call to np.digitize and np.where with something very similar to this:

# vario is the skg.Variogram class

digi = np.digitize(vario.distance, vario.bins, right=True)
grp = np.where(digi == vario.n_lags, -1, digi) 

This will yield exactly the same internal Variogram._groups array:

assert (grp == vario._groups).all() 

This should get rid of the for loop. In terms of speed, I ran this short test:

def digi(vario):
    grp = np.digitize(vario.distance, vario.bins, right=True)
    grp2 = np.where(grp == vario.n_lags, -1, grp)

for n in (30, 50, 100, 300, 500):
    coords, vals = skg.data.pancake(N=n, seed=42).get('sample')
    vario = skg.Variogram(coords, vals, maxlag=0.6, n_lags=25)

    print(F"N = {n}")
    print(f"{round(timeit.timeit(lambda: vario._calc_groups(force=True), number=100) * 10, 2)}ms")
    print(f"{round(timeit.timeit(lambda: digi(vario), number=100) * 10, 2)}ms\n")
N = 30
Old loop:    0.13ms
np.digitize: 0.04ms

N = 50
Old loop:    0.21ms
np.digitize: 0.07ms

N = 100
Old loop:    0.34ms
np.digitize: 0.2ms

N = 300
Old loop:    2.17ms
np.digitize: 1.43ms

N = 500
Old loop:    5.25ms
np.digitize: 4.04ms

If you want, and you agree, you are welcome to to push these changes. Then I can put my code review here and the contribution would be associated with you after the merge (as it was your idea).

Best
Mirko

@MartinSchobben
Copy link
Author

Yes, that sounds good. This seems to be the easier solution. I already noticed that not labeling with -1 gave some other issues down the line. (Can't remember what exactly). I wonder, however, why the performance increase is only marginal. When I ran benchmarks it was orders of magnitude faster.

@MartinSchobben
Copy link
Author

This is the pytest output

log.txt

@mmaelicke
Copy link
Owner

This is the pytest output

log.txt

Ok, the fail is not associated to the changes made here. There is an optional dependency on gstools, which is only needed for the gstools interface. Usually, the tests should be skipped, if gstools is not present. No idea why this does not happen here. Have to look into that.
The tests here on GitHub should work, as they install optional dependencies.

About the performance drop compared to your tests: First, the np.where can be the bottleneck here, I have not tested that. Second, when comparing against the instantiation of a skg.Variogram interface, aka. the old behavior, that also involves stuff like calculating the semi-variance and fitting a model. Thus, the test needs to test the old _calc_groups(force=True) only. Not sure how you did that.

If you are happy with the changes, I can go ahead and merge?

@MartinSchobben
Copy link
Author

Running this:

import numpy as np
import timeit

max_d = 1e6
max_lag = int(1e5)
# distance
d = np.arange(0, max_d)
# get the bin edges and distances
bin_edges = np.linspace(0, max_d, max_lag)
 
n = 5

# for loop
def for_loop_solution():
    # -1 is the group fir distances outside maxlag
    groups = np.ones(len(d), dtype=int) * -1

    for i, bounds in enumerate(zip([0] + list(bin_edges), bin_edges)):
        groups[np.where((d >= bounds[0]) & (d < bounds[1]))] = i

result = timeit.timeit(stmt='for_loop_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

# numpy
def numpy_solution():
    groups = np.digitize(d, bin_edges)
    groups = np.where(groups == max_lag, -1, groups)

result = timeit.timeit(stmt='numpy_solution()', globals=globals(), number=n)

# calculate the execution time
# get the average execution time
print(f"Execution time is {result / n} seconds")

I get the following:

Execution time is 111.75757568539993 seconds
Execution time is 0.017489605200171354 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants