Single threaded split #85

maartenbreddels · 2018-12-18T21:24:03Z

(continues from #79, will rebase after merge)

Implement what is discussed in #83.

TODO:

docstrings

codecov · 2018-12-18T21:42:43Z

Codecov Report

Merging #85 into master will decrease coverage by 0.21%.
The diff coverage is 86.66%.

@@            Coverage Diff             @@
##           master      #85      +/-   ##
==========================================
- Coverage   96.96%   96.74%   -0.22%     
==========================================
  Files          10       10              
  Lines        1053     1076      +23     
==========================================
+ Hits         1021     1041      +20     
- Misses         32       35       +3

Impacted Files	Coverage Δ
pygbm/grower.py	`92.89% <100%> (+0.07%)`	⬆️
pygbm/gradient_boosting.py	`96.86% <75%> (+0.01%)`	⬆️
pygbm/splitting.py	`97.59% <86.95%> (-1.35%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 686495b...e84cdc3. Read the comment docs.

NicolasHug

Looks good in general! Thanks!

Made a few comments.

Waiting for @ogrisel input but maybe it would be interesting to have a new benchmark script comparing the memory usage with parallel_splitting=True and parallel_splitting=False.

NicolasHug · 2018-12-18T21:49:14Z

pygbm/splitting.py

+def split_indices_single_thread(context, split_info, sample_indices):
+    """Split samples into left and right arrays.
+
+    This implementation requires less memory than the parallel version.


Maybe add a comment about using Hoare's partition scheme, just for reference

Ok, I didn't know it had a name, I'm ok with adding it, but did you check that it is effectively the same algo?

Yes I'm pretty sure it is https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme

NicolasHug · 2018-12-18T21:50:40Z

pygbm/splitting.py

+    ----------
+    context : SplittingContext
+        The splitting context
+    split_ingo : SplitInfo


split_info (my fault ^^)
can you also fix it at the other occurrence please?

NicolasHug · 2018-12-18T21:52:05Z

pygbm/splitting.py

+            i += 1
+            j -= 1
+    return (sample_indices[:i],
+            sample_indices[i:])


return sample_indices[:i], sample_indices[i:] ?

NicolasHug · 2018-12-18T21:53:59Z

pygbm/grower.py

@@ -336,6 +339,7 @@ def split_next(self):
        node = heappop(self.splittable_nodes)

        tic = time()
+        split_indices = split_indices_parallel if self.parallel_splitting else split_indices_single_thread


I think we should fork in SplittingContext instead of in the grower. We probably don't need grower.parallel_splitting.

That is, call SplittingContext.split_indice() in the grower, and in splitting.py do:

def split_indice(...) if self.parallel_splitting: return ... else: return ...

NicolasHug · 2018-12-18T21:58:27Z

tests/test_splitting.py

@@ -287,6 +288,12 @@ def test_split_indices():
    assert samples_left.shape[0] == si_root.n_samples_left
    assert samples_right.shape[0] == si_root.n_samples_right

+    samples_left_single_thread, samples_right_single_thread = split_indices_single_thread(


Maybe instead parametrize this test with parallel_split=True and parallel_split=False, so that both methods go through exactly the same tests (though I agree your current test also covers this implicitly)

ogrisel

Some more comments on top of @NicolasHug's review.

ogrisel · 2018-12-19T12:33:45Z

benchmarks/bench_higgs_boson.py

@@ -85,7 +85,7 @@ def load_data():
                                         max_leaf_nodes=n_leaf_nodes,
                                         n_iter_no_change=None,
                                         random_state=0,
-                                         verbose=1)
+                                         verbose=1, parallel_splitting=False)


Maybe you can expose this a commandline flag to make it easy to profile with mprof run benchmarks/bench_higgs_boson.py --disable-parallel-splitting (using memory_profiler).

pygbm/splitting.py

ogrisel · 2018-12-19T14:36:01Z

pygbm/splitting.py

@@ -304,6 +312,31 @@ def split_indices(context, split_info, sample_indices):
    return (sample_indices[:right_child_position],
            sample_indices[right_child_position:])



PEP8: you need 2 blank lines to separate top level functions.

NicolasHug

Few more comments,

Also please see the pep8 issues on travis, or locally run

flake8 pygbm tests examples benchmarks

NicolasHug · 2018-12-21T15:08:23Z

pygbm/splitting.py

@@ -304,6 +312,31 @@ def split_indices(context, split_info, sample_indices):
    return (sample_indices[:right_child_position],
            sample_indices[right_child_position:])

+@njit(parallel=False)


You can just use @njit, parallelism is off by default

Suggested change

@njit(parallel=False)

@njit

NicolasHug · 2018-12-21T15:10:23Z

pygbm/splitting.py

+            sample_indices[i], sample_indices[j] = sample_indices[j], sample_indices[i]
+            i += 1
+            j -= 1
+    return (sample_indices[:i], sample_indices[i:])


no need for parenthesis

NicolasHug · 2018-12-21T15:12:28Z

tests/test_splitting.py

@@ -261,7 +262,7 @@ def test_split_indices():
                               n_bins_per_feature,
                               all_gradients, all_hessians,
                               l2_regularization, min_hessian_to_split,
-                               min_samples_leaf, min_gain_to_split)
+                               min_samples_leaf, min_gain_to_split, True)


Please parametrize this test with True/False instead and remove the checks below

NicolasHug · 2018-12-21T15:14:47Z

pygbm/splitting.py

@@ -304,6 +312,31 @@ def split_indices(context, split_info, sample_indices):
    return (sample_indices[:right_child_position],
            sample_indices[right_child_position:])

+@njit(parallel=False)
+def _split_indices_single_threaded(context, split_info, sample_indices):


Suggested change

def _split_indices_single_threaded(context, split_info, sample_indices):

def _split_indices_single_threaded(context, split_info, sample_indices):

# single-threaded partition into left and right arrays (Hoare's partition scheme)

NicolasHug reviewed Dec 18, 2018

View reviewed changes

ogrisel reviewed Dec 19, 2018

View reviewed changes

maartenbreddels added 3 commits December 19, 2018 15:29

non-parallel split_indices that require less memory

e6ea8a3

implemented suggestions

b466047

add --no-parallel-split argument to higg benchmark

e84cdc3

maartenbreddels force-pushed the single_threaded_split branch from d8585bb to e84cdc3 Compare December 19, 2018 14:30

ogrisel reviewed Dec 19, 2018

View reviewed changes

NicolasHug reviewed Dec 21, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single threaded split #85

Single threaded split #85

maartenbreddels commented Dec 18, 2018 •

edited by NicolasHug

codecov bot commented Dec 18, 2018 •

edited

NicolasHug left a comment

NicolasHug Dec 18, 2018

maartenbreddels Dec 19, 2018

NicolasHug Dec 19, 2018

NicolasHug Dec 18, 2018

NicolasHug Dec 18, 2018

NicolasHug Dec 18, 2018

NicolasHug Dec 18, 2018

ogrisel left a comment

ogrisel Dec 19, 2018

ogrisel Dec 19, 2018

NicolasHug left a comment

NicolasHug Dec 21, 2018

NicolasHug Dec 21, 2018

NicolasHug Dec 21, 2018

NicolasHug Dec 21, 2018

		@@ -304,6 +312,31 @@ def split_indices(context, split_info, sample_indices):
		return (sample_indices[:right_child_position],
		sample_indices[right_child_position:])

	def _split_indices_single_threaded(context, split_info, sample_indices):
	def _split_indices_single_threaded(context, split_info, sample_indices):
	# single-threaded partition into left and right arrays (Hoare's partition scheme)

Single threaded split #85

Are you sure you want to change the base?

Single threaded split #85

Conversation

maartenbreddels commented Dec 18, 2018 • edited by NicolasHug

codecov bot commented Dec 18, 2018 • edited

Codecov Report

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maartenbreddels commented Dec 18, 2018 •

edited by NicolasHug

codecov bot commented Dec 18, 2018 •

edited