Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Progress Bars #7574

Open
Erotemic opened this issue Oct 4, 2016 · 16 comments
Open

[Feature Request] Progress Bars #7574

Erotemic opened this issue Oct 4, 2016 · 16 comments

Comments

@Erotemic
Copy link
Contributor

Erotemic commented Oct 4, 2016

I'm currently working with a fork of sklearn where I include progress bars in long running Estimators (particularly MiniBatchKmeans). It takes up to 9 hours to run some of the algorithms and having an estimated time remaining has been quite helpful to see when I should come back and look at the results.

I was wondering if it would be worthwhile to compile some of these into a pull requests and add this functionality to the library. If so there are few logistics:

Which progress implementation should I use? I see three options here.

  1. I could add a dependency on the external click library and use its ProgressBar class.
  2. I could port my a simplified version of my own ProgIter class into sklearn utils.
  3. I could port a simplified version of click's ProgressBar class

All options essentially wrap an iterator and dump some info to stdout about how far through the iterator they are.

These options are ordered from the least work to the most work. The first option is the least work, but it adds a dependency. I have experience with the second option, but it is not as widely used as click (although my port would use a more click-like signature). The third option requires me to delve into the click library, which I haven't had any experience with yet. However, judging from my initial look at its implementation it looks my implementation may be more efficient. The click implementation seems to compute and display progress information on every iteration, whereas my implementation tries to minimally impact the wrapped loop by adjusting its report frequency.

Here is some example code. Previously a code block looked like this:

        if self.verbose:
            print('Computing label assignment and total inertia')
        x_squared_norms = row_norms(X, squared=True)
        slices = gen_batches(X.shape[0], self.batch_size)
        results = [_labels_inertia(X[s], x_squared_norms[s],
                                   self.cluster_centers_) for s in slices]
        labels, inertia = zip(*results)
        return np.hstack(labels), np.sum(inertia)

If I port a refactored / paired down version of my progress iterator to sklearn, then it will look something like this:

        if self.verbose:
            print('Computing label assignment and total inertia')
        x_squared_norms = row_norms(X, squared=True)
        n_samples = X.shape[0]
        total_batches = int(n_samples // self.batch_size)
        slices = gen_batches(n_samples, self.batch_size)
        slices = ProgIter(slices, label='computing labels and inertia',
                          length=total_batches, enabled=self.verbose)
        results = [_labels_inertia(X[s], x_squared_norms[s],
                                   self.cluster_centers_) for s in slices]
        labels, inertia = zip(*results)
        return np.hstack(labels), np.sum(inertia)

Here is a mockup of the simplified ProgIter object. I paired it down rather quickly, so there may be some small issue in this version.

from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import time
import datetime
import collections
import six

WIN32 = sys.platform.startswith('win32')

if WIN32:
    # Use time.clock in win32
    default_timer = time.clock
    BEFORE_BAR = '\r'
    AFTER_BAR = '\n'
else:
    default_timer = time.time
    BEFORE_BAR = '\r\033[?25l'
    AFTER_BAR = '\033[?25h\n'

text_type = six.text_type

class ProgIter(object):
    def __init__(self, iterable=None, label=None, length=None, enabled=True,
                 freq=1, adjust=True, eta_window=64):
        if length is None:
            try:
                length = len(iterable)
            except Exception:
                pass
        if label is None:
            label = ''
        self.iterable = iterable
        self.label = label
        self.length = length
        self.freq = freq
        self.enabled = enabled
        self.adjust = adjust
        self.count = 0
        # Window size for estimated time remaining
        self.eta_window = eta_window

    def __call__(self, iterable):
        self.iterable = iterable
        return self

    def __iter__(self):
        if not self.enabled:
            return iter(self.iterable)
        else:
            return self.iter_rate()

    def build_msg_fmtstr(self):
        with_wall = True
        tzname = time.tzname[0]
        length_ = '?' if self.length == 0 else text_type(self.length)
        msg_head = ['', self.label, ' {count:4d}/', length_ , '...  ']
        msg_tail = [
            ('rate={rate:4.2f} Hz,'),
            ('' if self.length == 0 else ' eta={eta},'),
            ' ellapsed={ellapsed},',
            (' wall={wall} ' + tzname if with_wall else ''),
        ]
        msg_fmtstr_time = ''.join(([BEFORE_BAR] + msg_head + msg_tail))
        return msg_fmtstr_time

    def tryflush(self):
        try:
            # flush sometimes causes issues in IPython notebooks
            sys.stdout.flush()
        except IOError:
            pass

    def write(self, msg):
        sys.stdout.write(msg)

    def iter_rate(self):
        freq          = self.freq
        self.count    = 0
        between_count = 0
        last_count    = 0

        # how long iterations should be before a flush
        # (used for freq adjustment)
        time_thresh = 2.0

        max_between_time = -1.0
        max_between_count = -1.0

        iters_per_second = -1

        # Prepare for iteration
        msg_fmtstr = self.build_msg_fmtstr()

        self.tryflush()

        msg = msg_fmtstr.format(
            count=self.count, rate=0.0,
            eta=text_type('0:00:00'),
            ellapsed=text_type('0:00:00'),
            wall=time.strftime('%H:%M'),
        )
        self.write(msg)

        self.tryflush()

        start_time = default_timer()
        last_time  = start_time

        # use last few (64) times to compute a more stable average rate
        measure_between_time = collections.deque([], maxlen=self.eta_window)
        measure_est_seconds = collections.deque([], maxlen=self.eta_window)

        # Wrap the for loop with a generator
        for self.count, item in enumerate(self.iterable, start=1):
            yield item

            if (self.count) % freq == 0:
                # update progress information every so often
                now_time          = default_timer()
                between_time      = (now_time - last_time)
                between_count     = self.count - last_count
                total_seconds     = (now_time - start_time)

                measure_between_time.append(between_count / (float(between_time) + 1E-9))
                iters_per_second = sum(measure_between_time) / len(measure_between_time)

                if self.length is None:
                    est_seconds_left = -1
                else:
                    iters_left = self.length - self.count
                    est_eta = iters_left / (iters_per_second + 1E-9)

                    measure_est_seconds.append(est_eta)
                    est_seconds_left = sum(measure_est_seconds) / len(measure_est_seconds)

                last_count = self.count
                last_time  = now_time
                # Adjust frequency if printing too quickly
                # so progress doesnt slow down actual function
                if self.adjust and (between_time < time_thresh or
                                    between_time > time_thresh * 2.0):
                    max_between_time = max(max_between_time, between_time)
                    max_between_time = max(max_between_time, 1E-9)
                    max_between_count = max(max_between_count, between_count)
                    # If progress was uniform and all time estimates were
                    # perfect this would be the new freq to achieve time_thresh
                    new_freq = int(time_thresh * max_between_count / max_between_time)
                    new_freq = max(new_freq, 1)
                    # But things are not perfect. So, don't make drastic changes
                    max_freq_change_up = max(256, freq * 2)
                    max_freq_change_down = freq // 2
                    if (new_freq - freq) > max_freq_change_up:
                        freq += max_freq_change_up
                    elif (freq - new_freq) > max_freq_change_down:
                        freq -= max_freq_change_down
                    else:
                        freq = new_freq
                msg = msg_fmtstr.format(
                    count=self.count,
                    rate=iters_per_second,
                    eta=text_type(datetime.timedelta(seconds=int(est_seconds_left))),
                    ellapsed=text_type(datetime.timedelta(seconds=int(total_seconds))),
                    wall=time.strftime('%H:%M'),
                )
                self.write(msg)
                self.tryflush()

        if (self.count) % freq != 0:
            # Write the final progress line if it was not written in the loop
            est_seconds_left = 0
            now_time = default_timer()
            total_seconds = (now_time - start_time)
            msg = msg_fmtstr.format(
                count=self.count,
                rate=iters_per_second,
                eta=text_type(datetime.timedelta(seconds=int(est_seconds_left))),
                ellapsed=text_type(datetime.timedelta(seconds=int(total_seconds))),
                wall=time.strftime('%H:%M'),
            )
            self.write(msg)
            self.tryflush()
        self.write(AFTER_BAR)

Here is an example showing what this does:

# Define a function that takes some time
def is_prime(n):
    return n >= 2 and not any(n % i == 0 for i in range(2, n))
N = 50000

# Default behavior adjusts frequency of progress reporting so
# the performance of the loop is minimally impacted
iterable = (is_prime(n) for n in range(N))
piterable = ProgIter(iterable, length=N)
_ = list(piterable)
 50000/50000...  rate=25097.04 Hz, eta=0:00:00, ellapsed=0:00:22, wall=11:22 EST

# Adjustments can be turned off to give constant feedback
iterable = (is_prime(n) for n in range(N))
piterable = ProgIter(iterable, length=N, adjust=False, freq=100)
_ = list(piterable)
  2200/50000...  rate=32500.44 Hz, eta=0:00:01, ellapsed=0:00:00, wall=11:23 EST
43300/50000...    rate=1398.73 Hz, eta=0:00:06, ellapsed=0:00:18, wall=11:23 EST
50000/50000...  rate=1129.99 Hz, eta=0:00:02, ellapsed=0:00:25, wall=11:25 EST

# only one line of the above output is shown at a time, I copied different points from a few runs 
# to illustrate the output. 

Lastly here is some timeit information that shows how frequency adjusting causes minimal overhead

def is_prime(n):
    return n >= 2 and not any(n % i == 0 for i in range(2, n))

N = 1000

# Test time of raw loop without any fanciness 
%timeit list((is_prime(n) for n in range(N)))
100 loops, best of 3: 10 ms per loop

# Test time of raw loop when printing on every iteration
%timeit list(ProgIter((is_prime(n) for n in range(N)), length=N, freq=1, adjust=False))
10 loops, best of 3: 159 ms per loop

# Test time of raw loop when printing based on an adjusted schedule 
%timeit list(ProgIter((is_prime(n) for n in range(N)), length=N, freq=1, adjust=True))
100 loops, best of 3: 11 ms per loop
@amueller
Copy link
Member

amueller commented Oct 4, 2016

I think we would prefer a design that adds callbacks to the estimators, and not bake in the progress bar.
Better logging / callbacks would be great, though.
Then we can talk about whether we actually want progress bar functionality in scikit-learn.

How the callback or logging mechanism should work is a pretty big design decision. I'm not sure if there is any discussion on that anywhere.

@amueller amueller closed this as completed Oct 4, 2016
@amueller amueller reopened this Oct 4, 2016
@amueller
Copy link
Member

amueller commented Oct 4, 2016

sorry misclick.

@Erotemic
Copy link
Contributor Author

Erotemic commented Oct 4, 2016

I think it would be extremely useful to monitor the progress of algorithms beyond the current verbosity measures. Does anybody know of any other libraries that use a robust logging / callback system that can be used for inspiration / ideas? I'd certainly be willing to help on this. How would this system ideally work with existing verbosity messages? These messages are already baked in, and although a callback system make this even more customization, I think these baked in messages are still valuable.

In the short term though I was thinking more along the lines of simply improving the verbosity of the costly subroutines in some estimators. By improving the verbosity I really mean providing some timing measurements like current rate, estimate time remaining, start time, and total duration. Because I want to add this in multiple places I thought it would be nice to have a reusable class that can go in the utils.

What I envisioned for this feature request is almost the same as the way the current verbosity methods work. For example sklearn.cluster.kmeans_._mini_batch_convergence does print out a notion of absolute progress. All I want to do is add in some timing statistics and estimates and put these features in a reusable class that can be dropped in in other estimators.

@amueller
Copy link
Member

amueller commented Oct 4, 2016

We are sometimes somewhat conservative (or slow?) in adding entirely new functionality because it's hard to remove anything without breaking user code or at least annoying them. And if we add a lot of custom logging and displaying of logs, it's easy to get feature creep and hard-to-understand code.

If you want to add this reusable class, it still needs some form of interface, right?

We have a monitor callback in GradientBoostingClassifier but I'm not sure people are very happy with it.
Maybe draft what the interface would be like to support what you'd like to do in MiniBatchKMeans?
You can create a PR that adds the callback to MiniBatchKMeans and then maybe an example with a custom logger?

@amueller
Copy link
Member

amueller commented Oct 4, 2016

Btw, there's an issue over here: #78

@amueller
Copy link
Member

amueller commented Oct 4, 2016

It's a bit unclear to me if logging is the right solution for what you want, because we need to know what 100% is for a progress bar. Some algorithms have multiple stopping criteria, which makes it kind of tricky to use a bar. And even if there is a single stopping criterion, we sometimes stop when the change in objective is smaller than some threshold. How does that translate to a progress bar?

@jnothman
Copy link
Member

jnothman commented Oct 4, 2016

Improving verbosity is okay, but it would be better to provide structured objects to a logger that the user can configure. The logger then knows how to include timestamps in display.

@jnothman
Copy link
Member

jnothman commented Oct 5, 2016

Structured logging, with things like {"context": ['RandomForests.fit'],
"progress": 35, "out_of": 100} or {"context": ['SGDClassifier.fit'],
"objective": 1.100, "target_range": [-.005, .005]} might work... Main
problem is that it's essentially another spec that needs designing and
documenting.

On 5 October 2016 at 08:07, Andreas Mueller notifications@github.com
wrote:

It's a bit unclear to me if logging is the right solution for what you
want, because we need to know what 100% is for a progress bar. Some
algorithms have multiple stopping criteria, which makes it kind of tricky
to use a bar. And even if there is a single stopping criterion, we
sometimes stop when the change in objective is smaller than some threshold.
How does that translate to a progress bar?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60RksHz9-XflIE6IC_hUhBeu0vi_ks5qwr_2gaJpZM4KN0Tn
.

@amueller
Copy link
Member

amueller commented Oct 5, 2016

Might. What's the generic interface that the logger implements? This is additional complexity but I think we need to do this at some point.

@amueller
Copy link
Member

amueller commented Oct 9, 2016

Hm @denis-bz suggested here to just pass locals() to the callback, and it could return a dict of whatever....

@jnothman
Copy link
Member

jnothman commented Oct 9, 2016

That would have to be faked in Cython

@amueller
Copy link
Member

Do we want the monitor in Cython? I guess so. I would have left it at Python for now.

@rth
Copy link
Member

rth commented Jul 14, 2017

A few more thoughts on this. For all estimators that either,

  • support incremental learning (i.e. have a partial_fit method) , including MiniBatchKmeans
  • support iterators and process one row of data at a time (e.g. CountVectorizer().fit_transform)

using tqdm progress bars is IMO easier and more reliable that writing custom classes, even if it adds one dependency for people who need this functionality.

For instance, without modifying the scikit-learn code base, one could well do,

from tqdm import tqdm

# [...]
for sl in  tqdm(gen_batches(n_samples, self.batch_size)):
      estimator.partial_fit(X[sl], y[sl])

# [...] or

vect = CountVectorizer()
X = vect.fit_transform(tqdm(iterable_of_file_names))

For logging one then can write a custom logging handler which offsets this problem to the tqdm community that probably has more experience in handling progress bars..

@GuillaumeDesforges
Copy link

GuillaumeDesforges commented Mar 26, 2019

It would be a great help to have sklearn use tqdm (or any implementation of a progress bar) to monitor progress for long trainings of complex pipelines.

@Erotemic
Copy link
Contributor Author

I would actually recommend against tqdm for one simple reason: its implementation makes heavy use of Python threads. This can be an issue for scikit-learn processes that make use of multiprocessing. As noted in Raymond Hettinger's 2017 PyBay Keynote:

The general rule is “thread after you fork not before”. Otherwise, the locks used by the thread executor will get duplicated across processes. If one of those processes dies while it has the lock, all of the other processes using that lock will deadlock.

The implementation that I originally proposed here has been made into its own standalone pip-installable Python package called progiter. It is both fast (very low overhead) and single threaded, so it can be safely used with multiprocessing. Its API is mostly compatible with tqdm.

@rth
Copy link
Member

rth commented Apr 14, 2020

An experimental API for callbacks that would allow implementing progress bars was proposed in #16925. Feedback welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants