Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use python logging to report on convergence progress it level info for long running tasks #78

Open
ogrisel opened this issue Feb 12, 2011 · 32 comments · May be fixed by #22000
Open

Use python logging to report on convergence progress it level info for long running tasks #78

ogrisel opened this issue Feb 12, 2011 · 32 comments · May be fixed by #22000

Comments

@ogrisel
Copy link
Member

ogrisel commented Feb 12, 2011

This is a proposal to use python's logging module instead of using stdout and verbose flags in the models API.

Using the logging module would make it easier for the user to control the verbosity of the scikit using a single and well documented configuration interface and logging API.

http://docs.python.org/library/logging.html

@GaelVaroquaux
Copy link
Member

Work has started on this in https://github.com/GaelVaroquaux/scikit-learn/tree/progress_logger

What remains to be done is most probably fairly mechanical work.

@amueller
Copy link
Member

amueller commented Jan 7, 2014

There is also work in the new Gradient Boosting module.

@larsmans
Copy link
Member

Logging actually isn't that easy to use at all, in my experience, so -1 on this.

@aadilh
Copy link

aadilh commented Oct 19, 2015

Is anyone working on this ?

@amueller
Copy link
Member

How about we add a logger that by default prints to STDOUT? That should be fairly simple, right?

@domoritz
Copy link

domoritz commented May 30, 2018

This issue has been open since 2011 and so I wonder whether this is going to be fixed. I've run into this issue with RFECV (

class RFECV(RFE, MetaEstimatorMixin):
). I wanted to print the progress but the default verbose printing prints too many messages. I didn't want to monkey patch sys.stdout to make this work and overriding the logger would be the simple and clean solution.

There are other issued in sklearn such as #8105 and #10973 that would benefit from real logging in sklearn. Overall, I think logging would be a great addition to sklearn.

@jnothman
Copy link
Member

jnothman commented May 30, 2018 via email

@domoritz
Copy link

I'm a bit busy right now but I support customizable logging in sklean in whatever form (although I prefer standard python logging).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 18, 2019

Has there been any discussion about what verbose=True will mean when scikit-learn starts using logging? We're dealing with this a bit in dask-ml: dask/dask-ml#528.

Given that libraries aren't supposed to do logging configuration, it's up to the user to configure their "application" (which may just be a script or interactive session) to log things appropriately. This isn't easy always to do correctly.

My proposal in dask/dask-ml#528 is for verbose=True to mean "temporarily configure logging for me". You can use a context manager to configure logging, and scikit-learn would want to ensure that INFO level messages are printed to stdout to match the current behavior.

@jnothman
Copy link
Member

jnothman commented Nov 18, 2019 via email

@thomasjpfan
Copy link
Member

My proposal in dask/dask-ml#528 is for verbose=True to mean "temporarily configure logging for me".

This seems like a good balance. Using the logging module is not that too user friendly. Another "hack" would be to use info by default, but when a user sets verbose=True the logs can be elevated to warning. This would work because warnings are displayed by default.

@jnothman
Copy link
Member

jnothman commented Nov 18, 2019 via email

@TomAugspurger
Copy link
Contributor

@jnothman’s comment matches my thoughts. scikit-learn would always emit the message, and the verbose keyword controls the logger level and handlers.

@thomasjpfan
Copy link
Member

But the local handler could change from warning to info to debug
level on stream as verbose increases

Okay, lets go with this. Currently the logging levels are https://docs.python.org/3/library/logging.html#logging-levels By default, we can use the INFO, which does not emit by default. When verbose=1, we have the handler change info -> warning, and debug -> info. When we set verbose>=2, we still have info -> warning but also have debug -> warning, and the estimator can interpret the verbose>=2 to mean "emit more debug messages as verbose increases". This meaning can be different between different estimators.

What do you think?

@grisaitis
Copy link

Hi, I'm very interested in this issue. I have some experience with logging and would love to help implement an enhancement here if some consensus and plan is reached.

might be helpful to recap ideas mentioned here:

  1. use a callback pattern
  2. change the level of the message, depending on verbose
    if verbose:
        logger.debug(message)
    else:
        logger.info(message)
  1. change the level of the logger, depending on verbose
    if verbose:
        logger.selLevel("DEBUG")
  1. add a handler with level DEBUG, depending on verbose
    if verbose:
        verbose_handler = logging.StreamHandler()
        verbose_handler.setLevel("DEBUG")
        logger.addHandler(verbose_handler)

My take on these options:

Option 1 or option 4 would probably be best.

  • Option 1 (callbacks) is good in that it's most agnostic (people can log things however they want). But it might be less flexible from a messaging / state capture perspective. (Aren't callbacks only called once or once per some internal loop iteration?)
  • Option 2, as discussed here, i think is misusing the logging library
  • Option 3 works but, I think, defeats part of the purpose of using the logging library. If sklearn uses logging, then users can adjust verbosity via logging itself, e.g. import logging; logging.getLogger("sklearn").setLevel("DEBUG").
  • Option 4 is probably most canonical. The docs suggest not creating handlers in library code other than NullHandlers, but I think here it makes sense, given that sklearn has verbose flags. In this case, log printing is a "feature" of the library.

A fifth option would be to remove verbose flags, use logging everywhere, and let users adjust verbosity via the logging API. This is what logging was designed for, after all.

@rth
Copy link
Member

rth commented Jun 24, 2020

@grisaitis thanks! See also more recent related discussions in #17439 and #16925 (comment) (regarding callbacks). Your help would be very much appreciated, the main issue is we haven't decided yet what approach would be best :)

@jnothman
Copy link
Member

jnothman commented Jun 25, 2020 via email

@thomasjpfan
Copy link
Member

A fifth option would be to remove verbose flags, use logging everywhere, and let users adjust verbosity via the logging API. This is what logging was designed for, after all.

I would support removing verbose, as I find per-estimator configuration
frustrating, and the numeric values of verbose arbitrary, poorly
documented,

I think getting rid of verbose and using logging levels would be very nice. The only downside I see is that it would make logging slightly less discoverable.

@mitar
Copy link
Contributor

mitar commented Jun 26, 2020

Also, one thing what logging provides is that you can attach extra information to each logging message, not just strings. So whole dict of useful stuff. So if you want to report loss during learning, you can do that and store numerical value. Even more, you can both store the numerical value as number and use it to format user friendly string, like: logger.debug("Current loss: %(loss)s", {'loss': loss}). I find that very useful in general and would love if sklearn exposes that as well.

@amueller
Copy link
Member

amueller commented Jul 2, 2020

I think having module or estimator level loggers is a bit overkill for now and we should start with something simple that allows us to extend it later.
Also, whatever we do should allow users to reasonably easy reproduce current behavior.

How do logging and joblib interact? The logging level is not preserved (as expected I guess):

import logging
logger = logging.getLogger('sklearn')
logger.setLevel(2)

def get_level():
    another_logger = logging.getLogger('sklearn')
    return another_logger.level

results = Parallel(n_jobs=2)(
    delayed(get_level)() for _ in range(2)
)
results
[0, 0]

But that's probably not needed, since this works:

import logging
import sys
logger = logging.getLogger('sklearn')
logger.setLevel(1)

handler = logging.StreamHandler(sys.stdout)
logger.addHandler(handler)


def log_some():
    another_logger = logging.getLogger('sklearn')
    another_logger.critical("log something")

results = Parallel(n_jobs=2)(
    delayed(log_some)() for _ in range(2)
)

Honestly, I'm not entirely sure how this works though.

both stdout and stderr don't show up in jupyter btw.

My dream: being able to get an approximation of the current behavior with a single line, but also being able to easily use progress bars or plot convergence instead.

re verbose: it's probably cleaner to deprecate verbose, but deprecating verbose and not having estimator-level logging will make it a bit trickier to log one estimator but not another. I think it's fine to have the user filter the messages, though.

@grisaitis
Copy link

grisaitis commented Jul 2, 2020

hey all, thanks for the friendly replies and info. i read the other issues and have some thoughts.

joblib will be tricky. i have some ideas though.

@amueller that is very weird. i reproduced your example. things do work with concurrent.futures.ProcessPoolExecutor, which i think joblib uses...

seems like joblib is nuking the state in logging. any joblib experts have ideas of what could be going on?

import concurrent.futures
import logging
import os

logger = logging.getLogger("demo🙂")
logger.setLevel("DEBUG")

handler = logging.StreamHandler()
handler.setFormatter(
    logging.Formatter("%(process)d (%(processName)s) %(levelname)s:%(name)s:%(message)s")
)
logger.addHandler(handler)

def get_logger_info(_=None):
    another_logger = logging.getLogger("demo🙂")
    print(os.getpid(), "another_logger:", another_logger, another_logger.handlers)
    another_logger.warning(f"hello from {os.getpid()}")
    return another_logger

if __name__ == "__main__":
    print(get_logger_info())

    print()
    print("concurrent.futures demo...")
    with concurrent.futures.ProcessPoolExecutor(2) as executor:
        results = executor.map(get_logger_info, range(2))
        print(list(results))

    print()
    print("joblib demo (@amueller's example #2)...")
    from joblib import Parallel, delayed
    results = Parallel(n_jobs=2)(delayed(get_logger_info)() for _ in range(2))
    print(results)

which outputs

19817 another_logger: <Logger demo🙂 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19817 (MainProcess) WARNING:demo🙂:hello from 19817
<Logger demo🙂 (DEBUG)>

concurrent.futures demo...
19819 another_logger: <Logger demo🙂 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19819 (SpawnProcess-1) WARNING:demo🙂:hello from 19819
19819 another_logger: <Logger demo🙂 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19819 (SpawnProcess-1) WARNING:demo🙂:hello from 19819
[<Logger demo🙂 (DEBUG)>, <Logger demo🙂 (DEBUG)>]

joblib demo (@amueller's example #2)...
19823 another_logger: <Logger demo🙂 (WARNING)> []
hello from 19823
19823 another_logger: <Logger demo🙂 (WARNING)> []
hello from 19823
[<Logger demo🙂 (DEBUG)>, <Logger demo🙂 (DEBUG)>]

@mitar
Copy link
Contributor

mitar commented Jul 2, 2020

I think you should configure processes joblib spawns to send logging message to the main logger in the main process. Then one can control logging in the main process only. Something like this or this. So there are logging queue sinks and sources and you can tie them together. We use this on our cluster, to send all logging from all workers on all machines to central location, to show it on user's computer.

@grisaitis
Copy link

grisaitis commented Jul 2, 2020

@mitar i agree, i think that might be the best bet. (not necessarily network-based queues, but inter-process communication queues)

i'm actually coding up an example using logging's QueueHandler / QueueListener right now, to test with joblib and concurrent.futures. will follow up here.

also love your other suggestion:

Also, one thing what logging provides is that you can attach extra information to each logging message, not just strings. So whole dict of useful stuff. So if you want to report loss during learning, you can do that and store numerical value. Even more, you can both store the numerical value as number and use it to format user friendly string, like: logger.debug("Current loss: %(loss)s", {'loss': loss}). I find that very useful in general and would love if sklearn exposes that as well.

i implemented a visualization of gaussian mixture modeling using the extra param and a custom Handler class. works super nicely for passing state around, and letting the user decide how to handle the state.

also re joblib's idiosyncrasies that i noticed above... i'm going to accept that as is and out of scope. a queue based design would be most flexible anyway.

the only limitation of using a QueueHandler, that i can think of, is that any extra state (logger.debug("message", extra={...}) is that the extra dict must be serializable for the queue. so no numpy arrays. :/ cant think of any other issues though

@mitar
Copy link
Contributor

mitar commented Jul 2, 2020

i'm actually coding up an example using QueueHandler / QueueListener right now,

Yes, you should always use queue handler, because you never know when the sending over the socket blocks and you do not want to slow down the model because of logging blocking.

Also, you do not even have to use extra. I think logger.debug("message %(foo)s", {'foo': 1, 'bar': 2}) just works.

@grisaitis
Copy link

Yes, you should always use queue handler, because you never know when the sending over the socket blocks and you do not want to slow down the model because of logging blocking.

for the joblib case, if we implemented QueueHandler / QueueListener, what state would we have to pass to the process pool? just the queue, right?

Also, you do not even have to use extra. I think logger.debug("message %(foo)s", {'foo': 1, 'bar': 2}) just works.

thanks yes. i find it useful also to log state without converting it to text. e.g. including a numpy array in extra, and doing real-time data visualization (visual logging in a way) with a custom logging handler in a jupyter notebook. this would be SUPER nice with sklearn, and looks like @rth has been doing similar work with callbacks.

@mitar
Copy link
Contributor

mitar commented Jul 2, 2020

for the joblib case, if we implemented QueueHandler / QueueListener, what state would we have to pass to the process pool? just the queue, right?

I think so. I have not used this over process boundaries, but it seems they have a documented support for multiprocessing, so it should work with joblib as well. I am using QueueHandler / QueueListener inside the same process. To decouple logging writes from logging transport. So is QueueHandler -> QueueListener -> Send to central logging service. But from documentation it looks like it can work through multiprocessing queue.

i find it useful also to log state without converting it to text

Yes. What I am saying is that you do not have to use extra, but just pass dict directly, and then you use only few item from that dict for message formatting (do note that what is used in format strings is decided by users of sklearn library, not by sklearn library, one can always configure that you want in formatting something you didn't expect, so what exactly is converted to text is not really under sklearn control). All values in extra can also be used for message formatting as well. So I am not sure how useful that extra is. But I am also not saying we should not use it. It is much more explicit what was the payload for the string on the left, and what is extra. So we can also use both. I just wanted to make sure this alternative is known.

grisaitis added a commit to grisaitis/py-logging-multiprocessing that referenced this issue Jul 4, 2020
@amueller
Copy link
Member

amueller commented Jul 6, 2020

@grisaitis FYI if you mention a name in a commit, every time you do anything with the commit (like rebasing it or merging it or pushing it), the person gets an email, so it's generally discouraged ;)

@grisaitis
Copy link

Sorry about that Andreas! 😬 That's embarassing... I was just trying to have well-documented commits lol. Will avoid in the future.

In that repo I figured out how logging can work with joblib with a QueueHandler / QueueListener combo. Appears to work nicely.

As a first step I'll implement logging with that approach in a part of sklearn where joblib is used. Maybe one of the ensemble models. Will open a new PR.

@rth
Copy link
Member

rth commented Jul 9, 2020

for the joblib case, if we implemented QueueHandler / QueueListener,

Yes, it sounds like it would be necessary to start/stop a monitoring thread (here QueueListener) both if use the logging module and callbacks in the case of multiprocessing (approximate example of callbacks with multiprocessing in #16925 (comment))

@amueller
Copy link
Member

So I figure the only reason why what I did above "worked" was that it printed to stdout which was the shared resource and print is threadsafe in python3 or something like that?

@mitar
Copy link
Contributor

mitar commented Jul 15, 2020

So I figure the only reason why what I did above "worked" was that it printed to stdout which was the shared resource and print is threadsafe in python3 or something like that?

Print is not thread safe. They just print to the same file descriptor. Probably running for a longer time you would see that messages sometimes interleave and lines get mixed up.

@adrinjalali
Copy link
Member

Are we gonna do this? (we should). Moving to 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.