Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download progress bar #697

Open
wants to merge 3 commits into
base: development
Choose a base branch
from

Conversation

stared
Copy link
Contributor

@stared stared commented Nov 26, 2022

I was dearly missing a progress bar, so with a few lines of code, I added tqdm.

Screenshot 2022-11-26 at 15 52 36

As I checked, previously such a feature request was turned down (#469). So I am leaving it up to maintainers if it is for polishing & potentially merging, or for some reason progress bars are taboo.

@OMEGARAZER
Copy link
Contributor

I'd imagine for this to have any chance of making it into the main build you'd have to put it behind an option like --progress as it would likely mess with pre-existing scripts that wouldn't expect it.

@Serene-Arc
Copy link
Collaborator

This was declined for specific reasons that haven't changed. The BDFR is designed to be used in scripts and cron jobs. The progress bar and tqdm cause havoc in log files and output when trying to save it. The idea was always to keep this as the most basic and ubiquitous level tool that can be used for everything and then other more user-friendly tools can be used that expand the interface, such as a progress bar.

Now, since there has been no effort by anyone I know of to make such an interface, I may have softened a bit but I'm still not sure that it's a good thing to include. Such a small thing doesn't seem worth making an entirely separate tool but I try to subscribe to the Unix Philosophy: do one thing and do it well. The BDFR's thing is to download data from Reddit in any environment or conditions and for any use case. I myself use it as a script, and I worry that this feature would negatively impact those kinds of uses.

@stared
Copy link
Contributor Author

stared commented Nov 27, 2022

@OMEGARAZER and @Serene-Arc thank you for your remarks - I now understand more philosophy behind bdfr design. As I now see, it deviates from how I use this tool.

In this case, here is a version that provides end-user CLI experience (as for many command-line apps, including many dev tools, such as conda or npm). However, I created it in a way that without --progress-bar behavior is unaffected.

What do you think about this kind of approach, @Serene-Arc ?

Screenshot 2022-11-27 at 14 32 03

@Serene-Arc
Copy link
Collaborator

This should be fine. One change I will ask for is to make the progress bar False by default e.g. progress_bar : bool = False in the method declaration of setup_logging and that the tests be restored to their previous state. Since this conflicts with the most basic operation of the BDFR I don't think it should be explicitly disabled in things like tests, but explicitly enabled.

Also as a side note, there is a bug in the logging portion of tqdm that means it doesn't handle logging messages. This isn't related to your PR per se but it will mean that it doesn't work as expected since logging messages of all levels will be printed to the screen, which is not the intended behaviour. Despite there being about 5 duplicate PRs to fix this behaviour, nothing has been done to date.

@stared
Copy link
Contributor Author

stared commented Dec 4, 2022

@Serene-Arc Thank you for the remark on the default behavior of setup_logging - fixed that.

In the current setup (as in this PR) it seems that it works as intended, though. Or do you have some examples of unintended behavior?

@OMEGARAZER
Copy link
Contributor

I genuinely don't know so I apologize if this doesn't help but does Alive-Progress have the logging issues you mentioned Serene? This mentions it does handle the logging but not sure if it's in the same situation.

It's the library I was looking at for this purpose when I had last looked into it but I didn't want to step on stared's toes as I hadn't even started on it at all.

@Serene-Arc
Copy link
Collaborator

@OMEGARAZER I wouldn't know unfortunately. My PR for tqdm is here but there are at least three others that fix the same issue.

@stared tqdm does handle the logging but it prints logging messages of every level, not just info level ones, but debugging messages and lower as well.

@Serene-Arc
Copy link
Collaborator

We recently underwent a project-wide format to add consistency, if you could please resolve the conflicts that it has introduced, I can merge the PR.

@stared stared changed the title Add download progress bar (WIP) Add download progress bar Dec 8, 2022
@stared stared force-pushed the progress-bar branch 2 times, most recently from 54f91c0 to e9022ff Compare December 8, 2022 11:45
@stared
Copy link
Contributor Author

stared commented Dec 8, 2022

@Serene-Arc
Polished a bit, rebased to development, and cleaned history a bit.

Right now it looks like that:

Screenshot 2022-12-08 at 11 58 43

@Serene-Arc
Copy link
Collaborator

I've been thinking and I'm not sure that this is the best approach, specifically that most of the code has been added to the downloader module. For one, this ignores the cloner and archiver modules completely; the downloader module now has an option that applies to only it when there isn't really a reason for that.

Additionally, the parsing and reorganisation of logging messages to be more 'user-friendly' definitely doesn't belong in that module. I think another architectural approach is needed, maybe a separate module for a child of Logger that parses messages in the way that you want for the progress bar, and that hooks in at the setup_logging stage if the progress bar has been enabled. Or better yet, a child of tqdm that does the above as well, so it can be imported and used without any additional setup if the option has been parsed.

Do you think that you are able to make these changes?

@stared
Copy link
Contributor Author

stared commented Dec 9, 2022

@Serene-Arc Yes, I should add it to other modules as well. Or for now, I can make it an option for the downloader, to be expanded later if needed.

Or better yet, a child of tqdm that does the above as well, so it can be imported and used without any additional setup if the option has been parsed.

Not sure if I understand. Do you mean something like

for generator in loop_logging(self.reddit_lists)

in which loop_logging is imported from a separate file, does nothing if progress_bar == False?
I agree that moving all progress bar logic to a different file is a good idea.

@Serene-Arc
Copy link
Collaborator

Yes that is what I meant. Something that can wrap the iterator with as little code in those files as possible and handle any logging calls without changing the master log file. Like I said, this is the kind of change that I'd hoped would be handled in different tool built on the BDFR so try to keep as much of the code in a new module as possible.

Basically, try to make as much of it self-contained and separate as possible. I'm going to think about making a more modular architecture but that's a problem for another PR.

@stared
Copy link
Contributor Author

stared commented Dec 10, 2022

@Serene-Arc Thanks for this nice suggestion - refactored code. Now indeed it is more modular and manageable. In particular, now I was able to add it with ease to the archiver.

Let me know what you think about that.

@Serene-Arc
Copy link
Collaborator

I think I am going to make changes to standardize the creation and handling of loggers. This will make your PR much easier if you wait for that.

@stared
Copy link
Contributor Author

stared commented Dec 20, 2022

@Serene-Arc Up to you.

Let me know when I can start working back on this PR.

@stared
Copy link
Contributor Author

stared commented Jan 5, 2023

@ Serene-Arc So, what's the status of the loggers?

@Serene-Arc
Copy link
Collaborator

I'm working on a solution, should be ready soon. The aim will be to decouple the logging so arbitrary logging handlers can be plugged into the RedditConnector class, which will make your PR and subsequent similar PRs much easier in the future.

@Serene-Arc
Copy link
Collaborator

Right now that #750 is merged, you should have an easy way to do it. Make an alternative logging handler interface that can handle the messages however you want. Then, use make_console_logging_handler in __main__.py to slot it in depending on the user's choice.

@stared
Copy link
Contributor Author

stared commented Jan 8, 2023

@Serene-Arc Thanks, I will give it a try.

After so many changes (quite a few other PRs) resolving conflicts will take me some time, though.

@Serene-Arc
Copy link
Collaborator

It's fine, no rush. This will probably hit next release anyway.

@stared
Copy link
Contributor Author

stared commented Jan 10, 2023

@Serene-Arc Rebasing through the previous PRs was a pain (as they also touched downloader loops), but I managed.

Then it comes to the new logging handlers - frankly, I don't understand this pattern, so I don't know how to use it for the progress bar. Do you have any pointers on what to do?

@Serene-Arc
Copy link
Collaborator

So the change in the way loggers are handled mean that you don't have to modify the download loop. You should leave that alone; if everyone who wanted a minor change to the loop implemented it there, then it would get very long and crowded. Instead you should make a separate logger, and replace the default console logger with your replacement.

To do that, you make a logging handler class of your own that utilises tqdm to display a progress bar and the messages that you want. Here's an example bit of code that I wrote that does this.

import logging
from time import sleep
from typing import Optional
from tqdm import tqdm

logger = logging.getLogger()


class TestHandler(logging.StreamHandler):
    def __init__(self, num_of_submissions: int):
        super().__init__()
        self.progress_bar: Optional[tqdm] = None
        self.num_of_submissions = num_of_submissions

    def emit(self, record: logging.LogRecord) -> bool:
        if "test" in record.msg:
            if self.progress_bar is None:
                self.progress_bar = tqdm(total=self.num_of_submissions, initial=0)
            self.progress_bar.update(1)
            self.progress_bar.write(record.msg)
        self.flush()
        return True


if __name__ == "__main__":
    logger.setLevel(1)
    handler = TestHandler(20)
    handler.setLevel(logging.INFO)
    logger.addHandler(handler)
    logger.info("random example message")
    logger.info("random example message")
    logger.info("random example message")
    for i in range(0, 20):
        logger.info("testing")
        sleep(0.5)
    logger.info("done")

The only thing that this needs is the maximum number of submissions to download, which can be passed in. So make this logger just before the download() function is called and pass in the maximum size of the progress bar. You can format it and add as much as you want. Since all of the messages that the BDFR sends to the logger are standardised, you can apply a regex or whatever to have the progress bar update only on certain messages, as happened in my example.

@stared
Copy link
Contributor Author

stared commented Jan 14, 2023

@Serene-Arc Thank you for this example.

I guess I would need to start another PR, as this current code is not salvageable for the new architecture.

@stared
Copy link
Contributor Author

stared commented Jan 19, 2023

@Serene-Arc I made some changes, and now it should work.

The intrusion in the loop is minimal - a few other instructions, but no contexts or wraps. That way it is also highly refactorable. If you need to add anything in progress after each subreddit and post (including logs!), it is easy to do so without cluttering the loop.

Should I proceed with this approach?

@stared
Copy link
Contributor Author

stared commented Jan 26, 2023

@Serene-Arc By any chance, did you have some time to look at the last commit?

@Serene-Arc
Copy link
Collaborator

Hi, I just looked at it. That isn't what I meant with the design, you've still added a Progress object to the download loop and changed the signatures of some of those methods? Not sure why you've done that.

@stared
Copy link
Contributor Author

stared commented Jan 27, 2023

@Serene-Arc

Well, if you want to transmit data via text logs, it is a hard no for me. It would be a dirty hack; in the presence of other possibilities - a really, really bad practice. Sure, we can use a library that supports log payloads (How to make context logging in Python less cumbersome), but I don't think it is worth it.

If you have a cleaner solution than progress object, I would be happy to hear it. Otherwise, in the submission download loop, in place of a single line progress.post_done(submission, success), you would get:

title_short = submission.title[:60] + (submission.title[60:] and "...")
log_str = f"{submission.score:5d}🔼 {title_short}"
icon = "✅" if success else "❌"
logger.info(f"Progress: post done - {icon} {log_str}")

I don't think it is cleaner. Do you? Sure, you might want to have something in between, e.g.:

submission_log_message = progress.submission_to_log_message(submission, success)
logger.info(submission_log_message)

Yet still, I hardly see how it is cleaner than progress.post_done(submission, success). Plus, all stuff discussed above from encoding and decoding data as strings.

Signature - to get if the download operation succeeds. It is a standard pattern that (in a practical way) a function returns an object or None, while if it changes the internal state - True or False. I didn't touch other return lines, and True / None is OK-ish. This (IMHO) should be refactored regardless of this PR.

So again:

  • Is for you transmitting progress update data by text logs a hard requirement?
  • Do you have a cleaner approach than progress.post_done(submission, success)?

@Serene-Arc
Copy link
Collaborator

You claim that this is a cleaner solution, but it isn't. You haven't changed either the archiver or the cloner, nor would duplicating the code there make it simpler to maintain going forwards. Wouldn't even have the same code in the case of the cloner. And, again, this PR is neither unique nor special moving forwards; any other PR attempting to change the output of the BDFR will run into the exact same problem and add similar code in the exact same location.

The function doesn't change the internal state, nor does it have a valuable return so I'm not sure why that change is good practice here. Attempting to shoehorn the code to fit your specific use case doesn't seem like good practice to me at all.

Sure, more context aware logging could work. Could make an intermediate level to abstract that handling away from the higher level classes and deal with the messages in a consistent way. Do I have a better solution? No, but neither have I tried to find one. I have outlined the requirements that I feel are best for the project, but this is an open-source tool that I work on in my spare time. If you aren't willing to change it, or I can't convey the requirements properly and you don't want to waste your own time, then the solution is ultimately to wait until I have time to devise a solution.

Up until now, the unofficial stance has been that features like this require a different project built on top of the bdfr's core modules. Moving forwards, that can change, but it may require more work before I'm satisfied with it. Most of the work has gone into broadening the scope of the project, not narrowing it.

Let me know how you'd like to proceed.

@stared
Copy link
Contributor Author

stared commented Jan 27, 2023

@Serene-Arc To be honest, I feel bitter here. While I heard your initial reservations, later it turned into a guessing game, which gave me a lot frustration. While writing code always takes time, a lot of it was needless - especially waiting for quite a few other PR to be merged, so I could go through rebase hell, to have to rewrite the whole code anyway, to well-being in a place where I am.

I guess you are aware that to introduce any functionality you need to change some code. If it was your condition that no code in the download.py would be changed, it was impossible to fulfill.

If you think the loop logging part is the cleanest part of this project - I reassure you, they aren't. There are a lot of "if" statements, which are error-prone, and not transmitting status as data. If it were my library, instead of "log it, and return None" I would return results with status data or at least use this pattern:

self.failed_submission_status('filter_score', { 'score': score })
return False

Then within failed_submission_status anything from saving data (so as to know later how many were failed for a given reason) to log messages and their formatting. That way, BTW, adding an optional progress bar would be seamless.

Well, what's next?

With the said bitterness, I don't wish to develop this PR any further. Regardless - thank you for developing BDFR!

I will keep using mine locally. And will create a fork, so others may use it - as it seems I am not the only one, #745.

@Serene-Arc
Copy link
Collaborator

I'm sorry if I haven't been clear. I have tried to lay out my requirements clearly, to the point of combing through tqdm's and the logging library's API to provide a MWE that I thought fit your requirements but I don't think I understood and there may have been a miscommunication issue.

You are correct that the download loop contains a lot of if statements; when I wrote it, there were much fewer, but there have been several PRs that have added features to filter the downloads further and they have added their code in the same place. That area is definitely due for a rewrite to make it clearer for further additions, perhaps moving it to the DownloadFilter class. I'm aware that it's not a clean part of the code, which is the reason why I'd prefer that it isn't expanded upon here.

If you don't want to continue developing for the BDFR, I really am sorry that this has been so difficult. Keep the PR open and once I have a solution, I'll implement this myself if need be. Thank you for your effort.

@aliparlakci
Copy link
Owner

Hi,

It's nice to see @stared here again, one of the earliest contributors of the project :)

I haven't read the whole discussion but I want to say couple of things.

Although I understand Serene's points and we have had multiple discussions about it, I don't think a feature as simple as this (and much needed, imo) should be this controversial or hard to implement. I cannot explain the frustration I have when I cannot determine if the process hanged for some reason or the file it is downloading is huge.

There were other feature requests and PRs which were rejected because bdfr itself was no place to implement them. Although, I agree with the rejections, some are requested multiple times and it feels like community actually needs them. I think there should be a common ground which allows people to use bdfr however they like easily. We already provide great flexibility through CLI and sometimes it is not enough, as it was in this case.

I want to discuss this matter in the discussion section together with @stared , @Serene-Arc along with the other contributors. I am sure we will find a way.

@Serene-Arc
Copy link
Collaborator

My issue here isn't the feature per se but the implementation. I think we need a more comprehensive event logging system so that, in the future, adding similar features is easy and doesn't clutter the codebase in the wrong locations.

@stared
Copy link
Contributor Author

stared commented Mar 12, 2023

@aliparlakci It is nice to hear you again!

Of course, it is up to you to include (or not) the PR. If you look at the code diff, the changes are minimal and encapsulated.

In particular, if (and when) there is a different logging system or (what I consider more important) a consistent and well-designed error handling, moving the progress bar will be a breeze. While I appreciate some pointers of @Serene-Arc on how to separate the progress bar code better (and I put quite a few of them in action), I don't understand the cluttering argument. Right now inclusions are minimal and won't be much smaller not matter the way they are implemented.

Sure, refactoring logging and error handling will vastly improve code readability and robustness, but it is another project. And the improvement will be mostly in the places not touched by the progress bar at all.

@stared
Copy link
Contributor Author

stared commented Jan 10, 2024

@aliparlakci @Serene-Arc By any chance, did anything change with the status of this PR?

I would like to note that after more than a year of development of BDFR, there is no conflict between this PR and the development branch (except for two trivial ones, as you see above).

So, empirically, this progress bar adds an optional feature while it does not affect other functionalities or hamper development.

(I am still using the progress bar on my device.)

@Serene-Arc
Copy link
Collaborator

Hi, given that, by all means. If you'll resolve the conflicts, I'll merge. If you could add a section on the new option to the README, that would be great too.

@stared
Copy link
Contributor Author

stared commented Feb 2, 2024

@Serene-Arc I added README, resolved conflicts, and rebased on the newest development branch.

@ymgenesis
Copy link

ymgenesis commented Feb 19, 2024

Is there a way to just have the tqdm progress bar without it hijacking the log output like the first version of the PR? I process the log output txt file after bdfr runs and with --progress-bar the txt file is blank after a handful of lines. Wondering if there's a place to see the original commit? I'm assuming it's this

@stared
Copy link
Contributor Author

stared commented Feb 19, 2024

@ymgenesis Yes - the initial approach is in the https://github.com/stared/bulk-downloader-for-reddit/tree/old-progress-bar branch.

@Serene-Arc
Copy link
Collaborator

If the log file is empty when this option is used, that's a bug that will have to be fixed before it's merged.

@ymgenesis
Copy link

ymgenesis commented Feb 20, 2024 via email

@stared
Copy link
Contributor Author

stared commented Feb 20, 2024

Well, it is intended behavior that most errors (except for critical) are suspended in the progress-bar mode, vide

        if self.progress_bar:
            logger.setLevel(logging.CRITICAL)

Otherwise, the terminal-based progress bar wouldn't work - a visual summary of downloads (with success or error) would be littered with messages not intended for the end user.

@Serene-Arc
Copy link
Collaborator

Intended or not is irrelevant, sorry. Nothing you do can interfere with the file log. Change the printed log if you must but the file log must remain complete and unchanged. That's non-negotiable on any PR or change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants