Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get spotify_dl to continue downloading after 7k downloads #359

Open
Burn1n9m4n opened this issue Jul 26, 2023 · 8 comments
Open

Can't get spotify_dl to continue downloading after 7k downloads #359

Burn1n9m4n opened this issue Jul 26, 2023 · 8 comments
Labels

Comments

@Burn1n9m4n
Copy link

Describe the bug
I'm trying to run the downloader within some python code to pull a large number of mp3s for use in a data science project. The output that I'm seeing is the following:
image

My code looks like this:
image

It was working before, but after about 7k songs, it stopped. Even running the code in terminal don't seem to be working. Not sure if I got rate limited or there's something else going on. Any guidance would be appreciated.

To Reproduce
spotify_dl -l https://open.spotify.com/track/0BRjO6ga9RKCKjfDqeFgWV -o ../data/mp3s/<track_id> (where the track_id is a Spotify track ID)

Expected behavior
I expected the track to download.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Mac OS (13.4.1 (c))
  • Python version: 3.8.17
@Burn1n9m4n Burn1n9m4n added the bug label Jul 26, 2023
@SathyaBhat
Copy link
Owner

I can't say anything given that's there's no error whatsoever. After you run into the error, can you try running the spotify-dl command outside of the script? Maybe spotify is rate limiting you

@ray2301
Copy link

ray2301 commented Aug 20, 2023

i have a similar problem so i came here to search if anything was already talked about, but i can't seem to find anything else. not so much of a problem to me, but it's happening.
i have a few playlists that have more than 2000 files. it crashes for me too after some time so i have to re-run it. i'm trying to download these two big playlists fully for the second day now (and i'm almost done, but here's what happens):

  • because it crashes at some point before downloading all the files (ok, they're big so it'll crash at some point - i see they are some kind of connection errors), i have to re-run it and it has to get the spotify playlist again. since it crashes randomly around every 500-1500 downloaded files (i use -s y and -w to skip already downloaded files, i also tried without -s y but that's not the problem), when it gets the playlist from spotify again for the Nth time, it will get a 429 error at some point because of too many api call attempts and then i have to eather change the keys or wait until tomorrow because it blocks me for a long time (the first key is still blocked since yesterday when i had to re-run it about 4 times in a row).

is there some kind of setting for limiting spotify api calls? for example to 1 (or 1.5) second(s) between the calls? it would be slower, but it wouldn't make that problem. or are you maybe getting 100 spotify playlist items (tracks) at once in this script? because the limit is 50 in 1 api call (it's on spotifys api page documentation - to get the playlist items you can use "Range: 0 - 50" in 1 api call). just something that came to my mind that could be making that "API limit reached" problem so fast. the best bet is to get 50 items at once with a wait of 1 second between the calls untill it gets them all. that way it will not reach the limit if repeated requests for playlist items happen in a shorter amount of time. i didn't really look into the code yet, i have no idea how you are getting the playlist data (artist - track,...) but i'm thinking something related to that is making too many requests to spotify. when the playlists are big, like the OPs (or even mine) it makes sense that that would happen if there's no wait time between requests properly set in the script. and spotify is probably the service with the harshest api limits per second. i know, i was recently making a script that gets radio station data (what was playing) for more than 100 stations and makes automatic playlists with songs from those stations on spotify (and it reruns 10 minutes after it's done so i always get new tracks) so i had a lot of fun with api restriction implementations. but i was doing it with the help of chatGPT so i actually can't code other than the fact i can somehow read python code and understand it (probably because of years and years of kodi use) so i can change little things or tell the AI what to do in a way it will understand me. and it does it. AI's great, but i still can't create code without it. when a 429 error appears, spotify will give a retry-after information in the header with the amount of seconds you have to wait. i could never get my scripts to read that and wait that amount of time so the 1 second wait between the api calls helped when getting the playlists and getting the playlist data.

anyway, just wanted to pitch the idea of getting the api limits on spotify a little more thought when there are big playlists like that and a huge amount of data to be processed because spotify is much harsher with limiting - after you get a 429 and you retry, you wait longer. every time you retry, you wait longer. so if you continue retrying, like i did yesterday - the keys will be blocked for a long time and you just have to change them to continue cause it's faster. if you can read the "retry-after" information in the header when the 429 error first appears, then you can set it to wait that amount of time before retrying. if you can't 1 or 1.5 second wait between the api calls should be enough.

today i seem to have 0 errors so the rest of the songs continued just fine. and in subsequent runs, i expect not to be any errors since there won't be as much new tracks since i'm using -w. to me, the bigger problem is that spotify limit rate.

@ray2301
Copy link

ray2301 commented Aug 20, 2023

almost...this is the error that happens :)

[download] Destination: M:\Spotify\0RAYPC\Radio Sljeme\ABBA - Lovers (Live A Little Longer).webm
[download] 100% of    3.30MiB in 00:00:01 at 2.13MiB/s
[SponsorBlock] Fetching SponsorBlock segments
[SponsorBlock] No matching segments were found in the SponsorBlock database
[ModifyChapters] SponsorBlock information is unavailable
[ExtractAudio] Destination: M:\Spotify\0RAYPC\Radio Sljeme\ABBA - Lovers (Live A Little Longer).mp3
Deleting original file M:\Spotify\0RAYPC\Radio Sljeme\ABBA - Lovers (Live A Little Longer).webm (pass -k to keep)
[download] Finished downloading playlist: ABBA - Lovers (Live A Little Longer) Lyrics
Traceback (most recent call last):
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1037, in _send_output
    self.send(msg)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 975, in send
    self.connect()
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1447, in connect
    super().connect()
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 941, in connect
    self.sock = self._create_connection(
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\socket.py", line 845, in create_connection
    raise err
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\socket.py", line 833, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\Scripts\spotify_dl.exe\__main__.py", line 7, in <module>
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\site-packages\spotify_dl\spotify_dl.py", line 210, in spotify_dl
    download_songs(
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\site-packages\spotify_dl\youtube.py", line 321, in download_songs
    find_and_download_songs(kwargs)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\site-packages\spotify_dl\youtube.py", line 232, in find_and_download_songs
    set_tags(temp, mp3file_path, kwargs)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\site-packages\spotify_dl\youtube.py", line 130, in set_tags
    with urllib.request.urlopen(req) as resp:  # nosec
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 519, in open
    response = self._open(req, data)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "C:\Users\raych\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-Break to quit

and now i'm stuck on getting the playlist in the new run:
Fetched 2700 of 3148 songs from the playlist ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 86% 0:01:05

if i re-run it, it won't get them. i reached the API limit. i'd have to change the keys (or wait - but i don't know for how long). something was happening too fast for spotify api.

@ray2301
Copy link

ray2301 commented Aug 20, 2023

ok, i found the spotify.py and did this:

import time
# ... Other imports and code ...

def fetch_tracks(sp, item_type, item_id):
    """
    Fetches tracks from the provided item_id.
    :param sp: Spotify client
    :param item_type: Type of item being requested for: album/playlist/track
    :param item_id: id of the item
    :return Dictionary of song and artist
    """
    songs_list = []
    offset = 0
    songs_fetched = 0

    if item_type == "playlist":
        with Progress() as progress:
            songs_task = progress.add_task(description="Fetching songs from playlist..")
            while True:
                items = sp.playlist_items(
                    playlist_id=item_id,
                    fields="items.track.name,items.track.artists(name, uri),"
                    "items.track.album(name, release_date, total_tracks, images),"
                    "items.track.track_number,total, next,offset,"
                    "items.track.id",
                    additional_types=["track"],
                    offset=offset,
                )
                total_songs = items.get("total")
                track_info_task = progress.add_task(
                    description="Fetching track info", total=len(items["items"])
                )
                for item in items["items"]:
                    # ... Existing code ...

                # Introduce a wait time between API calls
                time.sleep(1)  # Wait for 1 second before making the next API call

                # ... Rest of the existing code ...

                # Update progress information

                if total_songs == offset:
                    break

    elif item_type == "album":
        with Progress() as progress:
            album_songs_task = progress.add_task(
                description="Fetching songs from the album.."
            )
            while True:
                # ... Existing code ...

                # Introduce a wait time between API calls
                time.sleep(1)  # Wait for 1 second before making the next API call

                # ... Rest of the existing code ...

                # Update progress information

                if album_total == offset:
                    break

    elif item_type == "track":
        # ... Existing code ...

    return songs_list

# ... Rest of the script ...

it's much slower now but it should hopefully work better with the api restrictions. i can probably limit it even less than that. i'll see what works. there could be some setting for that somwhere...somehow :)

something like -sl 0.5 (as in "spotify limit" and the amount of wait time in seconds)

i changed this too:

                items = sp.playlist_items(
                    playlist_id=item_id,
                    fields="items.track.name,items.track.artists(name, uri),"
                    "items.track.album(name, release_date, total_tracks, images),"
                    "items.track.track_number,total, next,offset,"
                    "items.track.id",
                    additional_types=["track"],
                    offset=offset,
                    limit=50,  # Adjust this to the desired number of tracks per call
                )

so currently, with the time.sleep(0.7) and 50 tracks per api call (instead of 100 that it's fetching by default) i was able to process all the playlist items. took a while but it didn't stop. i will see if i can lower time.sleep in the future runs since i'm thinking that the fact it might have used 2 api calls every time it got 100 items from a playlist might have had something to do with accumulating too many api calls in a 30 second period.

@ray2301
Copy link

ray2301 commented Aug 20, 2023

by now i understand that we are making an api call not only for playlist items (which i still think should be on 50 by page) and additional call for each track information. and it currently has no limits in spotify.py so it only depends on the spotipy defaults as i understand.
since this is a playlist with 3200 items, it will be too fast and the limit will be made quickly, especially because we are going to make 3200 api calls + aditional ones for each 50 (or 100) tracks we get from the playlist. even if there are no errors when downloading and even if we don't have to get the playlist again but we want to download more than one, it will get to the limit too quickly.
it really would be better if the time.sleep is implemented and could be set manually since 0.7 was not enough since i got that connection error too many times already (when downloading tracks) and i did eventually get to the limit. it works fine with a 1 full second delay, that gets me 50 tracks per minute and including that 1 api call for those 50 tracks it should still be less than 30 api calls per 30 seconds so that must keep it going the whole day if needed.
if you implement that and the ability to set it as i said before (with something like -command 1) it can be set if needed. and it probably should be set for big playlists. This is what i changed if you want to have a look.

i just have no idea how to debug that download error. it's always the same thing.
sorry to be talking to myself alot here, but i was trying to fix this api restriction thing so i can at least re-run it without changing api data every time. i want to put it to my path and not to think about it too much but i do :) if the api doesn't get restricted, i can create a .bat with a loop if it gets an error and forget about it and that's how it all started :)

@ray2301
Copy link

ray2301 commented Aug 21, 2023

in the end, this is what worked best. since it crashes with this playlist every 15-20 minutes when it downloads tracks i can never get it to run the playlist processing faster the next time and retain the ability to get them again when it crashes without being restricted by spotify api. the playlist has exactly 3226 songs. and i need to get them at least twice in an hour. that's a lot of api calls in one hour. what i did in the end was waiting for 15 seconds after each 50 tracks were processed. so now, i get 50 tracks, they are immediately processed (blazing fast), then i wait 15 seconds and then it gets the next 50 and so on. that way i can get it to get all 3226 tracks again and again and again. i'm on 1761/3226 downloaded files now and i think i can just make a bat to run it again automatically when it crashes and go to sleep. hopefully i will wake up tomorrow with it still running, but with all the tracks downloaded :)

@SathyaBhat
Copy link
Owner

Hi @ray2301 thanks for the detailed write ups. Never thought I'd see playlists having multiple tens of thousands in the playlist, so there wasn't any thought to having batch sizes / (exponential) backoffs to fetch the data. I'm not even sure what's spotify's default rate limit is like.

@ray2301
Copy link

ray2301 commented Aug 27, 2023

oh, i think nobody knows. it is calculated by how many api calls you make in a 30 second period so nobody really understands how it works (i did a lot of research about this but you just can't know - you can only try to get it right by knowing how many api calls spotify will make for a specific thing you are doing in a 30-second period and try to restrain them from making too many), but if you are retrieving data by their API documentation (some things can be fetched in batches and have a limit of max items that can be fetched to be counted as 1 api call), you can make it work. if you start making too many api calls, you get a 429 error and the error headers should tell you how many seconds you have to wait. if you continue retrying after that, the wait time will start to increase. i could never retrieve that data from headers to wait that specific amount of time. when you ignore the 429 error and continue retrying, the time will increase for the wait. and if you're still retrying, it will continue to increase forever :) when your script can read the header and implement the wait logic based on that header information, it can be possible to stop the script from continuing on a 429 error and wait before retrying but since i never could read the "retry-after" header information, i had to restrict it manually.

so the script i mentioned that i made that gets radio data from about 100 radio stations and updates my playlists (named as those radio stations) with new tracks runs 2 api call every 1 second (it's restricted to 0,5 seconds/1api call) and i never ever get restricted. the script runs 24/7. i found that works as expected with my flow so that it can just run in loops (with a 10 minute wait before it gets all the data again for all the stations - it skips existing files) for 1 full month before playlists get cleaned automatically (on the 1st of the month when the cleaning script runs and deletes songs from the playlists) and it all starts from the start. so each month i see a playlist for that specific radio station, i see how many songs they actually play (since duplicates are not added) in a month and then it it all gets reset on the 1st and starts again. but the script never stops running (since it's a bat with many scripts that execute specific things on a specific date). you should be able to have more than 2 api calls per seconds but you need to find the correct back-off strategy. i mean, if you're going to be thinking about it in the future. you don't have to - i'm just sharing my experience :) i'm actually having fun with this even though it can be exausting but i have a lot of time in my life now so it's good for the brain :)

there was nothing i could do in the end with your script because this was just doing everything so fast and i couldn't really pinpoint why and where so i found this which didn't really work (and still doesn't tag the year and album track number in the filenames and i can't pinpoint how to get more data for the tags) and i started to do what i know how to do - fixing the little things i could and implementing fuzzy matching so the tracks at least can be chosen with the correct data and the least amount of lenght difference between the best possible results. i mean i made a mess from the normal downloader (so it doesn't work) but the precise mode works perfectly now and the results i'm getting are pretty much perfect now (it even has cover art, but just missing year and track number in tags since i'm an idiot and can't understand how to get those two).

i don't know if you're interested in such an approach in your script, but have a look here to see what i mean.
everything i needed was pretty much in downloader.py so it was easier for me to understand the script since it was much simpler. as i updated the downloader.py, i left a bunch of comments so here you can actually see them when you press those 3 dots that show the comments from a specific update. from them, you will start to understand what i was trying to do and why. every time i chaned things i watched at least 500 files being processed and what happens with them so this is fully tested in the end on 1000 files that all seem to be perfectly synced in lenght and the correct songs are applied. i don't know if i left "Auto-generated by YouTube" as the search term add-on or not in the last downloader.py but when that is removed, it will get everything that fits the search term, but if that's left there it will get more official audio files. but then it could skip some songs because the search results are different so we don't get everything we would in search results if we did the search without adding "Auto-generated by YouTube" or "Provided to YouTube by" to the search terms - if you don't know why i'm talking about these two terms, read those comments :D

in the end i did find what i needed and i did make it work, but i made that just for me. i'm not someone who can maintain things and always pinpoint an error all by myself so i'm leaving you this just so you can see the idea i had to actually find the correct track on youtube based on the closest lenght to spotify's lenght and fuzzy matching to get the best possible quality of audio in our downloaded files by using specific boosters and filters. maybe it'll come handy if you continue to work on this.

i would love to see you implement some backoffs for spotify and some api restrictions so this flow can work even when you want to download a playlist of 3000 files :) anyway, i'm leaving you with my ideas and if they can make something better, good.

now, the main difference in why that one works and yours makes problems with 429 errors on spotify is because the script i used in the end gets a 100 playlist files from spotify, then goes throught them and then it downloads from youtube. so some time passes from the first api calls. then it gets the next 100 items from spotify and downloads them. and so on.
even when it skips the files since they are already downloaded, it doesn't retrive the next 100 spotify files from the playlist immediately after the first 100. it waits a few seconds for the next page to be loaded. that's why nothing gets blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants