Skip to content

Releases: shailshouryya/yt-videos-list

0.6.7: Fix pip installation problem and improve features & performance

11 Nov 04:06
12a3ef6
Compare
Choose a tag to compare
  • BUGFIX

    • fix pip installation problem due to incorrectly formatted
      version specifiers
    • update video duration extraction to correctly
      extract the duration of each video and avoid
      writing 'N/A'
  • FEATURE IMPROVEMENTS

    • improve identification of seen videos in csv files by
      • avoiding potentially brittle regular expression matching
      • parsing each row of the csv file and extracting the
        (Video ID|Video URL) value from the corresponding column directly
    • normalize whitespace to avoid including newlines,
      carriage returns, and multiple consecutive whitespace characters
      in the video title
    • improve logging messages by including time.time() and
      time.perf_counter() when logging the time taken to perform
      an operation
  • PERFORMANCE IMPROVEMENTS

    • increase write efficiency by completely avoiding writing to a
      temporary file when no new videos found for an existing file
  • INTERNAL IMPROVEMENT

    • the following change does not affect the functionality of the program
      • add unit tests for the video title whitespace normalization

0.6.6: Update scraping logic for the new UI

05 Dec 00:47
53e2bc1
Compare
Choose a tag to compare

0.6.5: Support newer driver binaries

28 Nov 06:51
0b81d1c
Compare
Choose a tag to compare
  • BINARY UPDATES
    • Mozilla Firefox
      • geckodriver v0.32.0 (Firefox versions ≥ 104)
      • geckodriver v0.31.0 (Firefox versions ≥ 99)
    • Opera Stable 82, 83, 84, 85, 88, 89, 90, 91, 92 & 93
      • operadriver v.107.0.5304.88 (Opera Stable 93)
      • operadriver v.106.0.5249.119 (Opera Stable 92)
      • operadriver v.105.0.5195.102 (Opera Stable 91)
      • operadriver v.104.0.5112.81 (Opera Stable 90)
      • operadriver v.103.0.5060.66 (Opera Stable 89)
      • operadriver v.102.0.5005.61 (Opera Stable 88)
      • there was no operadriver release specifically for version 101 (Opera Stable 87)
      • there was no operadriver release specifically for version 100 (Opera Stable 86)
      • operadriver v.99.0.4844.51 (Opera Stable 85)
      • operadriver v.98.0.4758.82 (Opera Stable 84)
      • operadriver v.97.0.4692.71 (Opera Stable 83)
      • operadriver v.96.0.4664.45 (Opera Stable 82)
    • Google Chrome version 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, & 108 (updated version 97 binaries)
      • chromedriver 108.0.5359.22
      • chromedriver 107.0.5304.62
      • chromedriver 106.0.5249.61
      • chromedriver 105.0.5195.52
      • chromedriver 104.0.5112.79
      • chromedriver 103.0.5060.134
      • chromedriver 102.0.5005.61
      • chromedriver 101.0.4951.41
      • chromedriver 100.0.4896.60
      • chromedriver 99.0.4844.51
      • chromedriver 98.0.4758.102
      • chromedriver 97.0.4692.71 (previously 97.0.4692.20)
    • Brave Browser version 96, 97, 98, 99, 102, 103, 104, 105, 106, & 107
      • bravedriver v.107.0.5304.88 (uses operadriver binaries)
      • bravedriver v.106.0.5249.119 (uses operadriver binaries)
      • bravedriver v.105.0.5195.102 (uses operadriver binaries)
      • bravedriver v.104.0.5112.81 (uses operadriver binaries)
      • bravedriver v.103.0.5060.66 (uses operadriver binaries)
      • bravedriver v.102.0.5005.61 (uses operadriver binaries)
      • there was no operadriver release specifically for version 101
      • there was no operadriver release specifically for version 100
      • bravedriver v.99.0.4844.51 (uses operadriver binaries)
      • bravedriver v.98.0.4758.82 (uses operadriver binaries)
      • bravedriver v.97.0.4692.71 (uses operadriver binaries)
      • bravedriver v.96.0.4664.45 (uses operadriver binaries)
    • Microsoft Edge version 100, 101, 102, 103, 104, 105, 106, 107, 108, & 109 (updated version 96, 97, & 98 binaries)
      • msedgedriver 109.0.1481.0
      • msedgedriver 108.0.1462.15
      • msedgedriver 107.0.1418.42
      • msedgedriver 106.0.1370.52
      • msedgedriver 105.0.1343.53
      • msedgedriver 104.0.1293.91
      • msedgedriver 103.0.1264.77
      • msedgedriver 102.0.1245.62
      • msedgedriver 101.0.1210.53
      • msedgedriver 100.0.1185.60
      • there was no msedgedriver release specifically for version 99
      • msedgedriver 98.0.1085.0 (previously 98.0.1086.0)
      • msedgedriver 97.0.1072.76 (previously 97.0.1072.8)
      • msedgedriver 96.0.1054.75 (previously 96.0.1054.26)
  • MINOR BUGFIXES
    • Update URL for Quanta Magazine channel (commit 06fa9d8)
    • Update time duration for video 130 in test reference files (commit b8641f7)
    • Use call command to properly run helper batch script (commit d519edf)
    • Make browser version detection more robust
  • INTERNAL IMPROVEMENTS
    • Update save_thread_result package dependency version number → 0.0.9 (commit b5a9f14)
    • Support browser versions up to 120 (commit a155c05)

0.6.4: Optimize multithreading and use explicit exception chaining

10 Aug 20:26
305305a
Compare
Choose a tag to compare
  • BUGFIXES

    update XPath for blocking cookies button
    make url a required positional argument
  • FEATURE IMPROVEMENTS

    raise error instead of printing message and then sys.exit()ing
    • see commits with a commit message starting with "Raise"
    • also, see commit d43ef6a
    use explicit exception chaining
    show warning for users on unsupported operating systems
    include real time taken by program
    • see commits with a commit message
      • starting with "Include real time"
      • including log_time_taken
  • PERFORMANCE IMPROVEMENTS

    optimize multithreading for create_list_from function
  • INTERNAL IMPROVEMENTS

    these changes do not affect the functionality of the program

0.6.3: Support newer driver binaries

28 Nov 01:06
d8b8555
Compare
Choose a tag to compare
  • BINARY UPDATES
  • Mozilla Firefox
    • geckodriver v0.30.0 (Firefox versions ≥ 92)
  • Opera Stable 77, 78, 79, 80, & 81
    • operadriver v.95.0.4638.54 (Opera Stable 81)
    • operadriver v.94.0.4606.61 (Opera Stable 80)
    • operadriver v.93.0.4577.63 (Opera Stable 79)
    • operadriver v.92.0.4515.107 (Opera Stable 78)
    • operadriver v.91.0.4472.77 (Opera Stable 77)
  • Google Chrome version 92, 93, 94, 95, 6, & 97 (updated version 91 binaries)
    • chromedriver 97.0.4692.20
    • chromedriver 96.0.4664.45
    • chromedriver 95.0.4638.69
    • chromedriver 94.0.4606.113
    • chromedriver 93.0.4577.63
    • chromedriver 92.0.4515.107
    • chromedriver 91.0.4472.101 (previously 91.0.4472.19)
  • Brave Browser version 91, 92, 93, 94, & 95
    • operadriver v.95.0.4638.54 (uses operadriver binaries)
    • operadriver v.94.0.4606.61 (uses operadriver binaries)
    • operadriver v.93.0.4577.63 (uses operadriver binaries)
    • operadriver v.92.0.4515.107 (uses operadriver binaries)
    • operadriver v.91.0.4472.77 (uses operadriver binaries)
  • Microsoft Edge version 93, 94, 95, 96, 97, & 98 (updated version 90, 91, & 92 binaries)
    • msedgedriver 98.0.1086.0
    • msedgedriver 97.0.1072.8
    • msedgedriver 96.0.1054.26
    • msedgedriver 95.0.1020.53
    • msedgedriver 94.0.992.58
    • msedgedriver 93.0.961.52
    • msedgedriver 92.0.902.84 (previously 92.0.881.0)
    • msedgedriver 91.0.864.71 (previously 91.0.864.19)
    • msedgedriver 90.0.818.66 (previously 90.0.818.56)
  • MINOR BUGFIXES
    • handle videos with no "Video Duration" field (commit 2f538e1)
      • this is an extremely rare edge case
        • based on anecdotal data, occurs about 1 in every 70000 videos
    • update URLs shown in exception messages (commit 3f09612 & commit 99ed682)
    • correctly handle unfinished threads in create_list_from() method (commit aa4ff3d)
    • generalize URL normalization for removing trailing parameters (commit 0789a3e)
      • this removes any trailing tracking parameters that might be associated with a video URL
        • e.g. youtube.com/watch?v=abcdefghijk?pp=sAQB → youtube.com/watch?v=abcdefghijk
    • verify page has videos (commit 82a4856)
      • prevents crashing on channels with 0 public videos
  • LOGGING IMPROVEMENTS
  • INTERNAL CHANGES
    • refactor code to:
      • reduce code duplication
      • make variable and function names more context specific
      • place repeated code inside variables
      • make browser naming more specific (commit 81144cb)

0.6.2: Explicitly order videos page & check existing videos more strictly

12 Sep 21:55
bf96c7f
Compare
Choose a tag to compare

0.6.1: Change `create_list_for()` return, add features & improvements

07 Sep 03:30
f8ca4a6
Compare
Choose a tag to compare
  • BREAKING CHANGE

    • BEFORE:
      • create_list_for() returned a str containing the name of the file the program wrote to
    • NOW:
      • create_list_for() returns a tuple containing
        • a list of lists containing the video information found by the program for the current run
          • by default, returns dummy video data to avoid cluttering the output
          • to return the actual video data, set the video_data_returned ListCreator attribute to True
            • dummy data: [[0, '', '', '']]
        • a tuple containing a str with the name of the channel (taken from the channel's heading) and a str with the name of the file written to
          • ('The Channel Name', 'the_name_of_the_file')
          • ('The Channel Name', '') if the ListCreator attributes are txt=False, csv=False, md=False, AND video_data_returned=True
      • see the NEW FEATURES section below for more details about video_data_returned
    • access the full documentation for the updated create_list_for method with help(ListCreator.create_list_for) in the python interpreter
  • BUGFIX

    • fixes cookie_consent blocking logic for new HTML in GDPR regions
      • YouTube updated the HTML formatting for blocking cookie consent, and the previous cookie consent blocking logic broke
      • this release fixes the blocking logic to work with the new HTML formatting
  • NEW FEATURES

    • overview for the new ListCreator attributes given here, but run help(ListCreator) in the python interpreter or read the "More API information" section in the python README to see the full documentation:
      • file_suffix allows more control over the file naming (True by default)
      • all_video_data_in_memory scrapes the ENTIRE YouTube channel's videos page, EVEN if files exist for the channel already (False by default)
        • must also set the video_data_returned attribute to True to actually get this information
      • video_data_returned returns the video data for all videos the program scraped (False by default)
        • data returned depends on a number of factors, see full documentation for more details
      • video_id_only saves only the video ID instead of the entire URL (False by default)
    • overview for the updated file_name argument options in the create_list_for method given here, but run help(ListCreator.create_list_for) in the python interpreter to see the full documentation:
      • file_name='auto' names the output file(s) using the name that shows up under the banner when you navigate to the channel's homepage (with spaces removed)
      • file_name='id' names the output file(s) using the identifier from the URL provided to the url argument
        • run help(ListCreator.create_list_for) for a comprehensive list of examples
        • using file_name='id' is very useful when multiple channels have the SAME channel name
  • PERFORMANCE IMPROVEMENTS

    • BEFORE:
      • the program pulled the video data from the selenium instance and wrote to the file(s) directly
    • NOW:
      • the program loads the video data from the selenium instance into memory, THEN writes the saved video data from memory to the file(s)
        • the performance improvement is more noticeable when writing more information
          • for example:
            • writing information for 200 videos to just a csv file: negligible performance difference between writing to csv file directly and loading to memory & THEN writing to csv file
            • writing information for 200 videos to csv, txt, md files: slight performance difference between writing to files directly and loading to memory & THEN writing to files, but still not much of a performance difference
            • writing information for 20000 videos to just a csv file: noticeable performance difference between writing to csv file directly and loading to memory & THEN writing to csv file
            • writing information for 20000 videos to csv, txt, md files: significant performance difference between writing to to files directly and loading to memory & THEN writing to files
          • summary:
            • the performance difference between writing to ONE file directly and loading to memory & THEN writing to ONE file is barely noticeable for small jobs and more noticeable for larger jobs
            • the performance difference between writing to MULTIPLE files directly and loading to memory & THEN writing to MULTIPLE file is more noticeable for small jobs (compared to writing to only ONE file) and SIGNIFICANT for larger jobs
    • logs from tests used to benchmark performance included below:
See logs
for https://www.youtube.com/user/schafer5 (small channel, 230 videos)
writing to 1 file directly with csv=True, txt=False, md=False
  • to create the file:
It took 9.240757292005583            seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 4.265756259999762            seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
This program took 19.537945401003526 seconds to complete.
  • to update the file:
It took 0.8453300589972059          seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 0.6392399440010195          seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
This program took 7.754261410002073 seconds to complete.
writing to 1 file by loading video information into memory THEN writing to files with csv=True, txt=True, md=True
  • to create the file:
It took 9.163404727999989            seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 4.260267737000007            seconds to load information for 230 videos into memory
It took 0.002389371999996115         seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
This program took 19.483281371000004 seconds to complete.
  • to update the file:
It took 0.8521808300000089          seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 1.0964175420000117          seconds to load information for 60 videos into memory
It took 0.0015745449999826633       seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
This program took 7.985743492000012 seconds to complete.
writing to 3 files directly with csv=True, txt=True, md=True
  • to create the files:
It took 9.166668037003546            seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 10.160974278995127           seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.txt
It took 10.164936708999448           seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
It took 10.168633003995637           seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.md
This program took 25.594990328005224 seconds to complete.
  • to update the files:
It took 0.8503098270011833          seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 1.5225159670007997          seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
It took 1.5322243859991431          seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.txt
It took 1.5359413480036892          seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.md
This program took 8.472728426997492 seconds to complete.
writing to 3 files by loading video information into memory THEN writing to files with csv=True, txt=True, md=True
  • to create the files:
It took 9.367390958000005      seconds to find 230 videos from https://www.youtube.com/user/schafer5/videos
It took 4.218187391999997      seconds to load information for 230 videos into memory
It took 0.003894963000000473   seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.md
It took 0.005060710999998719   seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.csv
It took 0.006283445999997639   seconds to write all 230 videos to CoreySchafer_reverse_chronological_videos_list.txt
This program took 18.754924324 seconds to complete.
  • to update the files:
It took 0.8672965029999986          seconds to find 60 videos from https://www.youtube.com/user/schafer5/videos
It took 1.0901944209999996          seconds to load information for 60 videos into memory
It took 0.005667658999996661        seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.csv
It took 0.008393589000000645        seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.txt
It took 0.008197031000001687        seconds to write the 0 ***NEW*** videos to the pre-existing CoreySchafer_reverse_chronological_videos_list.md
This program took 8.090583961999997 seconds to complete.
for https://www.youtube.com/c/KhanAcademy (medium channel, 8095 videos)
writing to 1 file directly with csv=True, txt=False, md=False
  • to create the file:
It took 322.72226654399856          seconds to find 8095 videos from htt...
Read more

0.6.0: Add `verify_page_bottom_n_times`, `file_buffering`, Video Duration

30 Jul 08:54
c8a9613
Compare
Choose a tag to compare
  • compare changes to previous version
  • if you are an existing user, skim through the BREAKING CHANGE and NON-BREAKING CHANGES sections below
    • if you are a new user, you do not need to worry about these sections - just skip to the NEW FEATURES section at the bottom and read the python README to get started
  • BREAKING CHANGE
    • the program now extracts the video duration for every video uploaded by a channel
      • this will likely cause problems when updating pre-existing csv files, since
        • the video duration information goes in a new column
        • csv file renderers expect consistent column formatting throughout the file
          • BUT a pre-existing csv file will only have the Video Number,Video Title,Video URL,Watched,Watch again later,Notes columns
          • so updating a pre-existing csv file will result in newly extracted videos having the Video Number,Video Title,Video Duration,Video URL,Watched,Watch again later,Notes columns while the already extracted videos will only have the Video Number,Video Title,Video URL,Watched,Watch again later,Notes columns (no Video Duration column)
          • therefore, updating a pre-existing csv file will result in the newly extracted videos having 7 columns, while pre-existing videos will have only 6 columns
      • if you want to continue using your pre-existing csv file and do NOT WANT TO INCLUDE the video duration for previously extracted videos:
        • if you have NOT yet updated the pre-existing csv file:
          • APPROACH 1: use a csv file editor such as Excel, Google Sheets, Numbers, IDE extension, etc.
            • open the csv file
            • insert the Video Duration column between the Video Title and Video URL columns
            • save the file
              • the csv editor should automatically format the existing rows to include the Video Duration column
              • therefore, all rows should now have an empty cell for the Video Duration column
          • APPROACH 2: use a simple text editor/IDE
            • open the csv file
            • insert the Video Duration column between the Video Title and Video URL columns
            • text editors will NOT automatically format the existing rows to include the Video Duration column
              • so you will need to manually format the existing rows to include the Video Duration column
              • the simplest way to do this would be to use a Find and Replace operation:
                • Find all occurrences of: ,https://
                • Replace with: ,,https://
                  • this assumes the only urls in the csv file are in the Video URL column!
                    • if you have manually added/modified parts of the file and this is no longer true, you will have to modify this approach slightly to meet your needs
        • if you have ALREADY updated the pre-existing csv file:
          • you will not be able to use APPROACH 1 from above
          • you will need to use APPROACH 2 with slight modifications:
            • Find all occurrences of (with regular expression mode enabled): ([^:][^\d]{2}),https://
            • Replace with: $1,,https:// (depending on your editor, you may need to substitute $1 with \1 or something else)
              • looks for ,https:// where it is NOT preceeded with :\d\d
                • since the most recently extracted videos will have the video duration but the already existing videos will not have the video duration
                • so this only adds a comma for previously extracted videos without the video duration
                • as with APPROACH 1, this also assumes the only urls in the csv file are in the Video URL column!
                  • if you have manually added/modified parts of the file and this is no longer true, you will also have to modify this approach slightly to meet your needs
            • if the file is a chronological_videos_list file (as opposed to a reverse_chronological_videos_list file):
              • you will ALSO need to insert the Video Duration column between the Video Title and Video URL columns in the csv header
                • since chronological_videos_list files use the csv header from the pre-existing csv file
                  • NOTE the program updates the reverse_chronological_videos_list csv header every time the program looks for new videos when rerun on a previously scraped channel
                  • but usually this csv header update is not noticeable since the header does not change
                  • the csv header update is noticeable this time, however, since there is a new column (Video Duration)
                  • for chronological_videos_list files, however, the program never updates the csv header
      • if you want to continue using your pre-existing csv file and WANT TO INCLUDE the the video duration for previously extracted videos:
        • rerun the program for the channel (in a different directory)
        • copy over any notes you took in the pre-existing file to the new file with the video duration information
      • if you do NOT want/care about using the pre-existing csv file
        • just delete the pre-existing csv file and rerun the program on the channel again (or run the program on the same channel from a different directory)
          • NOTE that if the channel deleted a video OR unlisted a video between
            • the time the video information was originally scraped
            • and you rerunning this after installing release 0.6.0+
            • the deleted/unlisted video(s) will not show up (no workaround for this - this is how YouTube displays videos)
  • NON-BREAKING CHANGES
    • txt and md files now also include the video duration information
      • this is simply an extra line in the output file, and will not cause any rendering issues since txt and md files do not depend on a consistent formatting the way csv files do
    • txt and md file now use slightly different formatting such as
      • fewer newlines
      • md files using h3 headings for video information instead of bullet points (the bullet points were also improperly formatted previously, but since they are no longer used, this is not an issue)
    • NOTE that if you want these files to contain the video duration information, you will still need to rerun the program on the channel from scratch (either in a different directory, or after deleting the pre-existing files in the current directory)
  • NEW FEATURES
    • verify_page_bottom_n_times attribute
    • file_buffering attribute
      • for more information, see

0.5.9: Add built-in multi-threading

28 Jun 00:14
2eed940
Compare
Choose a tag to compare
  • compare changes to previous version
  • creates new file I/O threads if writing to more than 1 file
  • supports scraping multiple channels from a txt file containing urls
    • see Scraping multiple channels from a file simultaneously with multi-threading section in python README for usage details
    • see __init__.py file for code changes

0.5.8: Fix bugs from release 0.5.7

24 May 02:01
9d8e804
Compare
Choose a tag to compare
  • compare changes to previous version
  • no changes in API or functionality
    • these changes should have been a part of release 0.5.7, but missed these bugs during testing
  • see the following commits for bug fixes: