Skip to content

How to pull all URLs modified today? #124

Answered by eliasdabbas
gk2go asked this question in Q&A
Discussion options

You must be logged in to vote

You can create a variable with the desired date, for example today, and then filter the sitemap df where the extracted date from lastmod is equal to today.

For example:

import datetime

import advertools as adv
import pandas as pd
sitemap = adv.sitemap_to_df('https://www.nytimes.com/sitemaps/new/news.xml.gz')

today = datetime.datetime(2021, 1, 19)

sitemap[pd.to_datetime(sitemap['lastmod'].dt.date).eq(today)]
loc lastmod publication_name news_publication_date
0 https://www.nytimes.com/live/2021/01/19/us/inauguration-day-biden 2021-01-19 22:08:17+00:00 The New York Times 2021-01-19T13:30:02Z
1 https://www.nytimes.com/2021/01/11/us/ca-covid-surge.html 2021-01-19 22:07:33+00:00 Th…

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
4 replies
@gk2go
Comment options

@eliasdabbas
Comment options

@gk2go
Comment options

@eliasdabbas
Comment options

Answer selected by eliasdabbas
Comment options

You must be logged in to vote
1 reply
@eliasdabbas
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants