Add report generation to workflow (move off of worker VM) #177

giancarloaf · 2023-02-02T01:38:17Z

The report generation script currently lives in a crontab on the "worker" VM. We would like to move this into the data-pipeline workflow at the end of the dataflow pipeline.

tunetheweb · 2023-02-02T15:56:35Z

It's a little bit trickier as we have three reports:

The bulk of the reports can be run once the old tables are ready
Some reports depend on the httparchive.blink.* tables, which aren't updated until the 1st of the month as they depend on two BigQuery schedule tasks (materialize_blink_features and then Materialize Blink Feature Percentages - which is dependent on the first job). Could they be run as part of the pipeline so we don't have to wait until the first?
The CrUX data is not available until the 2nd Tuesday of the month, so we currently run that on the 15th of the month.

Would be lovely to clean all this up!

giancarloaf · 2024-03-09T19:06:40Z

I believe I found the current report generation script from the worker VM under igrigorik's user crontab

giancarlo_faranda@worker:~$ sudo su igrigorik 
igrigorik@worker:/home/giancarlo_faranda$ crontab -l
#0 15 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_csv.sh `date +\%b_1_\%Y`'  >> /var/log/HAimport.log 2>&1
#0  8 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_csv.sh mobile_`date +\%b_1_\%Y`'  >> /var/log/HAimport.log 2>&1

#0 10 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh chrome' >> /var/log/HA-import-har-chrome.log 2>&1
#0 11 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh android' >> /var/log/HA-import-har-android.log 2>&1

# Attempt to run the reports everyday
0  8 * * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -th `date "+\%Y_\%m_01"` -l ALL' >> /var/log/generate_reports.log 2>&1

# Run the reports on the 2nd to pick up blink table updates
0  7 2 * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -th `date -d "-1 month" "+\%Y_\%m_01"` -l ALL' >> /var/log/generate_last_months_reports.log 2>&1

# Run the CrUX reports on 15th
0  7 15 * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -tfh `date -d "-1 month" "+\%Y_\%m_01"` -r "*crux*" -l ALL' >> /var/log/crux_reruns.log 2>&1

giancarloaf created this issue from a note in 10x capacity (To do) Feb 2, 2023

giancarloaf added this to the M3: Improving the analysis pipeline milestone Feb 2, 2023

tunetheweb mentioned this issue Feb 2, 2023

Update crontab to latest HTTPArchive/bigquery#181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add report generation to workflow (move off of worker VM) #177

Add report generation to workflow (move off of worker VM) #177

giancarloaf commented Feb 2, 2023 •

edited

tunetheweb commented Feb 2, 2023

giancarloaf commented Mar 9, 2024

Add report generation to workflow (move off of worker VM) #177

Add report generation to workflow (move off of worker VM) #177

Comments

giancarloaf commented Feb 2, 2023 • edited

tunetheweb commented Feb 2, 2023

giancarloaf commented Mar 9, 2024

giancarloaf commented Feb 2, 2023 •

edited