Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add report generation to workflow (move off of worker VM) #177

Open
giancarloaf opened this issue Feb 2, 2023 · 2 comments
Open

Add report generation to workflow (move off of worker VM) #177

giancarloaf opened this issue Feb 2, 2023 · 2 comments

Comments

@giancarloaf
Copy link
Collaborator

giancarloaf commented Feb 2, 2023

The report generation script currently lives in a crontab on the "worker" VM. We would like to move this into the data-pipeline workflow at the end of the dataflow pipeline.

@tunetheweb
Copy link
Member

It's a little bit trickier as we have three reports:

  1. The bulk of the reports can be run once the old tables are ready
  2. Some reports depend on the httparchive.blink.* tables, which aren't updated until the 1st of the month as they depend on two BigQuery schedule tasks (materialize_blink_features and then Materialize Blink Feature Percentages - which is dependent on the first job). Could they be run as part of the pipeline so we don't have to wait until the first?
  3. The CrUX data is not available until the 2nd Tuesday of the month, so we currently run that on the 15th of the month.

Would be lovely to clean all this up!

@giancarloaf
Copy link
Collaborator Author

I believe I found the current report generation script from the worker VM under igrigorik's user crontab

giancarlo_faranda@worker:~$ sudo su igrigorik 
igrigorik@worker:/home/giancarlo_faranda$ crontab -l
#0 15 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_csv.sh `date +\%b_1_\%Y`'  >> /var/log/HAimport.log 2>&1
#0  8 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_csv.sh mobile_`date +\%b_1_\%Y`'  >> /var/log/HAimport.log 2>&1

#0 10 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh chrome' >> /var/log/HA-import-har-chrome.log 2>&1
#0 11 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh android' >> /var/log/HA-import-har-android.log 2>&1

# Attempt to run the reports everyday
0  8 * * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -th `date "+\%Y_\%m_01"` -l ALL' >> /var/log/generate_reports.log 2>&1

# Run the reports on the 2nd to pick up blink table updates
0  7 2 * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -th `date -d "-1 month" "+\%Y_\%m_01"` -l ALL' >> /var/log/generate_last_months_reports.log 2>&1

# Run the CrUX reports on 15th
0  7 15 * * /bin/bash -l -c 'cd /home/igrigorik/code && sql/generate_reports.sh -tfh `date -d "-1 month" "+\%Y_\%m_01"` -r "*crux*" -l ALL' >> /var/log/crux_reruns.log 2>&1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants