Skip to content

Dev: Cron Job

Code Hugger (Matthew Jones) edited this page Dec 4, 2020 · 3 revisions

Setting up and running the cron job

TODO: Document setting up the cron job (or link to another guide)

Cron Settings

TODO: Fill in the settings relevant to cron

  • CRON_BQ_IN_LIMIT - Limit for the number of courses processed in one loop of BigQuery. The default currently is 20 and this has been determined to be too low. You should set this to something much higher like 1000. This will likely be removed in a future version.

Cron Logic

Cron has a few operations depending on the data updated:

  • Currently for terms: (TODO: Document term update logic)
  • Currently for assignments, submissions and users (grades). Every run it deletes everything from the table and repopulates entirely from UDW.
  • For Resources (BigQuery) there is update logic to save time and costs. The update logic works like this:

Resources update logic (update_with_bq_access in cron.py)

Resources update runs based on an "upsert" only inserting the new data. So it tries to determine how far back it needs to go.

Determine date when data needs to be gathered

  • If there are any course.data_last_updated columns with null value then update based on the earliest date_start date of all of this set of courses. If all courses have this set go to next step.
    • There was a bug identified where if any courses in this set are in the future, nothing is updated, at least one course has to be in the past.
    • There was another bug where if it was a really old course it would do a full scan instead of just a semester scan.
  • Update based on the earliest course.data_last_update. These should currently all be the same and indicated based on the last cron run.

date_start is determined by this logic:

  • Check the course.date_start field in database, if set use that.
    • This may be populated manually in MyLA or if this is set on the course settings in Canvas. If not set go to next step.
  • Check the term. If term is not null and term has a date_start that date. If not go to next step.
  • Else use today as the start date for the course

Remove/update data after this range

After the date range is determined, all records after this date are deleted from resource_accessed and new records are inserted into both resource and resource_access from BigQuery. The queries run and data loaded are based on the various settings for configuring the cron.

After all resource_access is loaded, a second process update_canvas_resource will run. This updates the names based on the data in UDW and also will remove any resources that are not currently available in Canvas. It's possible a file could be accessed in the past but is no longer available.

Final updates

After everything has run, the unizin_metadata table is updated with information from the UDW. Also all of the course.data_last_updated dates are updated.

There is currently a bug where even if there is a caught exception, course.data_last_updated is still marked as being updated.