Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue #1557 Changes to support switching to bigquery. #1568

Merged
merged 17 commits into from May 6, 2024

Conversation

jonespm
Copy link
Member

@jonespm jonespm commented Apr 1, 2024

The queries currently aren't working but it feels like it might be making progress!

  • Update and test query metadata
  • Update and test query user
  • Update and test query assignment_groups
  • Update and test query assignment
  • Update and test query assignment_weight
  • Update and test query term
  • Update and test query course
  • Update and test query resource
  • Update and test query submission
  • Update and test query 'submission_with_avg_score' (Combined with submission)
  • Clean up postgres package, DATA_WAREHOUSE setting and code using it from db_util if. this is no longer necessary.
  • See if we need to do anything about timestamp without time zone to keep dates correct
  • Compare old and new databases to ensure accuracy of data
  • Double check that start and end date are being correctly populated into courses
  • Remove utility function and just bind BQ parameters directly
  • Remove left over DATA_WAREHOUSE references

@jonespm jonespm changed the title WIP: Changes to support switching to bigquery. WIP Fixes issue #1557 Changes to support switching to bigquery. Apr 1, 2024
@pushyamig
Copy link
Contributor

The queries are Postgres Syntax so needs some tweaking to BQ format, so that is the reason might now be working. I also feel the submission query might perform bit better with BQ may be not need to create a Temp table(Needs to play around with it)

@jonespm
Copy link
Member Author

jonespm commented Apr 1, 2024

I put this up as draft right now because @zqian wanted to see it. No need to review any code here as it's really just a work in progress. I'm planning on keeping them in Postgres syntax and calling them using EXTERNAL_QUERY.

I think the reason it's currently not working is the parameter replacement doesn't work for EXTERNAL_QUERY in BigQuery. I've only spent about 1 day on this.

If we rewrote it to the tables in BigQuery using the context_store_entity and context_store_keymap it would incur a query cost. I'm not sure about what the cost would be.

@pushyamig
Copy link
Contributor

If we rewrote it to the tables in BigQuery using the context_store_entity and context_store_keymap it would incur a query cost. I'm not sure about what the cost would be.

From what I understood with BQ cost for running Context Store queries shouldn't be a lot. The reason as I understood from Unizin Folks is since the Storage for Context store data is not as big as BQ Event store so we don't have query Processing cost as with event store.

I think the reason it's currently not working is the parameter replacement doesn't work for EXTERNAL_QUERY in BigQuery. I've only spent about 1 day on this.

IMHO, I feel if we are switching to BQ backend that I feel the queries should also be BQ Fomat otherwise we are still be dealing with 2 different system(Postgres, BQ). But I fully don't understand why parameter replacement doesn't work so I am just saying what I am thinking. Since last time I did the update I was dealing with Postgres, Bigquery connector and it was not fun updating.

@jonespm
Copy link
Member Author

jonespm commented Apr 1, 2024

I added some more thoughts to this on Slack. I felt this was possibly the easiest way for a first pass. If we wanted to push this forward and do a rewrite to 100% BigQuery we might decide to use the canvas (CD2) tables instead of the context_store. It just feels better to only be going to one source for everything even if BigQuery is just mostly a "proxy" for the PostGres data.

@jonespm
Copy link
Member Author

jonespm commented Apr 1, 2024

Thanks for the feedback here. Because of complications around named parameters and the temporary table, I've decided to take your suggestion and explore rewriting this directly into BigQuery. Currently it runs as far as the assignment table so it's getting there!

@pushyamig
Copy link
Contributor

yes, all the tables other than Submission should be simple, just syntax change and probably some the way the variables are queried.

With submission query, with BQ we have to check if we want to create a Temp table route, I felt BQ is might be good with handling stuff. But Only way to test is in Myla Beta since it has lot of courses.

@jonespm
Copy link
Member Author

jonespm commented Apr 1, 2024

Doesn't seem like temp table is necessary anymore. Can just create it using with a WITH and CTE which seems to be creating a fast temporary table now. Was only necessary back in PostGres.

@jonespm jonespm marked this pull request as ready for review April 12, 2024 16:15
@jonespm jonespm requested review from pushyamig and zqian April 12, 2024 16:15
@jonespm jonespm changed the title WIP Fixes issue #1557 Changes to support switching to bigquery. Fixes issue #1557 Changes to support switching to bigquery. Apr 12, 2024
@pushyamig
Copy link
Contributor

I will review this today or tomorrow

dashboard/settings.py Show resolved Hide resolved
config/env_sample.hjson Show resolved Hide resolved
@@ -181,7 +182,7 @@ def get_last_cronjob_run() -> Union[datetime, None]:
return None


def get_canvas_data_date() -> Union[datetime, None]:
def get_canvas_data_date() -> Union[datetime.datetime, None]:
if not settings.DATABASES.get('DATA_WAREHOUSE', {}).get('IS_UNIZIN'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line need update after removal of 'DATA_WAREHOUSE' setting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should probably also just be removed as it's useless code now and will always return the default of {}. I also see this in the cron.py.

logger.info("************ total status=" + status + "\n")
if settings.LRS_IS_BIGQUERY:
total_tbytes_billed = self.total_bytes_billed / 1024 / 1024 / 1024 / 1024
# $6.25 per TB as of Feb 2024 https://cloud.google.com/bigquery/pricing
Copy link
Member

@zqian zqian Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to watch for the cost. I have 20 MyLA courses in localhost, and the initial run with this PR costs $18, and subsequent run costs $0.1

Copy link
Contributor

@pushyamig pushyamig Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only has 5 course and costed $24 next run cost $0.1. I would expect to cost less

// This is the total row count
resources_access:	2115
resources:	248
submissions:	1759
assignments:	95
user:	100
course:	5

Copy link
Member Author

@jonespm jonespm Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost is high likely because you courses without a start date, without a term (or Default Term/No Term) or with a old term in your database. It tries to pull from the event store from that date on an empty DB.

It might also be a bug from this code with retrieving the dates for terms which wouldn't be good. I was looking at my database for course 556393 and there was no start or end time in the local DB, even though it has Fall 2022 as the term on the course. Will have to look into that so I wouldn't run this till then

The event store is getting huge now so doing a full query is multiple TB of data.

I think we should have a safety case in here so it doesn't retrieve more than X months of data from the event_store unless explicitly directed to. Generally there won't ever be a case to populate old course data. Though querying a year is still a high cost but would likely be about 1/5 the cost of a full run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should create an issue if that is the case and fix the term start/end time issue.

@pushyamig
Copy link
Contributor

I feel this PR should be run from MyLA Beta and see how long it take to run. since MyLA Beta has about 50+ course

@@ -1,7 +1,7 @@
from datetime import datetime
import logging
from collections import namedtuple
from typing import Any, Dict, List, Union
from typing import Any, Dict, List, Optional, Union
from zoneinfo import ZoneInfo
Copy link
Contributor

@pushyamig pushyamig Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is redundant import from sqlalchemy.orm import sessionmaker this could be removed.

def setup_bigquery(self):
# Instantiates a client
self.bigquery_client = bigquery.Client()
# self.bigquery_client = bigquery.Client.from_service_account_json(settings.BIGQUERY_SERVICE_ACCOUNT_JSON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line since it is commented code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that I had this in there because I was thinking of switching the way we configure BigQuery to use an explicit JSON value in the hjson rather than the bq_cred.json file. This is probably a separate issue but something I noticed in the newer version of the library that is supported.

@pushyamig
Copy link
Contributor

The PR seems be working from fetching results as expected. I have some minor comment. But I feel this PR should be run against MyLA Beta course and see how it performance specially with Submission Query.

@jonespm
Copy link
Member Author

jonespm commented Apr 22, 2024

I feel this PR should be run from MyLA Beta and see how long it take to run. since MyLA Beta has about 50+ course

Do you mean to use the course-set that's on the MyLA beta server? That could be done locally by just exporting the course table. It should run about the same local or on that server and uses the same data set.

@pushyamig
Copy link
Contributor

Do you mean to use the course-set that's on the MyLA beta server? That could be done locally by just exporting the course table. It should run about the same local or on that server and uses the same data set.

I don't think I care if it's done locally or Beta server. If you run the courses from MyLA Beta or Test (Test has more courses) and see how the cron performance in same way as current prod then that should be good.

@jonespm
Copy link
Member Author

jonespm commented Apr 23, 2024

Do you mean to use the course-set that's on the MyLA beta server? That could be done locally by just exporting the course table. It should run about the same local or on that server and uses the same data set.

I don't think I care if it's done locally or Beta server. If you run the courses from MyLA Beta or Test (Test has more courses) and see how the cron performance in same way as current prod then that should be good.

That makes sense, I was only testing it locally on a few courses but even that ran significantly faster now. I should be able to try the full list of courses.

@jonespm jonespm requested review from zqian and pushyamig April 23, 2024 17:46
@jonespm
Copy link
Member Author

jonespm commented Apr 23, 2024

You should only test this with a current term course for now as it will pull down too much data otherwise on an empty database.

Copy link
Contributor

@pushyamig pushyamig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Work @jonespm for pulling this off in short amount of time.

Copy link
Member

@zqian zqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the Codacy messages. Otherwise, the PR is good to go!

@jonespm
Copy link
Member Author

jonespm commented Apr 26, 2024

Thanks, didn't notice the checks, they were valid about unused imports so should pass now.

@jonespm
Copy link
Member Author

jonespm commented Apr 26, 2024

I'm not sure if I'll have time to run on the test environment today this so planning on waiting until Monday to merge and test this out then.

@jonespm jonespm merged commit b13d45e into tl-its-umich-edu:master May 6, 2024
1 check passed
@jonespm jonespm deleted the issue_1557 branch May 6, 2024 19:55
@jonespm jonespm linked an issue May 6, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change Postgres cron queries to BigQuery quries
3 participants