New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1685769 - Add cloud monitoring export script and jobs #1679
Conversation
Yes, that's an existing pattern and this looks like a reasonable approach. |
I won't have time for a full review today, so I can take a look early next week. Otherwise, feel free to ask someone else for a full review. |
Ok no rush on this since it's still blocked by the grpc issue and also permissions. Thanks |
@@ -331,3 +331,14 @@ bqetl_desktop_platform: | |||
] | |||
retries: 2 | |||
retry_delay: 30m | |||
|
|||
bqetl_cloud_monitoring_export: | |||
schedule_interval: 0 * * * * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is equivalent to @daily
, and we should prefer that shorthand if it's what we use consistently elsewhere in this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hourly but yes @hourly
would work
# get existing timestamps in destination table to avoid overlap | ||
if not overwrite and len(time_series_data) > 0: | ||
time_series_data = filter_existing_data( | ||
time_series_data, | ||
bq_client, | ||
target_table, | ||
start_time, | ||
end_time, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination would be to create these monitoring tables with hourly partitioning, and then have these scripts atomically overwrite the target partition (specifying the destination table as mytable$2021011201
, etc.), avoiding the need for filtering logic like this.
We could achieve some simplification by having all this machinery assume that it's operating on one whole hour at a time. I may well be missing some nuance, though, so definitely open to pushback.
Going to discuss this with SRE before continuing |
Going with influxdb instead. See https://bugzilla.mozilla.org/show_bug.cgi?id=1619406 |
https://bugzilla.mozilla.org/show_bug.cgi?id=1685769
This is blocked by grpc size limit in the monitoring library (googleapis/python-monitoring#62). Some intervals for some metrics can't be exported with the current
google-cloud-monitoring==2.0.0
and I don't think it's possible to pip install from git withpip-compile --generate-hashes
.@jklukas does this structure of a
query.py
in theproject/dataset/table
directory that uses a file in thebigquery-etl
module fit into the bigquery-etl pattern?