Skip to content

Commit

Permalink
Monitor users by time since last request (#541)
Browse files Browse the repository at this point in the history
* rename existing monitoring task to say what we are actually looking for

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* new field: fm_user.last_seen_at

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* changelog entry and update docs to renamed monitor latest-run command.

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* add some explanation to user model fields

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* first working version of monitor last-seen

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* improve docstring

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* add optional custom message which would be sent to users

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* add warning about deprecated CLI command name

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* add short documentation on how to use the monitor command

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* update CLI changelog as well

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* review comments on using naive UTC dates and some re-wordings

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* add missing logic update for server_now()

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* improve docs from review

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* update flake8 & black versions (and flake8 repo location)

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* try to fix ubuntu version

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

* improve --alert-users option by adding the false option'S name; small edits to messages

Signed-off-by: Nicolas Höning <nicolas@seita.nl>

Signed-off-by: Nicolas Höning <nicolas@seita.nl>
  • Loading branch information
nhoening committed Dec 9, 2022
1 parent 8d8461d commit e9d6ef1
Show file tree
Hide file tree
Showing 13 changed files with 310 additions and 24 deletions.
5 changes: 5 additions & 0 deletions documentation/changelog.rst
Expand Up @@ -7,13 +7,15 @@ v0.12.0 | October XX, 2022

.. warning:: Upgrading to this version requires running ``flexmeasures db upgrade`` (you can create a backup first with ``flexmeasures db-ops dump``).


New features
-------------

* Hit the replay button to visually replay what happened, available on the sensor and asset pages [see `PR #463 <http://www.github.com/FlexMeasures/flexmeasures/pull/463>`_]
* Ability to provide your own custom scheduling function [see `PR #505 <http://www.github.com/FlexMeasures/flexmeasures/pull/505>`_]
* Visually distinguish forecasts/schedules (dashed lines) from measurements (solid lines), and expand the tooltip with timing info regarding the forecast/schedule horizon or measurement lag [see `PR #503 <http://www.github.com/FlexMeasures/flexmeasures/pull/503>`_]
* The asset page also allows to show sensor data from other assets that belong to the same account [see `PR #500 <http://www.github.com/FlexMeasures/flexmeasures/pull/500>`_]
* The CLI command ``flexmeasures monitor latest-login`` supports to check if (bot) users who are expected to contact FlexMeasures regularly (e.g. to send data) fail to do so [see `PR #541 <http://www.github.com/FlexMeasures/flexmeasures/pull/541>`_]
* The CLI command ``flexmeasures show beliefs`` supports showing beliefs data in a custom resolution and/or timezone, and also saving the shown beliefs data to a CSV file [see `PR #519 <http://www.github.com/FlexMeasures/flexmeasures/pull/519>`_]
* Improved import of time series data from CSV file: 1) drop duplicate records with warning, 2) allow configuring which column contains explicit recording times for each data point (use case: import forecasts) [see `PR #501 <http://www.github.com/FlexMeasures/flexmeasures/pull/501>`_], 3) localize timezone naive data, 4) support reading in datetime and timedelta values, 5) remove rows with NaN values, and 6) filter by values in specific columns [see `PR #521 <http://www.github.com/FlexMeasures/flexmeasures/pull/521>`_]
* Filter data by source in the API endpoint `/sensors/data` (GET) [see `PR #543 <http://www.github.com/FlexMeasures/flexmeasures/pull/543>`_]
Expand Down Expand Up @@ -41,6 +43,9 @@ Infrastructure / Support
* Clean up table formatting for ``flexmeasures show`` CLI commands [see `PR #540 <http://www.github.com/FlexMeasures/flexmeasures/pull/540>`_]


.. warning:: The CLI command ``flexmeasures monitor tasks`` has been renamed to ``flexmeasures monitor last-run``. The old name will stop working in version 0.13.


v0.11.3 | November 2, 2022
============================

Expand Down
2 changes: 2 additions & 0 deletions documentation/cli/change_log.rst
Expand Up @@ -10,6 +10,8 @@ since v0.12.0 | November XX, 2022
* Add ``--resolution``, ``--timezone`` and ``--to-file`` options to ``flexmeasures show beliefs``, to show beliefs data in a custom resolution and/or timezone, and also to save shown beliefs data to a CSV file.
* Add options to ``flexmeasures add beliefs`` to 1) read CSV data with timezone naive datetimes (use ``--timezone`` to localize the data), 2) read CSV data with datetime/timedelta units (use ``--unit datetime`` or ``--unit timedelta``, 3) remove rows with NaN values, and 4) add filter to read-in data by matching values in specific columns (use ``--filter-column`` and ``--filter-value`` together).
* Fix ``flexmeasures db-ops dump`` and ``flexmeasures db-ops restore`` incorrectly reporting a success when `pg_dump` and `pg_restore` are not installed.
* Add ``flexmeasures monitor last-seen``.
* Rename ``flexmeasures monitor tasks`` to ``flexmeasures monitor last-run``.

since v0.11.0 | August 28, 2022
==============================
Expand Down
34 changes: 30 additions & 4 deletions documentation/host/error-monitoring.rst
Expand Up @@ -6,13 +6,39 @@ Error monitoring
When you run a FlexMeasures server, you want to stay on top of things going wrong. We added two ways of doing that:

- You can connect to Sentry, so that all errors will be sent to your Sentry account. Add the token you got from Sentry in the config setting :ref:`sentry_access_token` and you're up and running!
- Another source of crucial errors are things that did not even happen! For instance, a task to import prices from a day-ahead market, which you depend on later for scheduling. We added a new CLI task called ``flexmeasures monitor tasks``, so you can be alerted when tasks have not successfully run at least so-and-so many minutes ago. The alerts will also come in via Sentry, but you can also send them to email addresses with the config setting :ref:`monitoring_mail_recipients`.
- Another source of crucial errors are things that did not even happen! For instance, a (bot) user who is supposed to send data regularly, fails to connect with FlexMeasures. Or, a task to import prices from a day-ahead market, which you depend on later for scheduling, fails silently.

For illustration of the latter monitoring, here is one example of how we monitor tasks on a server ― the below is run in a cron script every hour and checks if every listed task ran 60, 6 or 1440 minutes ago, respectively:

Let's look at how to monitor for things not happening in more detail:


Monitoring the time users were last seen
-----------------------------------------

The CLI task ``flexmeasures monitor last-seen`` lets you be alerted if a user has contacted your FlexMeasures instance longer ago than you expect. This is most useful for bot users (a.k.a. scripts).

Here is an example for illustration:

.. code-block:: console
flexmeasures monitor last-seen --account-role SubscriberToServiceXYZ --user-role bot --maximum-minutes-since-last-seen 100
As you see, users are filtered by roles. You might need to add roles before this works as you want.

.. todo:: Adding roles and assigning them to users and/or accounts is not supported by the CLI or UI yet (besides ``flexmeasures add account-role``). This is `work in progress <https://github.com/FlexMeasures/flexmeasures/projects/18>`_. Right now, it requires you to add roles on the database level.


Monitoring task runs
---------------------

The CLI task ``flexmeasures monitor latest-run`` lets you be alerted when tasks have not successfully run at least so-and-so many minutes ago.
The alerts will come in via Sentry, but you can also send them to email addresses with the config setting :ref:`monitoring_mail_recipients`.

For illustration, here is one example of how we monitor the latest run times of tasks on a server ― the below is run in a cron script every hour and checks if every listed task ran 60, 6 or 1440 minutes ago, respectively:

.. code-block:: console
flexmeasures monitor tasks --task get_weather_forecasts 60 --task get_recent_meter_data 6  --task import_epex_prices 1440
flexmeasures monitor latest-run --task get_weather_forecasts 60 --task get_recent_meter_data 6  --task import_epex_prices 1440
The first task (get_weather_forecasts) is actually supported within FlexMeasures, while the other two sit in plugins we wrote.

Expand All @@ -24,7 +50,7 @@ This task status monitoring is enabled by decorating the functions behind these
def my_function():
...
Then, FlexMeasures will log if this task ran, and if it succeeded or failed. The result is in the table ``latest_task_runs``, and that's where the ``flexmeasures monitor tasks`` will look.
Then, FlexMeasures will log if this task ran, and if it succeeded or failed. The result is in the table ``latest_task_runs``, and that's where the ``flexmeasures monitor latest-run`` will look.

.. note:: The decorator should be placed right before the function (after all other decorators).

Expand Down
14 changes: 12 additions & 2 deletions flexmeasures/auth/__init__.py
@@ -1,6 +1,6 @@
from flask import Flask
from flask_security import Security, SQLAlchemySessionUserDatastore
from flask_login import user_logged_in
from flask_login import user_logged_in, current_user
from werkzeug.exceptions import Forbidden, Unauthorized

from flexmeasures.data import db
Expand All @@ -21,7 +21,12 @@ def register_at(app: Flask):
unauthorized_handler,
unauthorized_handler_e,
) # noqa: F401
from flexmeasures.data.models.user import User, Role, remember_login # noqa: F401
from flexmeasures.data.models.user import (
User,
Role,
remember_login,
remember_last_seen,
) # noqa: F401

# Setup Flask-Security-Too for user authentication & authorization
user_datastore = SQLAlchemySessionUserDatastore(db.session, User, Role)
Expand All @@ -39,3 +44,8 @@ def register_at(app: Flask):

# add our custom handler for a user login event
user_logged_in.connect(remember_login)

# also store when the last contact was
@app.before_request
def record_last_seen():
remember_last_seen(current_user)
200 changes: 192 additions & 8 deletions flexmeasures/cli/monitor.py
@@ -1,5 +1,5 @@
from datetime import timedelta
from typing import Optional
from datetime import datetime, timedelta
from typing import Optional, List

import click
from flask import current_app as app
Expand All @@ -11,6 +11,7 @@
)

from flexmeasures.data.models.task_runs import LatestTaskRun
from flexmeasures.data.models.user import User
from flexmeasures.utils.time_utils import server_now


Expand All @@ -19,7 +20,7 @@ def fm_monitor():
"""FlexMeasures: Monitor tasks."""


def send_monitoring_alert(
def send_task_monitoring_alert(
task_name: str,
msg: str,
latest_run: Optional[LatestTaskRun] = None,
Expand Down Expand Up @@ -54,7 +55,7 @@ def send_monitoring_alert(
app.logger.error(f"{msg} {latest_run_txt} NOTE: {custom_msg}")


@fm_monitor.command("tasks") # TODO: a better name would be "latest-run"
@fm_monitor.command("tasks") # TODO: deprecate, this is the old name
@with_appcontext
@click.option(
"--task",
Expand All @@ -69,18 +70,48 @@ def send_monitoring_alert(
default="",
help="Add this message to the monitoring alert (if one is sent).",
)
def monitor_tasks(task, custom_message):
@click.pass_context
def monitor_task(ctx, task, custom_message):
"""
DEPRECATED, use `latest-run`.
Check if the given task's last successful execution happened less than the allowed time ago.
If not, alert someone, via email or sentry.
"""
click.echo(
"This function has been renamed (and is now deprecated). Please use flexmeasures monitor latest-run."
)
ctx.forward(monitor_latest_run)


@fm_monitor.command("latest-run")
@with_appcontext
@click.option(
"--task",
type=(str, int),
multiple=True,
required=True,
help="The name of the task and the maximal allowed minutes between successful runs. Use multiple times if needed.",
)
@click.option(
"--custom-message",
type=str,
default="",
help="Add this message to the monitoring alert (if one is sent).",
)
def monitor_latest_run(task, custom_message):
"""
Check if the given task's last successful execution happened less than the allowed time ago.
Tasks are CLI commands with the @task_with_status_report decorator.
If not, alert someone, via email or sentry.
"""
for t in task:
task_name = t[0]
app.logger.info(f"Checking latest run of task {task_name} ...")
latest_run: LatestTaskRun = LatestTaskRun.query.get(task_name)
if latest_run is None:
msg = f"Task {task_name} has no last run and thus cannot be monitored. Is it configured properly?"
send_monitoring_alert(task_name, msg, custom_msg=custom_message)
send_task_monitoring_alert(task_name, msg, custom_msg=custom_message)
return
now = server_now()
acceptable_interval = timedelta(minutes=t[1])
Expand All @@ -89,18 +120,171 @@ def monitor_tasks(task, custom_message):
# latest run time is okay, let's check the status
if latest_run.status is False:
msg = f"A failure has been reported on task {task_name}."
send_monitoring_alert(
send_task_monitoring_alert(
task_name, msg, latest_run=latest_run, custom_msg=custom_message
)
else:
msg = (
f"Task {task_name}'s latest run time is outside of the acceptable range"
f" ({acceptable_interval})."
)
send_monitoring_alert(
send_task_monitoring_alert(
task_name, msg, latest_run=latest_run, custom_msg=custom_message
)
app.logger.info("Done checking task runs ...")


def send_lastseen_monitoring_alert(
users: List[User],
last_seen_delta: timedelta,
alerted_users: bool,
account_role: Optional[str] = None,
user_role: Optional[str] = None,
):
"""
Tell monitoring recipients and Sentry about user(s) we haven't seen in a while.
"""
user_info_list = [
f"{user.username} (last contact was {user.last_seen_at})" for user in users
]

msg = (
f"The following user(s) have not contacted this FlexMeasures server for more"
f" than {last_seen_delta}, even though we expect they would have:\n"
)
for user_info in user_info_list:
msg += f"\n- {user_info}"

# Sentry
set_sentry_context(
"last_seen_context",
{
"delta": last_seen_delta,
"alerted_users": alerted_users,
"account_role": account_role,
"user_role": user_role,
},
)
capture_message_for_sentry(msg)

# Email
msg += "\n"
if account_role:
msg += f"\nThis alert concerns users whose accounts have the role '{account_role}'."
if user_role:
msg += f"\nThis alert concerns users who have the role '{user_role}'."
if alerted_users:
msg += "\n\nThe user(s) has/have been notified by email, as well."
else:
msg += (
"\n\nThe user(s) has/have not been notified (--alert-users was not used)."
)
email_recipients = app.config.get("FLEXMEASURES_MONITORING_MAIL_RECIPIENTS", [])
if len(email_recipients) > 0:
email = Message(
subject="Last contact by user(s) too long ago", bcc=email_recipients
)
email.body = msg
app.mail.send(email)

app.logger.error(msg)


@fm_monitor.command("last-seen")
@with_appcontext
@click.option(
"--maximum-minutes-since-last-seen",
type=int,
required=True,
help="Maximal number of minutes since last request.",
)
@click.option(
"--alert-users/--do-not-alert-users",
type=bool,
default=False,
help="If True, also send an email to the user. Defaults to False, as these users are often bots.",
)
@click.option(
"--account-role",
type=str,
help="The name of an account role to filter for.",
)
@click.option(
"--user-role",
type=str,
help="The name of a user role to filter for.",
)
@click.option(
"--custom-user-message",
type=str,
default="",
help="Add this message to the monitoring alert email to users (if one is sent).",
)
def monitor_last_seen(
maximum_minutes_since_last_seen: int,
alert_users: bool = False,
account_role: Optional[str] = None,
user_role: Optional[str] = None,
custom_user_message: Optional[str] = None,
):
"""
Check if given users last contact (via a request) happened less than the allowed time ago.
Helpful for user accounts that are expected to contact FlexMeasures regularly (in an automated fashion).
If the last contact was too long ago, we send alerts via Sentry, as well as emails to monitoring mail recipients.
The user can be informed, as well.
The set of users can be narrowed down by roles.
"""
last_seen_delta = timedelta(minutes=maximum_minutes_since_last_seen)

# find users we haven't seen in the given time window
users: List[User] = User.query.filter(
User.last_seen_at < datetime.utcnow() - last_seen_delta
).all()

# role filters
if account_role is not None:
users = [user for user in users if user.account.has_role(account_role)]
if user_role is not None:
users = [user for user in users if user.has_role(user_role)]

if not users:
click.echo(
f"All good ― no users were found with relevant criteria and last_seen_at longer than {maximum_minutes_since_last_seen} minutes ago."
)
return

# inform users & monitoring recipients
if alert_users:
for user in users:
msg = (
f"We noticed that user {user.username} has not been in contact with this FlexMeasures server"
f" for at least {maximum_minutes_since_last_seen} minutes (last contact was {user.last_seen_at})."
)
if custom_user_message:
msg += f"\n\n{custom_user_message}"
else:
msg += (
"\nBy our own accounting, this should usually not happen."
"\n\nMaybe you want to check if your local code is still working well."
)
email = Message(
subject=f"Last contact by user {user.username} has been too long ago",
recipients=[user.email],
)
email.body = msg
app.mail.send(email)
else:
click.echo("Users are not being alerted.")

send_lastseen_monitoring_alert(
users,
last_seen_delta,
alerted_users=alert_users,
account_role=account_role,
user_role=user_role,
)


app.cli.add_command(fm_monitor)

0 comments on commit e9d6ef1

Please sign in to comment.