Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: simple, more flexible log aggregation API for Archivers/RecordBuilders #20798

Draft
wants to merge 154 commits into
base: 5.x-dev
Choose a base branch
from

Conversation

diosmosis
Copy link
Member

@diosmosis diosmosis commented May 29, 2023

Description:

This PR is a follow up to the RecordBuilder change. Requested by @tsteur it is an exploration into providing a simpler, more understandable API for performing log aggregation during archiving. It should not be merged as is (nor is it actually merge-able).

Description of the API vs. existing code

The current API is essentially manually constructing SQL and building DataTables from the result set. Developers must manually provide selects, from metadata, where conditions, group bys, etc. Even for simpler cases, this can be confusing to read:

// from https://github.com/matomo-org/matomo/blob/5.x-dev/plugins/UserId/RecordBuilders/Users.php#L37-L93

$record = new DataTable();

$visitorIdsUserIdsMap = [];

$userIdFieldName = Archiver::USER_ID_FIELD;
$visitorIdFieldName = Archiver::VISITOR_ID_FIELD;

$rankingQueryLimit = $this->getRankingQueryLimit();

$rankingQuery = false;
if ($rankingQueryLimit > 0) {
    $rankingQuery = new RankingQuery($rankingQueryLimit);
    $rankingQuery->addLabelColumn($userIdFieldName);
    $rankingQuery->addLabelColumn($visitorIdFieldName);
}

/** @var \Zend_Db_Statement $query */
$query = $archiveProcessor->getLogAggregator()->queryVisitsByDimension(
    array($userIdFieldName),
    "log_visit.$userIdFieldName IS NOT NULL AND log_visit.$userIdFieldName != ''",
    array("LOWER(HEX($visitorIdFieldName)) as $visitorIdFieldName"),
    false,
    $rankingQuery,
    $userIdFieldName . ' ASC'
);

$rowsCount = 0;
foreach ($query as $row) {
    $rowsCount++;

    $columns = [
        Metrics::INDEX_NB_UNIQ_VISITORS => $row[Metrics::INDEX_NB_UNIQ_VISITORS],
        Metrics::INDEX_NB_VISITS => $row[Metrics::INDEX_NB_VISITS],
        Metrics::INDEX_NB_ACTIONS => $row[Metrics::INDEX_NB_ACTIONS],
        Metrics::INDEX_NB_USERS => $row[Metrics::INDEX_NB_USERS],
        Metrics::INDEX_MAX_ACTIONS => $row[Metrics::INDEX_MAX_ACTIONS],
        Metrics::INDEX_SUM_VISIT_LENGTH => $row[Metrics::INDEX_SUM_VISIT_LENGTH],
        Metrics::INDEX_BOUNCE_COUNT => $row[Metrics::INDEX_BOUNCE_COUNT],
        Metrics::INDEX_NB_VISITS_CONVERTED => $row[Metrics::INDEX_NB_VISITS_CONVERTED],
    ];

    $record->sumRowWithLabel($row[$userIdFieldName], $columns);

    // Remember visitor ID per user. We use it to fill metadata before actual inserting rows to DB.
    if (!empty($row[Archiver::USER_ID_FIELD])
        && !empty($row[Archiver::VISITOR_ID_FIELD])
    ) {
        $visitorIdsUserIdsMap[$row[Archiver::USER_ID_FIELD]] = $row[Archiver::VISITOR_ID_FIELD];
    }
}

$this->setVisitorIds($record, $visitorIdsUserIdsMap);

return [Archiver::USERID_ARCHIVE_RECORD => $record];
}

Log aggregation is generally never simple, and usually involves the use of other complex bits of code like RankingQuery (which truncates an entire result set in mysql to a predefined number of rows, the row limit usually numbering in the thousands).

The current attempt at dealing w/ this complexity is the monolithic LogAggregator class which contains several methods for common log aggregation use cases, like querying the visits table for general visits metrics. For core plugins these methods can be used in many places, but for anything more interesting (like every premium plugin), they are inadequate. In fact, the queries for premium plugins are often far more complicated, and thus even harder to understand.

The API introduced in this PR surrounds a new class: LogAggregationQuery. Dimensions and metrics are added one by one and the SQL query is then generated from that information. (Note: It's possible to add from entries w/ specific joins, but the aim is to eventually avoid having to think in relational terms in the API for this class.) Queries return generators that iterate over every row and auto-close the cursor, and can be supplied to new DataTable methods that automatically iterate over them and sum the data. Example usage:

$record = new DataTable();
$visitorIdsUserIdsMap = [];

$visitorIdFieldName = Archiver::VISITOR_ID_FIELD;
$userIdFieldName = Archiver::USER_ID_FIELD;
$rankingQueryLimit = $this->getRankingQueryLimit();

$query = $archiveProcessor->newLogQuery('log_visit');
$query->addDimensionSql('label', $userIdFieldName);
// add as metric so we don't group by. given the association between userid and visitorid, it's probably fine (and better to add
// it as a dimension, but for this example, we are matching the previous code exactly)
$query->addMetricSql($visitorIdFieldName, "LOWER(HEX($visitorIdFieldName))");
// these metrics are common and would normally added via a special method like `$query->addGeneralVisitMetrics()`. added here manually
// so you can see what the code might look like.
$query->addMetricSql(Metrics::INDEX_NB_UNIQ_VISITORS, "count(distinct log_visit.idvisitor)");
$query->addMetricSql(Metrics::INDEX_NB_VISITS, "count(*)");
$query->addMetricSql(Metrics::INDEX_NB_ACTIONS, "sum(log_visit.visit_total_actions)");
$query->addMetricSql(Metrics::INDEX_NB_USERS, "count(distinct log_visit.user_id)");
$query->addMetricSql(Metrics::INDEX_MAX_ACTIONS, "max(log_visit.visit_total_actions)");
$query->addMetricSql(Metrics::INDEX_SUM_VISIT_LENGTH, "sum(log_visit.visit_total_time)");
$query->addMetricSql(Metrics::INDEX_BOUNCE_COUNT, "sum(case log_visit.visit_total_actions when 1 then 1 when 0 then 1 else 0 end)");
$query->addMetricSql(Metrics::INDEX_NB_VISITS_CONVERTED, "sum(case log_visit.visit_goal_converted when 1 then 1 else 0 end)");
$query->addWhere("log_visit.$userIdFieldName IS NOT NULL AND log_visit.$userIdFieldName != ''");
$query->orderBy($userIdFieldName, 'ASC');

if ($rankingQueryLimit > 0) {
    // currently to use ranking query, the metric aggregations must be specified. it should be possible to get this through metadata
    // in the future.
    $query->useRankingQuery($rankingQueryLimit, [
        Metrics::INDEX_NB_UNIQ_VISITORS => false,
        Metrics::INDEX_NB_VISITS => 'sum',
        Metrics::INDEX_NB_ACTIONS => 'sum',
        Metrics::INDEX_NB_USERS => false,
        Metrics::INDEX_MAX_ACTIONS => 'max',
        Metrics::INDEX_SUM_VISIT_LENGTH => 'sum',
        Metrics::INDEX_BOUNCE_COUNT => 'sum',
        Metrics::INDEX_NB_VISITS_CONVERTED => 'sum',
    ]);
    
    // since we add $visitorIdFieldName as metric instead of a dimension above, we need to let RankingQuery know it should be treated
    // like a dimension. wouldn't be needed if we had used addDimensionSql.
    $query->getRankingQuery()->addLabelColumn($visitorIdFieldName);
}

// executed for each row when we iterate over the result set. normally it would be used to transform the row into rows that would
// appear in the record, if the format is different from what is queried. here we're using it to keep track of some data.
$query->addRowTransform(function ($row) use (&$visitorIdsUserIdsMap) {
    // Remember visitor ID per user. We use it to fill metadata before actual inserting rows to DB.
    if (!empty($row[Archiver::USER_ID_FIELD])
        && !empty($row[Archiver::VISITOR_ID_FIELD])
    ) {
        $visitorIdsUserIdsMap[$row[Archiver::USER_ID_FIELD]] = $row[Archiver::VISITOR_ID_FIELD];
    }

    return $row;
});

$resultSet = $query->execute();
$record->sumSimpleArrayTraversable($resultSet); // directly add each result set row to a record

$this->setVisitorIds($record, $visitorIdsUserIdsMap);

return [Archiver::USERID_ARCHIVE_RECORD => $record];

Proof of concept uses can be found here for the Goals plugin and as PRs in the MediaAnalytics and CrashAnalytics plugins (including the use of RankingQuery).

Potential future directions if the API is used

This change is not an end unto itself, but a stepping stone to other potentially more valuable changes:

Remove the LogAggregator class

With LogAggregationQuery class, the (more) hardcoded queries made in LogAggregator are no longer necessary. The class could be deprecated and removed.

Using only ArchivedMetric/Dimension classes to build queries

The original aim of this PoC was to only allow metadata classes like ArchivedMetric/Dimension in methods like LogAggregation::addDimension. Unfortunately, very few plugins use them and use ArchivedMetric in a way that would be optimal for generic log aggregation queries, so the PoC also includes (and depends on) methods that specify SQL specifically.

A future change, however, could introduce more ArchivedMetric/Dimension classes (along with fixing any inefficiencies in existing uses). This would further decouple LogAggregationQuery with relational database concepts, relying solely on concepts in the higher domain of analytics.

This is another stepping stone to what could be a very valuable end result:

Potential path for supporting different backends (for log aggregation at least)

With the above change, and perhaps some other changes added to LogTable entities, log aggregation can be done within plugins without having to write SQL. It follows that non-mysql or other non-relational backends could be supported by providing different LogAggregationQuery implementations. While this would not be all the work required for Matomo supporting other backends, it's a significant step in that direction.

Debugging commands that can help visualize the structure and flow of archiving

Since SQL queries with LogAggregationQuery are created from chunks, like dimension sql, metric sql, etc, it could be possible to create some debugging commands to help developers see how log aggregation for a specific plugin works. The basic process in a RecordBuilder is:

query log tables and get result set -> convert result set row to record row if needed -> add row to datatable result

A simple visualization of how the query result set looks and how the record row looks can be automatically generated from a LogAggregationQuery instance. Which could aid in understanding a plugin's archiving code for someone who doesn't yet understand.

Review

…ta is present within them. if some are not present, only archive those in a new partial archive.
…meric archive table, otherwise blob archive table
@diosmosis diosmosis changed the base branch from goals-no-data-array to custom-dimensions-record-builder June 4, 2023 10:58
@github-actions
Copy link
Contributor

If you don't want this PR to be closed automatically in 28 days then you need to assign the label 'Do not close'.

@github-actions github-actions bot added the Stale The label used by the Close Stale Issues action label Jun 19, 2023
Base automatically changed from custom-dimensions-record-builder to 5.x-dev June 26, 2023 08:25
@diosmosis diosmosis force-pushed the poc-simpler-log-aggregation branch from a2ce9c7 to 7bbdbff Compare July 16, 2023 19:56
@diosmosis diosmosis force-pushed the poc-simpler-log-aggregation branch from 7bbdbff to 3272571 Compare July 16, 2023 20:08
@diosmosis
Copy link
Member Author

FYI @tsteur this can be looked at now

@github-actions github-actions bot removed the Stale The label used by the Close Stale Issues action label Aug 14, 2023
@michalkleiner
Copy link
Contributor

@diosmosis can you please rebase/merge in latest 5.x and look into the failing phpcs test? Thanks

@diosmosis
Copy link
Member Author

@michalkleiner this shouldn't be merged as is, I assume someone will have to make a decision whether it should go into core first. Hence no "Needs Review" label.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

If you don't want this PR to be closed automatically in 28 days then you need to assign the label 'Do not close'.

@github-actions github-actions bot added the Stale The label used by the Close Stale Issues action label Sep 6, 2023
@diosmosis diosmosis added the Do not close PRs with this label won't be marked as stale by the Close Stale Issues action label Sep 6, 2023
@github-actions github-actions bot removed the Stale The label used by the Close Stale Issues action label Sep 7, 2023
Copy link
Contributor

@bx80 bx80 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general approach here of decoupling the aggregation from the query elements. More abstraction during the query build will allow better optimizations in the future. For example, window function support for ranking queries which would be easier to delegate to the DB adapter if less of the query is being written directly in code.

I can't see any issues with the code or structure here and since it's a PoC I won't get too detailed.

Perhaps it could be a good idea to test the new LogAggregationQuery class with some of the more complicated existing archiving queries such as those in the Funnels plugin or pageview conversions to confirm that edge cases like sub-queries can be handled?

I'm not sure what the next step is for this PR, but if there's any specific feedback that you'd find useful then let me know and I can raise it with the rest of the team, otherwise I look forward to seeing the final version of this merged once it's ready 👍

@diosmosis
Copy link
Member Author

@bx80 fyi, not sure if this was communicated, but there are related PoC PRs in Crash Analytics and Media Analytics if you want to see more complicated uses. Didn't try any other plugins.

@tsteur will come up with plans for this PR, if any.

@bx80
Copy link
Contributor

bx80 commented Sep 21, 2023

Thanks @diosmosis, I've taken a look at those companion PRs and judging by the complexity of those queries it looks like there should be more than enough flexibility to handle pageview conversions and funnels too 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Do not close PRs with this label won't be marked as stale by the Close Stale Issues action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants