Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve distinct_retract aggregation logic. #519

Open
yokofly opened this issue Jan 24, 2024 · 0 comments · May be fixed by #608
Open

improve distinct_retract aggregation logic. #519

yokofly opened this issue Jan 24, 2024 · 0 comments · May be fixed by #608
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@yokofly
Copy link
Contributor

yokofly commented Jan 24, 2024

Describe what enhancement you'd like to have
distinct_retract is a connector, distinct means filter the duplicate data to the next aggregation operator, and retract is the internal impl about CDC with changelog. (we use virtual column tp_delta, +1 means data in, and -1 means retract.)

short: this issue wanna tackle 2 things:

  1. replace the map->hashmap, this file shall be found in https://github.com/timeplus-io/proton/blob/develop/src/AggregateFunctions/Streaming/CountedValueHashMap.h
  2. add extra logic for numeric data.

detailed

Since we have introduced counted_value_hash_map in #484, the distinct still uses an ordered map, actually, the distinct connector does not care about the order of the element.
the related code can be found at

struct AggregateFunctionDistinctRetractGenericData
{
/// proton: starts.
/// When creating, the hash table must be small.
using Map = CountedValueMap<StringRef, false>; /// map<key(without delta_col), uint32>
using Self = AggregateFunctionDistinctRetractGenericData;
Map map;

another thing we wanna improve/refactor is currently we use the stringRef style for all the input, recalling another distinct implementation, we notice the function has to do dispatch with input data type. that's say, current distinct_retract needs extra serialize/deser for numeric data, but, we are input the numeric data, we do not need such string serialization overhead.

AggregateFunctionPtr transformAggregateFunction(
const AggregateFunctionPtr & nested_function,
const AggregateFunctionProperties &,
const DataTypes & arguments,
const Array & params) const override
{
AggregateFunctionPtr res;
if (arguments.size() == 1)
{
res.reset(createWithNumericType<AggregateFunctionDistinct, AggregateFunctionDistinctSingleNumericData>(
*arguments[0], nested_function, arguments, params));
if (res)
return res;
if (arguments[0]->isValueUnambiguouslyRepresentedInContiguousMemoryRegion())
return std::make_shared<AggregateFunctionDistinct<AggregateFunctionDistinctSingleGenericData<true>>>(
nested_function, arguments, params);
else
return std::make_shared<AggregateFunctionDistinct<AggregateFunctionDistinctSingleGenericData<false>>>(
nested_function, arguments, params);
}
return std::make_shared<AggregateFunctionDistinct<AggregateFunctionDistinctMultipleGenericData>>(
nested_function, arguments, params);
}
};

upstream #426

@yokofly yokofly added enhancement New feature or request good first issue Good for newcomers labels Jan 24, 2024
@yokofly yokofly removed the good first issue Good for newcomers label Feb 1, 2024
@lizhou1111 lizhou1111 added the good first issue Good for newcomers label Mar 12, 2024
@lizhou1111 lizhou1111 linked a pull request Mar 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants