-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(core) fix count_values query. #1545
Conversation
The column for count_values aggregator is StringColumn.
@@ -130,7 +130,7 @@ object RowAggregator { | |||
+ s"or calculate the rate and histogram_quantile before applying the aggregation. " | |||
+ s"If you have a genuine use case for this query, please get in touch.") | |||
} | |||
} else if (valColType == ColumnType.DoubleColumn) { | |||
} else if (valColType == ColumnType.DoubleColumn || valColType == ColumnType.StringColumn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need else if
? Original intent of the PR is to handle histograms which is already done just let everything else follow the same path as before. By restricting the types we don't know additional regression we may see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. The patch removed the default case so that compile would fail if missing any cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishramachandran , requesting your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current patch doesn't work for topk/bottomk
topk(2, foo{..})
fails with The TopK operation is not supported for StringColumn
thus I strongly feel as I originally recommended, minimize the blast radius and restrict your changes to only histograms, which was the original intention. Everything else should take the same code path as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right.
I think the problem may be caused by this function
def valueColumnType(schema: ResultSchema): ColumnType = { require(schema.isTimeSeries, s"Schema $schema is not time series based, cannot continue query") require(schema.columns.size >= 2, s"Schema $schema has less than 2 columns, cannot continue query") schema.columns(1).colType }
For schema more than 2 columns schema.columns(1).colType may be always be the valueColumnType
case BottomK => new TopBottomKRowAggregator(params(0).asInstanceOf[Double].toInt, true) | ||
case Quantile => new QuantileRowAggregator(params(0).asInstanceOf[Double]) | ||
case Stdvar if valColType == ColumnType.DoubleColumn => StdvarRowAggregator | ||
case Stddev if valColType == ColumnType.DoubleColumn => StddevRowAggregator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishramachandran I see Group
, Avg
, StdVar
and StdDev
are now restricted to column type double only. Would that be ok? I mentioned a couple of times here that we should try and minimize the blast radius by just fixing the count
and sum
for histograms only (the intent behind the original PR) and let everything else continue to work the way it is. We already found a regression in count and luckily I found regression in topk/bottomk during my local testing. I would like to get your views and I'll be holding off my approval for the PR till then.Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amolnayak311 Thanks for the help in identifying the regression !
I agree on containing the blast radius and minimizing regressions. At the same time we need to send out a clearer error message that helps the user esp if the query is not meaningful. I don't see us supporting some of the aggregations (like min, max, stddev, stdvar etc) on histograms. Do you see that possibility ?
I think we are seeing this regression because of absence of unit tests esp for this function. Improving unit tests by running this function through various types could make it more solid, but it comes at the cost of increasing this PR scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed over the call, this looks good now.
The column for count_values aggregator is StringColumn.
Pull Request checklist