Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC Time series #27

Closed
wants to merge 14 commits into from
Closed

PoC Time series #27

wants to merge 14 commits into from

Conversation

codebien
Copy link
Contributor

@codebien codebien commented Jun 13, 2022

The code is a PoC based on docs/metrics-data-model.md, more research and polish will be applied when a consensus on the design document will be achieved.

It contains a very basic implementation for tdunning's TDigest based on a 3rd-party library. The math from the library doesn't work so I will replace it soon.

@codebien codebien requested a review from na-- June 13, 2022 15:54
@codebien codebien self-assigned this Jun 13, 2022
type Metric struct {
Name string `json:"name"`
Type MetricType `json:"type"`
Contains ValueType `json:"contains"`
Copy link
Contributor Author

@codebien codebien Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Contains ValueType `json:"contains"`

Do we need ValueType?
It seems used only from the summary. Should we move the mapping directly there? If not, then I think we should extend the current JS API for supporting all the possible values (currently, it supports only Time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Good point, but I am not sure how we can remove it. The other state in Metric will be moved to the MetricsEngine, since only it works with them, but the value type is passed from inside of the JS scripts and we have to save it somewhere.

Also, what do you mean by "all the possible values"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and even in the eventual case we would refactor then it's better to keep any discussion in a dedicated topic, I think for now it's fine how it is.

Also, what do you mean by "all the possible values"?

In the current Trend API, you can't define a custom metric with the value type assigned to Data, only Time is supported. If I'm getting the current API right.

Name string `json:"name"`
Type MetricType `json:"type"`
Contains ValueType `json:"contains"`
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
Observed bool
}

Do we need it with the time series model? Potentially, when the thresholds are parsed we can just initialize a time series and it will be available also if any Sample has been generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thresholds (and the end-of-test summary) can span multiple time series, see my comment at the top. For example, you might have a threshold on http_req_duration{status:200}, while your actual time series are http_req_duration{status:200,url:http://foo,...},http_req_duration{status:200,url:http://bar,...},http_req_duration{status:200,url:http://baz,...}.

We might not need Observed, I don't know, but it doesn't really matter, it will be something local for the MetricsEngine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thresholds (and the end-of-test summary) can span multiple time series, see my comment at the top. For example, you might have a threshold on http_req_duration{status:200}, while your actual time series are http_req_duration{status:200,url:http://foo,...},http_req_duration{status:200,url:http://bar,...},http_req_duration{status:200,url:http://baz,...}.

My idea was the summary and the thresholds should use centralized per-time series sinks and aggregate them when required. This should be fine for the summary where we print it just at the end, but in the case of thresholds it will be heavy if we do it for each generated sample.

What is your idea about threshold integration? Should each Threshold keep the Sink autonomously and make the evaluation on top of it?

We might not need Observed, I don't know, but it doesn't really matter, it will be something local for the MetricsEngine.

Ok, so I'm not including it in this document.

Copy link
Member

@na-- na-- Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your idea about threshold integration? Should each Threshold keep the Sink autonomously and make the evaluation on top of it?

Yeah, this is one of the things that I meant with my earlier comment that time series are orthogonal (i.e. independent, unrelated, etc.) to aggregation. Determining the "group" ("time series") of a metric Sample based on its metric name and tag values is one thing. Aggregating groups of metric samples is quite another, the group won't always be related to the time series.

In the Prometheus outputs (whether remote-write or not), the mapping will probably be 1:1 - you'd have a Sink for every time series, and every metric Sample will be sent to exactly one Sink, right?

However, in the thresholds and the end-of-test summary, the mapping would be n:m. Each threshold definition can span multiple time-series, since it is defined only by partial (non-exhaustive) constraints. The example I gave with http_req_duration{status:200} can match a request to any URL and method, which may all be in different time series. But you can also have another threshold in the same test that also matches some of the same Samples, e.g. http_req_duration{method:GET}, and a third threshold http_req_duration{status:200,method:GET} with both constraints 😅 So one metric Sample will potentially need to be sent to multiple Sinks. And you would not necessarily be able to combine individual per-time-series Sinks back in a single per-threshold sink. In any case, even if you could, it would be quite inefficient to do so every few seconds... 😞 Not to speak about potential future improvements to thresholds which are also going to make things more complicated, e.g. grafana/k6#2379 😅

That said, even if we can't necessarily reuse or strongly tie together time series and metric sample aggregation/sinks, there are still some benefits 🎉 For example:

  • We probably could reuse the same Sink implementations here and in other similar outputs and in the MetricsEngine (thresholds / end-of-test summary). The grouping will just need to be done differently in both places and decoupled from the aggregation 🤷‍♂️ In my distributed execution PoC ([WIP] PoC for native distributed execution k6#2438) I used circonusllhist just for expedience, and we still need to do more research and testing about the various tradeoffs of such algorithms (Use HDR histograms for calculating percentiles in thresholds and summary stats k6#763 (comment)), but we probably won't need 2 different ones 🤷‍♂️ 🤞 🙏 😅
  • If every metric Sample has a link to its time series, this part of the MetricsEngine can be implemented much more efficiently 🎉 Right now it does nested string comparison for every SubMetric definition and on every Sample... 😞 With time series, the MetricsEngine can create a mapping for every new time series it sees to the matching sub-metrics and reuse that mapping when new samples with the same sub-metric arrive.

docs/metrics-data-model.md Outdated Show resolved Hide resolved
docs/metrics-data-model.md Outdated Show resolved Hide resolved

## Known issues

* Name and URL tags: `k6` is tagging HTTP requests with the URL. It will create a high cardinality issue for the time series data model.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest and more retro-compatible solution to me seems to make the url tag disabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@na-- or even better, when we will have the metadata concept then we can just move it there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I'd like, though there will be some... complications 😞 The iter, vu (and maybe ip) tags should probably also be metadata, not tags, but it's a bit too early to see how all of this will work out.

Copy link
Member

@na-- na-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are mostly my comments on the data model document. I still haven't been able to actually read through and grok the code, sorry. I decided to submit these comments first so we have something to talk about until I review the rest.

@@ -0,0 +1,177 @@
# k6 Time Series

We would migrate to a time series data model for getting the benefits of aggregated data so just allocating samples one time and pass around the aggregated time series value (or values - e.g. in case of Trend). The Outputs will switch from fetching the buffered samples to fetching a snapshot of the time series (the same concept adopted by Prometheus' scraping operation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand exactly what you mean here, but if I do, I am not sure I agree. I think you might be conflating two different concepts that shouldn't necessarily go together.

To me, grouping metric samples with the same metric name and tags (i.e. a "time series") is somewhat orthogonal to actually aggregating the data of every metric sample (i.e. measurement) that is part of the time series. Grouping measurements in time series is somewhat of a prerequisite to aggregating them efficiently, but they are two different things. And we mostly want the first everywhere in the k6 core, so we can implement the second in some places and in potentially different "flavors".

For example, this output will benefit from both, since Prometheus deals with aggregated data. However, the JSON and CSV outputs certainly wont. The most they will benefit from the grouping of metric samples in time series is to maybe cache some things a bit better. The end-of-test summary will benefit somewhat, but not its aggregation spans multiple time series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will try to rewrite this part. If we exclude the aggregation from the time series concept what are the other benefits we get?

I see:

  • Efficient comparison - we can compare quickly if a time series matches another just by checking the hash.
  • Memory storage - we can reference the hash instead to pass the entire name+tags where tags will end in a copy
  • Post-run analysis - We can count the number of generated time series so we know the unique combinations
  • Do we have more?

This should be also a quick summary of what 1831 is already documenting.

For example, this output will benefit from both, since Prometheus deals with aggregated data. However, the JSON and CSV outputs certainly wont. The most they will benefit from the grouping of metric samples in time series is to maybe cache some things a bit better. The end-of-test summary will benefit somewhat, but not its aggregation spans multiple time series.

Does it mean we will continue to flush all the samples to the outputs then each output will sink/aggregate autonomously?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have more?

Mentioned some side-benefits here: #27 (comment)

Does it mean we will continue to flush all the samples to the outputs then each output will sink/aggregate autonomously?

Yeah, I think so... 🤔 When every metric Sample is attached to a time-series, at least the aggregation will be quite a bit easier and computationally cheaper. Maybe we can also figure out how to expose Sinks, or some other common aggregation API from the k6 core, so work is not duplicated, but I am not sure if the benefits of that would outweigh the complexity 🤔 In any case, that definitely feels like a v2 kind of thing, we should probably start more simply without it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that we converge on the idea that we don't need time series we just want SampleTags to be comparable in O(1) time? If you want to call this time series, I am fine with that as well.

And then we will be aggregating in each output? With the possibility of there being some in core place where we also aggregate them in some manner ... possible in a future version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of... 🤔 If SampleTags are comparable in constant time, then they can be used as an index. So that gives us the ability to efficiently build a time series for every unique group of metric tags we want 🤷‍♂️ But we don't always need the exact same groupings, and we don't always need aggregation. So, by necessity, these things would have to be decoupled.

Copy link
Contributor Author

@codebien codebien Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SampleTags are comparable in constant time

Just to verify my alignment on this, in the solution in the document we can't really compare SampleTags in O(1), we can compare it in O(N). What we can compare in O(1) is the name+SampleTags that in the end most of the time will be the same thing.

But there are exceptions, for example, in the Ingester the first time it sees a new time series it should create a tree of all the parent's time series to know the full set of Sinks to invoke and we have to check all the single tags in the SampleTags for getting the right tree. Right?

Based on it and that we want distributed sinks, for now, do we really need to involve the name in the hash? In this way, we could avoid refactoring the signature here https://github.com/grafana/k6/blob/556747fbe585c981507904d634e705c68891e663/metrics/sample.go#L292-L318 mentioned in the other comment. WDYT? (I have the feeling I'm missing the potential limits of this proposal regarding the future features).

In the case of sinks per time series, we could have a nested index (e.g map[metricName]map[SampleTagHash]*Sink).

Copy link
Member

@na-- na-- Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify my alignment on this, in the solution in the document we can't really compare SampleTags in O(1), we can compare it in O(N). What we can compare in O(1) is the name+SampleTags that in the end most of the time will be the same thing.

Yeah, sorry, that's what I meant. If the metric name + tag set, i.e. the "time series", is comparable in constant time, we can use it as an index.

But there are exceptions, for example, in the Ingester the first time it sees a new time series it should create a tree of all the parent's time series to know the full set of Sinks to invoke and we have to check all the single tags in the SampleTags for getting the right tree. Right?

Kind of. When the MetricsEngine's ingester sees a Sample with a new (for it) time series, it will need to see which of its own sinks for the Sample's Metric match the tags of the new time series. And yeah, that operation is not constant, it's actually O(m*n), where m is the number sub-metrics (sub-metric thresholds) and n is the length of the constraints for each. However, the improvement here is that this matching Sink list (not a tree, just a simple slice) can be cached for the next time a metric from the same time series is encountered! Whereas now that comparison is done for every metric Sample.

Based on it and that we want distributed sinks, for now, do we really need to involve the name in the hash? In this way, we could avoid refactoring the signature here https://github.com/grafana/k6/blob/556747fbe585c981507904d634e705c68891e663/metrics/sample.go#L292-L318 mentioned in the other comment. WDYT? (I have the feeling I'm missing the potential limits of this proposal regarding the future features).

In the case of sinks per time series, we could have a nested index (e.g map[metricName]map[SampleTagHash]*Sink)

Hmm, interesting idea, I am not sure which will be better... 🤔

On the one hand, if we don't refactor it, we might introduce more complexity everywhere else 🤔 We'd need either these nested maps or composite indexes with (metric_name, tag_set_hash) tuples everywhere, right? Even if we ignore potential comparison and indexing inefficiencies, that will add development complexity, I think?

On the other hand, if we don't include the metric name in the hash, we'd be able to potentially more easily correlate Samples from different metrics but with the same tags 🤔 Not sure if that would be useful for anything, since we already package related samples in SampleContainers, but 🤷‍♂️

I think the first (including the metric name in the hash) is the better way to go here, but I am far from sure.

docs/metrics-data-model.md Outdated Show resolved Hide resolved
type Metric struct {
Name string `json:"name"`
Type MetricType `json:"type"`
Contains ValueType `json:"contains"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Good point, but I am not sure how we can remove it. The other state in Metric will be moved to the MetricsEngine, since only it works with them, but the value type is passed from inside of the JS scripts and we have to save it somewhere.

Also, what do you mean by "all the possible values"?

Name string `json:"name"`
Type MetricType `json:"type"`
Contains ValueType `json:"contains"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thresholds (and the end-of-test summary) can span multiple time series, see my comment at the top. For example, you might have a threshold on http_req_duration{status:200}, while your actual time series are http_req_duration{status:200,url:http://foo,...},http_req_duration{status:200,url:http://bar,...},http_req_duration{status:200,url:http://baz,...}.

We might not need Observed, I don't know, but it doesn't really matter, it will be something local for the MetricsEngine.

docs/metrics-data-model.md Outdated Show resolved Hide resolved
docs/metrics-data-model.md Show resolved Hide resolved
go.mod Outdated
@@ -44,6 +44,7 @@ require (
github.com/prometheus/client_model v0.2.0 // indirect
github.com/prometheus/common/sigv4 v0.1.0 // indirect
github.com/prometheus/procfs v0.6.0 // indirect
github.com/spenczar/tdigest v2.1.0+incompatible // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here: grafana/k6#763 (comment)
I still haven't had time to read the paper, but the youtube presentation is worth watching. IIRC, there are better implementations than t-digest out there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already experimented with https://openhistogram.io (circonusllhist). It works better, the Go implementation and the quantile math seem better. The unique cons I see at the moment is the fact we can't get in an efficient way the buckets and relative numbers - we would use the function for byte/string serialization of the entire histogram. In any case, I see it as fixable.

Copy link
Member

@na-- na-- Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I am not sure this is the best choice, I just picked it because I was in a hurry. I saw some issues with it during my PoC distributed execution experiments. More R&D is definitely needed... 😅


#### Open questions:

* `TimeseriesSample` knows the time series concept and it has the reference to a time series. The Sample is raw data and there isn't yet an identified/matched time series to reference. In this way, the collecting phase doesn't require knowledge from a time series database. It could speed up the implementation and introduction of the first stage, where hopefully we don't touch the sample generation (e.g. all the `k6/js/modules`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might not be the best way to go about it 🤔 It would force the calculation of hashes to run in just a single thread. If that was done as soon as a metric is measured in the js/ package (or elsewhere), that relatively CPU intensive task will be spread apart on multiple cores... 🤔

However, if the association with a TimeSeries is done there, it will require locking, whereas if it was done centrally from a single thread, it will not require any locking, just a map lookup (or insertion, if the time series didn't exist).

WDYT?

Copy link
Contributor Author

@codebien codebien Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@na-- I see the central solution as a more straightforward step for introducing this new model in the core. We can extend the solution with the distributed part in the next step.

I collapsed the two structures in the current Sample just adding a TimeSeries pointer that I expect the metrics engine or the ingester will resolve. In this way, we can resolve the submetrics/contains issue in the first step, and in a later PR, we can drop the Tags field from the Sample structure by distributing the time series generation to the components on the edge.

Do you see limits for it?

Copy link
Member

@na-- na-- Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@na-- I see the central solution as a more straightforward step for introducing this new model in the core. We can extend the solution with the distributed part in the next step.

Hmm... maybe? 😅 isn't the "distributed" part of the time series determination just calculating the hash when the metric sample is assembled? 🤔 That seems like it would be easier to get done at the start, rather than refactoring it later? If it was just dependent on the tags, it could have probably been done by just modifying things here, but since it also depends on the metric name, it's going to need some more refactoring 😞

edit: or just passing the metric name to those functions, even if it's not saved?

docs/metrics-data-model.md Outdated Show resolved Hide resolved

#### Thresholds

Thresholds should act as an output, it should work on top of a time series data layer. They get the latest value for the time series and check if the value triggers the threshold based on the defined condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it already is done after my latest refactoring, see how the MetricsEngine receives metric samples by just pretending to be another normal Output:
https://github.com/grafana/k6/blob/b725f03be5873b9b0d933b94a83924ff7c1f15e4/metrics/engine/ingester.go#L14-L16
https://github.com/grafana/k6/blob/b725f03be5873b9b0d933b94a83924ff7c1f15e4/core/engine.go#L98-L103

But it can't directly work on the aggregated data from time series since a single threshold spans multiple time series.

@codebien
Copy link
Contributor Author

I simplified the model by removing the TagSet concept, if we are going to use the TimeSeries then it should be enough in terms of memory storage and we can instead benefit from relaxing the system to avoid the synchronization it would introduce.

// This approach also allows to avoid hard to replicate issues with duplicate timestamps.

labels, err := tagsToLabels(sample.Tags, o.config)
tset, err := o.tagSetFromTags(sample.Tags.CloneTags())
Copy link
Member

@na-- na-- Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is a bit inefficient, since it will make a copy of the map[string]string. Ideally, all of these things should be done with the minimum amount of copying and allocation required, to reduce GC thrashing. That's only possible on the k6 side, by refactoring the types there directly, so just making a note.

Comment on lines +25 to +35
index := sort.Search(len(*set), func(i int) bool {
return (*set)[i].Key > t.Key
})

if len(*set) == index {
*set = append(*set, t)
return
}

*set = append((*set)[:index], append(make([]*Tag, 1), (*set)[index:]...)...)
(*set)[index] = t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to be much less efficient than adding all of the tags unsorted and then sorting the whole slice once at the end? This will just do a binary search on every insertion and then copy (potentially big) chunks of the slice every time an element is added... 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. TagSet was mostly an experiment that I removed from the document in the end because I don't think we really need to have a second layer of aggregation.

Comment on lines 30 to 44
type Tag struct {
Key, Value string
}

type TimeSeries struct {
ID uint64 // hash(metric+tags)
MetricName string
Tags []Tag
}

type Sample struct {
TimeSeries *TimeSeries
Timestamp uint64
Value float64
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic underlying structure here looks correct, i.e. every element now has the right constituent parts. However, I am not sure the implementation should use these structs precisely.

For one thing, we should probably try to make a lot of these things read-only once they have been created. Public properties are convenient, but we probably don't want any output or JS extension to be able to modify the internals of a TimeSeries, right? So constructors that create objects with unexported properties and read-only getter methods might be the way to go?

Another thing that I noticed is the fact that the tags are a slice. Having them in a sorted slice is necessary to calculate the hash of the TimeSeries, but we should probably also have some index (i.e. a map) of their keys, so the rest of the codebase could efficiently lookup the value of specific tags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one thing, we should probably try to make a lot of these things read-only once they have been created.

A common discussion in the Go world. 😄 I'm not convinced, some questions:

  • We will create a not consistent situation between Metric and TimeSeries. For example, why is it relevant to keep TimeSeries protected and to not protect the Metric for a name change? If you are suggesting doing the same for Metric then the impact in the source code seems big.
  • How should the Get methods work? I expect they have to make a copy of the internal structure. It sounds not optimal in terms of efficiency.

but we should probably also have some index

I agree.

so the rest of the codebase could efficiently lookup the value of specific tags?

Out of curiosity, when do you expect to use it? I expect most of the time TimeSeries.Tags access should be used for mapping the entire slice because a TimeSeries has been already resolved by matching the ID. In that case, the slice iteration would be ok.

I see the index as useful for the Contains operation where we need the pair key+val as keys otherwise we need to access by key the map but make a string comparison for the value. Right?

Copy link
Member

@na-- na-- Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, why is it relevant to keep TimeSeries protected and to not protect the Metric for a name change? If you are suggesting doing the same for Metric then the impact in the source code seems big.

Yes, I want to fix Metric too. I don't think the impact will be that big, it should be a fairly simple substitution. Most extensions don't use the metric name directly, they use a pointer to the Metric object. And besides, we make no promises for backwards compatibility of the Go APIs.

How should the Get methods work? I expect they have to make a copy of the internal structure. It sounds not optimal in terms of efficiency.

I guess depends on the specific thing. For example, for metrics.Metric, I expect getter methods that make copies of its individual properties will be enough: m.GetName() instead of just m.Name, etc. Besides, strings in Go are immutable, so I don't think it even makes an actual copy, just returns a read-only reference to the original string.

The problem here is that we always get a pointer to a Metric, we can't have copies of the whole struct since the whole point after the recent refactorings is that we only have a single Metric object with the same name. So its properties are mutable 😞

Out of curiosity, when do you expect to use it? I expect most of the time TimeSeries.Tags access should be used for mapping the entire slice because a TimeSeries has been already resolved by matching the ID. In that case, the slice iteration would be ok.

I see the index as useful for the Contains operation where we need the pair key+val as keys otherwise we need to access by key the map but make a string comparison for the value. Right?

Yeah, mostly for Contains(), since that is needed by the MetricsEngine for sub-metric sinks in the end-of-test summary and thresholds. Though that can probably be implemented internally 🤔 Still, I can imagine how some output would like the value of a specific tag (e.g. url, method, etc.) to do some pre- or post-processing only on specific raw metric samples 🤷‍♂️ In general though, it should be possible to efficiently inspect a TimeSeries and know what are its characteristics.

Copy link
Contributor Author

@codebien codebien Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess depends on the specific thing.

I was thinking more about TimeSeries.tags where GetTags() []Tag should make a copy before returning the slice to be an "immutable" and in this way, it seems similar to the current code, we are losing the benefit regarding the memory efficiency of having the TimeSeries where we would allocate the Tags slice once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to have a GetTags() function, since sometimes you just need a copy of the tags 🤷‍♂️ As long as we don't have to make these copies in the normal flow of metric samples, it should be fine. But you shouldn't need to make a copy of the whole tag set in order to get the value for a specific tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since sometimes you just need a copy of the tags

Maybe I'm not considering something but it's not sometimes, it's every time a flushMetrics method is invoked for outputs. We need to map the tags and we don't know any specific tag we just have the Sample and the pointer to the TimeSeries. For example, this line would be replaced with the GetTags() method, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe, it's probably not worth optimizing that particular line too much 🤷‍♂️ But if we really needed to optimize it, we can cache the CSV representation of these tags for every TimeSeries the first time we see it and reuse that for every other time 🙂

@codebien codebien requested a review from na-- June 23, 2022 16:48
id uint64 // hash(metric+tags)
metricName string
tags []Tag
keys map[string]struct{}
Copy link
Member

@na-- na-- Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
keys map[string]struct{}
tagKeys map[string]int

or something like that, to allow for direct lookup? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm...I thinking if we should do one step more at this point. My idea of tags map was something map[keySepValue]struct{} for having the maximum efficiency on Contains but it wouldn't support the fetch by Tag.Key.

So we could consider this data structure for supporting the single GetTag(key):

tags map[key]stuct{
    value string
    valhash uint64
}

In this way, we could support most of the operations with one data structure. The unique missing operation is returning the sorted []Tag for example for GetTags(), I see two potential solution:

  • store just the keys in a sorted slice
  • add another field like order/index in the struct so we know in which specific place set the Tag item.
func (ts TimeSeries) GetTags() []Tag {
    s := make([]Tag, len(ts.tags))
    for k, data := range {
       s[data.index] = Tag{Key: k, Value: data.Value}
    }
    return s
}

The list of operations we have potentially consider in comparisons:

func (TimeSeries) Contains(subset []Tag) bool
func (TimeSeries) HasKey(string) bool 
func (TimeSeries) GetTag(string) Tag
func (TimeSeries) GetTags() []Tag

@na-- Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking better on this, it doesn't really work. Because on Contains we don't have the relative tag.valhash to compare. So I think, we can just support the GetTag by binary search 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand all of your concerns, or maybe I'm missing something 😕 Please correct me if I'm wrong, but we basically need to be able to efficiently access the tags both in a sorted sequential order and in a random access order (i.e. lookup by their key). So, the basic implementation choices are these:

  1. Store the tag keys and values in a sorted list ([]Tag) for efficient sequential access and have an index (map[string]int) by the key for efficient random access
  2. Store the tag keys and values in a map (map[string]string) for the efficient random access and have a sorted list of keys as a []string slice for the sequential access.
  3. Something weirder

The first approach seems preferable to me, since the second one involves an extra map per tag access when iterating across all tags sequentially, and anything else seems unnecessary over-engineering at this early point. 🤷‍♂️

Given that these time series objects should be measured in thousands at most, and not millions, I don't see why we'd want anything more complicated than this? You probably won't save all that much memory usage if you don't have the index and use binary search, for example, since I think the strings for the keys will be reused between the slice and the map 🤷‍♂️

Besides, we might be verging in bike-shedding territory here 🙂 . The actual implementation should be internal to the TimeSeries object, right? So I don't think it matters very much implementation you pick, we should be able to change it at any time in the future without anything breaking, right?

docs/metrics-data-model.md Outdated Show resolved Hide resolved
docs/metrics-data-model.md Outdated Show resolved Hide resolved
docs/metrics-data-model.md Outdated Show resolved Hide resolved
docs/metrics-data-model.md Outdated Show resolved Hide resolved
@codebien
Copy link
Contributor Author

This PoC can be considered closed, we achieved our goal to draft a hypothesis for a good data model. The next implementation should use directly the new types defined in k6 when they will be merged in k6/master.

@codebien codebien closed this Jun 29, 2022
@codebien codebien deleted the timeseries branch August 16, 2023 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants