Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date type has not enough precision for the logging use case. #10005

Closed
jordansissel opened this issue Mar 5, 2015 · 70 comments
Closed

Date type has not enough precision for the logging use case. #10005

jordansissel opened this issue Mar 5, 2015 · 70 comments
Labels
>feature high hanging fruit :Search/Mapping Index mappings, including merging and defining field types stalled

Comments

@jordansissel
Copy link
Contributor

At present, the 'date' type is millisecond precision. For many log use cases, higher precision time is valuable - microsecond, nanosecond, etc.

The biggest impact of this is during sorting of search results. If you sort chronologically, newest-first, by a date field, documents with the same date will probably be sorted incorrectly (because they match). This is often reported by users seeing events "out of order" when they have the same timestamp. Specific example being sorting by date and seeing events in newest-first order, unless there is a tie, in which case oldest-first (or first-written?) appears. This causes a bit of confusion for the ELK use case.

Related: logstash-plugins/logstash-filter-date#8

I don't have any firm proposals, but I have two different implementation ideas:

  • Proposal 1, use a separate field: Store our own custom-precision time in a separate field as a long. This allows us to do correct sorting (because we have higher precision), but it makes any date-related functionality in Elasticsearch not usable (searching now-1h or doing date_histogram, etc)
  • Proposal 2, date type has tunable precision: Have the date type have configurable precision, with the default (backwards compatible) precision being milliseconds. This would let us choose, for example, nanosecond precision for the logging use case, and year precision for an archaeological use case (billions of years ago, or something). Benefit here is date histogram and other date-related features could still work. Further, having the precision configurable would allow us to keep the underlying data structure a 64bit long and users could choose their most appropriate precision.
@jordansissel
Copy link
Contributor Author

I know Joda's got a precision limit (the Instant class is millisecond precision) and a year limit ("year must be in the range [-292275054,292278993]"). I'm open to helping explore solutions in this area.

@synhershko
Copy link
Contributor

What about consequences to field_date size? even with docvalues in place, cardinality will be ridiculously high. Even for those scenarios which need this, this could be an overkill, no?

@clintongormley clintongormley added discuss :Search/Mapping Index mappings, including merging and defining field types labels Mar 9, 2015
@nikonyrh
Copy link

Couldn't you just store the decimal part of the second in a secondary field (as a float or long) and sort by these two fields when needed? You could still aggregate based on the standard date field but not at a microsecond resolution.

@markwalkom
Copy link
Contributor

I've been speaking to a few networking firms lately and it's dawned on me that microsecond level is going to be critical for IDS/network analytics.

@markwalkom
Copy link
Contributor

@anoinoz
Copy link

anoinoz commented Jun 24, 2015

that last request is mine I believe. I would add that to monitor networking (and other) activities in our field, nanosecond support is paramount.

@clintongormley clintongormley added :Dates and removed :Search/Mapping Index mappings, including merging and defining field types labels Jun 24, 2015
@abrisse
Copy link

abrisse commented Sep 4, 2015

👍 for this feature

@jack-pappas
Copy link

What about switching from Joda Time to date4j? It supports higher-precision timestamps compared to Joda and supposedly the performance is better as well.

@dadoonet
Copy link
Member

Before looking on the technical side, is BSD License compatible with Apache2 license?

@dadoonet
Copy link
Member

So BSD is compatible with Apache2.

@clintongormley
Copy link

I'd like to hear @jpountz's thoughts on this comment #10005 (comment) about high cardinality with regards to index size and performance.

I could imagine adding a precision parameter to date fields which defaults to ms, but also accepts s, us, ns.

We would need to move away from Joda, but I wouldn't be in favour of replacing Joda with a different dependency. Instead, we have this issue discussing replacing Joda with Java.time #12829

@jpountz
Copy link
Contributor

jpountz commented Sep 24, 2015

@clintongormley It's hard to predict because it depends so much on the data so I ran an experiment for an application that ingests 1M messages at a 2000 messages per second per shard rate.

Precision Terms dict (kB) Doc values (kB)
milliseconds 3348 2448
microseconds 10424 3912

Millisecond precision is much more space-efficient, in particular because with 2k docs per second, several messages are in the same millisecond, but even if we go with 1M messages at a rate of 200 messages per second so that sharing the same millisecond is much more unlikely, there are still significant differences between millisecond and microsecond precision.

Precision Terms dict (kB) Doc values (kB)
milliseconds 7604 2936
microseconds 10680 4888

That said, these numbers are for a single field, the overall difference would be much lower if you include _source storage, indexes and doc values for other fields, etc.

Regarding performance, it should be pretty similar.

@clintongormley
Copy link

From what users have told me, by far the most important reason for storing microseconds is for the sorting of results. it make no sense to aggregate on buckets smaller than a millisecond.

This can be achieved very efficiently with the two-field approach: one for the date (in milliseconds) and one for the microseconds. The microseconds field would not need to be indexed (unless you really need to run a range query with finer precision than one millisecond), so all that would be required is doc_values. Microseconds can have a maximum of 1,000 values, so doc_values for this field would require just 12 bits per document.

For the above example, that would be only an extra 11kB.

A logstash filter could make adding the separate microsecond field easy.

@clintongormley
Copy link

Meh - there aren't 1000 bits in a byte. /me hangs his head in shame.

It would require 1,500kB

@pfennema
Copy link

If we want to use the ELK framework proper analyzing network latency we really need nanosecond resolution. Are there any firm plans/roadmap to change the timestamps?

@portante
Copy link

portante commented Oct 1, 2015

Let's say I index the following JSON document with nanosecond precision timestamps:

{ "@timestamp": "2015-09-30T12:30:42.123456789-07:00", "message": "time is running out" }

So the internal date representation will be, 2015-09-30T19:30:42.123 UTC, right?

But if I issue a query matching that document, and ask for either the _source document or the @timestamp field explicitly, won't I get back the original string? If so, then in cases where the original time string lexicographically sorts the same as the converted time value, would that be sufficient for a client to further sort to get what they need?

Or is there a requirement that internal date manipulations in ES need such nanosecond precision? I am imagining that if one has records with nanosecond precision, only being able to query for a date range with millisecond precision could potentially result in more document matches than wanted. Is that the major concern?

@pfennema
Copy link

pfennema commented Oct 2, 2015

I think the latter, internal date manipulations need probably nanosecond precision. Reason is that when monitoring latency on 10Gb networks we get pcap records (or packets directly from the switch via UDP) which include multiple fields with nanosecond timestamps in the record. We like to find out the difference between the different timestamps in order to optimize our network/software and find correlations. In order to do this we like to zoom in on every single record and not aggregate records.

@abierbaum
Copy link

👍 for solving this. It is causing major issues for us now in our logging infrastructure.

@clintongormley
Copy link

@pfennema @abierbaum What problems are you having that can't be solved with the two-field solution?

@pfennema
Copy link

What we like to have is that we have a timescale in the display (Kibana) where we can zoom in on the individual measurements which have a timestamp with nanosecond resolution. A record in our case has multiple fields (NICTimestamp, TransactionTimestamp, etc) which we like to correlate with each other on an individual basis hence not aggregated. We need to see where spikes occur to optimize our environment. If we can have on the x-axis the time in micro/nanosecond resolution we should be able to zoom in on individual measurements.

@abierbaum
Copy link

@clintongormley Our use case is using ELK to analyze logs from the backend processes in our application. The place we noticed it was postgresql logs. With the current ELK code base, even though the logs coming from the database server have the commands in order, once they end up elastic search and are visualized in kibana the order of items happening on the same millisecond are lost. We can add a secondary sequence number field, but that doesn't work well in Kibana queries (since you can't sort on multiple fields) and causes quite a bit of confusion on the team because they just expect the data in Kibana to be sorted in the same order as it came in from postgresql and logstash.

@gigi81
Copy link

gigi81 commented Nov 20, 2015

We have the same problem as @abierbaum described. When events happen on the same millisecond the order of the messages is lost.
Any workaround or suggestion on how to fix this would be really appreciated.

@dtr2
Copy link

dtr2 commented Jan 17, 2016

You don't need to increase the timestamp accuracy: instead, the time sorting should be based on both timestamp and ID: message IDs are monotonically increasing, and specifically, they are monotonically increasing for a set of messages with the same timestamp...

@StephanX
Copy link

StephanX commented Nov 22, 2017

Our use case is that we ingest logs kubernetes => fluentd (0.14) => elasticsearch, and logs that are emitted rapidly (anything under a millisecond apart, which is easily done) obviously have no way of being kept in that order when displayed in kibana.

@varas
Copy link

varas commented Dec 11, 2017

Same issue, we are tracking events that happen within nanosec precision.

Is there any plan to increase it?

@clintongormley
Copy link

Yes, but we need to move from Joda to Java.time in order to do so. See #27330

@gavenkoa
Copy link

I opened bug in Logback as its core interface also preserves data in millisecond resolution so precision is lost even earlier, before ES: https://jira.qos.ch/browse/LOGBACK-1374

It seems that historical java.util.Date type is the cause of problems is Java world.

@shekharoracle
Copy link

Same use case, using kubernetes filebeat elasticsearch stack for log collection, but not having nano second precision is leading to incorrect ordering of logs.

@portante
Copy link

Seems like we need to consider the collectors providing a monotonically increasing counter which records the order in which the logs were collected. Nanosecond precision does not necessarily solve the problem because time resolution might not be nanosecond.

@lgogolin
Copy link

Seriously guys ? This bug is almost 3 years old...

@matthid
Copy link

matthid commented Feb 16, 2018

The problem is also that if you try to find a workaround you run into a series of other bugs so there is not even a viable acceptable workaround:

  • If you use a string, sorting will be slow
  • If you use a integer and try to make it readable you will not have big enough numbers (Support for BigInteger and BigDecimal #17006)
  • If you add an additional ordering field you cannot easily configure Kibana to have a "thenBy" ordering on that field.

So the only viable workaround seems to be to have an epoch + 2 additional digits which are increased in logstash when the timestamp matches.

Does anyone have found a better approach?

@jraby
Copy link

jraby commented Feb 16, 2018

Been storing microseconds since epoch in an number field for 2 years now.
Suits our needs but YMMV.

@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2018

cc @elastic/es-search-aggs

@tlhampton13
Copy link

Not all time data is collected using commodity hardware. There is plenty of specialty equipment that collects nanosecond resolution data. Thinking about other applications besides log analysis. Sorting by time is critical, but aggregations over small timeframes is also important. For example, maybe I just want to aggregate some scientific data over a one second window or even over millisecond window.

I have nanosecond resolution data and would love to be able to use ES aggregations to analyze it.

@jimczi
Copy link
Contributor

jimczi commented Feb 11, 2019

Elasticsearch 7.0 will include a date_nanos field type that handles nanoseconds sorting precision:
#37755
Nanoseconds precision field is now a first class citizen that doesn't require two fields to retain precision so I will close this issue, please open new ones if you find bugs or enhancements to make on this new field type.

@gavenkoa
Copy link

gavenkoa commented Dec 27, 2021

https://jira.qos.ch/browse/LOGBACK-1374 added Instant getInstant() to the interface ILoggingEvent allowing to capture nanotime resolution!

It is in 1.3.0-alpha12. I'll expect to see usage in new appenders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature high hanging fruit :Search/Mapping Index mappings, including merging and defining field types stalled
Projects
None yet
Development

No branches or pull requests