Comprehensive validation 🔎, 30+ fixes integrated/added 🔨🐛, optimized performance 🚀 #159

klondikedragon · 2023-12-21T06:49:12Z

This package is amazing and hugely popular, and has been the best package for automatic date parsing in go for years! ⭐

Thanks @araddon for crafting this package with love over the years!!

I've been using this while developing a new cloud-based log aggregation/search/visualization product, and I've found that there are three major opportunities for improvement for my particular use case:

The package does not strictly validate its input, leading to many false positives. This is OK if you know for sure the input matches one of the known formats, but cannot be trusted if the input could be anything and you only want a returned date/time if it definitely matches a known format.
While still being far more efficient than the "shotgun" parsing approach, the package currently allocates a relatively large amount of memory (several times the average input size), which can add up when parsing megabytes of date strings per second in a high-throughput microservice. It can also be relatively slow when parsing a string that doesn't match a known format and can allocate even more memory in this case, due to custom error messages that include contextual details.
There are a lot of unmerged one-off contributions that haven't been merged and need to be made coherent with each other.

This PR addresses all 3 opportunities:

Validation: it now comprehensively validates the input in the following ways: at each point in the state machine, it makes sure there are cases for all possibilities, and any invalid possibility will fail; additionally, it makes sure that the entire format string was specifically set (excluding any trailing punctuation, which can be safely ignored). False positives should be extremely rare now (hard to prove they don't exist).
Memory Efficiency: bytes allocated have been reduced by 90%, and parsing some formats are zero allocation. The SimpleErrorMessages option was added (off by default) that greatly speeds up the case where a string does not match a known format -- with the option on, this case is now 4x faster and produces almost no allocations.
Merge & Integrate Community Fixes: Fixes for nearly all of the pending issues from the community (including open pull requests) have been incorporated and adapted.

In the process of going through the state machine comprehensively for validation, redundant code/states were merged, and support was added for certain edge cases (for example, some date formats did not support being followed by times).

The example and README.md were updated to incorporate all of the newly supported formats and edge cases. More details on how to properly interpret returned location information with respect to abbreviated timezones was added.

BREAKING -- the package now requires go >= 1.20 to support memory optimizations converting from []byte to string in key places.

A huge thanks to all who posted issues and contributed PRs -- while the PRs were unable to be merged directly because the validation changes were so major, the ideas of all these contributions and the associated test cases were incorporated. Here's credit for all of the issues fixes and contributions in this PR as well as a summary of additional fixes added:

This PR started directly building from Supported extra whitespaces where I needed them + many other changes #151 by @arran4 that fixes Cannot parse 2015-06-10 00:00:00 GMT+02:00 #78 (support for GMT+offset, support additional whitespace in certain places, and other changes)
Fix [BUG] some coma-separated words are considered as a time #145 - words falsely detected as a valid format - fixed by comprehensive validation
Fix "1.jpg" is incorrectly matched as a date #108 - "1.jpg" incorrectly parsed as valid - fixed by comprehensive validation
Fix Names and addresses are parsed without an error #98 - names and addresses incorrectly parsed as valid - fixed by comprehensive validation
Fix Issues with parsing 12:xx am #150 - fix parsing "am" as "pm"
Fix can't parse Thu Jan 28 2021 15:28:21 GMT+0000 (Coordinated Universal Time) #157 - fix parsing of Thu Jan 28 2021 15:28:21 GMT+0000 format
Fix Support ":" as separator for fractional seconds in the absence of a zone offset #137 - variant of ':' as separator for fraction seconds - adapted support ":" as separator for fractional seconds #138 by @dferstay
Fix Support eurodate format (dd-mmm-yy with a digit month) #139 - add support for dd-mm-yyyy (digit month) - adapt Add support for dd-mm-yyyy (digit month) formats #140 by @dferstay (also fixes Cannot parse "13-02-2015", but successfully parses "13/02/2015" #155)
Fix Support "yyyy mmm dd" dates where mmm is an alpha literal #141 - add support for yyyy mon dd format - adapt Add support for dates of the form "yyyy mmm dd" where mmm is alpha #142 by @dferstay
Fix Support combined datetime format with subseconds (yyyyMMddhhmmss.SSS) #143 - add support for yyyymmddhhmmss.SSS format - adapt support combined datetime format with subseconds (yyyyMMddhhmmss.SSS) #144 by @dferstay
Fix Timezone 'Z' not parsed for time string with milliseconds after period #130 - fix timezone parsing after fractional seconds - adapted issue-130 fix timezone detection issue after timePeriod #131 by @zifengyu
Fix Not finding time for CET with millisecond and time with nanosecond #123 - fix parsing of fractional seconds for certain formats (e.g., 2017-04-03 22:32:14.322 CET)
Fix Date format not being parsed #109 - fix parsing of Sun, 07 Jun 2020 00:00:00 +0100 format
Fix bug in comment Cant parse a simple date #100 (comment) - fix parsing of 1 April 2022 23:59 format (time after certain date formats)
Fix Date format of DD.MM.YYYY HH:MM:SS not supported #129 - fix ambiguous and PreferMonthFirst parsing for format mm.dd.yyyy (time) - adapted fix problem with date format 02.01.2006 #133 by @mehanizm (also fixes Better support for dot-format (e.g. 13.1.2009) with ambiguousMMDD #91 and Request: Support for DD/MM/YYYY #28)
Fix Some forms of PM indicator combined with time zone don't work #149 - support PMDT and AMT time zones and validate that AM/PM indicators only appear at most once
Fix Suffixes are not supported in some formats #127 - add support for dd[th,nd,st,rd] Month yyyy format - adapt New date format 1st November 2020 #128 by @krhubert
Fix Does not respect offset for format (time) UTC[+-]NNNN #158 - fix parsing for format (time) UTC[+-]NNNN
Add support for mm/dd/yyyy, hh:mm:ss format - adapt added comma format like 04/2/2014, 03:00:37 #156 by @BrianLeishman
Add support for yyyy.mm.dd (time) format - adapt Bug parsing 2014.02.13 00:00:00 ? #134 by @jmdacruz, and add cases expected to fail to TestParseErrors unit test
Add support for git log format (e.g., Thu Apr 7 15:13:13 2005 -0700) - adapt commit 99d9682 from Add support for git date format #92 by @jiangxin (merge timeWsYearOffset case and add validation)
Add support for RabbitMQ log format (e.g., dd-mon-yyyy::hh:mm:ss) - adapt rabbitmq log datetime support #122 by @bizy01
Expand support for Chinese date formats, and allow times to follow - adapt expand Chinese time format #132 by @xwjdsh
Add support for mon/dd/yyyy format, e.g., Oct/31/1970
Add support for dd-month-year format
Extend format yyyy-mon-dd to allow times to follow it. Also allow full month name instead of just abbreviated.
Fix the case for ambiguous date/time in the mm:dd:yyyy format
Allow full day name before month (e.g., Monday January 4th, 2017)
Additional fixes for mm.dd.yyyy (time) format
Fix ambiguous parsing for mm/dd formats that start with a weekday

Also adds tests to verify that the following stay fixed:

Cannot parse date: 190910 11:51:49 #94 - mysql log format such as 190910 11:51:49
Date Parse Fails on dates returned by Workfront in the European timezone #117 - fractional seconds after ':'

… (gosimple)

…Is function

* Don't just assume we were given one of the valid formats. * Also consolidate the parsing states that occur after timePeriod. * Add subtests to make it easier to see what fails. * Additional tests for 4-char timezone names. * Fix araddon#117 * Fix araddon#150 * Fix araddon#157 * Fix araddon#145 * Fix araddon#108 * Fix araddon#137 * Fix araddon#130 * Fix araddon#123 * Fix araddon#109 * Fix araddon#98 * Addresses bug in araddon#100 (comment) Adds test cases to verify the following are already fixed: * araddon#94

Incorporates PR araddon#133 from https://github.com/mehanizm to fix araddon#129 Adds test cases to verify the following are already fixed: * araddon#105

Uses a memory pool for parser struct and format []byte Uses a new go 1.20 feature to avoid allocations for []byte to string conversions in allowable cases. go 1.20 also fixes a go bug for parsing fractional sec after a comma, so we can eliminate a workaround. The remaining allocations are mostly unavoidable (e.g., time.Parse constructing a FixedZone location or part to strings.ToLower). Results show an 89% reduction in allocated bytes for the big benchmark cases, and for some formats an allocation can be avoided entirely. There is also a resulting 26% speedup in ns/op. Details: BEFORE: cpu: 12th Gen Intel(R) Core(TM) i7-1255U BenchmarkShotgunParse-12 19448 B/op 474 allocs/op BenchmarkParseAny-12 4736 B/op 42 allocs/op BenchmarkBigShotgunParse-12 1075049 B/op 24106 allocs/op BenchmarkBigParseAny-12 241422 B/op 2916 allocs/op BenchmarkBigParseIn-12 244195 B/op 2984 allocs/op BenchmarkBigParseRetryAmbiguous-12 260751 B/op 3715 allocs/op BenchmarkShotgunParseErrors-12 67080 B/op 1679 allocs/op BenchmarkParseAnyErrors-12 15903 B/op 200 allocs/op AFTER: BenchmarkShotgunParse-12 19448 B/op 474 allocs/op BenchmarkParseAny-12 48 B/op 2 allocs/op BenchmarkBigShotgunParse-12 1075049 B/op 24106 allocs/op BenchmarkBigParseAny-12 25394 B/op 824 allocs/op BenchmarkBigParseIn-12 28165 B/op 892 allocs/op BenchmarkBigParseRetryAmbiguous-12 37880 B/op 1502 allocs/op BenchmarkShotgunParseErrors-12 67080 B/op 1679 allocs/op BenchmarkParseAnyErrors-12 3851 B/op 117 allocs/op

Previously, for ambiguous date strings, it was always calling parse twice even when the first parse would have been successful. Refactor so that parsing isn't re-attempted unless the first parse fails ambiguously. Benchmark results show that with RetryAmbiguousDateWithSwap(true), it's now about 6.5% faster (ns/op) and reduces allocated bytes by 3.4%.

Optimize the common and special case where mm and dd are the same length, just swap in place. Avoids having to reparse the entire string. For this case, it's about 30% faster and reduces allocations by about 15%. This format is especially common, hence the reason to optimize for this case. Also fix the case for ambiguous date/time in the mm:dd:yyyy format.

Audit every stateDate so every unexpected alternative will fail. In the process, fixed some newly found bugs: * Extend format yyyy-mon-dd to allow times to follow it. Also allow full month name. * Allow full day name before month (e.g., Monday January 4th, 2017) Relevant confirmatory test cases were added.

New option SimpleErrorMessages that avoids allocation in the error path. It's off by default to preserve backwards compatibility. Added benchmark BenchmarkBigParseAnyErrors that takes the big set of test cases, and injects errors to make them fail at pseudo-random places. This optimization speeds up the error path runtime by 4x and reduces error path allocation bytes by 13x!

Reduces CPU usage on large benchmarks by ~2%-3% and prepares for future with international month names in future.

Audited all test cases to make sure an example was listed for all known formats.

Options were not being properly passed to recursive parseTime call.

arran4 · 2023-12-21T23:32:24Z

Great work @klondikedragon

jmdacruz · 2023-12-23T21:54:07Z

this is great work @klondikedragon! Now, this repo hasn't seen much movement in years, do you think we should start using a fork? should we use yours?

arran4 · 2023-12-24T00:26:10Z

I would vote that we use his, we should see if it qualifies for https://github.com/avelino/awesome-go

This is implemented now using the "skip" parser field, indicating to skip the first N characters. This also avoids a recursive parse in one more case (more efficient). This simplifies the state machine a little bit, while the rest of the code needs to properly account for the value of the skip field. Also allow whitespace prefix without penalty. Modify the test suite to psuedo-randomly add a weekday prefix to the formats that allow it (all except the purely numeric ones).

klondikedragon · 2023-12-24T03:16:59Z

In some further testing, I found that weekday prefixes only worked for some date formats, but not for others. So that is fixed now. As a side effect (benefit?), leading whitespace is now allowed/ignored.

Let's see if @araddon has feedback and/or is interested in merging this PR (it's a pretty big change and changes the philosophy a bit to have validation, and also makes the code a little more complex in favor of performance). The changes are large enough now it could break backwards compatibility, so in the very least it should deserve a new major version IMO.

Although I don't want to fork something lightly, since we haven't heard any feedback from @araddon for a few years, it could definitely make sense to go ahead. If there is no comment after the holidays, I think it would make sense to go ahead and fork. This package is a key part of freeform date parsing in the "automatic structured field extraction" logic being built for IT Lightning (this new cloud-based log management platform I'm building). Given that's the case, the IT Lightning org would be willing to maintain the forked repo and work with community issues/contributions, since we're motivated to have best-in-class date recognition & parsing in the log ingestion pipeline. The license would remain the same of course.

The community contributions would help us improve our date parsing, we'd be motivated to put energy into it to keep our date parsing bug-free and comprehensive, and community use of the package might help us get a little exposure to devs/SREs who might become interested in our log management solution. So it should benefit everyone.

All feedback is welcome. What do ya'll think of this proposed plan?

* Merge duplicate states (fixes lots of edge cases) * Support for +00:00 is consistent with +0000 now * Support (timezone description) after any offset/name * Update tests to cover positive/negative cases * Update example with new supported formats

Fully support the format where a TZ name is in parentheses after the time (and possibly after an offset). This fixes the broken case where a 4 character TZ name was in parentheses after a time.

elliot40404 · 2024-01-03T20:11:36Z

great work @klondikedragon . How can i start using this?

klondikedragon · 2024-01-09T02:01:02Z

I'll go ahead and fork this package. I'm renaming the main branch as part of that.

klondikedragon · 2024-01-09T06:09:12Z

The fork is complete and published as v0.1.0 -- again, a huge thanks to @araddon for authoring and maintaining this package for so many years!

The fork is available using go get github.com/itlightning/dateparse -- issues and PRs are welcome.

@elliot40404 @arran4 @jmdacruz -- see what you think and how this updated package works! If this looks good and after incorporating feedback, I think I'll publish a v1.0.0 at some point soon. I'm also curious to get feedback on my log management project too, check out the site/discord if you're interested. Thanks!

arran4 and others added 30 commits February 15, 2023 15:40

added some actions

335c1f9

My issue

26d95ba

Error return value is not checked (errcheck)

5335e6f

field offsetlen is unused (unused)

14cb70e

S1021: should merge variable declaration with assignment on next line…

2fb4c46

… (gosimple)

S1023: redundant break statement (gosimple)

5143d47

SA4006: this value of err is never used (staticcheck)

57a1767

Lint action out of date.

ad0ab84

Go mod tidy

a8e238d

Bug fixes.

e654ac7

More typo changes

cefe5b3

Another error

4345a38

S1023: redundant break statement (gosimple)

515cd81

Text should be lowercase

eabb56b

Added go releaser

c5b562a

Commented code

094aad3

Unnecessary bracket

53a8cbd

Test improvements.. I think

544b542

My addition last

c5a1edc

So people don't have to check the string they can use the new errors.…

268a690

…Is function

Skip white space - to delete strategically

bf3a5b3

All of these did nothing

3a32cbb

The only required one.

b1fd89e

New failure - still white space

19ef6a2

Skip white space

8b765a5

Unused code

b0b5409

Another case.

01b692d

Fix ineffective break statements

465140d

Incorporate fix for dd.mm.yyyy format

3ebc8bc

Incorporates PR araddon#133 from https://github.com/mehanizm to fix araddon#129 Adds test cases to verify the following are already fixed: * araddon#105

klondikedragon added 13 commits December 16, 2023 10:48

Optimize checks for day of week and full month

a45d593

Reduces CPU usage on large benchmarks by ~2%-3% and prepares for future with international month names in future.

Comprehensive time validation

89df0f8

Fix mm.dd.yyyy (time) format

7a3c923

Add support for dd-month-year format

65e6e8d

Update example and README.md with new formats

4f7e854

Audited all test cases to make sure an example was listed for all known formats.

Fix ambiguous mm/dd that start with weekday

4d76f59

Options were not being properly passed to recursive parseTime call.

Update benchmark results

5cb2793

Update go doc

9f7bdf7

arran4 mentioned this pull request Dec 21, 2023

Supported extra whitespaces where I needed them + many other changes #151

Open

klondikedragon added 2 commits December 30, 2023 01:10

Unify/fix timezone offset/name states

c4de5d4

* Merge duplicate states (fixes lots of edge cases) * Support for +00:00 is consistent with +0000 now * Support (timezone description) after any offset/name * Update tests to cover positive/negative cases * Update example with new supported formats

Cleanup handling of TZ name parsing

d5b3c60

Fully support the format where a TZ name is in parentheses after the time (and possibly after an offset). This fixes the broken case where a 4 character TZ name was in parentheses after a time.

klondikedragon deleted the branch araddon:master January 9, 2024 01:59

klondikedragon closed this Jan 9, 2024

klondikedragon deleted the master branch January 9, 2024 01:59

klondikedragon restored the master branch January 9, 2024 02:02

klondikedragon reopened this Jan 9, 2024

klondikedragon mentioned this pull request Jan 9, 2024

Prepare v0.1.0 release itlightning/dateparse#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comprehensive validation 🔎, 30+ fixes integrated/added 🔨🐛, optimized performance 🚀 #159

Comprehensive validation 🔎, 30+ fixes integrated/added 🔨🐛, optimized performance 🚀 #159

klondikedragon commented Dec 21, 2023

arran4 commented Dec 21, 2023

jmdacruz commented Dec 23, 2023

arran4 commented Dec 24, 2023

klondikedragon commented Dec 24, 2023

elliot40404 commented Jan 3, 2024

klondikedragon commented Jan 9, 2024

klondikedragon commented Jan 9, 2024

Comprehensive validation 🔎, 30+ fixes integrated/added 🔨🐛, optimized performance 🚀 #159

Are you sure you want to change the base?

Comprehensive validation 🔎, 30+ fixes integrated/added 🔨🐛, optimized performance 🚀 #159

Conversation

klondikedragon commented Dec 21, 2023

arran4 commented Dec 21, 2023

jmdacruz commented Dec 23, 2023

arran4 commented Dec 24, 2023

klondikedragon commented Dec 24, 2023

elliot40404 commented Jan 3, 2024

klondikedragon commented Jan 9, 2024

klondikedragon commented Jan 9, 2024