Generate measurements with random station names #125

mtopolnik · 2024-01-05T09:45:28Z

Contributes a new way to create measurements.txt, using 10,000 random station names with length from 1 to 100.

To generate names with a realistic distribution, it uses a public list of city names. It concatenates all the names into one long string, and then uses this as a "source of name randomness". The distribution of generated name lengths is biased towards short names, but with quite frequent outliers.

Fixes #156

gunnarmorling · 2024-01-05T17:17:42Z

Hey @mtopolnik, you mentioned this would fail some of the existing contenders. Can you add a measurements.txt and corresponding out to src/test/resources and report back the result of ./test_all.sh? Thx!

mtopolnik · 2024-01-05T20:38:21Z

I ran it... it's not a pretty sight :)

PASS Ujjwalbharti
PASS anandmattikopp
FAIL armandino
PASS artsiomkorzun
PASS baseline
FAIL bjhara
PASS criccomini
FAIL davecom
FAIL ddimtirov
FAIL deemkeen
PASS ebarlas
FAIL filiphr
PASS fragmede
PASS itaske
PASS jgrateron
PASS khmarbaise
PASS kuduwa-keshavram
FAIL lawrey
FAIL moysesb
FAIL nstng
PASS padreati
FAIL palmr
FAIL rby
PASS richardstartin
FAIL royvanrijn
FAIL seijikun
FAIL spullara
FAIL truelive
FAIL twobiers

I had to disable fatroom because it would just hang with zero CPU.

datdenkikniet · 2024-01-05T20:59:05Z

Where does the data for the CSV file come from? Would be good (and kind) to add attribution, unless you've come up with it yourself :)

mtopolnik · 2024-01-05T21:01:54Z

Attribution provided here (as a code comment). Does it need to be more visible? If I put it into the CSV itself, I'll need more code to ignore those lines.

datdenkikniet · 2024-01-05T21:04:16Z

Ah, sorry, I had missed that! IMO preferable to put it in the CSV itself, should be relatively easy if you just treat all lines starting with # (or similar unused character) as comment.

Mostly important to license it correctly because it comes from a company, and wouldn't want them to come ruin the party (doubt they would, but you never know)! And they're relatively explicit about the data being licensed under Creative Commons Attribution 4.0 and requiring attribution.

gunnarmorling · 2024-01-05T21:17:36Z

I ran it... it's not a pretty sight :)

Oh wow, that's interesting. Curious why so many are failing. E.g. @royvanrijn, @spullara, @Palmr, any thoughts about it? Before committing this to the test suite I think we need to better understand the reasons. Admittedly, the constraints were not as well-defined initially as they'd ideally have been, and I want to be cautious of not moving goal posts (too much, anyways ;).

That being said, one thing every impl definitely should satisfy is to support the maximum station name length (100), I've logged #156 for adding this to the test suite. I think that's rather uncontroversial.

mtopolnik · 2024-01-05T21:24:06Z

The newly added test file contains a worst-case name that has exactly 100 characters, but they are all the "scariest"-looking characters I could find in a larger sample of names. This does bring up the point that was made elsewhere -- in general, Unicode characters have unbounded length due to the combining codepoints.

spullara · 2024-01-05T21:30:26Z

I took the maximum station name length of 100 as bytes (or some larger arbitrary number). If we choose that it will be easier to test and make sense of it.

gunnarmorling · 2024-01-05T21:39:35Z

I think the wording is quite unambigious in that regard:

Station name: non null UTF-8 string of min length 1 character and max length 100 characters

That would be UTF-8 characters, not bytes.

That being said, that limit was chosen rather arbitrarily, and I don't think it changes anything substantial about the nature of the challenge or its goals (which are not to solve all character encoding oddities in the world). So IMO we wouldn't compromise by effectively limiting it to 100 bytes indeed, e.g. by testing a 100 ASCII (one byte) characters and maybe 50 two-byte chars, and consider implementations which satisfy that as valid. Thoughts?

mtopolnik · 2024-01-05T21:41:57Z

I agree, let's put a 100 byte limit on it.

mtopolnik · 2024-01-05T22:04:36Z

After this adjustment, test results have improved:

PASS Ujjwalbharti
PASS anandmattikopp
FAIL armandino
PASS artsiomkorzun
PASS baseline
PASS bjhara
PASS criccomini
PASS davecom
FAIL ddimtirov
PASS deemkeen
PASS ebarlas
PASS filiphr
PASS fragmede
PASS itaske
PASS jgrateron
PASS khmarbaise
PASS kuduwa-keshavram
FAIL lawrey
FAIL moysesb
FAIL nstng
PASS padreati
PASS palmr
PASS rby
PASS richardstartin
PASS royvanrijn
FAIL seijikun
PASS spullara
PASS truelive
FAIL twobiers

gunnarmorling · 2024-01-05T22:45:12Z

Hey @twobiers and @seijikun, could you take a look at this additional test we're considering to add and why it currently fails with your implementations? Thx!

twobiers · 2024-01-05T22:51:37Z

Sure, for me just increase the buffer size.

Index: src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java b/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java
--- a/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java	(revision 9cef402a3e2246d28091861e17d635c7fc214616)
+++ b/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java	(date 1704494949716)
@@ -178,7 +178,7 @@
         var measurements = new ArrayList<Measurement>(100_000);
 
         final int limit = byteBuffer.limit();
-        final byte[] buffer = new byte[64];
+        final byte[] buffer = new byte[128];
 
         while (byteBuffer.position() < limit) {
             final int start = byteBuffer.position();

gunnarmorling · 2024-01-05T23:05:41Z

Sure, for me just increase the buffer size.

Ah yeah, nice. Could you PR this, please? Thx.

seijikun · 2024-01-05T23:14:29Z

It's a bug in the fickle SIMD part of my implementation. Was in the process of refactoring that anyway.
So nothing unexpected. Feel free to proceed.

gunnarmorling · 2024-01-05T23:46:33Z

Ok, cool. Let's do this then. Could someone perhaps log issues for each of those failing ones, tagging the respective authors? Thx!

Name length goes from 1 to 100.

gunnarmorling · 2024-01-06T08:42:41Z

@mtopolnik, I see you're still doing changes here. Let me know when it's good to go. Thx!

mtopolnik · 2024-01-06T08:44:03Z

I just had a few ideas when I woke up in the middle of the night :) I think this should be it now, at least for a start.

mtopolnik · 2024-01-06T09:13:36Z

One hint, don't know if it's useful: I realized I got a ton of hanging processes from running test_all.sh because some of the submissions never complete. This could hurt the Hetzner instance and impact measurements.

gunnarmorling · 2024-01-06T09:14:11Z

Haha, ok :) One more question on the provenience of the data, is this derived from the "Basic" data set at https://simplemaps.com/data/world-cities, which is published under Creative Commons Attribution 4.0? If so, can you add this line to the top of the file:

# Adapted from https://simplemaps.com/data/world-cities https://creativecommons.org/licenses/by/4.0/
# Licensed under Creative Commons Attribution 4.0 (https://creativecommons.org/licenses/by/4.0/

Then it's good to go. Thx!

gunnarmorling · 2024-01-06T09:35:39Z

Merging, thanks a lot!

mtopolnik force-pushed the main branch from 3868ff2 to 3a43bcf Compare January 5, 2024 11:20

mtopolnik force-pushed the main branch 2 times, most recently from 0057011 to f1dda12 Compare January 5, 2024 20:02

datdenkikniet mentioned this pull request Jan 5, 2024

Should the 1 billion row file be deterministic? #35

Open

mtopolnik force-pushed the main branch from e461749 to 12480c4 Compare January 5, 2024 21:39

mtopolnik force-pushed the main branch from 02965eb to 9cef402 Compare January 5, 2024 22:03

twobiers added a commit to twobiers/onebrc that referenced this pull request Jan 5, 2024

Adjust buffer size to solve test failure in gunnarmorling#125

2d6afd0

twobiers added a commit to twobiers/onebrc that referenced this pull request Jan 5, 2024

Adjust buffer size to solve test failure in gunnarmorling#125

1e2553f

gunnarmorling pushed a commit that referenced this pull request Jan 5, 2024

Adjust buffer size to solve test failure in #125

c24bcac

seijikun pushed a commit to seijikun/1brc that referenced this pull request Jan 6, 2024

Fix bug caused by new unit-test introduced with gunnarmorling#125

2c4b38a

seijikun added a commit to seijikun/1brc that referenced this pull request Jan 6, 2024

Fix bug caused by new unit-test introduced with gunnarmorling#125

420d66b

seijikun added a commit to seijikun/1brc that referenced this pull request Jan 6, 2024

seijikun: Fix new unit-test introduced with gunnarmorling#125

3c6e78f

seijikun added a commit to seijikun/1brc that referenced this pull request Jan 6, 2024

seijikun: Fix new unit-test introduced with gunnarmorling#125

c9e37d5

Generate measurements with random names

c91c6c0

Name length goes from 1 to 100.

mtopolnik added 4 commits January 6, 2024 07:51

Eliminate duplicate station names

02f4b8d

Add test sample with a worst-case UTF-8 name

7acbef6

Move attribution into weather_stations.csv

55dfc4d

Limit names to 100 bytes

bb330d0

mtopolnik force-pushed the main branch from 9cef402 to bb330d0 Compare January 6, 2024 06:51

mtopolnik added 2 commits January 6, 2024 08:25

Improve name generation

a65b0ac

One more sample in test file

1b5d6bf

More detailed attribution

3033db6

gunnarmorling merged commit 35b9099 into gunnarmorling:main Jan 6, 2024

gunnarmorling pushed a commit that referenced this pull request Jan 6, 2024

seijikun: Fix new unit-test introduced with #125

093bd35

dmitry-midokura pushed a commit to dmitry-midokura/1brc that referenced this pull request Jan 13, 2024

Adjust buffer size to solve test failure in gunnarmorling#125

7e38917

dmitry-midokura pushed a commit to dmitry-midokura/1brc that referenced this pull request Jan 13, 2024

seijikun: Fix new unit-test introduced with gunnarmorling#125

0d7f77e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate measurements with random station names #125

Generate measurements with random station names #125

mtopolnik commented Jan 5, 2024 •

edited

gunnarmorling commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

datdenkikniet commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

datdenkikniet commented Jan 5, 2024 •

edited

gunnarmorling commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

spullara commented Jan 5, 2024 •

edited

gunnarmorling commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

gunnarmorling commented Jan 5, 2024

twobiers commented Jan 5, 2024

gunnarmorling commented Jan 5, 2024

seijikun commented Jan 5, 2024

gunnarmorling commented Jan 5, 2024

gunnarmorling commented Jan 6, 2024

mtopolnik commented Jan 6, 2024

mtopolnik commented Jan 6, 2024

gunnarmorling commented Jan 6, 2024

gunnarmorling commented Jan 6, 2024

Generate measurements with random station names #125

Generate measurements with random station names #125

Conversation

mtopolnik commented Jan 5, 2024 • edited

gunnarmorling commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

datdenkikniet commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

datdenkikniet commented Jan 5, 2024 • edited

gunnarmorling commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

spullara commented Jan 5, 2024 • edited

gunnarmorling commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

mtopolnik commented Jan 5, 2024

gunnarmorling commented Jan 5, 2024

twobiers commented Jan 5, 2024

gunnarmorling commented Jan 5, 2024

seijikun commented Jan 5, 2024

gunnarmorling commented Jan 5, 2024

gunnarmorling commented Jan 6, 2024

mtopolnik commented Jan 6, 2024

mtopolnik commented Jan 6, 2024

gunnarmorling commented Jan 6, 2024

gunnarmorling commented Jan 6, 2024

mtopolnik commented Jan 5, 2024 •

edited

datdenkikniet commented Jan 5, 2024 •

edited

spullara commented Jan 5, 2024 •

edited