Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate measurements with random station names #125

Merged
merged 8 commits into from
Jan 6, 2024

Conversation

mtopolnik
Copy link
Contributor

@mtopolnik mtopolnik commented Jan 5, 2024

Contributes a new way to create measurements.txt, using 10,000 random station names with length from 1 to 100.

To generate names with a realistic distribution, it uses a public list of city names. It concatenates all the names into one long string, and then uses this as a "source of name randomness". The distribution of generated name lengths is biased towards short names, but with quite frequent outliers.

Fixes #156

@gunnarmorling
Copy link
Owner

Hey @mtopolnik, you mentioned this would fail some of the existing contenders. Can you add a measurements.txt and corresponding out to src/test/resources and report back the result of ./test_all.sh? Thx!

@mtopolnik
Copy link
Contributor Author

I ran it... it's not a pretty sight :)

PASS Ujjwalbharti
PASS anandmattikopp
FAIL armandino
PASS artsiomkorzun
PASS baseline
FAIL bjhara
PASS criccomini
FAIL davecom
FAIL ddimtirov
FAIL deemkeen
PASS ebarlas
FAIL filiphr
PASS fragmede
PASS itaske
PASS jgrateron
PASS khmarbaise
PASS kuduwa-keshavram
FAIL lawrey
FAIL moysesb
FAIL nstng
PASS padreati
FAIL palmr
FAIL rby
PASS richardstartin
FAIL royvanrijn
FAIL seijikun
FAIL spullara
FAIL truelive
FAIL twobiers

I had to disable fatroom because it would just hang with zero CPU.

@datdenkikniet
Copy link

Where does the data for the CSV file come from? Would be good (and kind) to add attribution, unless you've come up with it yourself :)

@mtopolnik
Copy link
Contributor Author

Attribution provided here (as a code comment). Does it need to be more visible? If I put it into the CSV itself, I'll need more code to ignore those lines.

@datdenkikniet
Copy link

datdenkikniet commented Jan 5, 2024

Ah, sorry, I had missed that! IMO preferable to put it in the CSV itself, should be relatively easy if you just treat all lines starting with # (or similar unused character) as comment.

Mostly important to license it correctly because it comes from a company, and wouldn't want them to come ruin the party (doubt they would, but you never know)! And they're relatively explicit about the data being licensed under Creative Commons Attribution 4.0 and requiring attribution.

@gunnarmorling
Copy link
Owner

I ran it... it's not a pretty sight :)

Oh wow, that's interesting. Curious why so many are failing. E.g. @royvanrijn, @spullara, @Palmr, any thoughts about it? Before committing this to the test suite I think we need to better understand the reasons. Admittedly, the constraints were not as well-defined initially as they'd ideally have been, and I want to be cautious of not moving goal posts (too much, anyways ;).

That being said, one thing every impl definitely should satisfy is to support the maximum station name length (100), I've logged #156 for adding this to the test suite. I think that's rather uncontroversial.

@mtopolnik
Copy link
Contributor Author

The newly added test file contains a worst-case name that has exactly 100 characters, but they are all the "scariest"-looking characters I could find in a larger sample of names. This does bring up the point that was made elsewhere -- in general, Unicode characters have unbounded length due to the combining codepoints.

@spullara
Copy link
Contributor

spullara commented Jan 5, 2024

I took the maximum station name length of 100 as bytes (or some larger arbitrary number). If we choose that it will be easier to test and make sense of it.

@gunnarmorling
Copy link
Owner

I think the wording is quite unambigious in that regard:

Station name: non null UTF-8 string of min length 1 character and max length 100 characters

That would be UTF-8 characters, not bytes.

That being said, that limit was chosen rather arbitrarily, and I don't think it changes anything substantial about the nature of the challenge or its goals (which are not to solve all character encoding oddities in the world). So IMO we wouldn't compromise by effectively limiting it to 100 bytes indeed, e.g. by testing a 100 ASCII (one byte) characters and maybe 50 two-byte chars, and consider implementations which satisfy that as valid. Thoughts?

@mtopolnik
Copy link
Contributor Author

I agree, let's put a 100 byte limit on it.

@mtopolnik
Copy link
Contributor Author

After this adjustment, test results have improved:

PASS Ujjwalbharti
PASS anandmattikopp
FAIL armandino
PASS artsiomkorzun
PASS baseline
PASS bjhara
PASS criccomini
PASS davecom
FAIL ddimtirov
PASS deemkeen
PASS ebarlas
PASS filiphr
PASS fragmede
PASS itaske
PASS jgrateron
PASS khmarbaise
PASS kuduwa-keshavram
FAIL lawrey
FAIL moysesb
FAIL nstng
PASS padreati
PASS palmr
PASS rby
PASS richardstartin
PASS royvanrijn
FAIL seijikun
PASS spullara
PASS truelive
FAIL twobiers

@gunnarmorling
Copy link
Owner

Hey @twobiers and @seijikun, could you take a look at this additional test we're considering to add and why it currently fails with your implementations? Thx!

@twobiers
Copy link
Contributor

twobiers commented Jan 5, 2024

Sure, for me just increase the buffer size.

Index: src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java b/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java
--- a/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java	(revision 9cef402a3e2246d28091861e17d635c7fc214616)
+++ b/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java	(date 1704494949716)
@@ -178,7 +178,7 @@
         var measurements = new ArrayList<Measurement>(100_000);
 
         final int limit = byteBuffer.limit();
-        final byte[] buffer = new byte[64];
+        final byte[] buffer = new byte[128];
 
         while (byteBuffer.position() < limit) {
             final int start = byteBuffer.position();

@gunnarmorling
Copy link
Owner

Sure, for me just increase the buffer size.

Ah yeah, nice. Could you PR this, please? Thx.

twobiers added a commit to twobiers/onebrc that referenced this pull request Jan 5, 2024
twobiers added a commit to twobiers/onebrc that referenced this pull request Jan 5, 2024
@seijikun
Copy link
Contributor

seijikun commented Jan 5, 2024

It's a bug in the fickle SIMD part of my implementation. Was in the process of refactoring that anyway.
So nothing unexpected. Feel free to proceed.

@gunnarmorling
Copy link
Owner

Ok, cool. Let's do this then. Could someone perhaps log issues for each of those failing ones, tagging the respective authors? Thx!

seijikun pushed a commit to seijikun/1brc that referenced this pull request Jan 6, 2024
seijikun added a commit to seijikun/1brc that referenced this pull request Jan 6, 2024
seijikun added a commit to seijikun/1brc that referenced this pull request Jan 6, 2024
seijikun added a commit to seijikun/1brc that referenced this pull request Jan 6, 2024
Name length goes from 1 to 100.
@gunnarmorling
Copy link
Owner

@mtopolnik, I see you're still doing changes here. Let me know when it's good to go. Thx!

@mtopolnik
Copy link
Contributor Author

I just had a few ideas when I woke up in the middle of the night :) I think this should be it now, at least for a start.

@mtopolnik
Copy link
Contributor Author

One hint, don't know if it's useful: I realized I got a ton of hanging processes from running test_all.sh because some of the submissions never complete. This could hurt the Hetzner instance and impact measurements.

@gunnarmorling
Copy link
Owner

Haha, ok :) One more question on the provenience of the data, is this derived from the "Basic" data set at https://simplemaps.com/data/world-cities, which is published under Creative Commons Attribution 4.0? If so, can you add this line to the top of the file:

# Adapted from https://simplemaps.com/data/world-cities https://creativecommons.org/licenses/by/4.0/
# Licensed under Creative Commons Attribution 4.0 (https://creativecommons.org/licenses/by/4.0/

Then it's good to go. Thx!

@gunnarmorling
Copy link
Owner

Merging, thanks a lot!

@gunnarmorling gunnarmorling merged commit 35b9099 into gunnarmorling:main Jan 6, 2024
dmitry-midokura pushed a commit to dmitry-midokura/1brc that referenced this pull request Jan 13, 2024
dmitry-midokura pushed a commit to dmitry-midokura/1brc that referenced this pull request Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add test case for station name of length 100
6 participants