Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateMeasurements4 proposal #347

Open
lehuyduc opened this issue Jan 12, 2024 · 6 comments
Open

CreateMeasurements4 proposal #347

lehuyduc opened this issue Jan 12, 2024 · 6 comments

Comments

@lehuyduc
Copy link

Use case: simulate new data appearing (for example if a new weather station is built) => test if program handle new key appearing correctly.

Summary: the first 500M lines only have 2500 keys, then the remaining 500M lines have full 10K keys.

/*
 *  Copyright 2023 The original authors
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */
package dev.morling.onebrc;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringReader;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.concurrent.ThreadLocalRandom;

public class CreateMeasurements4 {

    public static final int MAX_NAME_LEN = 100;
    public static final int KEYSET_SIZE = 10_000;

    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.out.println("Usage: create_measurements4.sh <number of records to create>");
            System.exit(1);
        }
        int size = 0;
        try {
            size = Integer.parseInt(args[0]);
        }
        catch (NumberFormatException e) {
            System.out.println("Invalid value for <number of records to create>");
            System.out.println("Usage: create_measurements4.sh <number of records to create>");
            System.exit(1);
        }
        final var weatherStations = generateWeatherStations();
        final var start = System.currentTimeMillis();
        final var rnd = ThreadLocalRandom.current();
        try (var out = new BufferedWriter(new FileWriter("measurements.txt"))) {
            int mid = size / 2;
            for (int i = 1; i <= mid; i++) {
                var station = weatherStations.get(rnd.nextInt(KEYSET_SIZE / 4));
                double temp = rnd.nextGaussian(station.avgTemp, 7.0);
                out.write(station.name);
                out.write(';');
                out.write(Double.toString(Math.round(temp * 10.0) / 10.0));
                out.newLine();
                if (i % 50_000_000 == 0) {
                    System.out.printf("Wrote %,d measurements in %,d ms%n", i, System.currentTimeMillis() - start);
                }
            }

            for (int i = mid + 1; i <= size; i++) {
                var station = weatherStations.get(rnd.nextInt(KEYSET_SIZE));
                double temp = rnd.nextGaussian(station.avgTemp, 7.0);
                out.write(station.name);
                out.write(';');
                out.write(Double.toString(Math.round(temp * 10.0) / 10.0));
                out.newLine();
                if (i % 50_000_000 == 0) {
                    System.out.printf("Wrote %,d measurements in %,d ms%n", i, System.currentTimeMillis() - start);
                }
            }
        }
    }

    record WeatherStation(String name, float avgTemp) {
    }

    private static ArrayList<WeatherStation> generateWeatherStations() throws Exception {
        // Use a public list of city names and concatenate them all into a long string,
        // which we'll use as a "source of city name randomness"
        var bigName = new StringBuilder(1 << 20);
        try (var rows = new BufferedReader(new FileReader("data/weather_stations.csv"));) {
            skipComments(rows);
            while (true) {
                var row = rows.readLine();
                if (row == null) {
                    break;
                }
                bigName.append(row, 0, row.indexOf(';'));
            }
        }
        final var weatherStations = new ArrayList<WeatherStation>();
        final var names = new HashSet<String>();
        var minLen = Integer.MAX_VALUE;
        var maxLen = Integer.MIN_VALUE;
        try (var rows = new BufferedReader(new FileReader("data/weather_stations.csv"))) {
            skipComments(rows);
            final var nameSource = new StringReader(bigName.toString());
            final var buf = new char[MAX_NAME_LEN];
            final var rnd = ThreadLocalRandom.current();
            final double yOffset = 4;
            final double factor = 2500;
            final double xOffset = 0.372;
            final double power = 7;
            for (int i = 0; i < KEYSET_SIZE; i++) {
                var row = rows.readLine();
                if (row == null) {
                    break;
                }
                // Use a 7th-order curve to simulate the name length distribution.
                // It gives us mostly short names, but with large outliers.
                var nameLen = (int) (yOffset + factor * Math.pow(rnd.nextDouble() - xOffset, power));
                var count = nameSource.read(buf, 0, nameLen);
                if (count == -1) {
                    throw new Exception("Name source exhausted");
                }
                var nameBuf = new StringBuilder(nameLen);
                nameBuf.append(buf, 0, nameLen);
                if (Character.isWhitespace(nameBuf.charAt(0))) {
                    nameBuf.setCharAt(0, readNonSpace(nameSource));
                }
                if (Character.isWhitespace(nameBuf.charAt(nameBuf.length() - 1))) {
                    nameBuf.setCharAt(nameBuf.length() - 1, readNonSpace(nameSource));
                }
                var name = nameBuf.toString();
                while (names.contains(name)) {
                    nameBuf.setCharAt(rnd.nextInt(nameBuf.length()), readNonSpace(nameSource));
                    name = nameBuf.toString();
                }
                int actualLen;
                while (true) {
                    actualLen = name.getBytes(StandardCharsets.UTF_8).length;
                    if (actualLen <= 100) {
                        break;
                    }
                    nameBuf.deleteCharAt(nameBuf.length() - 1);
                    if (Character.isWhitespace(nameBuf.charAt(nameBuf.length() - 1))) {
                        nameBuf.setCharAt(nameBuf.length() - 1, readNonSpace(nameSource));
                    }
                    name = nameBuf.toString();
                }
                if (name.indexOf(';') != -1) {
                    throw new Exception("Station name contains a semicolon!");
                }
                names.add(name);
                minLen = Integer.min(minLen, actualLen);
                maxLen = Integer.max(maxLen, actualLen);
                var lat = Float.parseFloat(row.substring(row.indexOf(';') + 1));
                // Guesstimate mean temperature using cosine of latitude
                var avgTemp = (float) (30 * Math.cos(Math.toRadians(lat))) - 10;
                weatherStations.add(new WeatherStation(name, avgTemp));
            }
        }
        System.out.format("Generated %,d station names with length from %,d to %,d%n", KEYSET_SIZE, minLen, maxLen);
        return weatherStations;
    }

    private static void skipComments(BufferedReader rows) throws IOException {
        while (rows.readLine().startsWith("#")) {
        }
    }

    private static char readNonSpace(StringReader nameSource) throws IOException {
        while (true) {
            var n = nameSource.read();
            if (n == -1) {
                throw new IOException("Name source exhausted");
            }
            var ch = (char) n;
            if (ch != ' ') {
                return ch;
            }
        }
    }
}
@lehuyduc
Copy link
Author

lehuyduc commented Jan 12, 2024

@AlexanderYastrebov @gunnarmorling could you check please? Thanks!

@RagnarGrootKoerkamp
Copy link
Contributor

If you're going this way, you should also add some tests where a few cities only appear once at random places in the input.

@lehuyduc
Copy link
Author

Please don't take it the wrong way 🥲 I really want to see what creative solutions you come up for hash map, that's why I set it to 500M exactly instead of using rand()

@RagnarGrootKoerkamp
Copy link
Contributor

Oh don't worry :)
But if you put it to any kind of fixed number I will exploit it :P For example I could just scan names from the back.

Another idea is to split the input into K chunks, and use each city only in one / a few of the chunks.

Or of course you could make it completely linear, first 1000 lines for city 1, then 1000 lines for city 2, ...

@lehuyduc
Copy link
Author

lehuyduc commented Jan 12, 2024

I can do that but the goal isn't to kill creativity 🥳 So I will stick to 500M exactly (please don't read the file in reverse 😟)

@gunnarmorling
Copy link
Owner

Hey @lehuyduc, no problems with adding this generator, though probably I wouldn't run use it for any "official" run, so as to keep things somewhat focused (two experiments which I do want to do in addition to the original data set is running on all 32 cores / 64 threads of the machine and running with 10K different station names rather than the ~400 of the original example data set). Still this could be a nice tool for experimenting on their own. PR welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants