Optimization: Read only first 128B from the file when searching for shebang #1040

WieeRd · 2023-11-20T07:29:51Z

Currently, LanguageType::from_shebang() uses BufReader and BufRead::read_line() to fetch the first line of the file.

BufReader::new() allocates 8 KiB of buffer by default
BufRead::read_line() can end up reading the entire file

A typical shebang line ranges from 10 to 30 characters (e.g. #!/usr/bin/env python3 is only 22B),
and it is very unlikely the file contains a valid shebang syntax if we don't find a newline character within the first few dozens of bytes.

This PR modifies the code so that it only reads the first 128B of the file, into the fixed sized u8 array, without involving the unnecessary BufReader and String.

XAMPPRocky · 2023-11-20T07:52:50Z

Thank you for your PR! Do you have a benchmark or similar that shows that this is faster than the current implementation?

WieeRd · 2023-11-20T12:16:45Z

I cannot use my desktop until I get back home tomorrow, but I'll try to logically break down the detection cost on best/worst cases.

I/O

Current:
- Reads at least 8 KiB (BufReader)
- Reads at most the entire file if it does not contain any newline
PR: Reads at most 128 bytes

Searching

Current: Can end up scanning the entire file if it does not contain any newline
PR: Searches at most 128 bytes
Both stops searching upon discovering a newline*

*Note: Current implementation may be slightly faster in this case since read_line() internally uses memchr to search 2 usizes at a time.

*Note²: Also not really because the reduced I/O and allocation cost is much more significant.

Heap Allocation

Current: Allocates at least 8 KiB (BufReader + String)
PR: None; uses fixed-sized 128B array on stack.

Summary

Both reading and searching is capped at 128B after this patch while the current implementation might process the entire file - in the worst scenario, tens of gigabytes from a binary file if someone ran tokei on a LLM project.

I don't know if there is a binary detection logic in the directory iteration part of tokei to prune that kind of madness, but as someone who plans to use tokei mainly as a filetype detection library, safety features like this on LanguageType::from_* methods are crucial.

Hope that sums up the advantage of this PR.

WieeRd · 2023-11-20T12:18:12Z

It was a misinput!

WieeRd · 2023-11-20T14:20:17Z

Little off-topic question, why does from_path() take unused Config parameter? Are there plans to utilize it somehow?

XAMPPRocky · 2023-11-20T14:27:27Z

Little off-topic question, why does from_path() take unused Config parameter? Are there plans to utilize it somehow?

I couldn't say why, I haven't focused on tokei in a few years.

WieeRd · 2023-11-21T15:18:37Z

Benchmarks

Still not convinced?

Test Data

# ./chungus: 4 GiB of zeros
$ dd if=/dev/zero of=./chungus bs=4M count=1024
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.50768 s, 780 MB/s

# ./script: tiny script with shebang
$ echo "#!/usr/bin/env python3\nprint('Hello World!')" > ./script

$ ls
╭───┬─────────┬──────┬────────┬───────────────╮
│ # │  name   │ type │  size  │   modified    │
├───┼─────────┼──────┼────────┼───────────────┤
│ 0 │ chungus │ file │ 4.3 GB │ 2 minutes ago │
│ 1 │ script  │ file │   45 B │ 2 minutes ago │
╰───┴─────────┴──────┴────────┴───────────────╯

src
├── bin
│  ├── bufreader.rs
│  └── justreader.rs
└── lib.rs
chungus
script
Cargo.lock
Cargo.toml

`src/bin/bufreader.rs`

use std::{
    fs::File,
    io::{BufRead, BufReader},
    path::Path,
    time::Instant,
};

pub fn bufread(path: &Path) -> Option<usize> {
    let file = File::open(path).ok()?;

    let mut buf = BufReader::new(file);
    let mut line = String::new();

    buf.read_line(&mut line).ok()
}

fn main() {
    let t = Instant::now();
    bufread("./script".as_ref());
    let dt = t.elapsed();
    println!("./script: {}μs", dt.as_micros());

    let t = Instant::now();
    bufread("./chungus".as_ref());
    let dt = t.elapsed();
    println!("./chungus: {}μs", dt.as_micros());
}

`src/bin/justreader.rs`

use std::{fs::File, io::Read, path::Path, time::Instant};

pub fn justread(path: &Path) -> Option<usize> {
    const READ_LIMIT: usize = 128;

    let mut file = File::open(path).ok()?;
    let mut buf = [0; READ_LIMIT];

    let len = file.read(&mut buf).ok()?;
    let buf = &buf[..len];

    let first_line = buf.split(|b| *b == b'\n').next()?;
    let first_line = std::str::from_utf8(first_line).ok()?;

    Some(first_line.len())
}

fn main() {
    let t = Instant::now();
    justread("./script".as_ref());
    let dt = t.elapsed();
    println!("./script: {}μs", dt.as_micros());

    let t = Instant::now();
    justread("./chungus".as_ref());
    let dt = t.elapsed();
    println!("./chungus: {}μs", dt.as_micros());
}

Test Results

$ cargo run --release --bin bufreader
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/bufreader`
./script: 10μs
./chungus: 2619986μs

$ cargo run --release --bin justreader
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/justreader`
./script: 5μs
./chungus: 7μs

Is this enough to get this PR merged?

WieeRd · 2023-11-21T15:33:22Z

I haven't focused on tokei in a few years.

I started using tokei as a library in my recent projects, and hopefully I can find some time to improve the codebase and deal with unresolved issues in the span of next few months :)

WieeRd · 2023-11-21T15:46:15Z

How many CI jobs do you have

XAMPPRocky · 2023-11-21T17:55:16Z

How many CI jobs do you have

Haha, each target * 3 (stable, beta, nightly). Tokei is now a pretty old project in the Rust space, and has been/is used to help catch regressions in the compiler.

I commented out netbsd, and there's issue with that target, if you rebase it should pass.

WieeRd · 2023-11-21T18:41:39Z

Rebased and force-pushed.

WieeRd closed this Nov 20, 2023

WieeRd reopened this Nov 20, 2023

perf: read only first 128B from the file when searching for shebang

242e590

WieeRd force-pushed the master branch from 0b9e45b to 242e590 Compare November 21, 2023 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Read only first 128B from the file when searching for shebang #1040

Optimization: Read only first 128B from the file when searching for shebang #1040

WieeRd commented Nov 20, 2023

XAMPPRocky commented Nov 20, 2023

WieeRd commented Nov 20, 2023 •

edited

WieeRd commented Nov 20, 2023

WieeRd commented Nov 20, 2023

XAMPPRocky commented Nov 20, 2023

WieeRd commented Nov 21, 2023 •

edited

WieeRd commented Nov 21, 2023

WieeRd commented Nov 21, 2023

XAMPPRocky commented Nov 21, 2023

WieeRd commented Nov 21, 2023

Optimization: Read only first 128B from the file when searching for shebang #1040

Are you sure you want to change the base?

Optimization: Read only first 128B from the file when searching for shebang #1040

Conversation

WieeRd commented Nov 20, 2023

XAMPPRocky commented Nov 20, 2023

WieeRd commented Nov 20, 2023 • edited

I/O

Searching

Heap Allocation

Summary

WieeRd commented Nov 20, 2023

It was a misinput!

WieeRd commented Nov 20, 2023

XAMPPRocky commented Nov 20, 2023

WieeRd commented Nov 21, 2023 • edited

Benchmarks

Test Data

src/bin/bufreader.rs

src/bin/justreader.rs

Test Results

WieeRd commented Nov 21, 2023

WieeRd commented Nov 21, 2023

XAMPPRocky commented Nov 21, 2023

WieeRd commented Nov 21, 2023

WieeRd commented Nov 20, 2023 •

edited

WieeRd commented Nov 21, 2023 •

edited

`src/bin/bufreader.rs`

`src/bin/justreader.rs`