Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Read only first 128B from the file when searching for shebang #1040

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

WieeRd
Copy link

@WieeRd WieeRd commented Nov 20, 2023

Currently, LanguageType::from_shebang() uses BufReader and BufRead::read_line() to fetch the first line of the file.

  • BufReader::new() allocates 8 KiB of buffer by default
  • BufRead::read_line() can end up reading the entire file

A typical shebang line ranges from 10 to 30 characters (e.g. #!/usr/bin/env python3 is only 22B),
and it is very unlikely the file contains a valid shebang syntax if we don't find a newline character within the first few dozens of bytes.

This PR modifies the code so that it only reads the first 128B of the file, into the fixed sized u8 array, without involving the unnecessary BufReader and String.

@XAMPPRocky
Copy link
Owner

Thank you for your PR! Do you have a benchmark or similar that shows that this is faster than the current implementation?

@WieeRd
Copy link
Author

WieeRd commented Nov 20, 2023

I cannot use my desktop until I get back home tomorrow, but I'll try to logically break down the detection cost on best/worst cases.

I/O

  • Current:
    • Reads at least 8 KiB (BufReader)
    • Reads at most the entire file if it does not contain any newline
  • PR: Reads at most 128 bytes

Searching

  • Current: Can end up scanning the entire file if it does not contain any newline
  • PR: Searches at most 128 bytes
  • Both stops searching upon discovering a newline*

*Note: Current implementation may be slightly faster in this case since read_line() internally uses memchr to search 2 usizes at a time.

*Note²: Also not really because the reduced I/O and allocation cost is much more significant.

Heap Allocation

  • Current: Allocates at least 8 KiB (BufReader + String)
  • PR: None; uses fixed-sized 128B array on stack.

Summary

Both reading and searching is capped at 128B after this patch while the current implementation might process the entire file - in the worst scenario, tens of gigabytes from a binary file if someone ran tokei on a LLM project.

I don't know if there is a binary detection logic in the directory iteration part of tokei to prune that kind of madness, but as someone who plans to use tokei mainly as a filetype detection library, safety features like this on LanguageType::from_* methods are crucial.

Hope that sums up the advantage of this PR.

@WieeRd WieeRd closed this Nov 20, 2023
@WieeRd WieeRd reopened this Nov 20, 2023
@WieeRd
Copy link
Author

WieeRd commented Nov 20, 2023

It was a misinput!

@WieeRd
Copy link
Author

WieeRd commented Nov 20, 2023

Little off-topic question, why does from_path() take unused Config parameter? Are there plans to utilize it somehow?

@XAMPPRocky
Copy link
Owner

Little off-topic question, why does from_path() take unused Config parameter? Are there plans to utilize it somehow?

I couldn't say why, I haven't focused on tokei in a few years.

@WieeRd
Copy link
Author

WieeRd commented Nov 21, 2023

Benchmarks

Still not convinced?

Test Data

# ./chungus: 4 GiB of zeros
$ dd if=/dev/zero of=./chungus bs=4M count=1024
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.50768 s, 780 MB/s

# ./script: tiny script with shebang
$ echo "#!/usr/bin/env python3\nprint('Hello World!')" > ./script
$ ls
╭───┬─────────┬──────┬────────┬───────────────╮
│ # │  name   │ type │  size  │   modified    │
├───┼─────────┼──────┼────────┼───────────────┤
│ 0 │ chungus │ file │ 4.3 GB │ 2 minutes ago │
│ 1 │ script  │ file │   45 B │ 2 minutes ago │
╰───┴─────────┴──────┴────────┴───────────────╯
src
├── bin
│  ├── bufreader.rs
│  └── justreader.rs
└── lib.rs
chungus
script
Cargo.lock
Cargo.toml

src/bin/bufreader.rs

use std::{
    fs::File,
    io::{BufRead, BufReader},
    path::Path,
    time::Instant,
};

pub fn bufread(path: &Path) -> Option<usize> {
    let file = File::open(path).ok()?;

    let mut buf = BufReader::new(file);
    let mut line = String::new();

    buf.read_line(&mut line).ok()
}

fn main() {
    let t = Instant::now();
    bufread("./script".as_ref());
    let dt = t.elapsed();
    println!("./script: {}μs", dt.as_micros());

    let t = Instant::now();
    bufread("./chungus".as_ref());
    let dt = t.elapsed();
    println!("./chungus: {}μs", dt.as_micros());
}

src/bin/justreader.rs

use std::{fs::File, io::Read, path::Path, time::Instant};

pub fn justread(path: &Path) -> Option<usize> {
    const READ_LIMIT: usize = 128;

    let mut file = File::open(path).ok()?;
    let mut buf = [0; READ_LIMIT];

    let len = file.read(&mut buf).ok()?;
    let buf = &buf[..len];

    let first_line = buf.split(|b| *b == b'\n').next()?;
    let first_line = std::str::from_utf8(first_line).ok()?;

    Some(first_line.len())
}

fn main() {
    let t = Instant::now();
    justread("./script".as_ref());
    let dt = t.elapsed();
    println!("./script: {}μs", dt.as_micros());

    let t = Instant::now();
    justread("./chungus".as_ref());
    let dt = t.elapsed();
    println!("./chungus: {}μs", dt.as_micros());
}

Test Results

$ cargo run --release --bin bufreader
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/bufreader`
./script: 10μs
./chungus: 2619986μs
$ cargo run --release --bin justreader
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/justreader`
./script: 5μs
./chungus: 7μs

Is this enough to get this PR merged?

@WieeRd
Copy link
Author

WieeRd commented Nov 21, 2023

I haven't focused on tokei in a few years.

I started using tokei as a library in my recent projects, and hopefully I can find some time to improve the codebase and deal with unresolved issues in the span of next few months :)

@WieeRd
Copy link
Author

WieeRd commented Nov 21, 2023

How many CI jobs do you have

@XAMPPRocky
Copy link
Owner

How many CI jobs do you have

Haha, each target * 3 (stable, beta, nightly). Tokei is now a pretty old project in the Rust space, and has been/is used to help catch regressions in the compiler.

I commented out netbsd, and there's issue with that target, if you rebase it should pass.

@WieeRd
Copy link
Author

WieeRd commented Nov 21, 2023

Rebased and force-pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants