Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer Vec<u8>/[u8] over Vec<char>/[char] #15

Open
TianyiShi2001 opened this issue Jan 3, 2021 · 4 comments
Open

Prefer Vec<u8>/[u8] over Vec<char>/[char] #15

TianyiShi2001 opened this issue Jan 3, 2021 · 4 comments

Comments

@TianyiShi2001
Copy link
Contributor

When creating a Vec<char> from a string, s.chars().collect() is the standard practice. The .chars() method works on any UTF-8 encoded string and it takes time to find the boundary for each character. However, we know that PDB files contain ASCII characters only, therefore it will be faster to transmute a string directly to bytes using as_bytes() and use vectors/arrays/slices of bytes instead of chars throughout the crate. This practice is used, for example in rust-bio and seq-io

@douweschulte
Copy link
Owner

I totally agree.

@TianyiShi2001
Copy link
Contributor Author

Doing so requires that when we parse a PDB file we assume it's all ASCII, is it OK? (the checks that are run when modifying the structure are preserved)

@DocKDE
Copy link
Collaborator

DocKDE commented Mar 15, 2022

Out of curiosity: what happened to this suggestion? I'm asking because I took another look during profiling and a decent chunk of the time pdbtbx needs for parsing is currently used up by collecting chars into Strings. I'm not sure if this would be impacted by a switch to bytes but since PDB files only contain ASCII characters the crate could benefit from such a change, no?

@DocKDE DocKDE reopened this Mar 15, 2022
@DocKDE DocKDE changed the title Perfer Vec<u8>/[u8] over Vec<char>/[char] Prefer Vec<u8>/[u8] over Vec<char>/[char] Mar 15, 2022
@douweschulte
Copy link
Owner

The progress stalled. But changing to u8 will not necessarily decrease the time spent on collecting to Vecs. The most benefit will be in the use of the more specialised u8 based ascii functions over UTF8. If most time is spent on collecting to Vecs I think it would be more beneficial to find out why and if the collection can be removed/sped up, maybe by providing the final size of the collection. If you want feel free to work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants