Prefer `Vec<u8>/[u8]` over `Vec<char>/[char]` #15

TianyiShi2001 · 2021-01-03T12:09:16Z

When creating a Vec<char> from a string, s.chars().collect() is the standard practice. The .chars() method works on any UTF-8 encoded string and it takes time to find the boundary for each character. However, we know that PDB files contain ASCII characters only, therefore it will be faster to transmute a string directly to bytes using as_bytes() and use vectors/arrays/slices of bytes instead of chars throughout the crate. This practice is used, for example in rust-bio and seq-io

The text was updated successfully, but these errors were encountered:

douweschulte · 2021-01-03T12:17:30Z

I totally agree.

douweschulte#15

TianyiShi2001 · 2021-01-04T01:50:52Z

Doing so requires that when we parse a PDB file we assume it's all ASCII, is it OK? (the checks that are run when modifying the structure are preserved)

DocKDE · 2022-03-15T12:50:23Z

Out of curiosity: what happened to this suggestion? I'm asking because I took another look during profiling and a decent chunk of the time pdbtbx needs for parsing is currently used up by collecting chars into Strings. I'm not sure if this would be impacted by a switch to bytes but since PDB files only contain ASCII characters the crate could benefit from such a change, no?

douweschulte · 2022-03-15T13:35:33Z

The progress stalled. But changing to u8 will not necessarily decrease the time spent on collecting to Vecs. The most benefit will be in the use of the more specialised u8 based ascii functions over UTF8. If most time is spent on collecting to Vecs I think it would be more beneficial to find out why and if the collection can be removed/sped up, maybe by providing the final size of the collection. If you want feel free to work on this.

TianyiShi2001 added a commit to TianyiShi2001/rust-pdb that referenced this issue Jan 4, 2021

replace char/str/string with u8/[u8]/Vec<u8>

0b582cd

douweschulte#15

TianyiShi2001 mentioned this issue Jan 4, 2021

replace char/str/string with u8/[u8]/Vec<u8> #30

Closed

douweschulte closed this as completed Mar 2, 2022

DocKDE reopened this Mar 15, 2022

DocKDE changed the title ~~Perfer Vec<u8>/[u8] over Vec<char>/[char]~~ Prefer Vec<u8>/[u8] over Vec<char>/[char] Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer `Vec<u8>/[u8]` over `Vec<char>/[char]` #15

Prefer `Vec<u8>/[u8]` over `Vec<char>/[char]` #15

TianyiShi2001 commented Jan 3, 2021

douweschulte commented Jan 3, 2021

TianyiShi2001 commented Jan 4, 2021

DocKDE commented Mar 15, 2022 •

edited

douweschulte commented Mar 15, 2022

Prefer Vec<u8>/[u8] over Vec<char>/[char] #15

Prefer Vec<u8>/[u8] over Vec<char>/[char] #15

Comments

TianyiShi2001 commented Jan 3, 2021

douweschulte commented Jan 3, 2021

TianyiShi2001 commented Jan 4, 2021

DocKDE commented Mar 15, 2022 • edited

douweschulte commented Mar 15, 2022

Prefer `Vec<u8>/[u8]` over `Vec<char>/[char]` #15

Prefer `Vec<u8>/[u8]` over `Vec<char>/[char]` #15

DocKDE commented Mar 15, 2022 •

edited