Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make alloc-free string manipulation more ergonomic with String::mutate(&mut self, ...) #2864

Open
mqudsi opened this issue Feb 14, 2020 · 16 comments

Comments

@mqudsi
Copy link

mqudsi commented Feb 14, 2020

The current API for manipulating String objects is largely relegated to str, and many operations treat the source string as immutable and return a new &str reference with the result of the operation (unless non-contiguous results are not guaranteed, e.g. .replace(), etc.)

Oftentimes destructive modifications to a String are needed "while it is being created" and developers are forced to either write their own logic to duplicate str functionality to update String values or else use .to_owned() on the result of str functions.

e.g. a function runs a command and returns its output as a string, with trailing whitespace removed:

    ....
    let mut path = String::from_utf8(output.stdout)?;
    let new_len = path.trim_end().len();
    path.truncate(new_len);
    Ok(path)

vs

    ....
    let path = String::from_utf8(output.stdout)?;
    let path_str = path.trim_end();
    Ok(path_str.to_owned())

I believe we're close to being able to support a safe mutating api that provides a &str and uses an &str result to redefine the string (mainly to pop or dequeue elements from the underlying storage), something along the lines of

fn String::mutate(&mut self, F) -> ()
    where F: Fn(&'restricted str) -> &'restricted str`

which passes in an (exclusive) reference to the current value of the string and expects to receive a subset: &'restricted str as a result, where self.as_bytes() == &[.., subset.as_bytes(), ..] holds true (i.e. subset.as_bytes().as_ptr() >= self.as_bytes().as_ptr() && subset.as_bytes().as_ptr() + subset.as_bytes().len() < self.as_bytes().as_ptr() + self.as_bytes().len())

This makes the most sense it if it is possible to make compile-time guarantees ensuring the above predicate. I don't think #2000 via rust-lang/rust#44580 currently makes this possible, but I feel as if there is overlap between const generics and the ability to do so.

Using it would look something like this:

let mut start = "  hello, world  ".to_owned();
// Now trim both ends of the string without reallocating
start.mutate(|s| s.trim());
assert_eq!(&start, "hello, world");

while forbidding the following:

let other_string = "not my string";
let mut start = "  hello, world  ".to_owned();
// Try to provide a reference to an str that is not a subset of the string itself
start.mutate(|_| &other_string); // should not compile
@burdges
Copy link

burdges commented Feb 14, 2020

We have String::truncate already. We cannot drop a prefix from the allocation though, so first I'd suggest adding String::shift_left methods to Vec and String:

impl<T> Vec<T> {
    pub fn shift_left(&mut self, mid: usize) {
        if mid >= self.len() { return; }
        self.rotate_left(mid);  // Reimplement with faster unsafe code
        self.truncate(self.len()-shift);
    }
}

impl String {
    pub fn shift_left(&mut self, shift: usize) {
        if shift >= self.len() { return; }
        assert!(self.is_char_boundary(shift));
        self.vec.shift_left(shift);
    }
}

At that point, your String::mutate could be implemented with some unsafe safety checks and shift_left plus truncate, but your signature looks like

fn String::mutate(&mut self, f: F) 
    where for<'restricted> F: FnOnce(&'restricted mut str) -> &'restricted mut str
{
    // unsafely copy self to old_self
    let self = f(self);
    // unsafely compute shift and new_len from old_self and self and check they lie within the allocation
    self.truncate(shift+new_len);
    self.shift_left(shift);
}

I'm unsure if miri could actually be made happy with mutate thought and maybe this needs more is_char_boundary calls. Ask @RalfJung

See also: https://crates.io/search?q=rope

@RustyYato
Copy link

RustyYato commented Feb 14, 2020

note just using:

fn mutate(&mut self, F) -> ()
    where F: for<'a> Fn(&'a str) -> &'a str

doesn't work without checks because of,

string.mutate(|_| "static string always works");

To fix this we can add checks that check if the string is in fact a sub-string of self. If it isn't, then we can just overwrite the current string with the new string entirely. for example

@mqudsi
Copy link
Author

mqudsi commented Feb 14, 2020

Indeed, that is why the musings about const generics came into play.

@KrishnaSannasi it did not occur to me to have the failure mode for mutate (when result is not a substring) be the allocation of a new string rather than an error, which does indeed make it less pressing for the substr check to be a compile-time error.

@RustyYato
Copy link

I don't think const generics can help here, we need a way to say exactly this lifetime, not at least this lifetime. But I don't think that would be reasonable to implement. (sounds tricky)

@mqudsi
Copy link
Author

mqudsi commented Feb 14, 2020

Ah, but it would if you disregard lifetimes altogether and instead think of the generics as the start address of the slice and its length.

we need a way to say exactly this lifetime, not at least this lifetime.

That was my initial line of thought before thinking of a possible out via const generics. It's certainly possible if we pass in a wrapper that includes a phantom type such that it can't be constructed anywhere else, but call any of the str methods would return a bare &str dropping the wrapper so it would be ugly.

@RustyYato
Copy link

A wrapper type may work (we could forward all of str's methods to it, and it would rewrap the str), but that's tricky and I think sticking to run time checks will be much easier to justify.

@burdges
Copy link

burdges commented Feb 14, 2020

I suppose impl Trait tricks could hide opaque wrapper types perhaps, except Deref tricks cannot handle methods like split_at:

pub trait Substring : Deref<Target=str>+DerefMut { }

impl String {
    fn ::mutate<S,F>(&mut self, f: F)
    where
        S: Substring,
        F: FnOnce(S) -> S
    {
        struct MySubstring<'a>(&mut String,Range<usize>);
        ... impls ...

        let s = f(MySubstring(self,0..self.len()));
        ...
    }
}

We should add the shift_left methods regardless because doing this optimally requires new unsafe code.

@kennytm
Copy link
Member

kennytm commented Feb 14, 2020

What about the less ambitious alternative i.e. introduce the specific in-place modification functions?

impl String {
    pub fn trim_start_in_place(&mut self);
    pub fn trim_end_in_place(&mut self);
    pub fn trim_in_place(&mut self);
    pub fn trim_start_matches_in_place<'a>(&'a mut self, pat: impl Pattern<'a>);
    pub fn trim_end_matches_in_place<'a>(&'a mut self, pat: impl Pattern<'a>);
    pub fn trim_matches_in_place<'a>(&'a mut self, pat: impl Pattern<'a>);
}

@burdges

impl<T> Vec<T> {
    pub fn shift_left(&mut self, mid: usize);
}

This already exists, x.shift_left(mid) is the same as x.drain(..mid). (The same exists on String.)

@mqudsi
Copy link
Author

mqudsi commented Feb 14, 2020

Trim was just an example. What if you wanted to replace a string with the first occurrence of foo within it, etc?

@sfackler
Copy link
Member

I don't understand why the magic 'restricted semantics matter at all. In what way would the implementation be unable to deal with a string from outside its own buffer?

@mqudsi
Copy link
Author

mqudsi commented Feb 14, 2020

The original idea was that the method would be compile-time guaranteed to be (re-)allocation-free and the magic lifetimes were to be able to avoid a Result<..> return value. Obviously the only way a String could take over a slice would be if it owned it in the first place, but if the string reallocates as the "operation not possible" failure mode then I suppose that magic isn't needed at all.

@sfackler
Copy link
Member

The only time reallocation would be required is if the new slice is larger than the String's current capacity. Beyond that, the entire implementation is just to copy the slice into the bottom of the String's buffer and set its length to the size of the slice.

@mqudsi
Copy link
Author

mqudsi commented Feb 14, 2020

Ideally, there would be no copying needed if the slice is a subset of the string.

@RustyYato
Copy link

@mqudsi That can only work in cases like truncate. If the substring does not start at the beginning of the string, then there must be a copy.

@burdges
Copy link

burdges commented Feb 14, 2020

I think @sfackler is right: We should implement mutate as a "copy into anywhere inside myself from anywhere including possibly elsewhere inside myself" method. It almost always copies anyways, so zero reason to enforce anything about subslices. Internal copy methods like this make sense even for &mut str and &mut [T] but any truncate calls require String or Vec.

@Mark-Simulacrum
Copy link
Member

This is sounding somewhat similar to Vec::splice, for what it's worth: https://doc.rust-lang.org/nightly/alloc/vec/struct.Vec.html#method.splice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants