Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct.UTFString.get() fails for UTF-16 #30

Open
blschatz opened this issue Nov 19, 2014 · 5 comments · May be fixed by #254
Open

Struct.UTFString.get() fails for UTF-16 #30

blschatz opened this issue Nov 19, 2014 · 5 comments · May be fixed by #254

Comments

@blschatz
Copy link

This fails due to the underlying call to IO.getZeroTerminatedByteArray - this should really be looking for double nulls not single nulls for wide Charsets.

@headius
Copy link
Member

headius commented Apr 23, 2015

This should probably be using Java's charset logic to decode. Will investigate.

@headius
Copy link
Member

headius commented Apr 23, 2015

Ahh I see, it's just looking for the nulls to peel them off. Will see what I can do.

@headius
Copy link
Member

headius commented Apr 23, 2015

Ok, I understand now.

getZeroTerminatedByteArray is used to return the bytes of a string sans the null terminator. It does this by taking the given string address and calling strlen on it. strlen only looks for \0, and then that length is used to allocate and populate a Java byte array.

This would be a problem if there's any embedded null bytes, which is obviously a problem for UTF-16 in ASCII range.

This is going to be a much more difficult fix, since the actual strlen call happens inside native code. Whenever we change native code, we need to rebuild the native stubs across platforms.

I'm also not sure that just changing strlen is the right fix. These functions have no way of knowing what encoding the bytes are in.

Here's what I think we should do:

  1. As a workaround, you could work with the strings as bytes and deal with the nulls yourself. Not ideal, I know.
  2. Add a second version of this logic that takes either an encoding or an explicit terminator to look for, along the lines of getTerminatedByteArray(addr, [terminator|encoding]).
  3. Finally figure out how to set up VMs for all the platforms we support, so we can more easily update the native bits (ping @tduehr).

@blschatz
Copy link
Author

My fix was as follows:

public class UTF16String extends String {

public UTF16String(int length, Charset cs) {
        super(length * 8, 8, length, cs); 

    }
    protected jnr.ffi.Pointer getStringMemory() {
        return getMemory().slice(offset(), length());
    }

    public final void set(java.lang.String value) {
        getStringMemory().putString(0, value, length, charset);
    }

    public final java.lang.String get() {
        jnr.ffi.Pointer memory = getStringMemory();
        byte[] bytes = new byte[length];
        memory.get(0, bytes, 0, length);

        // find the null terminator first
        int nullPos = bytes.length;
        for (int i=0; i< nullPos ; i+=2) {
            if (bytes[i] == 0 && bytes[i+1] == 0) {
                nullPos = i;
                break;
            }
        }
        CharBuffer res = charset.decode(ByteBuffer.wrap(bytes, 0, nullPos));
        return res.toString();
    }

}

@headius
Copy link
Member

headius commented Sep 26, 2016

@blschatz Possible for you to turn that into a pull request we can integrate? I'm not sure how you're using that within jnr-ffi and your own code (i.e. I'd like to see some examples and ideally tests in a PR).

demon36 added a commit to demon36/jnr-ffi that referenced this issue Jul 6, 2021
DirectMemoryIO.getString() fails for non UTF-8
@demon36 demon36 linked a pull request Jul 6, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants