Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port 'string-lessp' to Rust #217

Merged
merged 15 commits into from Jul 11, 2017
Merged

Conversation

DavidDeSimone
Copy link
Collaborator

@DavidDeSimone DavidDeSimone commented Jun 25, 2017

This is a WIP PR for porting string-lessp to Rust. As part of this work, it seemed useful to add an iterator for a LispStringRef, in order to replace the macro FETCH_STRING_CHAR_ADVANCE

Work left to do:

  • Port basic string-lessp functionality to Rust
  • Remove string-lessp implmentation in C.
  • Add code point iterator for LispStringRef
  • Add safe API for working with LispSymbol's, similar to LispStringRef/as_string/as_string_or_error
    • Fix issue where the value coming out of the LispSymbolRef's mem::transmute seem to be nonsense
  • Verify that the alignment/size/padding of Rust's Lisp_Symbol matches the alignment/size/padding of C's Lisp_Symbol
  • Basic profiling to make sure that these new concepts do not add significant cost over their C equivalents. (Based on my findings, performance for this function is equivalent to it's C counterpart)

DavidDeSimone added 3 commits June 24, 2017 21:05
…tringRef. This commit is a work is part of a work in progress.
Conflicts:
	rust_src/src/strings.rs
…stead of incrementing the char counter by 1 per iteration.
@@ -155,6 +155,10 @@ impl LispObject {
pub fn is_symbol(self) -> bool {
self.get_type() == LispType::Lisp_Symbol
}

pub fn symbol_name(&self) -> LispObject {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe - it'll treat the LispObject as a symbol regardless of its type.

You'll have to introduce a LispSymbolRef newtype and respective as_symbol and as_symbol_or_error methods on LispObject to convert safely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, I will do this next.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -106,6 +106,22 @@ impl LispStringRef {
}
}

// Substitue for FETCH_STRING_CHAR_ADVANCE
impl Iterator for LispStringRef {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea to use an iterator here!

i1 += 1;

if codept1 != codept2 {
if codept1 < codept2 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just return LispObject::from_bool(codept1 < codept2)

}
}

if i1 < lispstr2.len_bytes() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here (without return)

let end = cmp::min(lispstr1.len_bytes(), lispstr2.len_bytes());
let mut i1 = 0;
while i1 < end {
// Unwraps should be fine here, due to our manual tracking of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, could this be done with a zip? I don't think we need i1 afterwards since if we finish the loop it will always be = end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another good call out, I've changed the implementation to use zip

DavidDeSimone added 2 commits June 25, 2017 23:58
… codepoints on a LispString. Declaring string_lessp in Rust as a #[lisp_fn]. Removing c definition of string-lessp.
…instead of len_chars, causing a potential iteration panic. Adding iteration zip implementation for cleaner code.
codepoint = cp;
self.cur += advance;
} else {
codepoint = ref_slice[self.cur] as u32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. as Codepoint?

self.cur += 1;
}

Some((codepoint, self.cur))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this returns the index of the next codepoint. Is this intended?

}

impl LispStringRef {
pub fn iter(&self) -> LispStringRefIterator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on need by other APIs, I think two iterators a la chars() and char_indices() would make sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable to me, I like the idea of a chars() and char_indicies().

}
}

LispObject::from_bool(count < lispstr2.len_chars())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just lispstr1.len_chars() < lispstr2.len_chars(), and keeping count is unnecessary.

(If lispstr2 is shorter or equally long, this will test len2 < len2 which is false.)

@@ -1,6 +1,6 @@
//! Functions operating on strings.

use std::ptr;
use std::{ptr, cmp};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import?

@@ -155,6 +155,10 @@ impl LispObject {
pub fn is_symbol(self) -> bool {
self.get_type() == LispType::Lisp_Symbol
}

pub fn symbol_name(&self) -> LispObject {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

David DeSimone added 2 commits June 28, 2017 11:49
…the current codepoint location, not the next offset location. Updating string-lessp logic to not need an explicit 'count' variable
…. This will allow us to create a LispSymbolRef like we have a LispStringRef. This will also allow a similar API for working with symbols as we have for strings.
fn get_string_or_symbol(mut string: LispObject) -> multibyte::LispStringRef {
if string.is_symbol() {
string = string.symbol_name()
string = string.as_symbol_or_error().symbol_name()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still duplicates the symbol check. I'd make it

match string.as_symbol() {
    Some(sym) => sym.symbol_name().as_string().expect("symbol name not a string?")
    None => string.as_string_or_error()
}

If you like, you can also make string_equal use this function, it currently uses SYMBOL_NAME directly.

…ged Rust union for a SymbolUnion. Updating the implementation of 'get_string_or_symbol' to avoid an additional symbol check.
#[inline]
pub fn as_symbol(&self) -> Option<LispSymbolRef> {
if self.is_symbol() {
Some(LispSymbolRef::new(unsafe { mem::transmute(self.get_untaggedptr()) }))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be incorrect. The code in lisp.h for XSYMBOL looks like

INLINE struct Lisp_Symbol *
(XSYMBOL) (Lisp_Object a)
{
#if USE_LSB_TAG
  return lisp_h_XSYMBOL (a);
#else
  eassert (SYMBOLP (a));
  intptr_t i = (intptr_t) XUNTAG (a, Lisp_Symbol);
  void *p = (char *) lispsym + i;
  return p;
#endif
}

and it looks like we will have to emulate this logic for getting the address to mem::transmute.

// @TODO check the value of name post and pre transmutation, it seems that name is surviving but
// may not be the correct value
#[repr(C)]
pub struct Lisp_Symbol {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcmarkbit, redirect ...etc are missing ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is my understanding of the situation re: the mentioned variables. If anyone spots any incorrect information please do not hesitate to correct me:

Lisp_Symbol has the following definition in lisp.h (I've stripped out the comments for clarity):

struct Lisp_Symbol
{
  bool_bf gcmarkbit : 1;
  ENUM_BF (symbol_redirect) redirect : 3;
  ENUM_BF (symbol_trapped_write) trapped_write : 2;
  unsigned interned : 2;
  bool_bf declared_special : 1;
  bool_bf pinned : 1;
  Lisp_Object name;
  union {
    Lisp_Object value;
    struct Lisp_Symbol *alias;
    struct Lisp_Buffer_Local_Value *blv;
    union Lisp_Fwd *fwd;
  } val;
  Lisp_Object function;
  Lisp_Object plist;
  struct Lisp_Symbol *next;

This struct is using the C notation for bit fields. According to my research, Rust does not (rust-lang/rfcs#314, https://users.rust-lang.org/t/c-structs-with-bit-fields-and-ffi/1429) support setting up an equivalent for C bitfieldsin it's #[repr(C)] directive.

On my system (64-bit Ubuntu 16.04), no matter what the typedef of bf_bool or ENUM_BF, the bit field section of the struct takes 4 bytes. Due to alignment, the struct is padded, and offsetof(struct Lisp_Symbol, name) reports 8. My initial solution to representing this struct in Rust was to represent the bit field block as a u32, taking up the 4 bytes I mentioned earlier. I have not fully convinced myself that this is 100% the correct and portable thing to do.

Even if we can safely represent this block with a u32 on every system we support, it seems that there are 'gotcha's with accessing these bit fields via the bit wise operators (due to endianness, and compiler differences w.r.t bit field implementation.)

Overall I am not sure the best way to handle the interop for C structs that use bit fields that we need to access in Rust. It seems that one option is to simply pad the Rust struct as best we can, and if we need to access these fields, we will need to maintain C bindings that access them for us.

DavidDeSimone added 6 commits July 5, 2017 01:26
…guments. This was due to an improper call to mem::transmute in the Rust layer.

Symbols are a special case in which you cannot just call mem::transmute(self.get_untaggedptr()). Instead, one must offset the pointer value based on the memory address of an emacs global 'lispsym'.
…Ref. This will allow a user to loop over the codepoints of a LispStringRef, or the indicies of the codepoints of a LispStringRef.
LispStringRefCharIterator(self.iter())
}

pub fn char_indices(&self) -> LispStringRefIndexIterator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick again, but actually char_indices returns the (index, char) tuple - you just need to rename iter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to apologize, this should be fixed now.

@DavidDeSimone DavidDeSimone changed the title [WIP] Port 'string-lessp' to Rust Port 'string-lessp' to Rust Jul 6, 2017
@DavidDeSimone
Copy link
Collaborator Author

Overall, I feel pretty confident with the PR now. If any maintainers have additional feedback, I will be happy to address their concerns, otherwise I think this is ready to merge.

Copy link
Collaborator

@Wilfred Wilfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me: code looks clean :). Other than one missing docstring, I think it's good to merge.


pub struct LispStringRefCharIterator<'a>(LispStringRefIterator<'a>);

// Substitue for FETCH_STRING_CHAR_ADVANCE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Substitute

@@ -202,6 +202,21 @@ fn string_to_unibyte(string: LispObject) -> LispObject {
}
}

#[lisp_fn]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the docstring here.

@Wilfred Wilfred merged commit 008064a into remacs:master Jul 11, 2017
@Wilfred
Copy link
Collaborator

Wilfred commented Jul 11, 2017

Marvellous :)

@DavidDeSimone DavidDeSimone deleted the string-lessp branch July 11, 2017 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants