Skip to content

Strings

Alan Jeffrey edited this page Dec 30, 2015 · 9 revisions

String representations

Rust

Rust libraries use UTF-8-encoded byte arrays (https://doc.rust-lang.org/std/string/struct.String.html).

Servo

Servo has two representations for strings:

Conversion from atoms to strings is O(n), since the string has to be copied. Conversion from strings to atoms is O(n), since the string has to be hashed. Equality comparison on atoms is O(1), on strings is O(n).

Common atoms are pre-cached, (https://github.com/servo/string-cache/blob/master/shared/static_atom_list.rs).

SpiderMonkey

SpiderMonkey has quite a collection of string representations (https://blog.mozilla.org/ejpbruel/2012/02/06/how-strings-are-implemented-in-spidermonkey-2/) aimed at reducing memory usage on most we pages (https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/).

A summary (ignoring some of the types) is that a string is either:

  • A rope, representing the concatenation of its left and right child (which are themselves strings). This has O(1) concatenation, at the cost of O(log(n)) random access (if the rope is kept balanced, and O(n) if not).

  • A dependent string, representing a substring of its child (which is a flat string). This has O(1) substring.

  • A flat string is either

    • Interned or not. Interned strings are similar to Servo atoms (but allocated in the SM string pool)

    • Inlined or not. Inlined string are stored in the data value itself rather than on the heap

    • Latin-1 or UCS-2

SM on 64-bit architectures uses 3 words for a JSString, which consists of:

  • 32 bit string length

  • 32 bits of flags

  • 128 bits of data: either a pointer, or 15 Latin-1 characters plus a null terminator, or 7 UCS-2 characters plus a null terminator

In Firefox, SM uses a one-place cache when converting from a JSString to Gecko's flat string representation, so the most recent round-trip can be short-circuited.

Just to make things trickier, JavaScript allows invalid UTF-16 sequences in JSStrings, essentially treating strings as u16 arrays. An arbitrary JSString converted to UTF-8 and back may have different content.

String conversions

JSStrings are converted to and from DOMStrings when they are used in a method call at the SM/Servo boundary.

DOMStrings are converted to and from atoms when they are used in atomic contexts, for example node ids and element names.

DOMStrings are converted to and from Servo Strings when they are used by layout (currently this is a no-op, since DOMString is just a wrapper around String).

Strategies for avoiding some of these conversions:

a) Caching, for example Gecko caches the most recent JSString-to-DOMString conversion. There is the usual memory-vs-time trade-off here.

b) Sharing representations, for example Servo currently shares a representation between String and DOMString.

Experimental implementation

As an experiment, a branch of Servo which uses an alternative string representation is at https://github.com/asajeffrey/servo/commits/domstring_hackery. Strings are either:

  • a Rust String,
  • an inlined string (up to 15 UTF-8 encoded bytes),
  • an Atom

These are packed into a three-word data structure, to ensure no more memory usage than a String.

Comparing the running times on the Dromaeo DOM tests (running each test 1000 times, and reporting running times in ms, best run of 3), we get for Servo master

$ ./target/release/servo-master -x tests/dromaeo/dromaeo/tests/dom-attr.html 
getAttribute 8
element.property 12
setAttribute 6323
element.property = value 5654
element.expando = value 1711
element.expando 435
was ready to save, resetting ready_to_save_state
Shutting down the Constellation after generating an output file or exit flag specified

compared to the experimental build:

$ ./target/release/servo-domstring-hackery -x tests/dromaeo/dromaeo/tests/dom-attr.html 
getAttribute 8
element.property 12
setAttribute 6266
element.property = value 5675
element.expando = value 1773
element.expando 434
was ready to save, resetting ready_to_save_state
Shutting down the Constellation after generating an output file or exit flag specified

which is very little savings. The time benefit of using an alternate representation (less allocation, and fewer pointer indirections) seems to be about the same as the cost (more conditional jumps, so more speculative evaluation and poorer pipelining).

For memory usage on real-world web pages, valgrind reports:

$ valgrind target/release/servo-master -x http://twitter.com
...
==7705== HEAP SUMMARY:
==7705==     in use at exit: 9,072,935 bytes in 23,823 blocks
==7705==   total heap usage: 197,381 allocs, 173,558 frees, 334,235,688 bytes allocated

compared to:

$ valgrind target/release/servo-domstring-hackery -x http://twitter.com
==10725==     in use at exit: 8,932,476 bytes in 23,730 blocks
==10725==   total heap usage: 191,543 allocs, 167,813 frees, 290,674,179 bytes allocated
=...

(These are the results from the best of three runs, with reported allocation of 290M-458M.)

Here, there is a benefit, but it is only 140K (1.5%) less heap usage at exit, and 6K (3%) fewer allocations. Interestingly, there is significantly less total allocation (44M is about 13%) but this is not translating to space savings, presumably because many of the strings are short-lived, and jemalloc is performing well.

In summary, this alternative string representation is saving space, but not a significant amount of space, and the small gain is not obviously worth the cost in code complexity.

Clone this wiki locally