Skip to content

Implement HTML charset parsing project

Simon Sapin edited this page Feb 17, 2020 · 3 revisions

Background information: Major browsers support parsing HTML content that does not provide an HTTP Content-Encoding header but declares it inline in the page in a <meta> element instead. This causes the bytes of the page to be reinterpreted in the requested character encoding. The goal of this project is to implement support for this delayed encoding interpretation in Servo as well, which will increase the number of passing tests and improve compatibility with existing web content that relies on this feature.

Tracking issue: (please ask questions in these issues)

Useful references:

Initial steps:

  • email the mozilla.dev.servo mailing list (be sure to subscribe to it first!) introducing your group and asking any necessary questions
  • create a new prescan.rs module in the html5ever repository and implement the byte stream prescanning algorithm.
    • add a new public function which accepts a &[u8] argument and returns Result<&'static Encoding, AbortReason> where AbortReason is an enum representing not enough bytes or no encoding detected within the first 1024 bytes.
    • use Encoding::for_label to convert a named charset into an Encoding value
  • add unit tests that cover success and failure cases for the algorithm (use cargo test prescan to run tests defined in the new prescan.rs module)

Subsequent steps:

  • Integrate the new prescan algorithm into Servo's HTML parser implementation following the encoding sniffing algorithm:
    • add a Cargo override that uses the locally-modified version of html5ever in Servo's Cargo.toml
    • modify components/script/dom/servoparser/mod.rs to create an enum with two states - Prescanning(Vec<u8>) and Detected(NetworkDecoder), and replace the network_decoder field with this enum
    • in push_bytes_input_chunk, if the prescanning case is active then perform prescanning on any existing buffer along with the newest chunk, transitioning into the Detected phase if prescanning completes (and updating the associated Document's encoding with the detected encoding) (step 4)
    • if prescanning does not complete, no parsing should occur in parse_bytes_chunk
    • modify new_inherited to accept an Option<&'static Encoding> argument, which is used as an override that avoids prescanning any input (step 3)
    • when prescanning completes with no detected encoding, check document's browsing context's parent's document's encoding (step 5)
  • Verify the failing automated tests pass with the new parser changes
Clone this wiki locally