CSC/ECE 517 Spring 2020 - M2001. Implement charset prescanning for the HTML parser

From Expertiza_Wiki
Jump to navigation Jump to search

Servo is a modern, high-performance browser engine designed for both application and embedded use. Servo is written in the Rust language. It is currently developed on 64-bit macOS, 64-bit Linux, 64-bit Windows, and Android. As of February 17, 2020, Servo is not yet capable of charset parsing, a feature that all other browsers have. The goal of this project is to implement HTML charset parsing in the current version of Servo.

Introduction

Servo

Servo is an experimental browser engine developed to take advantage of the memory safety properties and concurrency features of the Rust programming language. The project was initiated by Mozilla Research with the effort from Samsung to port it to Android and ARM processors. The prototype seeks to create a highly parallel environment, in which many components (such as rendering, layout, HTML parsing, image decoding, etc.) are handled by fine-grained, isolated tasks.

Rust

Rust is a multi-paradigm programming language focused on performance and safety, especially safe concurrency. Rust is syntactically similar to C++ but provides memory safety without using garbage collection.

DOM

  • The HTML DOM is an Object Model for HTML. It defines HTML elements as objects, properties for all HTML elements, methods for all HTML elements, events for all HTML elements.
  • The HTML DOM is an API (Programming Interface) for JavaScript. JavaScript can add/change/remove HTML elements, add/change/remove HTML attributes, add/change/remove CSS styles etc.

The HTML DOM Tree of Objects

Setup

Setting up the local environment on our machines requires "rustup", an installer for the systems programming language Rust. The guide to set up the local environment for each operating system can be found here.

Final Project

Problem Statement

  • Our main focus is to complete the initial steps listed on the project page. The goal here is to create a new Rust module in the html5ever repository and implement the byte stream prescanning algorithm.
  • After completing the initial steps, we integrate the new prescan algorithm into Servo's HTML parser implementation following the encoding sniffing algorithm. Here, Rust package manager "Cargo" will be used.

Design Pattern

Design pattern will not be applied here since our main goal is to create a method that implements a byte stream prescanning algorithm.

Implementation

  • Step 1: create a new prescan.rs module in the html5ever repository and implement the byte stream prescanning algorithm.
  • Step 1a: add a new public function which accepts a &[u8] argument and returns Result<&'static Encoding, AbortReason> where AbortReason is an enum representing not enough bytes or no encoding detected within the first 1024 bytes.
  • Step 1b: use Encoding::for_label to convert a named charset into an Encoding value
  • Step 2: add unit tests that cover success and failure cases for the algorithm (use cargo test prescan to run tests defined in the new prescan.rs module)
  • Step 3: Integrate the new prescan algorithm into Servo's HTML parser implementation following the encoding sniffing algorithm:
  • Step 3a: add a Cargo override that uses the locally-modified version of html5ever in Servo's Cargo.toml
  • Step 3b: modify components/script/dom/servoparser/mod.rs to create an enum with two states - Prescanning(Vec<u8>) and Detected(NetworkDecoder), and replace the network_decoder field with this enum
  • Step 3c: in push_bytes_input_chunk, if the prescanning case is active then perform prescanning on any existing buffer along with the newest chunk, transitioning into the Detected phase if prescanning completes (and updating the associated Document's encoding with the detected encoding) (step 4)
  • Step 3d: if prescanning does not complete, no parsing should occur in parse_bytes_chunk
  • Step 3e: modify new_inherited to accept an Option<&'static Encoding> argument, which is used as an override that avoids prescanning any input (step 3)
  • Step 3f: when prescanning completes with no detected encoding, check document's browsing context's parent's document's encoding (step 5)
  • Step 3g: Verify the failing automated tests pass with the new parser changes