CSC/ECE 517 Fall 2015 M1503 Integrate xml5ever XML parser

From Expertiza_Wiki
Revision as of 18:47, 6 December 2015 by Jsharda (talk | contribs) (Details of TreeSink Implementation)
Jump to navigation Jump to search

Rust

Rust is a general-purpose, compiled programming language developed by Mozilla Research. The syntax of Rust is somewhat similar to C and C++, with blocks of code delimited by curly brackets, and control flow and structure. Rust does not use automatic garbage collection mechanism similar to java. It accomplishes the goals of memory safe without using garbage collection and it supports concurrency and parallelism in building platforms.

Servo

Servo is web browser layout engine developed by Mozilla Research. It was developed in Rust. Servo handles parallel environments such as rendering, layout, image decoding as a separate tasks. Servo provides APIs, JavaScript support. Servo was not developed explicitly to create full web browser but to achieve maximum parallelism.

Compilation and Build

Servo's build system automatically downloads a snapshot Rust compiler to build itself. This is normally a specific revision of Rust upstream, but sometimes has a backported patch or two.

Code link: https://github.com/servo/servo/ . This repository is forked to https://github.com/ronak6892/servo .

Servo is built with Cargo, the Rust package manager. We also use Mozilla's Mach tools to orchestrate the build and other tasks.

Normal build

To build Servo in development mode. This is useful for development, but the resulting binary is very slow.

git clone https://github.com/servo/servo
cd servo
./mach build --dev
./mach run tests/html/about-mozilla.html

For benchmarking, performance testing, or real-world use, add the --release flag to create an optimized build:

./mach build --release
./mach run --release tests/html/about-mozilla.html

Building for Android target

git clone https://github.com/servo/servo
cd servo
ANDROID_TOOLCHAIN=/path/to/toolchain ANDROID_NDK=/path/to/ndk PATH=$PATH:/path/to/toolchain/bin ./mach build --android
cd ports/android
ANDROID_SDK=/path/to/sdk make install

Rather than setting the ANDROID_* environment variables every time, you can also create a .servobuild file and then edit it to contain the correct paths to the Android SDK/NDK tools:

cp servobuild.example .servobuild
# edit .servobuild

Running

./mach run [url] 

Project Description

Background information

Servo uses a custom HTML5 parser written in Rust, called HTML5ever. Servo currently lacks a parser for XML documents, which prevents it from running XHTML tests and implementing APIs that rely on it. XML5ever which is an experimental XML parser that works on a modified specification of XML called XML5, which drops certain properties of XML like well-formedness for better compatibility with HTML and better error recovery. XML5ever is based largely on HTML5ever parser.

Goal

The goal of the project is to integrate XML5ever parser into Servo for parsing of XML documents which is currently not present in Servo. After the project, Servo will differentiate between HTML and XML documents and parse them accordingly using their respective parser which is currently lacking in it.

Steps done as part of the OSS Project

To achieve project goal we have done following initial steps in our OSS project (which have already been merged in Servo's master branch).

  • Complied servo and added xml5ever as a dependency to the script crate using cargo package manager. To do this we edited Cargo.toml located at components/script by adding xml5ever as a dependency.
  • Added xml.rs at components/script/parse with parse_xml() as a function. Declared xml as public module in mod.rs in order to declare file.
  • declared an empty ServoXMLParser interface in a webidl file located at located at components/script/dom/webidls.
  • Implemented ServoXMLParser interface with necessary stubs in servoxmlparser.rs located at components/script/dom. Also declared servoxmlparser as public module in mod.rs located at components/script/dom.
  • Called parse_xml from domparser.rs located at components/script/dom this will help compile.

Design

To integrate XML5ever, dependency was added for XML5ever parser similar to HTML5ever.A separate interface was defined in ServoXMLParser webidl file and this interface was implemented in its corresponding rust file along with necessary stubs to parse XML.

UML diagram

Design pattern

The Adapter Design Pattern was applied to enhance parsing mechanism for XML5 in Servo. Interface defined using adapter pattern closely resembles servoHTMLParser interface as this will facilitate parsing or modifying any code for both XML and HTML documents and future reader don't have to understand code for both separately as they are related in their functionality.

Implementation

  • Modify Script::load in scipt_task.rs to check whether document being parsed is of type text/xml. When content type of the document is text/xml, parse_xml method which we was defined in OSS project will be called. Earlier it called parse_html for all the documents but now since we are integrating the XML parser, it will call parse_xml instead of parse_html, while passing the appropriate flag to the Document constructor.
  • Sink was implemented for ServoXMLParser in which a utility function is defined get_or_create which searches for a child and if not found then creates a new one and returns it.
  • XML5Ever defines an interface of TreeSink for integrating it. Implementation of this TreeSink was provided in xml.rs. It included implementation of following functions:
  1. get_document : returns the xml document being parsed
  2. elem_name : returns the name of node of the XML Document specified in the argument
  3. create_element : creates a new elements, sets its attributes provided in the argument and returns newly created element
  4. create_comment : creates a new comment node using the text specified in the argument
  5. append(parent, child) : fins the child using get_or_create function defined in Sink(previous step) and appends it to to the parent node
  6. append_doctype_to_document : creates doctype using public_id and system_id provided in the arguments and appends it to the parent node
  7. create_pi : creates processing instructions using the target and data in the arguments and returns in reference
  • Implementing a TreeSink for the XML parser which will pick nodes from XML document and append them to XML tree in the hierarchy. Serializable module also needs to be implemented in order to implement TreeSink.
  • Support for XML document responses needs to be implemented. In this step,response Document and its MIME type needs to be checked and if either one of them is null then function will return null otherwise if MIME type is text/html then its charset will be checked and if its null then set it will be set to UTF-8 and if MIME type is text/xml then document will be defined as a Document that represents the result of running the XML parser with XML scripting support disabled on bytes. At the end of this step based on above condition,document’s encoding will be set to charset, content type to final MIME type and url to document’s URL and this response document object will be returned.
  • Implement XMLDocument API:
  1. adding the new IDL file at components/script/dom/webidls/XMLDocument.webidl;
  2. creating components/script/dom/XMLDocument.rs;
  3. listing XMLDocument .rs in components/script/dom/mod.rs;
  4. defining the DOM struct XMLDocument with a #[dom_struct] attribute, a superclass or Reflector member, and other members as appropriate;
  5. implementing the dom::bindings::codegen::Bindings::XMLDocumentBindings::XMLDocumentMethods trait for &'a XMLDocument.
  6. In XMLDocument.webidl file, implement the load method.
  partial interface XMLDocument {
	boolean load(DOMString url);
  };
  • The load(url) method will check and validate url specified in the method call and set readiness of current document to "loading". It will then start a request-response flow in which it will initiate a request with destination as sub-resource and its response will be result of the fetched request. If this response's Content-Type metadata is an XML MIME type then it will create a new XML parser associated with result document and pass this parser responses' body and success set to true and readiness of this document will be set to "Complete". At the end, it will replace all children of document by the children of the result followed by mutation events so the XML document gets loaded.

Challenges

Primary challenge in this project is to continuously sync with the latest commits of servo and ensuring build success after integrating each step. Adding necessary stubs in the newly added files as per the changes in parsing mechanism will also need continuous and careful efforts. Parsing of XML document incorporates many input arguments so error checking for all the edge cases will also be a challenge while implementing TreeSink and support for XML HttpRequest.

Testing

To integrate the model of XML parser into current servo code, no extra test cases were added. We have added interface files and method stubs, XML parser functionality will be implemented as subsequent steps in final project. Integration success is tested by successful compilation and build after adding our changes. Following commands are used to check that all test cases were passed.

./mach run tests/html/about-mozilla.html

./mach test-tidy

All the modifications which were suggested by servo community through comments on pull request, have been incorporated and pull request has been merged successfully.

References

Servo Documentation - http://doc.servo.org/servo/index.html

Project Definition - https://github.com/servo/servo/wiki/Integrate-xml5ever

Rust Documentation - https://doc.rust-lang.org/nightly/index.html

XML specs - https://xhr.spec.whatwg.org/#document-response

YouTube - https://youtu.be/i8dONOzYwlc