CSC/ECE 517 Fall 2015 M1505 Add conformance tests to unicode-bidi and fix conformance bugs

Introduction

Web browsers are expected to support international text, and Servo is no exception. This project is an attempt to improve an existing library to implement the Unicode Bidirectional Algorithm for display of mixed right-to-left and left-to-right text, and it has not yet achieved full conformance with the specification<ref name="servo">http://en.wikipedia.org/wiki/Servo_%28layout_engine%29</ref>.

Servo

Servo is an experimental project to build a Web browser engine for a new generation of hardware: mobile devices, multi-core processors and high-performance GPUs. With Servo, we are rethinking the browser at every level of the technology stack — from input parsing to page layout to graphics rendering — to optimize for power efficiency and maximum parallelism. Servo builds on top of Rust to provide a secure and reliable foundation. Memory safety at the core of the platform ensures a high degree of assurance in the browser’s trusted computing base. Rust’s lightweight task mechanism also promises to allow fine-grained isolation between browser components, such as tabs and extensions, without the need for expensive runtime protection schemes, like operating system process isolation<ref name = "servo"/>.

Rust

Rust is a new programming language for developing reliable and efficient systems. It is designed to support concurrency and parallelism in building platforms that take full advantage of modern hardware. Its static type system is safe and expressive and it provides strong guarantees about isolation, concurrency execution and memory safety. Rust combines powerful and flexible modern programming constructs with a clear performance model to make program efficiency predictable and manageable. One important way it achieves this is by allowing fine-grained control over memory allocation through contiguous records and stack allocation. This control is balanced with the absolute requirement of safety: Rust’s type system and runtime guarantee the absence of data races, buffer overflow, stack overflow or access to uninitialized or deallocated memory<ref>http://www.rust-lang.org/</ref>.

Terminology

There is some terminology which may seem a little strange to first time readers. Here are some of the terms and their intended meanings.

RTL: Right-to-Left text. For example Urdu or Hebrew scripture.
LTR: Left-to-Right text. For example English or French scripture.
LRE: Left-to-Right Embedding Character. Usage: Treat the following text as embedded left-to-right.
LRE: Left-to-Right Embedding Character. Usage: Treat the following text as embedded left-to-right.
RLE: Right-to-Left Embedding Character. Usage: Treat the following text as embedded right-to-left.
UBA: Unicode-bidi Algorithm. This is an algorithm defined by Unicode.org which defines the display of the different character types like RTL, LTR, RLE, LRE among others.
N0, L1, W1 - W7: These are rules which are defined as part of the UBA Algorithm. These terms are explained below in the Project Description section.

To find more information on the above terms please check the Specification sheet for the UBA Algorithm here.

Architecture

generate.py - This file is central to the code base as it fetches the test data files BidiTest.txt<ref name="biditest">http://www.unicode.org/Public/UNIDATA/BidiTest.txt</ref> and BidiCharacterTest.txt<ref name="bidichartest">http://www.unicode.org/Public/UNIDATA/BidiCharacterTest.txt</ref>
BidiTest.txt<ref name="biditest"/> - Contains test case sample data at word level.
BidiCharacterTest.txt<ref name="bidichartest"/> - Contains test case sample data at character level.
Cargo - Rust file used to run the cargo test.
lib.rs - Rust file used for Unicode Bidirectional Algorithm testing.

Project Description

Initial Steps:

Clone the unicode-bidi repository, compile it, and run the tests in defined in src/lib.rs.

In-order to run the tests the user must navigate to the unicode-bidi folder and then use the following command. Rust must be downloaded in order for this command to work.

cargo test

Add code to tools/generate.py to download the files that make up the conformance testsuite: BidiTest and BidiCharacterTest.

The fetch commands have been commented out for now, as they were not required in the initial steps.

if __name__ == "__main__":
    os.chdir("../src/") # changing download path to /unicode-bidi/src/
    r = "tables.rs"
    # downloading the test case files
    # fetch("BidiTest.txt")
    # fetch("BidiCharacterTest.txt")

By hand, convert several test cases from the file into Rust tests that can be run automatically.

We have tested the following public methods:

reorder_line() : We added tests to this method. A failing test was found, which will be fixed in the subsequent steps.
is_ltr()
is_rtl()
removed_by_x9()
not_removed_by_x9()

Subsequent Steps:

Need to add methods to tools/generate.py that automatically converts the tests in the conformance suite into Rust tests that can be run automatically [1]

BidiTest and BidiCharacterTest contain a comprehensive list of test cases for testing characters to be rendered in Servo. Our implementation in generate.py will generate the Rust code which will be the input for the reorder_line() method in lib.rs. Two python files, BidiCharacterTest.py and BidiTest.py have been added to generate the Rust test cases. They are invoked from generate.py. Code for deleting test cases has also been added. They can be found in:

tools/BidiCharacterTest.py
tools/BidiTestParser.py

Implement the missing step L1 from the UBA [2]

The L1 step can be described as: On each line, reset the embedding level of the following characters to the paragraph embedding level:

Segment separators,
Paragraph separators,
Any sequence of whitespace characters and/or isolate formatting characters (FSI, LRI, RLI, and PDI) preceding a segment separator or paragraph separator, and
Any sequence of whitespace characters and/or isolate formatting characters (FSI, LRI, RLI, and PDI) at the end of the line.

In combination with the following rule, this means that trailing whitespace will appear at the visual end of the line (in the paragraph direction). Tabulation will always have a consistent direction within a paragraph. <ref>http://unicode.org/reports/tr9/#L1</ref>

This step will be implemented in the lib.rs file.

A function called resolve_white_space(para_level: u8, classes: &[BidiClass], levels: &mut [u8]) has been added to the lib.rs file.

fn resolve_white_space(para_level: u8, classes: &[BidiClass], levels: &mut [u8])
{   
    let mut start = 0;
    let line_end = levels.len();
    let mut level = levels[start];

     for i in start..line_end {

        let new_level = levels[i];

        if new_level != level {
            let prev_start = start;
            start = i;
            level = new_level;
            //Rule L1, clause 4
            for k in (prev_start..start).rev()
            {
                if prepare::white_space(classes[k]) {
                   levels[k] = para_level;
                }
                else if prepare::removed_by_x9(classes[k]) {
                            levels[k] = if k > 0 { levels[k-1] } else { para_level };
                            
                        }
                else {
                    break;
                }
            } 
        }
        else {
            //Rule L1, clauses 1 and 2
            if prepare::is_segement_or_paragraph_separator(classes[i]) {
                    levels[i] = para_level;

                    //Rule L1, clause  3
                    for j in (0..i).rev(){

                        if prepare::white_space(classes[j]) {
                           levels[j] = para_level;
                        }
                        else if prepare::removed_by_x9(classes[j]) {
                            levels[j] = if j > 0 { levels[j-1] } else { para_level };
                        }
                        else {
                             break;
                        }
                    }
                    
                } 
        }
    }

}

Implement the missing step N0 from the UBA [3]

Process bracket pairs in an isolating run sequence sequentially in the logical order of the text positions of the opening paired brackets. Identify the bracket pairs in the current isolating run sequence according to BD16. <ref>http://unicode.org/reports/tr9/#N0</ref>

Solve the conformance problems related to the implementation of steps W1 to W7[4]

We will have to improve the existing implementation of the steps W1 to W7.

Flowchart of Project Description

Design Principles

Three of our proposed design principles are:

Open-Closed principle: In object-oriented programming, the open/closed principle states "software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification" that is, such an entity can allow its behavior to be extended without modifying its source code.<ref>https://en.wikipedia.org/wiki/Open/closed_principle</ref> In the subsequent steps we used this principle in generate.py. We have added code to generate.py to call external modules to convert the test cases in BidiTest.txt and BidiCharacterTest.txt into Rust test cases. However, we haven't changed existing code in the generate.py file. Beyond this we have written modularized code where each module was tested and debugged before it was integrated into the
Single Responsibility Principle: In object-oriented programming, the single responsibility principle states that every class should have responsibility over a single part of the functionality provided by the software, and that responsibility should be entirely encapsulated by the class. All its services should be narrowly aligned with that responsibility<ref>https://en.wikipedia.org/wiki/Single_responsibility_principle</ref>. The code to be implemented in generate.py and lib.rs contains separate single responsibilities. generate.py deals will fetching files, loading, unloading data. Whereas, lib.rs deals with actually testing the existing methods, and extending the functionality of the Unicode-Bidi algorithm. Further we have created parsers and test case generators that by themselves parse or generate test cases from the test case files that they are responsible for. Beyond this we have written modularized code where each module was inherently responsible only for part of the task that it was assigned.
Expert Pattern: This principle is stated as: "The object that contains the necessary data to perform a task should be the object that performs the task.". Since objects of the classes BidiCharacterTestCase and BidiTestCase have all the data necessary to build the test cases, we have made these objects responsible for building test cases.

Design Pattern

Command Pattern was used as the design pattern for this implementation. The Command Pattern uses an object to encapsulate all information needed to perform an action or trigger an event at a later time. <ref>https://en.wikipedia.org/wiki/Command_pattern</ref>

The five terms associated with this pattern are:

Command Object:A command object knows about receiver and invokes a method of the receiver.
Command: Values for parameters of the receiver method are stored in the command
Receiver: Does the work after receiving the command.
Invoker: An invoker object knows how to execute a command, and optionally does bookkeeping about the command execution.
Client: The client decides which commands to execute at which points

In this project the terms given above can be interpreted as:

Command Object: generate.py
Command:

   BidiCharacterTest.txt - Calls a method, BidiCharacterTestCase(object) to convert parsed BidiCharacterTest.txt file to test cases.
   BidiTest.txt - Calls a method, BidiTestCase(object) to convert the parsed BidiTest.txt file to test cases.

Receiver: lib.rs
Invoker: Cargo Test
Client: Cargo

Design Decisions

Here are some of the design decisions that we have made regarding each of the steps:

Step 1

Objective: Add code to tools/generate.py that automatically converts the tests in the conformance suite into Rust tests that can be run automatically.

Here, separate python source files will be used to build the test automation framework.
Essentially the test case repositories will have to be parsed and these will be used to build test cases.
These test cases will be inserted into the lib.rs files between commented automation test markers. These markers will exist to ensure that the lines where the code has to be generated is identified easily.
Each inserted test case will have commented ids associated with them in order to be able to map each test case to it's source.

Step 2

Objective: Implement the missing step L1 from the UBA.

The visual_runs function does not yet implement step L1, which defines steps for adjusting the levels of certain whitespace and formatting characters. This function will be modified to take care of the step L1 and will adjust levels for removed_by_x9 characters.

Step 3

Objective: Implement the missing step N0 from the UBA.

In the resolve_neutral function, implement step N0: Process bracket pairs in an isolating run sequence. A new function will be introduced to take care of the step N0. The resolve neutral function will call this new function to process step N0.

Step 4

Objective: Solved the conformance problems related to the implementation of steps W1 to W7.

The previous implementation of steps W1 - W7 tried to map the steps in a single pass. This strategy produced incorrect results. The current version will preserve the state of the variable that was getting modified and hence ensure that each step becomes mutually exclusive from other steps.

UML Diagrams

Class Diagram

Here is a class diagram representing the different classes involved and their mutual interaction.

Test Cases

As part of this project we had to perform manual testing in the initial steps of the algorithm, and we have to automate testing as part of one of the subsequent steps. Here are some details about both:

Manual Testing

As part of the initial steps, we did add a few manual test cases that would check for conformance to some of the major steps. Here are some of those test cases:

Check for LTR by passing the level number
Check for RTL by passing the level number
Check for removal of characters according to the Rule X9 of the algorithm
Check for non removal of characters according to the Rule X9 of the algorithm
Check for reordering of characters in accordance with the following types of characters:
- Weak LTR
- Strong LTR
- Strong RTL
- Neutral characters
- RTL(Explicit Right-To-Left) Markers (Failing Test Case. The steps to implement this was not implemented till that point in time.)

Automation Test Suite

The project involved adding code from BidiCharacterTest.txt and BidiTest.txt so as to ensure that the implementation of the unicode-bidi algorithm always conforms to the specifications defined in the Unicode Bidirectional algorithm. All the conformance tests are mentioned in these two files. At any point in time, an implementation of the Unicode-bidi algorithm should pass exactly all of the test cases mentioned in these files in order to conform to the specification standards set by the Unicode Consortium. Our design will take care of this with the following steps:

With every new version, the generate.py script will be automatically executed
This script will get the latest copy of the test case files from Unicode.org.
Parse the test case files and generate test cases from these files.
Convert each test case to appropriate test code, and insert these test cases within the lib.rs file

Additionally, with each change to the base version of the unicode-bidi library, when the cargo test command is called, it will test all of the test cases in lib.rs and this will ensure that the library automatically conforms to specifications created by unicode.org.