CSC/ECE 517 Spring 2018- Project M1803: Implement a web page fuzzer to find rendering mismatches (Part 2)

From Expertiza_Wiki
Jump to navigation Jump to search

By Alexander Simpson(adsimps3), Abhay Soni (asoni3), Dileep badveli (dbadvel) and Jake Batty(jbatty)

Introduction

This Mozilla project was broken in to 2 main parts: the previous work and the work to be done. The previous work was finished as a part of the OSS project. As a part of the OSS project (explained more below) we created a tool which generates random valid HTML files and automated servo. Servo is experimental web browser developed by Mozilla to "to create a new layout engine using a modern programming language". By automating servo (the web browser) we were able to quickly see if servo could render the randomly generated pages.

The goal of this final project was to build off the original project while adding more features. The work we completed as a part of this project is split into a couple main parts. First, we extended the program created in the OSS project to also control Firefox. By also controlling Firefox, we then had 2 screenshots of the randomly generated content - Servo and Firefox . After getting both screenshots, we then compared them using Python and OpenCV. Finally, we were able to expand upon the page generation tool to allow the randomly generated web pages to have more properties/be more complex.

A demo video of our project can be found in 2 parts: Part1 and Part2

Previous Work (Part of the OSS Project)

As per the project description, we were expected to complete the initial steps. The implementation is explained below for each of these steps.

1) In a new repository, create a program that can generate a skeleton HTML file with a doctype, head element, and body element, print the result to stdout
- Here is the link to the repository which contains code_generation.py file which will be used to generate random valid HTML files.
2) Add a module to the program that enable generating random content specific to the <head> element (such as inline CSS content inside of a <style> element) and add it to the generated output
- The file code_generation.py, contains the code which generates random content specific to the head element and adds style on top of it. As seen in this code, after generating random content to the file, we will add CSS elements on top of this content. We have established a list of commonly used styles, weights, fonts, font_styles, and alignments which will be used at random. For practical purposes, we are limiting the number of options.
3) Add a module to the program that enables generating random content specific to the <body> element (such as a

block that contains randomly generated text) and add it to the generated output

4) Generate simple random CSS that affects randomly generated content (ie. if there is an element with an id foo, generate a CSS selector like #foo that applies a style like colorto it)
5) Create a program under Servo's etc/ that launches Servo and causes it to take a screenshot of a particular URL - use this to take screenshots of pages randomly generated by the previous program
Sample Screenshot:

Lists of Tasks

Below is a list of the tasks that were completed as a part of our final project. Below each task we have described how exactly it was implemented with code examples.

1) Extend the program that controls Servo to also control Firefox using geckodriver

Task 1 is relatively simple. It just involves downloading geckodriver and running it. Geckodriver is an open source software engine that allows us to render marked content on a web browser. It should allow us to take screenshots of a particular URL, just like task 5 in the previous work section, but for Firefox instead.

2) Compare the resulting screenshots and report the contents of the generated page if the screenshots differ

This task involves automating Firefox to use geckodriver and the current servo program. They both will create 2 different screenshots. If servo and Firefox render it differently, we will report that file and mark the differences.

3) Extend the page generation tool with a bunch of additional strategies, such as:
-Generating elements trees of arbitrary depth
-Generating sibling elements
-Extending the set of CSS properties that can be generated (display, background, float, padding, margin, border, etc.)
-Extending the set of elements that can be generated (span, div, header elements, table (and associated table contents), etc.)
-Randomly choose whether to generate a document in quirks mode or not
UML Diagram of Tasks:

Below is a UML diagram of the tasks of the overall system. As you can see, it covers all 3 tasks. It starts with the code generation and then it splits and takes screenshots on both Firefox and Servo. It then compares then screenshots and reports the distance.

Implementation of Tasks

1) Extend the program that controls Servo to also control Firefox using geckodriver

To open up Firefox we would use geckodriver and selenium.After browser is opened,html files will be loaded one at a time. A screenshot would be taken of each html file and stored locally as a png file. The resolution of the file is the width and height variables which gets initialized by the arguments, as well as the num_of_files variable is how many different html pages of screenshots would be needed. The browser.execute_script returns the coordinates of the header bar. Using the header bar size we were able to calculate the dimensions of the browser to get the indicated resolution.
 browser = webdriver.Firefox() #opens up the firefox browser
 for x in range(num_of_files):
        file = os.path.abspath("file" + str(x) + ".html")
        dx, dy = browser.execute_script("var w=window; return [w.outerWidth - w.innerWidth, w.outerHeight - w.innerHeight];")
        browser.set_window_size(width + dx, height + dy)
        browser.get("file:///" + file)  # go to html page with firefox
        browser.save_screenshot("screen" + str(x) + ".png")  # saves the current screen
        img= Image.open("screen" + str(x) + ".png")
        img_width, img_height=img.size
        print('Image width is :', img_width)
        print('Image height is :', img_height)
 browser.close()

2) Compare the resulting screenshots and report the contents of the generated page if the screenshots differ

To actually make the comparisons we use OpenCV. OpenCV allows us to not only make comparisons but will also mark where the differences in. To actually do the comparisons we will first read in the images, convert the images to grayscale, and then call compare_ssim().
imageA = cv2.imread(image1)
imageB = cv2.imread(image2)
grayA = cv2.cvtColor(imageA, cv2.COLOR_BGR2GRAY)
grayB = cv2.cvtColor(imageB, cv2.COLOR_BGR2GRAY)
(score, diff) = compare_ssim(grayA, grayB, full=True)


This gets us a score and diff values. The score variables represents how close the to images are to each other and the diff variables tells us where the differences are. If the score indicates that the images are different we then use the diff value to findContours() and then draw a rectangle around the differences.

thresh = cv2.threshold(diff, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
contours = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if imutils.is_cv2() else contours[1]

After getting contours, we get a rectangle for each contour by calling "(x, y, w, h) = cv2.boundingRect(c)". We then combine any rectangles that are inside each other or very close to each other. After combining the rectangles, we then add the rectangles to each image. Lastly we then save image A (with the rectangles to show the differences)

for rrr in rectangles:
    x = rrr[0]
    y = rrr[1]
    w = rrr[2]
    h = rrr[3]             
    cv2.rectangle(imageA, (x, y), (x + w, y + h), (0, 0, 255), 2)
    cv2.rectangle(imageB, (x, y), (x + w, y + h), (0, 0, 255), 2)
cv2.imwrite(filename,imageA)


3) Extend the page generation tool with a bunch of additional strategies, such as:

For task 3 there are several different parts, but the main goal was to increase the complexity of our randomly generated pages. The code_generation.py file was adapted to provide these functions. The implementation of each subtask is explained below.

Generating elements trees of arbitrary depth and Generating sibling elements

We understood a tree of elements to mean divs inside of each other. By using this definition, it was quite simple to create a tree of elements. When RandomDiv is called we don't just return a div with a random section. We also recursively call RandomDiv a random amount of times. We also may call RandomDiv several time on the same level of the tree. This effectively creates sibling elements. Below is the code for RandomDiv.
numDivs = random.randint(1,num_sibling_divs)
for i in range(0, numDivs):
    yield '<div id="a'+str(random.randrange(0,count[0]))+'">\r\n'
    yield RandomSection(count, max_depth, num_sibling_divs, tree_height, max_headers, min_headers)
    if tree_height > 0:
        treeHeight = random.randint(0, tree_height)
        if treeHeight > 0:
            yield RandomDiv(treeHeight - 1, count, max_depth, num_sibling_divs, max_headers, min_headers)          
    yield '</div>'


Extending the set of CSS properties that can be generated (display, background, float, padding, margin, border, etc.)

Implementing this was relatively simple. We just did the same thing we did before but with more CSS properties. Below is the code we used to do this. It does not include creating the arrays with different property values but shows us accessing the different arrays.
string_css += 'display: ' + displayTypes[random.randint(0, 3)] + ';\r\n'
string_css += 'background-color: rgb(' + str(random.randint(0, 255)) + ',' + str(random.randint(0, 255)) + ',' + str(random.randint(0, 255)) +');\r\n'
string_css += 'float: ' + floatTypes[random.randint(0, 3)] + ';\r\n'
string_css += 'padding: ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px;\r\n'
string_css += 'margin: ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px;\r\n'
string_css += 'border-width: ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px ' + str(random.randint(0, 20)) + 'px;\r\n'
string_css += 'border-style: ' + borderTypes[random.randint(0, 4)] + ';\r\n'

Extending the set of elements that can be generated (span, div, header elements, table (and associated table contents), etc.)

Separate functions were added to create a span, div, table, ordered list, unordered list, and table. Because a span element is in-line and used typically to label a small chunk of text, the RandomSpan function is called at random within the RandomSentence function. Previously, the RandomSentence function simply called RandomWord a random number of times to create a sentence. To add in a span a small percentage of the time, a variable was given a random integer value between 0-99. If the variable was less than 5, RandomSpan was called, else RandomWord was called.
#generate a random sentence using random range function
def RandomSentence():
    global count
    num_of_words = random.randrange(5, 20)
    yield RandomWord()
    for _ in range(num_of_words-1):
        #1 in 20 words will be within a span
        z= random.randrange(0,99)
        if z<5:
            yield ' '
            yield ''.join(RandomSpan())
        else:
            yield ' '
            yield ''.join(RandomWord())
    yield '. '

The other newly added element types- divs, tables, lists- are called at random within the NestedElement function discussed above. These functions utilize the RandomWord and RandomSentence functions to create their elements. For example, see the RandomTable function below:

def RandomTable():
    column_count = random.randrange(1,10)
    row_count = random.randrange(2,20)
    yield '<table>\r\n'
    #generate table head
    yield '\t<tr>\r\n\t\t'
    for _ in range(column_count):
        yield '<th>'
        yield RandomWord()
        yield '</th>\r\n'
    yield '\t</tr>'
    #fill in rows
    for _ in range(row_count-1):
        yield '\t<tr>\r\n\t\t'
        for _ in range(column_count):
            yield '<td>'
            yield str(random.randrange(0,1000))
            yield '</td>'
        yield '\r\n\t</tr>\r\n'
    yield '</table>\r\n'

Randomly choose whether to generate a document in quirks mode or not

Quirks mode is enable whenever an HTML document does not have a DOCTYPE. To randomly enable Quirks mode we only added a DOCTYPE 50% of the time. Below is the code:
quirksMode = random.randint(0,1)
if quirksMode == 1 or quirks_mode_possible == False:
    yield '<!DOCTYPE html>'

Testing

Because of the nature of this project, testing the newly added features will be very simple.

* For task 1, once we are actually controlling Firefox using geckodriver we know its working (by getting a resulting screenshot).
* Task 2 actually involves us comparing screenshots. By running the code with images that are the same and images that are different, we have proven that this comparison works.
* Additionally, we have visually tested that task 3 is complete (seeing that the new page generation features work). While we could probably automate some of the testing for task 2 and 3 (to test all possible scenarios) it is out of the scope of this project.

Conclusion

The previously completed work allows us to generate simple html documents with a randomized structure, render the page in Servo, and take a screenshot of the page. This project furthered this work by rendering the pages in both Servo and Firefox, taking screenshots of the pages within both browsers, and reporting differences between the two. This project also expanded on the random content generator. This work will now allow users to evaluate Servo’s ability to load web pages. To make this testing even more informative.