CSC/ECE 517 Spring 2021 - E2111. Refactor github metrics integration

From Expertiza_Wiki
Jump to navigation Jump to search

Abstract

Github repository and pull request links are submitted as part of programming projects in several NC State CSC courses. Instructors would like to compile statistics on these Github links on a per-project, per-team basis, and integrate display of these data within Expertiza (similar to Github Insights). Code was introduced to add this functionality in a 2018 project, E1858, which was not merged, and another team re-visited this project in 2019, which was also not merged. The project team will troubleshoot existing code to eliminate issues that prevented these projects from being merged. Additionally, the team will add new features based on design patterns and MVC principles, including a Metrics model, that will decouple the Github Metrics functionality from other areas of the application and allow for future extensibility. Finally, the team will investigate DRYing the code using existing graphing functionality, and complete an analysis of the automated testing for existing and new code.

Functionality

The Github metrics system, as currently implemented, queries data from Github based on repository and pull request links submitted for the assignment. The commit frequency is displayed in a bar graph. Numeric statistics, as well as Travis CI status, are displayed in a table.

Methodology

The team has synced the revised 2019 project code with the current status of the Expertiza/beta branch and begun analyzing the existing code. Prior customer feedback has advised the team to address inextensibility/coupling issues, documentation issues, and synchronization issues. We are performing our own end-to-end analysis of the code in addition to this feedback, and have so far found numerous action items. We have separated these action items into Primary and Secondary objectives: Primary objectives are "must-haves" that need to be addressed before the existing code is mergeable. Secondary objectives are "nice-to-haves," or features that were either left out of the original project or would make the integration display more complete.

Design Patterns

In order to achieve the Primary Objectives while making the application more extensible, the team proposes using a more strict implementation of the Adapter design pattern to decouple the details of the Github API from the proposed Metrics model. Elements of the Observer design pattern will also come into play to ensure that once Github data is queried and stored, and an assignment has come to a close, the Metric for that Assignment is unsubscribed from the Github API. This will further decouple the dependency on API data while retaining the data for future statistical analysis and/or research studies.

Team Action Items

These action items include both refactoring tasks, and required feature enhancements to be implemented.

Minor API Changes Breaking Functionality

The Github API has had minor changes since 2018 which are causing the previously working code to either not display, or throw a NoMethodError. We have already analyzed and applied bugfixes to address these minor API changes, but would like to further refine how the API calls are set-up to decouple our code from the API and make it more robust.

One key issue is that the Github API no longer permits queries with empty hash keys. This shows up in queries with a single page, where after: ___ has, in the past, been left blank. In order to resume queries, the team now tests whether after: SHA1 should be included in the query, and if not, we do not add the empty key. This permits data to be pulled from Github where repos have fewer than two pages of commits.

Refactoring Improvements

Several method and variable names do not use good Ruby style, and should be refactored to improve style and clarity. Some examples include get_pull_request_details() and authorize_github(). These shall be refactored for improved clarity, and may be relocated within the application to make the code better fit the MVC system. See Metrics Model section for further details.

Improvements to Data Time and User Constraints

If a team submits a repo containing merged data, with many Expertiza Development Team commits, these commits currently show up as part of the team metrics. Improved logic needs to be implemented to timebox the statistics to only cover commits made during the duration of the project. Additionally, improvements are needed for the way in which commit owners are handled, so that metrics are not stored/displayed/statistically analyzed for Expertiza Development Team commits. An example of a problematic submission is shown below.


Major Feature: Metrics Model

The Expertiza system architect has instructed the team to handle a major shortcoming in the original project as part of this body of work: the creation of a new model to handle local storage of the metric data. This new model should be able to support future extensibility for new metrics, like Travis CI or Danger. This change will bring the Github metrics system in full compliance with MVC principles and allow for local warehousing of data for performance and data retention.

The current implementation works synchronously, and each time a Github integration route is followed, the entire collection of Github data is requested from the Github API and used to render the graphs and statistics. These data are cumbersome, difficult to work with, and nearly impossible to extend to future functionality. Furthermore, this coupling to the Github API means that data is not being warehoused within the system and makes adding new functionality much more difficult. The code uses controller methods to load the data and format it for display.

The team proposes creating a fully realized MVC implementation to separate Github queries, warehousing of data, and displaying of graphs and statistics which are currently handled in a single chain of controller methods. This will require implementation of a new model, and additional database tables to store the data as it is queried. Also, views will need to be uncoupled from the Github API and reconfigured to render data from the local data warehouse. Furthermore, this will allow methods that aren't CRUDlike to be moved from the Github metrics controller into the new model.

Diagram of Proposed MVC Architecture for New Metrics Model

One point of discussion among the team and our mentor will be, should an update query be run each time the page is displayed, or should a burst of queries be run at the expiration of a deadline to populate the database table? Another option may be to create a manual "Github Update" link, but reject all data after a deadline. Yet another option would be to keep all Github data and use the "late data" to call out number of late submissions on the statistics page.

Proposed Schema and Entries Example

User JoeCodeMonkey, on team 15, added 3 commits which were queried from Github. The Metrics table will keep track of each metric, with the Metrics_Data_Points table keeping track of each individual data point. The Metric_Sources and Metric_Data_Point_Types tables are used as lookup tables to keep track of the types of metric data being stored in Metrics and Metrics_Data_Point.


Table: Metric_Sources


ID(PK)


Source Name


Description

0

Github

CVS Hosting

1

Travis CI

Build Tool


Table: Metric_Data_Point_Types


ID(PK)

Source_Id (FK)

to Metric Sources Table)


Value_Type


Name


Description

0

Github -0

string

github_user

Github user

1

0

int

lines_added

Number of lines added

2

0

int

lines_removed

Number of lines removed

3

0

datetime

date_committed

Date of change


Table: Metrics

ID(PK)

Source_Id (FK

Team_ID (FK)

0

0 -Github

15 - Team 15

1

0 - Github

15 - Team 15

2

0 - Github

15 - Team 15


Table: Metrics_Data_Points


Id

Metric_Id (FK to metric table)

Metric_Data_P oint_Type_Id (FK)


Value

0

(metric id )0

0 (github_user)

JoeCodeMonkey

1

0

1 (lines_added)

5

2

0

2 (lines_remove

10

3

0

3 (date_commit

1/2/2017 1600

4

1 (metric id 1)

0

JoeCodeMonkey

5

1

1

8

6

1

2

5

7

1

3

1/5/2017 1700

8

2

0

JoeCodeMonkey

9

2

1

7

10

2

2

6

11

2

3

1/20/2017 1430

Lack of Comments / Opaque Code

The majority of methods added in E1858 are lacking comments describing their purpose and functionality and comments were not modified for any modified methods.
Comments will be added to the following files

  • app/controllers/auth_controller.rb: New methods oauth_login and github_login need comments, existing method action_allowed? needs a comment
  • app/controllers/grade_controller.rb: Large amount of new methods need comments, existing method "view" needs modification to existing comment
  • app/helpers/grades_helper.rb: The methods display_github_metrics, chart_options, and graph_scales need comments describing functionality
  • app/views/assignments/list_submissions.html.erb: New GitHub Metrics column added to table
  • app/views/grades/_tabbing.html.erb: Class "tmr_tab" will have to be explained
  • app/views/grades/_teammate_reviews_tab.html.erb: Class "github" will have to be explained
  • app/views/grades/view.html.erb: JavaScript functionality to be moved to a helper .js file and then commented.
  • app/views/grades/view_github_metrics.html.erb : JavaScript functionality to be moved to a helper .js file and then commented.
  • spec/controllers/grades_controller_spec.rb: Comments will have to be added describing the RSpec tests including their purpose and how they achieve their tests

DRYing Graphing Code

The existing code uses new code to graph the statistical Github data. Code already exists elsewhere in Expertiza (Statistics) to graphically display statistical data, and the team intends to replace the Github Graphing code by re-using the existing graphing code, if possible.

Properly Timebox Statistics

The existing code pulls all commits from a repository or pull request. We propose utilizing the assignment start and end dates to ensure that only commits which apply to the scope of the project are included in the statistics view/analysis.

Bugfix: Hide Expertiza Development Team

At this time, if a team merges in many old commits by the Expertiza Development team, the bar graph statistics may be thrown off by including the development team ("Winbobob" et al). Logic exists in the code to prevent these developer's work from being credited to the project team, but it is not working reliably. If the data permanence objective is met, creating more complex queries should be easier, and the team would like to explore updating this feature to to "hide known contributors," and ensure that it is working reliably. This would hide Github data from any user included on the "blacklist" which can be manually curated with the short list of Expertiza developers. While this would not address edge cases where an Expertiza developer later participates in a course using Expertiza, this edge case should be quite rare.

Test Plan

The 2018 project wrote numerous tests, several of which are deeply coupled into the grades_controller tests. The team proposes several changes to the testing strategy.

Decouple Tests / Pass Builds

Builds never passed for this project because, in the current code, a user must be logged in and authorized to Github to render grades#view, and this broke several (expected) test cases. As part of decoupling Github from the grades controller, we expect to decouple and revert tests for the grades controller such that the tests become independent, permitting the build to pass if our Github controller tests pass as expected.

Detailed Analysis of Github Testing

The previous team wrote numerous intricate tests for the new functionality. Our team intends to go through these tests line-by-line and ensure that having so many tests is necessary, and the most elegant way to address the issue. We believe that some of the Github refactoring should move methods to become private, which may mean that specific tests for that method shouldn't be used. Alternately, particularly for API authentication which can be tricky, it may be prudent to have discrete tests. Finally, we want to ensure that the tests are actually covering the code -- we are receiving a TravisCI warning about tests without expectations, which needs to be resolved before the code can be merged.

New Tests for Metrics Model

The new model being created, along with associated database tables, will lead to new tests being required for this model. These tests will need to be written first in keeping with TDD principles. Also, some tests may need revision when code is moved from the Github Metrics controller into the new MVC architecture. These tests will include, but not be limited to, testing cases when Github Pull requests exist, and do not exist (for Expertiza and non-Expertiza projects), validation of all data types being stored in the Model / new database tables, appropriate coverage of new and existing graphing helpers, cases when multiple repos and/or pull requests are included in the assignment links, and edge cases when Repos or pull requests were submitted but access rights have not been granted.

Final Design UML

We show the UML for the new classes and model we created and implemented in our final project for a overview. Other than these, we also modified some of the existing classes and method, we will give detailed introduction later.

Documentation

MetricsHelper

We have a helper model called MetricsHelper, which contain helper methods for graphing:

display_github_metrics(parsed_data, authors, dates)

This function take parsed Github metrics data, authors in a PR, and dates of all commits as parameter. It Creates the bar graph for the Github metrics data. Links the authors with their Github data and assigns them a color. Currently supports up to 6 different colors and will loop if it goes over.

display_totals_piechart(parsed_data, authors, dates)

This function take parsed Github metrics data, authors in a PR, and dates of all commits as parameter. It Creates the pie chart for the Github metrics data. Links the authors with their Github data and assigns them a color. It is easy for an instructor to tell the ratio of contribution of each team member with a pie chart.

chart_options

This function does not take any parameters. It defines the general settings of the Github metrics chart

graph_scales

This function does not take any parameters. It defines the labels and display of the data on the Github metrics chart

Metric

This is a new model that saves user's Github email as the identifier of their Github account. With this information, we can connect Github metrics with each students. It also saves the team_id, the participant_id, and the cumulative total commits in the submitted PR for each user. We make this model in a way that it is easy for future team to extend.

pull_query(hyperlink_data, after)

This function takes a submitted Github pull request URL and a cursor for pagination in Github GraphQL API as parameter. It Formulate and return the actual query message to send over HTTP request to do the GraphQL query for a Github pull request. For more detailed information, check Github GraphQL API.

repo_query(hyperlink_data, date, after)

This function takes a submitted Github repository URL, the date of the project starting date, and a cursor for pagination in Github GraphQL API as parameter. It Formulate and return the actual query message to send over HTTP request to do the GraphQL query for a Github repository. For more detailed information, check Github GraphQL API.

MetricsController

The Metrics controller contains all the logics for querying Github metrics and present these metrics with the default show method.

action_allowed?

This function does not take any parameters, it checks whether the current user has the required privilege to run certain method.

create_github_metric(team_id, github_id, total_commits)

This function takes team_id, github_id, and total_commits as parameter. It does some validations and logic checking of the input value, then it saves all values to database as fields of the Metric model.

query_assignment_statistics

This function does not take any parameters, it runs a query against all the link submissions for an entire assignment, populating the DB fields that are used by the view_team in grades heatgrid showing user contributions

show

The default show method, which renders the html page that shows all Github metrics with a bar chart and a pie chart.

authorize_github

This function redirect the user to Github authorization endpoint to authorize the current user. An authorized user can use the Github API with 5000 rate limits per hour. Unauthorized user only has 60 rate limits per hour, which may not be enough.

single_submission_initial_query(id)

This function takes the participant id as parameter. It calls several helper methods to query Github metrics information from links that this team submitted and populate all related instance variable, which will be used in frontend presentation.

retrieve_github_data

This function call corresponding method to retrieve Github pull request metrics information or repository information base on different conditions.

query_all_pull_requests(pull_links)

This function take all PR URLs that a team submitted and retrieve all Github metrics.

pull_request_data(hyperlink_data)

This function takes all hyperlink_data includes pull request number, repository name, owner name of a single PR url as parameter and retrieve all Github metrics for a single PR.

parse_pull_request_data(github_data)

This function takes the Github metrics of a PR returned by the query as parameter, parse and store them into corresponding instance variable.

query_all_merge_statuses

This function saves each PR's statuses in a hash. This is done through Github REST API not GraphQL.

retrieve_repository_data(repo_links)

This function take all repo URLs that a team submitted and retrieve all Github metrics.

parse_repository_data(github_data)

This function takes the Github metrics of a repo returned by the query as parameter, parse and store them into corresponding instance variable.

team_statistics(github_data, data_type)

This function takes all Github metrics we queried and a data_type that specify the github_data is for a PR or a repo as parameter. It does the basic accounting and store total additions, total deletions, total files changed, total commits, and merge status in instance variable for future presentation purpose.

count_github_authors_and_dates(author_name, author_email, commit_date)

sort_commit_dates

query_commit_statistics(data)

query_pull_request_status(pr_object)

Resources

Key Links

E2111 Repository

E2111 Pull Request

Original E1858 Documentation

E1983 Revision Documentation

Github API Documentation