CSC/ECE 517 Fall 2014/ch1a 15 gs

From Expertiza_Wiki
Jump to navigation Jump to search

Git Version Control System

Git is a widely used open source distributed version control system used to manage small as well as large projects. It outclasses SCM tools like Subversion, CVS, Perforce, and ClearCase with features like cheap local branching, convenient staging areas, and multiple workflows. Git supports branching and merging, multiple distributed workflows and is fast.

Topic Document


Introduction to Git

Source code management (SCM) systems all attempt to store system source code and provide the ability to see what changes have been made to that source code over time. The Git SCM however is unique is many ways from more traditional SCM systems.

Git is a source code management system (SCM) originally designed and developed by Linus Torvalds in response to Bitkeeper, the SCM previously used by the Linux Kernel development team, withdrawing its free license. Today it has become the most widely used version control system (VCS) for software development due in part to it’s features falling closely in line with the tenets of Agile methodology. Git was developed to create a system with an emphasis on working with branches being easy and very fast compared to other SCM’s, the ability for development and collaboration on a project being done largely offline from a central server, and data assurance, either accidentally or maliciously.

Basic Git Structures

Internally Git consists of a number of objects such as commits, tags, trees, and BLOBs. Commit objects contain a reference to its tree object, the committer, the author, comments on what was changed in the specific commit, and, if not the initial commit, the parent or parents that it has. The tree object represents snapshot taken of the directory structure as a result of the commit and the blob objects store the contents of files. All of these objects have SHA hashes which uniquely identify them of every other object in every git repo in existence. Git also stores a snapshot of every version of every file in the repo rather than just storing the deltas like many other source control systems. It quickly and efficiently compacts the repo to save space. It is the architecture that allows git's branching to be so quick. When a developer enters the following.

git checkout -b new-branch-name

Git simply creates a new branch object that points to the commit that the branch it was created from pointed to. It isn't until a file is modified and committed back to the repo that these two branch deviate. Additionally merges are quick because git just replays the commits from the source branch on the target branch and does some additional book keeping and provided you didn't get any conflicting merges you will be back to writing code in no time.

<ref>https://issues.liferay.com/secure/attachment/40633/git-book.pdf</ref>

Git also supports a number of different protocols for pushing, cloning, pulling etc. your repo to and from remote locations. These include http, ssh, and gits own protocol. The features in Git make it very flexible. This opens the door to being able to use Git in a number of different ways modeled in workflows.

Using Checksums to Track Files

Traditional sequential version numbers are arbitrary and aren't very conducive in a distributed environment. Sequential version numbers require synchronization. So Git did away with traditional sequential version numbers in lieu of SHA-1 hashes. SHA-1 hashes virtually guarantee that no two numbers will be identical. In fact SHA-1 hashes identify every object in Git from commits to tree objects to tags and branches. It is the SHA-1 hash that allows developers from all corners of the globe to work independently yet collectively.

Git Basics

Initialize a new repository in the current directory:

git init

Clone an existing remote repository:

git clone http://www.kernel.org/pub/scm/git/git.git

To add a file to staging:

git add file-to-add

To commit the staged files to the repo. This will invoke the default editor to enter a check in comment.

git commit

To commit the staged file to the repo with a short check in message...

git commit -m

A powerful tool that comes with Git is gitk. Gitk is a tcl/tk graphical tool to display information about your repo. Learn to use it.

gitk

Click here to find a tutorial to learn about Git using gitk.

Help can be found for any Git command by entering:

git help <command>

A 15 minute interactive tutorial on using Git can be found here.

A note about check in comments

Check in comments are your friend. Do not overlook them or neglect them. Convention suggests using check in comments of the following format. The first line should be a brief explanation of the changes in the commit. It should be no longer than 50 characters. The second line should be left blank and all remaining lines should be a complete explanation of the commit but each line should not exceed 72 characters. The purpose of this format is that some Git commands will pull the first 50 characters of the commit message to display. And the 72 character limit is for full text formatting purposes.

For VIM users the Fugitive Git plugin includes vim script that will highlight beyond 50 characters in the first line and will automatically wrap at 72 characters.


Git Workflows

Centralized Workflow

The centralized workflow <ref>https://www.atlassian.com/git/tutorials/comparing-workflows/centralized-workflow</ref> is similar to a central Subversion repository (repo). The developer commits all changes to a single repository, ‘master’ in the case of Git. Git however provides additional features to the user beyond the functionality found in Subversion. Developers are able to work in an isolated environment from all others developing features in the same system. There are times when getting a new feature can unexpectedly cause delays in a developers work because the new feature in some way changes the public interface of code being used by the developer. Often these changes have no direct link to what the developer is working on, but because it was introduced into his/her environment it must be addressed in the code immediately. If the developer is working on a hotfix for a production issue, this could cause undue delay in deploying the hotfix. When using git however all commits occur in the developer’s local environment buffered from any changes being committed by others. Once the developer is at a point where the code they have been working on is stable they can then consider merging in changes made by other developers. The point is that the developer has better control of when to pull in changes made by someone else.

With that said, the Centralized Workflow holds the upstream repository as sacred ground. The central repository should exist on some server that all developers assigned to work on the project should have access and the repository should be created as a bare repository or one that has no working directory associated with it. Note that convention is to name repos that are bare with an extension of .git. In short the repository only ever changes when users push their changes into it.

An additional constraint placed on the central repository is that if the developers local commits diverge from the central repo then the developer cannot just merge his/her changes into the central repo. A merge is simply a 3-way comparison of the two commits and their common ancestor. Developers whose local repo diverges from the central repo must do what is called a rebase in order to be allowed to push their changes back to the central repo. A rebase replays all of the commits from the central repo on top of the developer’s repo. If a conflict occurs they must resolve the conflict and commits those changes. Once the developer rebases his/her local repo it can then be pushed back to the central repo. This push back to the central repo results in a fast forward merge. The end of result of a rebase and merge is the same as a 3-way merge using the merge command without rebasing, however the advantage of requiring the rebase over just a merge is that the fast forward merge results in a linear commit history that is much easier to read and manage over time. The commits in the central repository appear to be linear even though their development may have been parallel and were serialized by their relative local commit order and push.

Feature Branch Workflow

The feature branch workflow.
The feature branch workflow.

The Feature Branch Workflow <ref>https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow</ref> builds upon the centralized workflow. In this workflow however developers create a new branch each time they start a new feature or hotfix. There is also a role reversal when it comes to how the commits end up back in the central master branch. When the developer is ready to release their code back to the master branch they make a pull request rather than pushing it.

There a couple of advantages to this approach. Not every feature will involve just one developer. In the centralized workflow there is only one branch that everyone uses. When using git to implement this workflow it had the advantage that each developer was isolated from each other’s work which might cause heartburn due to incompatible changes. The downside to this approach is that developers are isolated from each other. Two developers cannot “share” code without making that code visible to everyone. This poses a significant problem if the code that the developers are working on is some crazy idea that very well may end up being sent to the bit bucket.

The feature branch workflow allows sharing of such code between multiple developers without touching the master branch code. Developers push their feature branch changes back to the server in order to share it with other developers. This allows multiple developers to work on a feature while preserving the pristine state of the master branch.

When a feature reaches the stage where the developer is ready to have it merged back into the master branch, they issue a pull request. A pull request identifies the commit that the target branch should start with. This would likely be the common ancestor of those branches. It also identifies the developers repo (the source repo) and the end commit which if omitted just runs to the end of the commit history. A pull request is really a request for a discussion about the feature in the branch. Developers can review the changes included in the branch. The changes in the branch would either be accepted as is, require modification before they would be accepted, or would be denied.

This type of workflow works very well in a larger environment where a separate change management group controls what code makes it into the master branch. It also is very advantageous when features require more than one developer to implement. Also by pushing feature commits back to the server a backup of commits is effectively made in case a system failure takes out a developer’s system.

Gitflow Workflow

Gitflow is a Workflow <ref>https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow</ref> proposed by Vincent Driessen. In this workflow there are no additional Git features that are used over the feature branch workflow. It does however prescribe a strict branching model that follows the project release cycle. This model works well in a large project due to the predictable and repeatable way that it handles releases while continuing further development of the system. Even though there are no real new features used in this workflow there a number of open source projects such as nvie/gitflow provided by Vincent Driessen on github.com and the Git Flow Integration plugin for RubyMine that provide helper scripts and menus to help guide developers to more easily follow the structure provided by workflow.

In the Gitflow Workflow there are a number of persistent and intermittent branches. Specifically there are two persistent branches, master and develop (or development). When you initialize a Gitflow Workflow you will always have these two branches. Intermittently there will exist a release branch and a hotfix branch as well as many feature branches. The feature branches will be named something that indicates what feature will be developed.

The process in a nutshell is as follows. The master branch is initialized. For a new project this branch will be tagged as a beta version or maybe even an alpha version, or if migrating from another version control system this may be the last stable release of the product. The development branch is branched from the master. All developers will then branch their feature branches from the development branch. No real development actually occurs on the development branch. Along with the master branch the development branch should be kept as pristine as possible at all times. This is simply a staging branch for holding completed features. All developer implementation occurs on feature branches. When a developer completes a feature they make a pull request just like in the feature branch workflow. Over time features build up in the development branch and at some point the decision is made to release a version of the software. At this point a release branch is created from the development branch. The release branch is then used to do final testing. No new features are added to the branch. Only fixes to existing features is allowed in a process often called release hardening. As issues are discovered they are fixed in the release branch and eventually after all tests are passed that code then deemed a release. Versions from the release branch may be major or minor releases. It is then merged into the master branch to be built and released to production as well as being merged back into the development branch so that the fixes made during the hardening process are made visible to future development. Once the merges have occurred the release branch is deleted.

The final piece of this puzzle is the hotfix branch. No matter the extent of testing done to software occasionally bugs make it into a production system. When issues are discovered in production that must be fixed immediately a hotfix branch is made from the master branch. The fix is made and tested. It is then merged back into the master branch as well as being merged into the production branch and the branch is deleted as it is no longer needed until the next hotfix. Although it is possible to keep the release and hotfix branches around in between the need for them, but the advantage of deleting them is that they don’t need to be kept up to date in between uses.

Forking/Integration Manager Workflow

One of the main issues of a distributed workflow is the access level everyone working on the project can have on a central repository. With large groups, everyone having write-access to a repository can cause messy merges, merge races and frustration due to a constantly changing central repo where it may be difficult to keep up with changes made in the group. The Forking or Integration Manager workflow <ref>http://git-scm.com/book/en/Distributed-Git-Distributed-Workflows</ref> can be used to address these issues.

The Forking Workflow <ref>https://www.atlassian.com/git/tutorials/comparing-workflows/forking-workflow</ref> is significantly different than the other types of workflows. In this workflow every developer has a Git repository on the server. That means that every developer has both a private local repository as well as a private remote or server-side repository. Project maintainers control when code is merged into the official repository. Because the nature of this workflow puts each developer’s code base is put in a silo it is ideal and frequently used the workflow of choice for open source projects where there may be untrusted developers contributing code.

When a developer forks a repo a copy of the official repo is made on the server-side for the developer. The act of forking a repo is nothing more than creating a copy of an existing repository so no special functionality is necessary in Git, however Git does support this type of workflow by providing the ability to set more than one remote repository. In the simplest case a developer would fork a repository possibly on repo service like GitHub. He/She would then clone a local version of the repo on their development machine. In order to keep up to date with changes being made in the official repo the developer would set another remote using the git remote add command which by convention is named upstream.<ref>https://help.github.com/articles/fork-a-repo</ref>

git remote add upstream https://github.com/octocat/Spoon-Knife.git

Once the developer also links his repo to the upstream repo it can then be kept up to date by periodically fetching changes from the upstream repo and merging them into their branch. <ref>https://help.github.com/articles/syncing-a-fork)</ref>

git fetch upstream

Getting the changes into their server-side repos just requires the developer to execute a push.

git push origin master

The developer will eventually reach a point where their fix or feature is ready to be released. At that point they send a pull request using the command:

git request-pull [branch you want your topic branch pulled into] [topic branch she wants pulled]

An integration manager receives that request and at that point decides to accept or reject the change merges them into his local repository and then pushes it to the central repository.

<ref>http://git-scm.com/book/en/Distributed-Git-Distributed-Workflows</ref>

A less commonly used expansion on the idea of the integration manager workflow is the Lieutenant-Dicatator workflow where developers given the roles of lieutenants are the first line of integrators who then give pull requests to a "benevolent dictator" who integrates all the changes and makes the push to the central-repository. This is used in very large projects where one integration manager cannot handle all of the pull requests.

Git Vs. SVN <ref>https://git.wiki.kernel.org/index.php/GitSvnComparison</ref>

Git and Subversion are two of the most widely used representatives of the two seemingly dichotomous approaches of distributed and centralized SCM services. Git has become the most widely used SCM according to a survey by the eclipse foundation <ref>http://ianskerrett.wordpress.com/2014/06/23/eclipse-community-survey-2014-results/</ref> and lends itself to a very distributed workflow with multiple developers being able to work independently on parallel branches of a project with the ability to then push changes made to the repository to each other or a central repository depending on the workflow model. Subversion on the other hand, is based around a central repository where developers can checkout a sub-directory of a repository or a recent version but do not clone the entire repository history.

Git advantages over SVN

  • Everyone who has cloned the central repository in order to work on it, has a complete working backup of all the files. Therefore, careful backups of the central server are not as critical due to the number of local backups depending on the number of developers
  • As a result of having the repository and its complete history locally, users can institute version control and choose for themselves what to track and what to merge and when while the repository on the central server can be tightly controlled.
  • In order to obtain the history of a file or perform a difference operation, with SVN one must have access to the server whereas all of that information is local with git.
  • Branching is much easier and more natural. Much more information is collected such as the user who initiated the merge, changes made on the branches and who made them and changes made to complete the merge. Subversion does not lend itself to working with branches as easily and does not record as much information either.
  • Since everything is local, speed of completing a commit, checking history, and checking differences between commits, among other actions, are orders of magnitude faster than SVN.

SVN advantages over Git

  • SVN has the advantage of being considerably simpler from the standpoint of version control. The central repository is tightly controlled and all changes are recorded there so there is little confusion compared to the potential complexity of having so much parallel un-reported development on a repository in git.
  • As a result of the ability to have multiple central repositories in git as well as a potentially large number of divergent branches on multiple local machines, obtaining all of the necessary files to run a project could get confusing. In contrast, all necessary files for a SVN repository must be in the central location making it easy to find.
  • With git, it is necessary to download the whole repository even if you only want to work on a small subdirectory of the repo which, depending on what kind of connection you have and the size of the repository, can consume bandwidth and be quite a legthy process. SVN allows you to pick and choose what files you download.
  • SVN has a much simpler version naming convention and therefore is much easier to quickly read forward and backwards through histories.

Additional Advantages of Git Over Other SCM's<ref>http://git-scm.com/about</ref>

The Git version control system differs from many other software change management products such as Subversion, CVS, Perforce, SourceSafe and Team Foundation Server in the way that it stores the source within its files. A great deal of other source control systems store deltas or only the parts of the files that differ from one version to the next. Git focuses more on taking a snapshot of the full repository with each commit, and storing any changes made to files while only referencing files that have not changed which allows for the ease of branching in git.

A characteristic that git shares with some other SCM services like BitKeeper and Monotone is but not with Subversion is that when a repo is cloned the repository and all of its history is copied onto the local machine making reading the history and rolling back or forward code much easier and faster. The advantage is that in git, because of the local nature of the software, one can work and commit changes locally on his or her machine without access to an external server, and can then sync up again and merge as needed when access is available once again. With subversion, you cannot commit if you do not have access to the central repository as you do not have a full repository on your local machine.

Also most other version control systems have a version number to identify one version of a file from another. The number is sequential and how the number is applied to the changes in the repository (or repo) varies from one SCM to another. In the case of Subversion it applies a single system version number which is applied at the file level while many others just sequentially number individual files. This type of version numbering is arbitrary and posed a significant problem for Linus when he considered how many people contribute to the development and maintenance of Linux. Instead git uses SHA-1 checksums to refer to files which allows for greater promised integrity of files since if they are changed, git can immediately detect it, providing additional security.

When git was developed, the traditional idea of tracking a files history explicitly was done away with.In git, files do not hold an explicit history but instead the history is created implicitly through tracking between tree snapshots. Files that have the same name are considered to be directly related and if a equivalently named file is not found, the snapshot is searched for a similar file to compare to which allows for identifying renames. When tracking is done explicitly there are potential inaccuracies caused by the recording of merges or splits of files in addition to a rename being recorded as only a rename.

Finally, one of the most widely referenced features that git has over other SCM's is the staging area. After files are added to a repository in the initial commit, any modifications made to those files are tracked. Once modified, files can be added to the staging area using the $git add [filename] command. Once there, diff operations can be operated on them against the last commit. In addition, it makes it easy for a developer to stage each commit in a way that is intuitive and easy to understand since it is not necessary to commit all modified files. The diagram below shows the "File Status Lifestyle" from the Pro Git book which is a good reference to learn more about git.

<ref>http://git-scm.com/book/en/Git-Basics-Recording-Changes-to-the-Repository</ref>

Disadvantages of Git Compared to Other SCM's

Centralized SCMs such as SVN have one great advantage over Git and all distributed version control systems. The advantage is in the handling of binary files. That isn't to say that you cannot store binary files in a Git repository. In fact Git can store binary files in it's repository just as efficiently as it can store text based files. The root of the issue is that binary files cannot be reliably merged. As long as two developers never modify any version except the latest commit of the binary file and never modify the latest commit of that file at the same time there will never be problem because in those cases a merge is not necessary. The connected nature of centralized version control systems which is considered to be a disadvantage in most cases is the key to consistent and controlled modification of binary files. It is that communication and single point of storage that allows centralized version control systems to prevent more than one developer from modifying the file. It can accomplish this by allowing only exclusive checkouts of binary files. There are ways to work around this in Git. It simply requires human interaction and communication so that only one developer modifies a file at a time. In a project setting the project manager could assign work on binary files to a single developer. In a more distributed development environment IMs or e-mails may be used to communicate when a binary file will be modified. However these approaches don't provide the automated protection that a centralized version control system can for binary files.

References

<references/>