CSC/ECE 517 Fall 2013/ch1 1w01 aj

From Expertiza_Wiki
Jump to navigation Jump to search

Version Control Tools

Version control tools are the tools which manage the changes to documents, computer programs, large web sites, and other collections of information. It is a repository of code along with the history of the changes. All the change made to the source is tracked, along with the other details like who made the changes, why the changes were made and other comments. This concept is very important for large projects requiring collaborated development. Whether it is the history of this wiki page or large software development project like Facebook, the ability to track the changes, and to reverse changes when necessary can make all huge difference between a well-managed and controlled process and an uncontrolled ‘first come, first served’ system. It can be used to track the development lifecycle of a project in detail.

Categories of version control tools

Distributed Model

Peer-to-peer approach is followed in distributed revision control system in which each person’s copy of codebase is the current version of his code. In this way, each developer has the copy of full code rather than a part of code on which that developer is working. This system then synchronizes the copies of each developer by exchanging patches which contain the sets of changes. In this way each developer gets the latest copy of the code portion that others are working on. This method, most of the times, does not use the central repository. Major advantages of this system are:


• As the full code is available on the hard disk, most of the actions are very fast as the interaction with the remote server is not required.

• Committing new change sets can be done locally without anyone else seeing them. Once you have a group of change sets ready, you can push all of them at once.

• Except pushing and pulling, most of the actions can be performed without an internet connection.

Despite the above advantages, Distributed Version Control Tools pose some problems. As the copy of whole code is maintained at the local machine, it requires a huge amount of space in case the files cannot be compressed (binary files). Moreover, in case the project has a huge history of changes, it might require large amount of time to download the entire patches.

Few examples of such tools are Codeville, Fossil, LibreSource, Monotone, Veracity and Git. Let us consider Git to explain the features of distributed model for version control tools.


Git

Git is a free and open source distributed version control system. Its highest priority is to provide the high speed to its users. It was initially designed and developed for Linux kernel in 2005. was originally designed as a low-level version control system engine on top of which others could write front ends. However, the core Git project has since become a complete version control system that is usable directly. As of today, Git is estimated to have captured 30% of the market. Every Git working directory is a full-fledged repository with complete history and full version tracking capabilities, not dependent on network access or a central server.

(Image showing the concept of branching for Git)

Design of Git

Git, like any other distributed version control system, has three main functionalities:

• Code storage

• Keeping track of the changes made to the code

• Synchronizing all the developers for the latest code

Code Storage

Blob is the basic storage in Git. Git stores the contents of the file for tracking history, and not just the differences between individual files for each change. The contents are then referenced by a 40 character SHA1 hash. Pretty much every object, be it a commit, tree, or blob has a SHA.

This gives an added advantage that if two or more copies of the same file are stored in the repository, Git will store only one file internally.

The next level object is a tree. These can be thought of as folders or directories. Finally, this brings us to the most important object: the commit. Commits can be thought of as snapshots. The major difference between Git and any other tool is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. Tools consider the information they keep as a set of files and the changes made to each file over time.

(Image showing data storage in case of normal control system) In contrast, Git thinks of its data more like a set of snapshots of a mini file system. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.

Git stores data as:

(Image showing data storage in case of Git)

This is an important difference between Git and all other tools. It makes Git reconsider almost every aspect of version control that most other systems copied from the previous generation.

Tracking the changes made to the code

Git uses directed acyclic graph to store its history. Each commit in Git can have multiple or single parent commits. This info about the ancestors is stored with each commit. Here we can have nonlinear history. That means a commit can have multiple parent commits on which the child commit depends. Below example shows 3-way merge that has 3 parents:



The history of a file is linked all the way up its directory structure to the root directory, which is then linked to a commit node. This commit node, in turn, can have one or more parents. When merging multiple branches we are merging the contents in a DAG. This allows Git to determine common ancestors.

Synchronizing all the developers for the latest code

In Git, when a developer makes the same modification to the equivalent file in the local Git code, the change will be recorded locally first, then they can "push" the commits to the other users who are working on the same project. The content changes are stored identically for each Git repository that the commit exists in. Upon the local commit, the local Git repository will create a new object representing a file for the changed file. For each directory above the changed file, a new tree object is created with a new identifier. A DAG is created starting from the newly created root tree object pointing to the files and referencing the newly created blob in place of that file's previous blob object in the previous tree hierarchy. At this point the commit is still local to the current Git code on developer’s local device. When he "pushes" the commit to a publicly accessible Git repository this commit gets sent to that repository. After the public repository verifies that the commit can apply to the branch, the same objects are stored in the public repository as were originally created in the local Git repository.

Tools available for Git

Git provides many command line and UI tools for most of the platforms. All these tools are based on Git core toolkit. Git toolkit is divided into 2 parts: plumbing and porcelain. Plumbing contains two level commands that enable basic content tracking and modification of directed acyclic graphs. Procelain contains the more commonly used commands which are used to maintain repositories and communicating with other developers. Work is in progress for improving the toolkit design and making it faster.


Some of the major characteristics of Git:
Good support for expansion

Git is extremely fast and scalable. Referring to a local copy of code instead of the copy on remote server is very fast. As the project size increases, the affect of size on performance is very less.

Distributed development

Each developer is provided with the local copy of the entire development history. Changes are copied from one of the repository to another. These changes are transferred and can integrate the same way the local branches of the repository.

Backward compatibility

Repositories can be published by HTTP, FTP, rsync, or a Git protocol over either a plain socket, ssh or HTTP. Git also has a CVS server emulation, which enables the use of existing CVS clients and IDE plugins to access Git repositories.

Non-linear development

Git provides quick branching and merging. A branch in git is only a reference to a single commit. It supports non-linear development through various tools.

Support for incomplete merge

In case of an incomplete merge, Git provides multiple algorithms for letting the user know of this situation and also provides support for manual editing.

Cryptographic authentication of history

Git stores the history so that the id of a particular version depends upon the history of changes resulting into that commit. Once it is published, it is not possible to change the old versions without it being noticed.

Some of the major companies using Git in their projects include Google, Facebook, Microsoft(ASP.NET), Twitter, Eclipse and Netflix.

Some of the other major tools in this category are given below

GNU Bazaar

GNU Bazaar (formerly Bazaar-NG, command line tool bzr) is a distributed revision control system sponsored by Canonical. Bazaar can be used by a single developer working on multiple branches of local content, or by teams collaborating across a network. Bazaar is written in the Python programming language, with packages for major GNU/Linux distributions, Mac OS X and Microsoft Windows.

Mercurial

Mercurial is a cross-platform, distributed revision control tool for software developers. It is mainly implemented using the Python programming language, but includes a binary diff implementation written in C. It is supported on Windows and Unix-like systems, such as FreeBSD, Mac OS X andLinux. Mercurial is primarily a command line program but graphical user interface extensions are available. All of Mercurial's operations are invoked as arguments to its driver program hg, a reference to the chemical symbol of the element mercury.

Bitkeeper

BitKeeper is a software tool for distributed revision control (configuration management, SCM, etc.) of computer source code. A distributed system, BitKeeper competes largely against other systems such as Git and Mercurial. BitKeeper is produced by BitMover Inc., a privately held company based in Campbell, California[2] and owned by CEO Larry McVoy, who had previously designed TeamWare.

BitKeeper builds upon many of the TeamWare concepts. Its key selling point is the fact that it is a distributed version control tool, as opposed to CVSor SVN. One of the defining characteristics of any distributed version control tool is the ease with which distributed development teams can keep their own local source repositories and still work with the central repository. Its web site claims that "BitKeeper has been shown to double the pace of software development".

Client Server Model

Centralized or client-server version control systems are based on the idea that there is a single “central” copy of your project somewhere (probably on a server), and programmers will “commit” their changes to this central copy.

“Committing” a change simply means recording the change in the central system. Other programmers can then see this change. They can also pull down the change, and the version control tool will automatically update the contents of any files that were changed.

Most modern version control systems deal with “changesets,” which simply are a groups of changes (possibly to many files) that should be treated as a cohesive whole. For example: a change to a C header file and the corresponding .c file should always be kept together.

Major Advantages of Centralized Version Control over Distributed Version Control

If the project contains many large, binary files that cannot be easily compressed, the space needed to store all versions of these files can accumulate quickly.

If the project has a very long history (50,000 changesets or more), downloading the entire history can take an impractical amount of time and disk space.

Perforce

Perforce is an enterprise version control and management system which uses centralized Version Control System. Dedicated Perforce applications are used to sync files between the file repository and individual users' workstations. Perforce supports both Git clients and clients that use Perforce's own protocol. A Git client can communicate with the Perforce server over SSH, and other Perforce clients communicate with the server via TCP/IP using a proprietary RPC and streaming protocol.

The Perforce Versioning Service

The Perforce versioning service manages shared file repositories called depots. Depots contains all revisions of all file under Perforce control. Perforce organizes all the files in depots into directory trees, like a hard drive. Files in a depot are referred to as depot files or versioned files. Files are identified by namespace (i.e., by OS-neutral filenames). File content itself is not stored in the database. MD5 hashes of file content are stored in the database. Text file revisions are stored as RCS deltas and binary file revisions are stored in their entirety.

  The service maintains a database to track change logs, user permissions, and which users have which files checked out at any time. The information stored in this database is referred to as metadata.

  Database tables are stored as binary files. Checkpoints and journals are written as text files that can be compressed and taken out. A database that has been corrupted by hardware failure or other catastrophe can be recovered from the most recent journal and checkpoint. Administrators must plan for disaster recovery by configuring database journaling and setting up regular checkpoints.

Perforce applications (clients)

Dedicated Perforce applications communicate with the versioning service to enable the checking in and out, conflict management, branch development, bug tracking request changing and more. They include:

  • P4, the Perforce Command-Line Client, for all platforms

  • P4V, the Perforce Visual Client, for Mac OS X, UNIX, Linux, and Windows

  • P4Web, the Perforce Web Client, a browser-based interface to Perforce

  • Integrations, or plug-ins, that work with commercial IDEs and productivity software

   When the files are loaded into the workspace, the Perforce application requests the files from the central data repository. To optimize the network utilization, the service keeps track of which files you (and other users) have retrieved. Perforce applications do not require a persistent connection to the versioning service.

Connecting and Mapping files to the workspace

  Perforce applications manages files in a certain location of the hard disk, called the workspace. More than one client workspace can co-exist, even on the same workstation.

  To customize the location of depot files under the root workspace, users must map the files and directories on the shared versioning service to corresponding areas of local hard drive.   Workspace views:

  • Determine which files in the depot needs to be in the workspace.

  • Custom-map files in the depot to files in the workspace.

  Client workspace views may consist multiple lines or mappings. Each line in the workspace view has a "depot side", designates a subset of files within the depot, and a "client side" that controls where the files specified on the depot side are located under the workspace root.

 What file types are supported?

  Perforce file types include seven base file types.

  • text files,

  • binary files,

  • native apple files on the Macintosh,

  • Mac resource forks,

  • symbolic links (symlinks),

  • unicode (and utf16) files.

Working with files

The changelist is the fundamental unit of work. The basic file operations which are common to all VCS (such as editing, adding, deleting, rolling back changes, and file check in) are taken care in changelists. A changelist consists file list, their revision numbers, the changes made, and a description that describes the work that has been performed. Changelists serve two purposes:

  • To logically organize the work by grouping related changes to files together,

  • To ensure the work integrity by making sure that related changes to files are checked in together.

Perforce changelists are atomic change transactions; if a changelist affects three files, then the changes for all three files are committed to the depot, or none of the changes are. Even if the network connection between your Perforce client program and the Perforce server is interrupted during changelist submission, the entire submit fails. Each changelist is identified by a changelist number (generated by Perforce), and a description (supplied by the user).

Working concurrently

  Perforce helps teams to work concurrently. The conflict resolution and three-way merge process enables multiple users to work on the same files at the same time without interfering with each other's work.

  The three-way merge process for resolving file conflicts helps you to resolve conflicting changes to text files, but is not necessarily meaningful for binary files such as graphics or compiled code. If the user is working on files where merges are not meaningful, locking such files is an option to prevent others from making changes that conflict.

  Perforce supports two types of file locking:

  • To prevent inconsistencies in current file changes, lock the file. Other users can still check out the locked file, but are restricted to view only privileges.

  • To prevent a file from being checked out by more than one user at a time, use the +l exclusive-open filetype modifier. Files that have the +l filetype modifier can only be opened by one user at a time. Your Perforce administrator can use a special table called the typemap table to automatically specify certain file types as exclusive-open.   

 Streams

  Perforce streams are structured containers for the files that consist projects, codelines, and components. Streams confer the following benefits:

  • Ensures a hierarchical approach to branching

  • Provides an "out of the box" best-practice branching strategy

  • Provide metadata about the branch hierarchy to the Perforce service

  • Provide a standard approach to structuring code (stability and hierarchy)

  • Automate the generation of client workspace views and branch views

  • Offer a compelling and informative visualization of stream structure and status

  • Enable you to organize and visualize (bodies of) code.

  • Provide rules to make development easier.   

 Codeline Management

Codelines are inter-related files that grow together. To organize groups of related files by purpose, branches are created. To move changes between branches, changelists are integrated. To create a snapshot of files in a specific phase, labels can be used.   

 Branching

  Branching is a method of managing changes between two or more sets of files. Perforce's Inter-File Branching mechanism enables copying of any set of files to a new location in the depot by allowing changes made to one set of files to be copied, or integrated, to the other. The new file set (or codeline) evolves separately from the original files, but changes in either codeline can be propagated to the other by means of integration. Almost all of the version control systems support some form of branching.

  Merging is actually only one of three possible outcomes of an integration. The others are ignoring (aka "blocking") and copying (aka "promoting"). Merging is used to keep one set of files up to date with another. For example, a development branch may be kept up to date with its trunk through repeated merging. Ignoring disqualifies changes in one set of files from future integration into another. It is often used when a development branch must be up to date with, and yet divergent from, its trunk. Copying is typically used to promote the content of an up-to-date development branch into a trunk.

Some other tools in this category are given below

AccuRev

AccuRev is a centralized version control system which uses a client/server model. Communication is performed via TCP/IP using a proprietary protocol. Servers function as team servers, continuous integration servers, or build servers. AccuRev is built around a stream-based architecture in which streams form a hierarchical structure of code changes where parent streams pass on certain properties to child streams. Developers make changes using command line functions, the Java GUI, the web interface, or one of the IDE plug-ins.

Characteristics:

• Streams and parallel development

• Private developer history

• Change packages

• Distributed development

• Automated merging

IBM Rational ClearCase

The Rational ClearCase family consists of several software tools for supporting software configuration management (SCM) of source code and other software development assets. It is developed by the Rational Software division of IBM. ClearCase forms the base for configuration management for many large and medium sized businesses and can handle projects with hundreds or thousands of developers. A part of Rational ClearCase is revision control system, which is a feature for end users.

ClearCase supports two kinds of use models, UCM (Unified Change Management), and base ClearCase. UCM provides an out-of-the-box model while base ClearCase provides a basic infrastructure (upon which UCM is built). Both can be configured to support a wide variety of needs. UCM is part of RUP (Rational Unified Process) and therefore all process templates and roles can be used from RUP.

Features:

• Build auditing

• VOB (Versioned Object Base)

• Configuration Record

• Build Avoidance

• Unix/Windows Interoperability

• Integration With Other Products

• Space Saving

Weaknesses:

• Speed

• Sensitivity to network problems

Local Data Model

In the local-only approach, all developers must use the same computer system. These software often manage single files individually and are largely replaced or embedded within newer software. This data model is the grandfather of all the modern version control softwares. These are the very first softwares for version control that were implemented in the late 1970's. Technological and software advances have rendered these practically useless but certain legacy systems still use these due to the hardware and software limitations.

Source Code Control System (SCCS)

  Source Code Control System (SCCS) is an early revision control system, geared toward program source code and other text files. It was originally developed in SNOBOL at Bell Labs in 1972 by Marc Rochkind for an IBM System/370 computer running OS/360 MVT. SCCS was the dominant version control system for Unix until the release of the Revision Control System (RCS)[dubious – discuss]. Today, SCCS is generally considered obsolete. However, its file format is still used internally by a few other revision control programs, including BitKeeper and TeamWare. The latter is a frontend to SCCS. Sablime has been developed from a modified version of SCCS but uses a history file format that is incompatible with SCCS. The SCCS file format uses a storage technique called interleaved deltas (or the weave). This storage technique is now considered by many revision control system developers as foundational to advanced merging and versioning techniques, such as the "Precise Codeville" ("pcdv") merge.   

Revision Control System (RCS)

The Revision Control System (RCS) is a software implementation of revision control that automates the storing, retrieval, logging, identification, and merging of revisions. RCS is useful for text that is revised frequently, for example programs, documentation, procedural graphics, papers, and form letters. RCS is also capable of handling binary files, though with reduced efficiency. Revisions are stored with the aid of the diff utility.

This came into existency in 1982 and gave the version control systems a new dimension. It was the best VCS for the single user system, where all the files related to a project are in a single system. RCS just locks a single file that is being used and doesn't lock the entire project which is computationally cheaper.

References

1. Version Control Tools

2. DVS

3. Distributed Version Control

4. Git

5. GitReady

6. Archlinux

7. StackOverflow

8. Wikipedia

9. Perforce documentation and manuals ‎