CSC/ECE 517 Fall 2013/ch1 1w01 aj

From Expertiza_Wiki
Jump to navigation Jump to search

Version Control Tools

Version control tools are the tools which manage the changes to documents, computer programs, large web sites, and other collections of information. It is a repository of code along with the history of the changes. All the change made to the source is tracked, along with the other details like who made the changes, why the changes were made and other comments. This concept is very important for large projects requiring collaborated development. Whether it is the history of this wiki page or large software development project like Facebook, the ability to track the changes, and to reverse changes when necessary can make all huge difference between a well-managed and controlled process and an uncontrolled ‘first come, first served’ system. It can be used to track the development lifecycle of a project in detail.


Categories of version control tools

Distributed Model

Peer-to-peer approach is followed in distributed revision control system in which each person’s copy of codebase is the current version of his code. In this way, each developer has the copy of full code rather than a part of code on which that developer is working. This system then synchronizes the copies of each developer by exchanging patches which contain the sets of changes. In this way each developer gets the latest copy of the code portion that others are working on. This method, most of the times, does not use the central repository. Major advantages of this system are:


• As the full code is available on the hard disk, most of the actions are very fast as the interaction with the remote server is not required.

• Committing new change sets can be done locally without anyone else seeing them. Once you have a group of change sets ready, you can push all of them at once.

• Except pushing and pulling, most of the actions can be performed without an internet connection.

Despite the above advantages, Distributed Version Control Tools pose some problems. As the copy of whole code is maintained at the local machine, it requires a huge amount of space in case the files cannot be compressed (binary files). Moreover, in case the project has a huge history of changes, it might require large amount of time to download the entire patches.

Few examples of such tools are Aegis, Bazaar, Codeville, Fossil, LibreSource, Monotone, Veracity and Git. Let us consider Git to explain the features of distributed model for version control tools.


Git

Git is a free and open source distributed version control system. Its highest priority is to provide the high speed to its users. It was initially designed and developed for Linux kernel in 2005. was originally designed as a low-level version control system engine on top of which others could write front ends. However, the core Git project has since become a complete version control system that is usable directly. As of today, Git is estimated to have captured 30% of the market. Every Git working directory is a full-fledged repository with complete history and full version tracking capabilities, not dependent on network access or a central server.

(Image showing the concept of branching for Git)

Design of Git

Git, like any other distributed version control system, has three main functionalities: •Code storage •Keeping track of the changes made to the code •Synchronizing all the developers for the latest code

Code Storage

The most basic data storage is the blob. Git stores just the contents of the file for tracking history, and not just the differences between individual files for each change. The contents are then referenced by a 40 character SHA1 hash of the contents, which means it’s pretty much guaranteed to be unique. Pretty much every object, be it a commit, tree, or blob has a SHA. They’re easily referenced by the first 7 characters which are usually enough to identify the whole string.

One advantage to storing only the content means that if you have two or more copies of the same file in your repository, Git will only store one copy internally.

The next object is a tree. These can be thought of as folders or directories. Finally, this brings us to the most important object: the commit. Commits can be thought of as snapshots. The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time.

Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored. Git thinks about its data more like:

This is an important distinction between Git and nearly all other VCSs. It makes Git reconsider almost every aspect of version control that most other systems copied from the previous generation. This makes Git more like a mini filesystem with some incredibly powerful tools built on top of it, rather than simply a VCS.


Some of the major characteristics of Git:
Good support for expansion

Git is extremely fast and scalable. Referring to a local copy of code instead of the copy on remote server is very fast. As the project size increases, the affect of size on performance is very less.

Distributed development

Each developer is provided with the local copy of the entire development history. Changes are copied from one of the repository to another. These changes are transferred and can integrate the same way the local branches of the repository.

Backward compatibility

Repositories can be published by HTTP, FTP, rsync, or a Git protocol over either a plain socket, ssh or HTTP. Git also has a CVS server emulation, which enables the use of existing CVS clients and IDE plugins to access Git repositories.

Non-linear development

Git provides quick branching and merging. A branch in git is only a reference to a single commit. It supports non-linear development through various tools.

Support for incomplete merge

In case of an incomplete merge, Git provides multiple algorithms for letting the user know of this situation and also provides support for manual editing.

Cryptographic authentication of history

Git stores the history so that the id of a particular version depends upon the history of changes resulting into that commit. Once it is published, it is not possible to change the old versions without it being noticed.

Some of the major companies using Git in their projects include Google, Facebook, Microsoft, Twitter and Eclipse.

Client Server Model

Centralized or client-server version control systems are based on the idea that there is a single “central” copy of your project somewhere (probably on a server), and programmers will “commit” their changes to this central copy.

“Committing” a change simply means recording the change in the central system. Other programmers can then see this change. They can also pull down the change, and the version control tool will automatically update the contents of any files that were changed.

Most modern version control systems deal with “changesets,” which simply are a groups of changes (possibly to many files) that should be treated as a cohesive whole. For example: a change to a C header file and the corresponding .c file should always be kept together.

Major Advantages of Centralized Version Control over Distributed Version Control

If the project contains many large, binary files that cannot be easily compressed, the space needed to store all versions of these files can accumulate quickly. If the project has a very long history (50,000 changesets or more), downloading the entire history can take an impractical amount of time and disk space.

Perforce

Perforce is an enterprise version management system in which users connect to a shared file repository. Perforce applications are used to transfer files between the file repository and individual users' workstations. Perforce supports both Git clients and clients that use Perforce's own protocol. A Git client can communicate with the Perforce server over SSH, and other Perforce clients communicate with the server via TCP/IP using a proprietary RPC and streaming protocol.

The Perforce Versioning Service

The Perforce versioning service manages shared file repositories called depots. Depots contain every revision of every file under Perforce control. Perforce organizes files in depots into directory trees, like a large hard drive. Files in a depot are referred to as depot files or versioned files. Files are identified by namespace (i.e., by OS-neutral filenames). File content itself is not stored in the database. MD5 hashes of file content are stored in the database. Text file revisions are stored as RCS deltas and binary file revisions are stored in their entirety.

  The service maintains a database to track change logs, user permissions, and which users have which files checked out at any time. The information stored in this database is referred to as metadata.

  Database tables are stored as binary files. Checkpoints and journals are written as text files that can be compressed and offloaded. A database that has been corrupted by hardware failure or other catastrophe can be recovered from the most recent journal and checkpoint. Administrators must plan for disaster recovery by configuring database journaling and setting up regular checkpoints.

Perforce applications (clients)

  Perforce applications are used to communicate with the versioning service. Perforce applications enable you to check files in and out, manage conflicts, create development branches, track bugs and change requests, and more. Perforce applications include:

  • P4, the Perforce Command-Line Client, for all platforms

  • P4V, the Perforce Visual Client, for Mac OS X, UNIX, Linux, and Windows

  • P4Web, the Perforce Web Client, a browser-based interface to Perforce

  • Integrations, or plug-ins, that work with commercial IDEs and productivity software

  When the files are retrieved into the workspace, the Perforce application requests the files from the shared file repository. To keep network traffic to a minimum, the service keeps track of which files you (and other users) have retrieved. Perforce applications do not require a persistent connection to the versioning service.

Connecting and Mapping files to the workspace

  Perforce applications manage files in a designated area of the local disk, called the workspace. As the name implies, the workspace is where most of the work is done. More then one client workspace can exist, even on the same workstation.

  To control where the depot files appear under the root workspace, users must map the files and directories on the shared versioning service to corresponding areas of local hard drive.   Workspace views:

  • Determine which files in the depot can appear in a workspace.

  • Map files in the depot to files in the workspace.

  Client workspace views consist of one or more lines, or mappings. Each line in the workspace view has two sides: a "depot side" that designates a subset of files within the depot and a "client side" that controls where the files specified on the depot side are located under the workspace root.

 What file types are supported?

  Perforce file types include seven base file types.

  • text files,

  • binary files,

  • native apple files on the Macintosh,

  • Mac resource forks,

  • symbolic links (symlinks),

  • unicode (and utf16) files.

Working with files

The changelist is the basic unit of work in Perforce. The basic file editing operations common to all versioning systems (such as editing files, adding files, deleting files, backing out changes, and checking in files) are performed in changelists. A changelist consists of a list of files, their revision numbers, the changes you have made to the files, and a description that you supply that describes the work the user has performed. Changelists serve two purposes:

  • To organize your work into logical units by grouping related changes to files together,

  • To guarantee the integrity of your work by ensuring that related changes to files are checked in together.

Perforce changelists are atomic change transactions; if a changelist affects three files, then the changes for all three files are committed to the depot, or none of the changes are. Even if the network connection between your Perforce client program and the Perforce server is interrupted during changelist submission, the entire submit fails. Each changelist has a changelist number (generated by Perforce), and a changelist description (supplied by the user who performed the changes).

Working concurrently

  Perforce helps teams to work concurrently. The conflict resolution and three-way merge process enables multiple users to work on the same files at the same time without interfering with each other's work.

  The three-way merge process for resolving file conflicts helps you to resolve conflicting changes to text files, but is not necessarily meaningful for binary files such as graphics or compiled code. If the user is working on files where merges are not meaningful, locking such files is an option to prevent others from making changes that conflict.

  Perforce supports two types of file locking:

  • To prevent other users from checking in changes to a file being worked on, lock the file. Other users can still check out your locked file, but are unable to submit changelists that affect the locked file until the user submit changes.

  • To prevent a file from being checked out by more than one user at a time, use the +l exclusive-open filetype modifier. Files that have the +l filetype modifier can only be opened by one user at a time. Your Perforce administrator can use a special table called the typemap table to automatically specify certain file types as exclusive-open.   

 Streams

  Perforce streams are structured containers for the files that compose projects, codelines, and components. Applications like the Perforce Command-Line Client and P4V, the Perforce Visual Client, provide extensive support for streams.

  Streams confer the following benefits:

  • Ensure a hierarchical approach to branching

  • Provide an "out of the box" best-practice branching strategy

  • Provide metadata about the branch hierarchy to the Perforce service

  • Provide a standard approach to structuring code (stability and hierarchy)

  • Automate the generation of client workspace views and branch views

  • Offer a compelling and informative visualization of stream structure and status

  • Enable you to organize and visualize (bodies of) code.

  • Provide rules to make development easier.   

 Codeline Management

  Codelines are sets of related files that evolve together. To structure groups of related files by purpose, such as a new product or release, branches are created. To propagate changes between branches, changelists are integrated. To create a snapshot of files in a specific state, you can create a label, or refer to the files collectively by specifying a date or a changelist number.   

 Branching

  Branching is a method of managing changes between two or more sets of related files. Perforce's Inter-File Branching mechanism enables copying of any set of files to a new location in the depot by allowing changes made to one set of files to be copied, or integrated, to the other. The new file set (or codeline) evolves separately from the original files, but changes in either codeline can be propagated to the other by means of integration.

  Most version control systems support some form of branching; Perforce's mechanism is unique because it mimics the style in which users create their own file copies when no branching mechanism is available.

  Merging is actually only one of three possible outcomes of an integration. The others are ignoring (aka "blocking") and copying (aka "promoting"). Merging is used to keep one set of files up to date with another. For example, a development branch may be kept up to date with its trunk through repeated merging. Ignoring disqualifies changes in one set of files from future integration into another. It is often used when a development branch must be up to date with, and yet divergent from, its trunk. Copying is typically used to promote the content of an up-to-date development branch into a trunk.

Local Data Model

In the local-only approach, all developers must use the same computer system. These software often manage single files individually and are largely replaced or embedded within newer software. This data model is the grandfather of all the modern version control softwares. These are the very first softwares for version control that were implemented in the late 1970's. Technological and software advances have rendered these practically useless but certain legacy systems still use these due to the hardware and software limitations.

Source Code Control System (SCCS)

  Source Code Control System (SCCS) is an early revision control system, geared toward program source code and other text files. It was originally developed in SNOBOL at Bell Labs in 1972 by Marc Rochkind for an IBM System/370 computer running OS/360 MVT. SCCS was the dominant version control system for Unix until the release of the Revision Control System (RCS)[dubious – discuss]. Today, SCCS is generally considered obsolete. However, its file format is still used internally by a few other revision control programs, including BitKeeper and TeamWare. The latter is a frontend to SCCS. Sablime has been developed from a modified version of SCCS but uses a history file format that is incompatible with SCCS. The SCCS file format uses a storage technique called interleaved deltas (or the weave). This storage technique is now considered by many revision control system developers as foundational to advanced merging and versioning techniques, such as the "Precise Codeville" ("pcdv") merge.   

Revision Control System (RCS)

The Revision Control System (RCS) is a software implementation of revision control that automates the storing, retrieval, logging, identification, and merging of revisions. RCS is useful for text that is revised frequently, for example programs, documentation, procedural graphics, papers, and form letters. RCS is also capable of handling binary files, though with reduced efficiency. Revisions are stored with the aid of the diff utility.

This came into existency in 1982 and gave the version control systems a new dimension. It was the best VCS for the single user system, where all the files related to a project are in a single system. RCS just locks a single file that is being used and doesn't lock the entire project which is computationally cheaper.