CSC/ECE 517 Fall 2009/wiki1a 5 rp: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=Introduction=
=Introduction=


The defining characteristic of a version control system is its ability
The defining characteristic of a version control system is its ability to track changes to a document, or set of documents, over many changes, or revisions.  For the vast majority of applications, version control systems have focused on tracking plain text files, such as those used for programming source code, HTML documents, and various markup syntax.   
to track changes to a document, or set of documents, over many
changes, or revisions.  For the vast majority of applications, version
control systems have focused on tracking plain text files, such as
those used for programming source code, HTML documents, and various
markup syntax.   


The history of the development of version control tools can be roughly
The history of the development of version control tools can be roughly categorized into three main phases:
categorized into three main phases:


# Local Version Control
# Local Version Control
Line 15: Line 9:
# Distributed Version Control
# Distributed Version Control


This breakdown is focused on the mechanisms that underly how data is
This breakdown is focused on the mechanisms that underlays how data is ''shared'' and ''stored'' in a version control system. It should not be inferred from this structure that other attributes are not important to the history of developments of version control systems. There have been many advances in how:
''shared'' and ''stored'' in a version control system. It should not
be inferred from this structure that other attributes are not
important to the history of developments of version control systems.
There have been many advances in how:


* conflicts are recognized and merges are performed
* conflicts are recognized and merges are performed
* groups of logically coherent changes are tracked
* groups of logically coherent changes are tracked
* and how the data is stored
* and how the data is represented within the repository


We will discuss these each in turn in [[#Other Advances]].
While these may seem to be three separate topics, we will see in the section on [[#Other Advances]] that they are in fact very closely related.


=Local Version Control=
=Local Version Control=


The first version control systems focused on ''local version
The first version control systems (like [http://en.wikipedia.org/wiki/Source_Code_Control_System  SCCS], 1972) focused on ''local version control''; that is, centralized computer systems that were used by many users, often at the same time. In such a system, there were often many users of the system and the ''repository'', or location in which the data was stored, was simply a directory on the server to which the users had access. Because of this use case, these systems focused on two main features:
control''; that is, centralized computer systems that were used by
many users, often at the same time. In such a system, there were often
many users of the system and the ''repository'', or location in which
the data was stored, was simply a directory on the server to which the
users had access. Because of this use case, these systems focused on
two main features:


* File revision tracking
* File version tracking
* File Checkout and Locking
* File Checkout and Locking


We will address each of these fundamental features in turn.
We will address each of these fundamental features in turn.


==File Revision Tracking==
==File Version Tracking==


The primary feature of these early systems was the ability to check in
The primary feature of these early systems was the ability to check in files at various points as they were altered, so that the history of changes made to files under version control was kept permanently. Thus, many users could alter many files over time, and the entire set of documents under version control at a given point in time could be recovered, preventing loss of valuable data, as well a providing a record of what users made changes to files over time.
files at various points as they were altered, so that the history of
changes made to files under version control was kept permanently.
Thus, many users could alter many files over time, and the entire set
of documents under version control at a given point in time could be
recovered, preventing loss of valuable data, as well a providing a
record of what users made changes to files over time.


==File Checkout and Locking==
==File Checkout and Locking==


Because many users in a shared system may desire to edit a file
Because many users in a shared system may desire to edit a file simultaneously, one of the first features developed for version control systems was the ability to ''check out'' and ''lock'' a file. When a user checks out a file, he or she reserves the right to be the sole editor of that file until it is checked back in to version control.  Both SCCS and [http://www.gnu.org/software/rcs/ RCS] were designed for use in a shared environment and, as such, allowed files to be checked out and locked in this way.  Other users could check out the file, but only to view it.  Thus, the files were locked from editing by all but the user that had checked the file out most recently.
simultaneously, one of the first features developed for version
control systems was the ability to ''check out'' and ''lock'' a file.
When a user checks out a file, he or she reserves the right to be the
sole editor of that file until it is checked back in to revision
control.  The first revision control systems designed for use in a
shared environment, such as [http://www.gnu.org/software/rcs/ RCS],
allowed files to be checked out and locked in this way.  Other users
could check out the file, but only to view it.  Thus, the files were
locked from editing by all but the user that had checked the file out
most recently.


==Weaknesses of Local Version Control Systems==
==Weaknesses of Local Version Control Systems==  
These local systems had two primary problems. 


First, they required that every user log into a single computer to
These local systems had two primary problems.
edit or access the information in the repository.


Second, they restricted a particular file to having only one editor at
First, they required that every user log into a single computer to edit or access the information in the repository. This often posed both performance and security risks, in addition to being cumbersome as networks became more prevalent.
any given time. The next development in revision control, embodied by
[http://www.nongnu.org/cvs/ CVS], sought to address both of these
problems.


=Networked Revision Control: Client-Server=
Second, they restricted a particular file to having only one editor at any given time. The next development in version control, embodied by CVS, sought to address both of these problems.


As users moved away from logging into systems locally to make their
=Networked Version Control: Client-Server=
changes to files, the need for a revision control system that
supported remote operations emerged.  The natural way to implement
such remote operations was as an extension of the existing system, and
by far the most prominent manifestation of this philosophy was present
in CVS, the concurrent versions system, which was initially based on
RCS.


The main feature driving the development of CVS was the need for many
As users moved away from logging into systems locally to make their changes to files, the need for a version control system that supported remote operations emergedThe natural way to implement such remote operations was as an extension of the existing system, and by far the most prominent manifestation of this philosophy was present in [http://www.nongnu.org/cvs/ CVS], the concurrent versions system, which was initially based on RCS and began development in 1984 and matured throughout the mid- to late-1980sVersion 1.0 was released under the [http://www.gnu.org/copyleft/gpl.html GNU GPL], a free software license, in the second half of 1990.
users, each on his or her own machine, to be able to perform all the
operations present in the original RCS, but over a network connection,
and in a way that allowed for concurrent editting to take placeThis
led to the development of a client-server model of revision control
systems, in which one central server would contain the canoncal
version of the repository, and various clients could connect to the
central server and perform file check outs and commitsThis model is
very similar to the original RCS model, but rather than requiring
users of the system to log into the revision control system locally,
it allowed users to access and alter the contents of the repository
over the network.


Although CVS supports locking in the same way RCS does, CVS was among
The main feature driving the development of CVS was the need for many users, each on his or her own machine, to be able to perform all the operations present in the original RCS, but over a network connection, and in a way that allowed for concurrent editing to take place.  This led to the development of a client-server model of version control systems, in which one central server would contain the canonical version of the repository, and various clients could connect to the central server and perform file check outs and commits.  This model is very similar to the original RCS model, but rather than requiring users of the system to log into the version control system locally, it allowed users to access and alter the contents of the repository over the network.
the first version control systems to support a ''non-locking
 
repository''. This system allowed for concurrent editing of files
Although CVS supports locking in the same way RCS does, CVS was among the first version control systems to support a ''non-locking repository''. This system allowed for concurrent editing of files under version control, and generated the need to develop new features that addressed the resulting complexities. Chief among the new features introduced to handle these complexities were the notions of ''branching'' and ''merging''. This allowed CVS to offer a non-locking repository, which is why there is an emphasis on the "concurrent" portion of CVS's name "concurrent versions system".
under revision control, and generated the need to develop new features
that addressed the resulting complexities. Chief among the new
features introduced to handle these complexities were the notions of
''branching'' and ''merging''. This allowed CVS to offer a non-locking
repository, which is why there is an emphasis on the "concurrent"
portion of CVS's name "concurrent versions system".


==Branching and Merging==
==Branching and Merging==


Inherent in the notion of concurrent editing is the problem of how to
Inherent in the notion of concurrent editing is the problem of how to reconcile conflicting changes to the same file. A ''conflict'' is essentially two or more changes made to the same file that it may be difficult to merge into a final file that contains both sets of changes. An example of a conflict would occur if two users both edited a file on line 49, one changing the word "blue" to "red", and the other changing the same word "blue" to "green". The first user would then commit his or her changes back to the repository, and when the second user committed changes, the version control system would detect that the repostory had changed since the second user had obtained the file (since the first user had made a change and then committed it). At that point, the version control system would detect a conflict, and prompt the two users to coordinate to resolve the conflict to determine what text should be on line 49.
reconcile conflicting changes to the same file. A ''conflict'' is
essentially two or more changes made to the same file that it may be
difficult to merge into a final file that contains both sets of
changes. An example of a conflict would occur if two users both edited
a file on line 49, one changing the word "blue" to "red", and the
other changing the same word "blue" to "green". The first user would
then commit his or her changes back to the repository, and when the
second user committed changes, the version control system would detect
that the repostory had changed since the second user had obtained the
file (since the first user had made a change and then committed it).
At that point, the version control system would detect a conflict, and
prompt the two users to coordinate to resolve the conflict to
deteremine what text should be on line 49.


The solution to this problem lies in allowing users of the revision
The solution to this problem lies in allowing users of the version control system to ''branch'' a version of the repository and make (possibly many) changes to that branch independent of the changes occurring on the main branch of the repository, known as the ''trunk''. Once a logical set of changes was completed on a branch, that branch would then need to have its changes reconciled with the current state of the repository on the trunk. This process of reconciliation is known as merging.
control system to ''branch'' a version of the repository and make
(possibly many) changes to that branch independent of the changes
occurring on the main branch of the repository, known as the
''trunk''. Once a logical set of changes was completed on a branch,
that branch would then need to have its changes reconciled with the
current state of the repository on the trunk. This process of
reconciliation is known as merging.


This feature is critical in a multi-user client environment as it
This feature is critical in a multi-user client environment as it allows work to progress on multiple fronts simultaneously, only requiring that the files be merged once the users of the system are ready to reconcile changes with other users.
allows work to progress on multiple fronts simultaneously, only
requiring that the files be merged once the users of the system are
ready to reconcile changes with other users.


Along with development of mechanisms to allow this sort of concurrent
Along with development of mechanisms to allow this sort of concurrent access to the repository over the network, version control systems became more adept in the algorithms they used to detect conflicts and merge conflicts. This aspect of version control is discussed further in [[#Merge Algorithms]].
access to the repository over the network, version control systems
became more adept in the algorithms they used to detect conflicts and
merge conflicts. This aspect of revision control is discussed further
in [[#Merge Algorithms]].


==Client-Server Beyond CVS==
==Client-Server Beyond CVS==


Although CVS developed good approaches to solving many of these
Although CVS developed good approaches to solving many of these problems, it had many problems that gained attention when it became the most widely used version control system for open source development. An exhaustive list would be lengthy, but to mention a few might be illustrative.
problems, it had many problems that gained attention when it became
 
the most widely used revision control system for open source
* CVS doesn't provide ''atomic'' operations, which means that if there were a network failure during a commit, the repository could become corrupted.
development. An exhaustive list would be lengthy, but to mention a few
* CVS does not version control directories or symbolic links, which means the repository is really a lossy copy of a developer's environment, sometimes resulting in failure to track changes accurately.
might be illustrative.  
* CVS doesn't track what files were committed at the same time, so if you make a logical group of changes to several files and want to track the fact that those files were changed together, you can only only derive that information from log messages. CVS will not track it for you.
* CVS cannot track when files are renamed; rather, a rename of a file in CVS looks like the original file was deleted and a new file added, thus losing the file's history.
* Creating branches and managing the subsequent merges is slow and difficult.
 
In short, while CVS provided a whole host of new features and advanced the state of the art in version control, it left room for improvement. This resulted in a vast number of client-server version control systems entering the market following CVS. One of the latest and most notable of these is [http://subversion.tigris.org/ Subversion], which seeks to address all of the [http://subversion.tigris.org/features.html issues] mentioned above and a whole lot more.
 
=Distributed Version Control=
 
In the late 1990s, a new paradigm of development started to emerge with the development of new, proprietary version control systems.  The first of these was [http://en.wikipedia.org/wiki/Sun_WorkShop_TeamWare Sun WorkShop TeamWare], the lead designer of which went on to found a new company, [http://www.bitkeeper.com/ BitMover], and develop the leading proprietary distributed version control system, BitKeeper.  These were the first distributed version control systems.
 
''Distributed'' version control system (DVCS) took many of the advances seen in client-server version control systems and moved them into a less centralized architecture. Essentially, the original version control systems were completely centralized, requiring every user to locally log in to the server on which the repository was located. In client-server version control systems, the system was made slightly more distributed, allowing users to connect from across the network to the repository, copy files from the repository to other machines for editing, and then commit them back to the server when edits were complete. Distributed version control continues the trend of decentralization by putting an entire repository, complete with a history of changes and ability to support remote connections, on each user's machine.


* CVS doesn't provide ''atomic'' operations, which means that if there
One of the strengths of CVS is that it supports file locking even though the main advance it provides is a non-locking repository.. In the same way that CVS supports legacy locking work flows, so do distributed version control systems support the workflows usually associated with a centralized repository. The main improvement distributed version control systems offer, however, is they do not require a central server. There are three advantages to this decentralized approach.
were a network failure during a commit, the repository could become
corrupted.  
* CVS does not version control directories or symbolic links, which
means the repostitory is really a lossy copy of a developer's
environment, sometimes resulting in failure to track changes
accrurately.
* CVS doesn't track what files were committed at the same time, so if
you make a logical group of changes to several files and want to track
the fact that those files were changed together, you can only only
derive that information from log messages. CVS will not track it for
you.
* CVS cannot track when files are renamed; rather, a rename of a file
in CVS looks like the original file was deleted and a new file added,
thus losing the file's history.
* Creating braches and managing the subsequent merges is slow and difficult.


In short, while CVS provided a whole host of new features and advanced
First, it encourages creation of branches.  Specifically, every time a user "checks out" a file or group of files, a new branch is created on that users machine. This is in stark contrast to the client-server model in which each time a branch is created, it is carefully planned and coordinated with other users of the system. Essentially, branching and merging in a centralized system is often difficult and slow, and in a DVCS, it is designed to be natural and fast.
the state of the art in version control, it left room for improvement.
This resulted in a vast number of client-server version control
systems entering the market following CVS. One of the latest and most
notable of these is [http://subversion.tigris.org/ Subversion], which
seeks to address [http://subversion.tigris.org/features.html all of
the issues above] and a whole lot more.


=Distributed Revision Control=
Second, it allows users to commit their changes without disturbing other users of the system.  In typical client-server work flows, the notion of a ''commit'' is tightly coupled to the notion of a ''merge'' with the code that is currently in the repository.  Distributed version control decouples these two notions, allowing developers to commit freely, and merge with other users at a different time.


''Distributed'' revision control took many of the advances seen in
Third, because each user has an entire copy of the repository, all work is done ''locally'', which allows users to continue doing work even when they don't have access to the internet or to a particular server. Further, many useful operations which take a long time in a centralized system take an order of magnitude less time in a distributed system simply because the entire repository is local, and therefore no network latency is involved.
client-server version control systems and moved them into a less
centralized architecture. Essentially, the original version control
systems were completely centralized, requiring every user to locally
log in to the server on which the repository was located. In
client-server version control systems, the system was made slightly
more distributed, allowing users to connect from across the network to
the repository, copy files from the repostory to other machines for
editing, and then commit them back to the server when edits were
complete. Distributed version control continues the trend of
decentralization by putting an entire repository, complete with a
history of changes and ability to support remote connections, on each
user's machine.


One of the strengths of CVS is that it supports file locking even
All of these changes are made possible as networks have become faster and the computers on which end users now work are often as powerful as the servers that would host a centralized repository. Thus, and the compute power has moved to the edge of the network, so too has the data in the repositories.
though the main advance it provides is a non-locking repository. This
allows for users of RCS to switch to CVS while maintaining their work
flow. When users are ready, they can take advantage of the non-locking
repository features CVS provides. In the same way that CVS supports
legacy locking work flows, so do distributed version control systems
support a centralized repository. The main improvement distributed
version control systems offer, however, is they do not require a
central server, and allow each user to maintain a complete copy of the
repository on their local machine.


This treatment might make it seem as though DVCS approaches are strictly superior to centralized networked approaches to version control.  In general, DVCS is considered to be an advancement of the state of the art, much as networked systems were considered superior to their local counterparts.  However, one common use case for version control systems is inside the firewall of a corporation, where work is done on possible proprietary or even classified data.  In such scenarios, there is often a strong desire, at least at the management level, to strongly control storage of the version controlled data.  Sometimes, such control is mandated by the customer.  In these cases, it may well be undesirable to have developers replicating data at will, and in places that are not well managed and controlled.  In these circumstances, it often makes sense to use a more centralized approach simply because the possibility of uncontrolled copying of the data represents a security risk.  Products like IBM's [http://www-01.ibm.com/software/awdtools/clearcase/ ClearCase] are designed to perform very centralized and controlled version control in such environments.


=Other Advances=
=Other Advances=


In addition to the evolution of the way version control systems
In addition to the evolution of the way version control systems allowed users to access, modify and share data in the repository, many advances have been made in the way changes are merged, tracked and stored.
allowed users to access, modify and share data in the repository, many
advances have been made in the way changes are merged, tracked and
stored.


==Merge Algorithms==
==Merge Algorithms==
Merge algorithms are a good way to frame the many of the problems that arise in a concurrent development environment.  It is therefore useful to start by discussing the issue of merge algorithms, even though relatively few advances have been made in recent years on the algorithms themselves.
There are really two kinds of merging algorithm:
* 2-way merge
* 3-way merge
2-way merge was developed first, so we will discuss it first.
===2-way Merge===
A [http://en.wikipedia.org/wiki/Merge_%28revision_control%29#Two-way_merge 2-way Merge] takes two files and compares them for differences, merges differences that do not conflict and identifies differences that conflict for human resolution. [http://en.wikipedia.org/wiki/Diff diff] is a very well-known utility for performing such comparisons and its [http://en.wikipedia.org/wiki/Diff#Algorithm algorithm] is based upon a procedure for finding the [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem longest common subsequence] of text in the files to be compared.  Methods for approaching this basic algorithm has gone largely unchanged since 1975, and the algorithm is used in both SCCS and RCS to create ''patches'' as a means of storing multiple versions of files under version control.
While the 2-way merge approach is a good start to manage concurrent modification to the same resource in a version control system, it became apparent that while the merge algorithm was sound, the version control system actually had much more information than the merge program did, and therefore there were many cases in which a merge was ''conceptually'' easy, but proved to be difficult in practice.  In particular, given a knowledge of a ''common ancestor'' from which two files had both originated, a merge program could make more decisions autonomously, streamlining the merge process.  Thus, 3-way merge was born.
===3-way Merge===
A [http://en.wikipedia.org/wiki/Merge_%28revision_control%29#Three-way_merge 3-way merge] is an approach to merging two differing files with reference to a third file that is a common ancestor of the two differing files.  By comparing each of the differing files first to the ancestor, and then to each other, the merge program can merge conflicts without human intervention more often than is possible in a 2-way merge approach.
This advancement alone made branching a much less risky proposition for development teams, and allowed distributed version control systems (which place special emphasis on a branched development approach) viable in both small and large development teams.  While both CVS and Subversion (and other modern client-server version control systems) made use of a 3-way merge approach, the newer distributed version control systems tracked groups of changes effectively, and represented the changes of sets of files as a directed acyclic graph ([http://en.wikipedia.org/wiki/Directed_acyclic_graph DAG]), making identification of ''useful'' common ancestors easier, and thus streamlining the merge process.  We will discuss each of these features in turn.


==Tracking Groups of Changes==
==Tracking Groups of Changes==


==Repository Data Storage==
An issue closely related to merge algorithms is the issue of exactly ''which'' files the version control system inputs into a the 3-way merge.  In early client-server implementations, a file was the both the largest and smallest entity tracked by the version control system; that is, if several files were modified in support of a single logical change to e.g. a piece of software, those version control systems tracked changes to each individual file and had no tracking for the fact that a particular files changes were part of a larger entity (in this case, a group of files part of a single commit).
 
While the algorithms to perform textual merges improved when the switch was made from 2-way to 3-way merges, most modern improvements for end users in the area of merge stem from how the files input into the merge algorithm are selected.  By tracking ''groups'' of changes, the version control system can make more effective decisions about which files to pick for merge, making the merge algorithms perform optimally.
 
The ability to track groups of changes most noticeably affects merging improvements in two ways.
 
First, even when changes are tracked in groups, sometimes a manual merge (one requiring user intervention) is required.  However, because the version control system can match patterns across mulitple files, if a similar merge is needed later (i.e. the same feature needs to be merged into another branch or the same branch at a later time), the version control system can "remember" the steps the user took to merge in that change and apply them automatically.  This feature has an enourmous impact on user productivity in practice.
 
Second, tracking groups of changes allows the version control system to use algorithms that take advantage of that additional knowledge to better select ancestors for a 3-way merge.  The best way to understand this is to engage in a thought experiment.
 
Imagine that you have a common code base, the trunk, from which two branches, A and B, are created.  Branch A has a set of changes (Set 1) applied to it.  Branch B has a different set of changes applied to it (Set 2).  Then, in another commit, Branch B has another set of changes applied to it (Set 3).  At this point, the user wishes to merge the two branches.  Let us also suppose that Set 1 and Set 2 are different sets of changes that ''do not conflict'' with one another.  Finally, let us also suppose that changes in Set 1 and Set 3 ''do'' conflict.
 
In a legacy version control system, when the merge took place and some file were examined that was part of Set 1, 2 and 3, the version control system would pick the trunk as the ancestor of both versions of the file.  This would mean all the changes from Set 1, 2 and 3 would be examined simultaneously, presenting a complicated picture.
 
In a modern version control system, the file with changes from Set 1 would be easily merged with the files with changes from Set 2.  This new file would then be chosen as the ancestor and merged with Set 3.  This approach vastly simplifies the merging process for both the computer and the user, but it requires that groups of changes be tracked together and applied individually.
 
 
==Repository Data Representation==
 
In most client-server (and local) version control implementations, the version control system tracks individual files.  The implication of this approach is that the entire view of the data within the system is file-centric.  CVS, for example, allows annotations, but only on a per-file basis.  One of the major recent advancements (pioneered by Git) is the ability to track the entire repository as a single monolithic entity. 
 
One of the results of this is that modern systems (usually DVCS) can track single blocks of text (like a single function in a computer program) across several files if it were moved.  This allows users to focus on the actual data in the repository, rather than the files that hold the data.  One of the most well known negative effects of the file-centric approach appeared in CVS in which file renames would appear in the repository as delete-create pairs, resulting in loss of version history.
 
Another result of decoupling data representation from the file system has already been mentioned: it allows groups of changes across multiple files to be tracked and associated.  A related notion is the the change history that was traditionally represented as a line in most client-server systems is being represented as a DAG in more modern systems.  The DAG is a more restricted data structure in some ways, but make many operations much simpler for the end user.  The most notable example is when merging from a branch multiple times, a DAG can accurately represent what changes have been incorporated, whereas the traditional file-system oriented approach leaves it as an exercise for the user to re-merge the same changes over and over.  This particular example is discussed in depth in Eric Sink's [http://www.ericsink.com/entries/merge_history.html article on merging].
 
 
=Wikis as an Extension to Version Control=
 
As use of the internet became more widespread and ''users'' of the web became some of its most important ''contributors'', concepts like the [http://en.wikipedia.org/wiki/Wiki Wiki] allowed users of a website to edit the website itself.  Often, the website was just a collection of files (pages) being served, that visitors could create, modify and remove.  One of the aspects of [http://c2.com/cgi/wiki?WikiCommunity WikiCommunity] is it is discussed on the original Wiki is the notion of a Wiki being open to change, and thus, Wikis tend to allow anonymous modifications to the wiki's pages.  From this simple fact emerged the idea of saving previous versions of the wiki's pages.
 
So, taken loosely, Wikis are essentially web-enabled version control systems in which the users are anyone who visits the wiki.  This participatory culture lends itself to a distributed model of information exchange, and yet most wikis are still using what amounts to a local or client-server version control system behind the scenes.
 
Looking to the future, it will be very interesting to see what can be done with wiki communities if they incorporate the important lessons learned from distributed version control systems.
 
The first (and only) example I have found of a community of users developing wiki-like content using distributed version control systems is [http://orgmode.org/worg/ Worg], a group of [http://www.gnu.org/software/emacs Emacs] [http://orgmode.org/ Org-Mode] users sharing information about the ways they use the Org-Mode software.  Incidentally, Worg is itself a collection of Org-Mode files stored in [http://git-scm.com/ Git], one of the most prominent distributed version control systems.


=Further Reading=  
=Further Reading=  


IBM has an excellent
For quick refreshers on [http://betterexplained.com/articles/a-visual-guide-to-version-control/ Version Control] and [http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/ Distributed Version Control], BetterExplained has excellent, concise articles.
[http://www.ibm.com/developerworks/java/library/j-subversion/index.html
 
DeveoperWorks article] describing the ways is which Subversion
Linus Torvalds, the creator and maintainer of both Linux kernel and Git, gave a entertaining and inforative [http://www.youtube.com/watch?v=4XpnKHJAok8&feature=player_embedded tech talk] at Google on version control and Git.  A fabulous resource.
improved upon CVS, along with some history of version control up until
 
Subversion.
An excellent [http://www.ericsink.com/entries/merge_history.html entry] in Eric Sink's blog on how merge history and repostory representation are related.
 
IBM has an well-written [http://www.ibm.com/developerworks/java/library/j-subversion/index.html DeveloperWorks article] describing the ways is which Subversion improved upon CVS, along with some history of version control up until Subversion.

Latest revision as of 22:46, 18 September 2009

Introduction

The defining characteristic of a version control system is its ability to track changes to a document, or set of documents, over many changes, or revisions. For the vast majority of applications, version control systems have focused on tracking plain text files, such as those used for programming source code, HTML documents, and various markup syntax.

The history of the development of version control tools can be roughly categorized into three main phases:

  1. Local Version Control
  2. Client-Server Version Control
  3. Distributed Version Control

This breakdown is focused on the mechanisms that underlays how data is shared and stored in a version control system. It should not be inferred from this structure that other attributes are not important to the history of developments of version control systems. There have been many advances in how:

  • conflicts are recognized and merges are performed
  • groups of logically coherent changes are tracked
  • and how the data is represented within the repository

While these may seem to be three separate topics, we will see in the section on #Other Advances that they are in fact very closely related.

Local Version Control

The first version control systems (like SCCS, 1972) focused on local version control; that is, centralized computer systems that were used by many users, often at the same time. In such a system, there were often many users of the system and the repository, or location in which the data was stored, was simply a directory on the server to which the users had access. Because of this use case, these systems focused on two main features:

  • File version tracking
  • File Checkout and Locking

We will address each of these fundamental features in turn.

File Version Tracking

The primary feature of these early systems was the ability to check in files at various points as they were altered, so that the history of changes made to files under version control was kept permanently. Thus, many users could alter many files over time, and the entire set of documents under version control at a given point in time could be recovered, preventing loss of valuable data, as well a providing a record of what users made changes to files over time.

File Checkout and Locking

Because many users in a shared system may desire to edit a file simultaneously, one of the first features developed for version control systems was the ability to check out and lock a file. When a user checks out a file, he or she reserves the right to be the sole editor of that file until it is checked back in to version control. Both SCCS and RCS were designed for use in a shared environment and, as such, allowed files to be checked out and locked in this way. Other users could check out the file, but only to view it. Thus, the files were locked from editing by all but the user that had checked the file out most recently.

Weaknesses of Local Version Control Systems

These local systems had two primary problems.

First, they required that every user log into a single computer to edit or access the information in the repository. This often posed both performance and security risks, in addition to being cumbersome as networks became more prevalent.

Second, they restricted a particular file to having only one editor at any given time. The next development in version control, embodied by CVS, sought to address both of these problems.

Networked Version Control: Client-Server

As users moved away from logging into systems locally to make their changes to files, the need for a version control system that supported remote operations emerged. The natural way to implement such remote operations was as an extension of the existing system, and by far the most prominent manifestation of this philosophy was present in CVS, the concurrent versions system, which was initially based on RCS and began development in 1984 and matured throughout the mid- to late-1980s. Version 1.0 was released under the GNU GPL, a free software license, in the second half of 1990.

The main feature driving the development of CVS was the need for many users, each on his or her own machine, to be able to perform all the operations present in the original RCS, but over a network connection, and in a way that allowed for concurrent editing to take place. This led to the development of a client-server model of version control systems, in which one central server would contain the canonical version of the repository, and various clients could connect to the central server and perform file check outs and commits. This model is very similar to the original RCS model, but rather than requiring users of the system to log into the version control system locally, it allowed users to access and alter the contents of the repository over the network.

Although CVS supports locking in the same way RCS does, CVS was among the first version control systems to support a non-locking repository. This system allowed for concurrent editing of files under version control, and generated the need to develop new features that addressed the resulting complexities. Chief among the new features introduced to handle these complexities were the notions of branching and merging. This allowed CVS to offer a non-locking repository, which is why there is an emphasis on the "concurrent" portion of CVS's name "concurrent versions system".

Branching and Merging

Inherent in the notion of concurrent editing is the problem of how to reconcile conflicting changes to the same file. A conflict is essentially two or more changes made to the same file that it may be difficult to merge into a final file that contains both sets of changes. An example of a conflict would occur if two users both edited a file on line 49, one changing the word "blue" to "red", and the other changing the same word "blue" to "green". The first user would then commit his or her changes back to the repository, and when the second user committed changes, the version control system would detect that the repostory had changed since the second user had obtained the file (since the first user had made a change and then committed it). At that point, the version control system would detect a conflict, and prompt the two users to coordinate to resolve the conflict to determine what text should be on line 49.

The solution to this problem lies in allowing users of the version control system to branch a version of the repository and make (possibly many) changes to that branch independent of the changes occurring on the main branch of the repository, known as the trunk. Once a logical set of changes was completed on a branch, that branch would then need to have its changes reconciled with the current state of the repository on the trunk. This process of reconciliation is known as merging.

This feature is critical in a multi-user client environment as it allows work to progress on multiple fronts simultaneously, only requiring that the files be merged once the users of the system are ready to reconcile changes with other users.

Along with development of mechanisms to allow this sort of concurrent access to the repository over the network, version control systems became more adept in the algorithms they used to detect conflicts and merge conflicts. This aspect of version control is discussed further in #Merge Algorithms.

Client-Server Beyond CVS

Although CVS developed good approaches to solving many of these problems, it had many problems that gained attention when it became the most widely used version control system for open source development. An exhaustive list would be lengthy, but to mention a few might be illustrative.

  • CVS doesn't provide atomic operations, which means that if there were a network failure during a commit, the repository could become corrupted.
  • CVS does not version control directories or symbolic links, which means the repository is really a lossy copy of a developer's environment, sometimes resulting in failure to track changes accurately.
  • CVS doesn't track what files were committed at the same time, so if you make a logical group of changes to several files and want to track the fact that those files were changed together, you can only only derive that information from log messages. CVS will not track it for you.
  • CVS cannot track when files are renamed; rather, a rename of a file in CVS looks like the original file was deleted and a new file added, thus losing the file's history.
  • Creating branches and managing the subsequent merges is slow and difficult.

In short, while CVS provided a whole host of new features and advanced the state of the art in version control, it left room for improvement. This resulted in a vast number of client-server version control systems entering the market following CVS. One of the latest and most notable of these is Subversion, which seeks to address all of the issues mentioned above and a whole lot more.

Distributed Version Control

In the late 1990s, a new paradigm of development started to emerge with the development of new, proprietary version control systems. The first of these was Sun WorkShop TeamWare, the lead designer of which went on to found a new company, BitMover, and develop the leading proprietary distributed version control system, BitKeeper. These were the first distributed version control systems.

Distributed version control system (DVCS) took many of the advances seen in client-server version control systems and moved them into a less centralized architecture. Essentially, the original version control systems were completely centralized, requiring every user to locally log in to the server on which the repository was located. In client-server version control systems, the system was made slightly more distributed, allowing users to connect from across the network to the repository, copy files from the repository to other machines for editing, and then commit them back to the server when edits were complete. Distributed version control continues the trend of decentralization by putting an entire repository, complete with a history of changes and ability to support remote connections, on each user's machine.

One of the strengths of CVS is that it supports file locking even though the main advance it provides is a non-locking repository.. In the same way that CVS supports legacy locking work flows, so do distributed version control systems support the workflows usually associated with a centralized repository. The main improvement distributed version control systems offer, however, is they do not require a central server. There are three advantages to this decentralized approach.

First, it encourages creation of branches. Specifically, every time a user "checks out" a file or group of files, a new branch is created on that users machine. This is in stark contrast to the client-server model in which each time a branch is created, it is carefully planned and coordinated with other users of the system. Essentially, branching and merging in a centralized system is often difficult and slow, and in a DVCS, it is designed to be natural and fast.

Second, it allows users to commit their changes without disturbing other users of the system. In typical client-server work flows, the notion of a commit is tightly coupled to the notion of a merge with the code that is currently in the repository. Distributed version control decouples these two notions, allowing developers to commit freely, and merge with other users at a different time.

Third, because each user has an entire copy of the repository, all work is done locally, which allows users to continue doing work even when they don't have access to the internet or to a particular server. Further, many useful operations which take a long time in a centralized system take an order of magnitude less time in a distributed system simply because the entire repository is local, and therefore no network latency is involved.

All of these changes are made possible as networks have become faster and the computers on which end users now work are often as powerful as the servers that would host a centralized repository. Thus, and the compute power has moved to the edge of the network, so too has the data in the repositories.

This treatment might make it seem as though DVCS approaches are strictly superior to centralized networked approaches to version control. In general, DVCS is considered to be an advancement of the state of the art, much as networked systems were considered superior to their local counterparts. However, one common use case for version control systems is inside the firewall of a corporation, where work is done on possible proprietary or even classified data. In such scenarios, there is often a strong desire, at least at the management level, to strongly control storage of the version controlled data. Sometimes, such control is mandated by the customer. In these cases, it may well be undesirable to have developers replicating data at will, and in places that are not well managed and controlled. In these circumstances, it often makes sense to use a more centralized approach simply because the possibility of uncontrolled copying of the data represents a security risk. Products like IBM's ClearCase are designed to perform very centralized and controlled version control in such environments.

Other Advances

In addition to the evolution of the way version control systems allowed users to access, modify and share data in the repository, many advances have been made in the way changes are merged, tracked and stored.

Merge Algorithms

Merge algorithms are a good way to frame the many of the problems that arise in a concurrent development environment. It is therefore useful to start by discussing the issue of merge algorithms, even though relatively few advances have been made in recent years on the algorithms themselves.

There are really two kinds of merging algorithm:

  • 2-way merge
  • 3-way merge

2-way merge was developed first, so we will discuss it first.

2-way Merge

A 2-way Merge takes two files and compares them for differences, merges differences that do not conflict and identifies differences that conflict for human resolution. diff is a very well-known utility for performing such comparisons and its algorithm is based upon a procedure for finding the longest common subsequence of text in the files to be compared. Methods for approaching this basic algorithm has gone largely unchanged since 1975, and the algorithm is used in both SCCS and RCS to create patches as a means of storing multiple versions of files under version control.

While the 2-way merge approach is a good start to manage concurrent modification to the same resource in a version control system, it became apparent that while the merge algorithm was sound, the version control system actually had much more information than the merge program did, and therefore there were many cases in which a merge was conceptually easy, but proved to be difficult in practice. In particular, given a knowledge of a common ancestor from which two files had both originated, a merge program could make more decisions autonomously, streamlining the merge process. Thus, 3-way merge was born.

3-way Merge

A 3-way merge is an approach to merging two differing files with reference to a third file that is a common ancestor of the two differing files. By comparing each of the differing files first to the ancestor, and then to each other, the merge program can merge conflicts without human intervention more often than is possible in a 2-way merge approach.

This advancement alone made branching a much less risky proposition for development teams, and allowed distributed version control systems (which place special emphasis on a branched development approach) viable in both small and large development teams. While both CVS and Subversion (and other modern client-server version control systems) made use of a 3-way merge approach, the newer distributed version control systems tracked groups of changes effectively, and represented the changes of sets of files as a directed acyclic graph (DAG), making identification of useful common ancestors easier, and thus streamlining the merge process. We will discuss each of these features in turn.

Tracking Groups of Changes

An issue closely related to merge algorithms is the issue of exactly which files the version control system inputs into a the 3-way merge. In early client-server implementations, a file was the both the largest and smallest entity tracked by the version control system; that is, if several files were modified in support of a single logical change to e.g. a piece of software, those version control systems tracked changes to each individual file and had no tracking for the fact that a particular files changes were part of a larger entity (in this case, a group of files part of a single commit).

While the algorithms to perform textual merges improved when the switch was made from 2-way to 3-way merges, most modern improvements for end users in the area of merge stem from how the files input into the merge algorithm are selected. By tracking groups of changes, the version control system can make more effective decisions about which files to pick for merge, making the merge algorithms perform optimally.

The ability to track groups of changes most noticeably affects merging improvements in two ways.

First, even when changes are tracked in groups, sometimes a manual merge (one requiring user intervention) is required. However, because the version control system can match patterns across mulitple files, if a similar merge is needed later (i.e. the same feature needs to be merged into another branch or the same branch at a later time), the version control system can "remember" the steps the user took to merge in that change and apply them automatically. This feature has an enourmous impact on user productivity in practice.

Second, tracking groups of changes allows the version control system to use algorithms that take advantage of that additional knowledge to better select ancestors for a 3-way merge. The best way to understand this is to engage in a thought experiment.

Imagine that you have a common code base, the trunk, from which two branches, A and B, are created. Branch A has a set of changes (Set 1) applied to it. Branch B has a different set of changes applied to it (Set 2). Then, in another commit, Branch B has another set of changes applied to it (Set 3). At this point, the user wishes to merge the two branches. Let us also suppose that Set 1 and Set 2 are different sets of changes that do not conflict with one another. Finally, let us also suppose that changes in Set 1 and Set 3 do conflict.

In a legacy version control system, when the merge took place and some file were examined that was part of Set 1, 2 and 3, the version control system would pick the trunk as the ancestor of both versions of the file. This would mean all the changes from Set 1, 2 and 3 would be examined simultaneously, presenting a complicated picture.

In a modern version control system, the file with changes from Set 1 would be easily merged with the files with changes from Set 2. This new file would then be chosen as the ancestor and merged with Set 3. This approach vastly simplifies the merging process for both the computer and the user, but it requires that groups of changes be tracked together and applied individually.


Repository Data Representation

In most client-server (and local) version control implementations, the version control system tracks individual files. The implication of this approach is that the entire view of the data within the system is file-centric. CVS, for example, allows annotations, but only on a per-file basis. One of the major recent advancements (pioneered by Git) is the ability to track the entire repository as a single monolithic entity.

One of the results of this is that modern systems (usually DVCS) can track single blocks of text (like a single function in a computer program) across several files if it were moved. This allows users to focus on the actual data in the repository, rather than the files that hold the data. One of the most well known negative effects of the file-centric approach appeared in CVS in which file renames would appear in the repository as delete-create pairs, resulting in loss of version history.

Another result of decoupling data representation from the file system has already been mentioned: it allows groups of changes across multiple files to be tracked and associated. A related notion is the the change history that was traditionally represented as a line in most client-server systems is being represented as a DAG in more modern systems. The DAG is a more restricted data structure in some ways, but make many operations much simpler for the end user. The most notable example is when merging from a branch multiple times, a DAG can accurately represent what changes have been incorporated, whereas the traditional file-system oriented approach leaves it as an exercise for the user to re-merge the same changes over and over. This particular example is discussed in depth in Eric Sink's article on merging.


Wikis as an Extension to Version Control

As use of the internet became more widespread and users of the web became some of its most important contributors, concepts like the Wiki allowed users of a website to edit the website itself. Often, the website was just a collection of files (pages) being served, that visitors could create, modify and remove. One of the aspects of WikiCommunity is it is discussed on the original Wiki is the notion of a Wiki being open to change, and thus, Wikis tend to allow anonymous modifications to the wiki's pages. From this simple fact emerged the idea of saving previous versions of the wiki's pages.

So, taken loosely, Wikis are essentially web-enabled version control systems in which the users are anyone who visits the wiki. This participatory culture lends itself to a distributed model of information exchange, and yet most wikis are still using what amounts to a local or client-server version control system behind the scenes.

Looking to the future, it will be very interesting to see what can be done with wiki communities if they incorporate the important lessons learned from distributed version control systems.

The first (and only) example I have found of a community of users developing wiki-like content using distributed version control systems is Worg, a group of Emacs Org-Mode users sharing information about the ways they use the Org-Mode software. Incidentally, Worg is itself a collection of Org-Mode files stored in Git, one of the most prominent distributed version control systems.

Further Reading

For quick refreshers on Version Control and Distributed Version Control, BetterExplained has excellent, concise articles.

Linus Torvalds, the creator and maintainer of both Linux kernel and Git, gave a entertaining and inforative tech talk at Google on version control and Git. A fabulous resource.

An excellent entry in Eric Sink's blog on how merge history and repostory representation are related.

IBM has an well-written DeveloperWorks article describing the ways is which Subversion improved upon CVS, along with some history of version control up until Subversion.