CSC/ECE 517 Fall 2009/wiki1a 5 rp

From Expertiza_Wiki
Revision as of 22:46, 18 September 2009 by 7r3ad (talk | contribs) (→‎Distributed Version Control)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

The defining characteristic of a version control system is its ability to track changes to a document, or set of documents, over many changes, or revisions. For the vast majority of applications, version control systems have focused on tracking plain text files, such as those used for programming source code, HTML documents, and various markup syntax.

The history of the development of version control tools can be roughly categorized into three main phases:

  1. Local Version Control
  2. Client-Server Version Control
  3. Distributed Version Control

This breakdown is focused on the mechanisms that underlays how data is shared and stored in a version control system. It should not be inferred from this structure that other attributes are not important to the history of developments of version control systems. There have been many advances in how:

  • conflicts are recognized and merges are performed
  • groups of logically coherent changes are tracked
  • and how the data is represented within the repository

While these may seem to be three separate topics, we will see in the section on #Other Advances that they are in fact very closely related.

Local Version Control

The first version control systems (like SCCS, 1972) focused on local version control; that is, centralized computer systems that were used by many users, often at the same time. In such a system, there were often many users of the system and the repository, or location in which the data was stored, was simply a directory on the server to which the users had access. Because of this use case, these systems focused on two main features:

  • File version tracking
  • File Checkout and Locking

We will address each of these fundamental features in turn.

File Version Tracking

The primary feature of these early systems was the ability to check in files at various points as they were altered, so that the history of changes made to files under version control was kept permanently. Thus, many users could alter many files over time, and the entire set of documents under version control at a given point in time could be recovered, preventing loss of valuable data, as well a providing a record of what users made changes to files over time.

File Checkout and Locking

Because many users in a shared system may desire to edit a file simultaneously, one of the first features developed for version control systems was the ability to check out and lock a file. When a user checks out a file, he or she reserves the right to be the sole editor of that file until it is checked back in to version control. Both SCCS and RCS were designed for use in a shared environment and, as such, allowed files to be checked out and locked in this way. Other users could check out the file, but only to view it. Thus, the files were locked from editing by all but the user that had checked the file out most recently.

Weaknesses of Local Version Control Systems

These local systems had two primary problems.

First, they required that every user log into a single computer to edit or access the information in the repository. This often posed both performance and security risks, in addition to being cumbersome as networks became more prevalent.

Second, they restricted a particular file to having only one editor at any given time. The next development in version control, embodied by CVS, sought to address both of these problems.

Networked Version Control: Client-Server

As users moved away from logging into systems locally to make their changes to files, the need for a version control system that supported remote operations emerged. The natural way to implement such remote operations was as an extension of the existing system, and by far the most prominent manifestation of this philosophy was present in CVS, the concurrent versions system, which was initially based on RCS and began development in 1984 and matured throughout the mid- to late-1980s. Version 1.0 was released under the GNU GPL, a free software license, in the second half of 1990.

The main feature driving the development of CVS was the need for many users, each on his or her own machine, to be able to perform all the operations present in the original RCS, but over a network connection, and in a way that allowed for concurrent editing to take place. This led to the development of a client-server model of version control systems, in which one central server would contain the canonical version of the repository, and various clients could connect to the central server and perform file check outs and commits. This model is very similar to the original RCS model, but rather than requiring users of the system to log into the version control system locally, it allowed users to access and alter the contents of the repository over the network.

Although CVS supports locking in the same way RCS does, CVS was among the first version control systems to support a non-locking repository. This system allowed for concurrent editing of files under version control, and generated the need to develop new features that addressed the resulting complexities. Chief among the new features introduced to handle these complexities were the notions of branching and merging. This allowed CVS to offer a non-locking repository, which is why there is an emphasis on the "concurrent" portion of CVS's name "concurrent versions system".

Branching and Merging

Inherent in the notion of concurrent editing is the problem of how to reconcile conflicting changes to the same file. A conflict is essentially two or more changes made to the same file that it may be difficult to merge into a final file that contains both sets of changes. An example of a conflict would occur if two users both edited a file on line 49, one changing the word "blue" to "red", and the other changing the same word "blue" to "green". The first user would then commit his or her changes back to the repository, and when the second user committed changes, the version control system would detect that the repostory had changed since the second user had obtained the file (since the first user had made a change and then committed it). At that point, the version control system would detect a conflict, and prompt the two users to coordinate to resolve the conflict to determine what text should be on line 49.

The solution to this problem lies in allowing users of the version control system to branch a version of the repository and make (possibly many) changes to that branch independent of the changes occurring on the main branch of the repository, known as the trunk. Once a logical set of changes was completed on a branch, that branch would then need to have its changes reconciled with the current state of the repository on the trunk. This process of reconciliation is known as merging.

This feature is critical in a multi-user client environment as it allows work to progress on multiple fronts simultaneously, only requiring that the files be merged once the users of the system are ready to reconcile changes with other users.

Along with development of mechanisms to allow this sort of concurrent access to the repository over the network, version control systems became more adept in the algorithms they used to detect conflicts and merge conflicts. This aspect of version control is discussed further in #Merge Algorithms.

Client-Server Beyond CVS

Although CVS developed good approaches to solving many of these problems, it had many problems that gained attention when it became the most widely used version control system for open source development. An exhaustive list would be lengthy, but to mention a few might be illustrative.

  • CVS doesn't provide atomic operations, which means that if there were a network failure during a commit, the repository could become corrupted.
  • CVS does not version control directories or symbolic links, which means the repository is really a lossy copy of a developer's environment, sometimes resulting in failure to track changes accurately.
  • CVS doesn't track what files were committed at the same time, so if you make a logical group of changes to several files and want to track the fact that those files were changed together, you can only only derive that information from log messages. CVS will not track it for you.
  • CVS cannot track when files are renamed; rather, a rename of a file in CVS looks like the original file was deleted and a new file added, thus losing the file's history.
  • Creating branches and managing the subsequent merges is slow and difficult.

In short, while CVS provided a whole host of new features and advanced the state of the art in version control, it left room for improvement. This resulted in a vast number of client-server version control systems entering the market following CVS. One of the latest and most notable of these is Subversion, which seeks to address all of the issues mentioned above and a whole lot more.

Distributed Version Control

In the late 1990s, a new paradigm of development started to emerge with the development of new, proprietary version control systems. The first of these was Sun WorkShop TeamWare, the lead designer of which went on to found a new company, BitMover, and develop the leading proprietary distributed version control system, BitKeeper. These were the first distributed version control systems.

Distributed version control system (DVCS) took many of the advances seen in client-server version control systems and moved them into a less centralized architecture. Essentially, the original version control systems were completely centralized, requiring every user to locally log in to the server on which the repository was located. In client-server version control systems, the system was made slightly more distributed, allowing users to connect from across the network to the repository, copy files from the repository to other machines for editing, and then commit them back to the server when edits were complete. Distributed version control continues the trend of decentralization by putting an entire repository, complete with a history of changes and ability to support remote connections, on each user's machine.

One of the strengths of CVS is that it supports file locking even though the main advance it provides is a non-locking repository.. In the same way that CVS supports legacy locking work flows, so do distributed version control systems support the workflows usually associated with a centralized repository. The main improvement distributed version control systems offer, however, is they do not require a central server. There are three advantages to this decentralized approach.

First, it encourages creation of branches. Specifically, every time a user "checks out" a file or group of files, a new branch is created on that users machine. This is in stark contrast to the client-server model in which each time a branch is created, it is carefully planned and coordinated with other users of the system. Essentially, branching and merging in a centralized system is often difficult and slow, and in a DVCS, it is designed to be natural and fast.

Second, it allows users to commit their changes without disturbing other users of the system. In typical client-server work flows, the notion of a commit is tightly coupled to the notion of a merge with the code that is currently in the repository. Distributed version control decouples these two notions, allowing developers to commit freely, and merge with other users at a different time.

Third, because each user has an entire copy of the repository, all work is done locally, which allows users to continue doing work even when they don't have access to the internet or to a particular server. Further, many useful operations which take a long time in a centralized system take an order of magnitude less time in a distributed system simply because the entire repository is local, and therefore no network latency is involved.

All of these changes are made possible as networks have become faster and the computers on which end users now work are often as powerful as the servers that would host a centralized repository. Thus, and the compute power has moved to the edge of the network, so too has the data in the repositories.

This treatment might make it seem as though DVCS approaches are strictly superior to centralized networked approaches to version control. In general, DVCS is considered to be an advancement of the state of the art, much as networked systems were considered superior to their local counterparts. However, one common use case for version control systems is inside the firewall of a corporation, where work is done on possible proprietary or even classified data. In such scenarios, there is often a strong desire, at least at the management level, to strongly control storage of the version controlled data. Sometimes, such control is mandated by the customer. In these cases, it may well be undesirable to have developers replicating data at will, and in places that are not well managed and controlled. In these circumstances, it often makes sense to use a more centralized approach simply because the possibility of uncontrolled copying of the data represents a security risk. Products like IBM's ClearCase are designed to perform very centralized and controlled version control in such environments.

Other Advances

In addition to the evolution of the way version control systems allowed users to access, modify and share data in the repository, many advances have been made in the way changes are merged, tracked and stored.

Merge Algorithms

Merge algorithms are a good way to frame the many of the problems that arise in a concurrent development environment. It is therefore useful to start by discussing the issue of merge algorithms, even though relatively few advances have been made in recent years on the algorithms themselves.

There are really two kinds of merging algorithm:

  • 2-way merge
  • 3-way merge

2-way merge was developed first, so we will discuss it first.

2-way Merge

A 2-way Merge takes two files and compares them for differences, merges differences that do not conflict and identifies differences that conflict for human resolution. diff is a very well-known utility for performing such comparisons and its algorithm is based upon a procedure for finding the longest common subsequence of text in the files to be compared. Methods for approaching this basic algorithm has gone largely unchanged since 1975, and the algorithm is used in both SCCS and RCS to create patches as a means of storing multiple versions of files under version control.

While the 2-way merge approach is a good start to manage concurrent modification to the same resource in a version control system, it became apparent that while the merge algorithm was sound, the version control system actually had much more information than the merge program did, and therefore there were many cases in which a merge was conceptually easy, but proved to be difficult in practice. In particular, given a knowledge of a common ancestor from which two files had both originated, a merge program could make more decisions autonomously, streamlining the merge process. Thus, 3-way merge was born.

3-way Merge

A 3-way merge is an approach to merging two differing files with reference to a third file that is a common ancestor of the two differing files. By comparing each of the differing files first to the ancestor, and then to each other, the merge program can merge conflicts without human intervention more often than is possible in a 2-way merge approach.

This advancement alone made branching a much less risky proposition for development teams, and allowed distributed version control systems (which place special emphasis on a branched development approach) viable in both small and large development teams. While both CVS and Subversion (and other modern client-server version control systems) made use of a 3-way merge approach, the newer distributed version control systems tracked groups of changes effectively, and represented the changes of sets of files as a directed acyclic graph (DAG), making identification of useful common ancestors easier, and thus streamlining the merge process. We will discuss each of these features in turn.

Tracking Groups of Changes

An issue closely related to merge algorithms is the issue of exactly which files the version control system inputs into a the 3-way merge. In early client-server implementations, a file was the both the largest and smallest entity tracked by the version control system; that is, if several files were modified in support of a single logical change to e.g. a piece of software, those version control systems tracked changes to each individual file and had no tracking for the fact that a particular files changes were part of a larger entity (in this case, a group of files part of a single commit).

While the algorithms to perform textual merges improved when the switch was made from 2-way to 3-way merges, most modern improvements for end users in the area of merge stem from how the files input into the merge algorithm are selected. By tracking groups of changes, the version control system can make more effective decisions about which files to pick for merge, making the merge algorithms perform optimally.

The ability to track groups of changes most noticeably affects merging improvements in two ways.

First, even when changes are tracked in groups, sometimes a manual merge (one requiring user intervention) is required. However, because the version control system can match patterns across mulitple files, if a similar merge is needed later (i.e. the same feature needs to be merged into another branch or the same branch at a later time), the version control system can "remember" the steps the user took to merge in that change and apply them automatically. This feature has an enourmous impact on user productivity in practice.

Second, tracking groups of changes allows the version control system to use algorithms that take advantage of that additional knowledge to better select ancestors for a 3-way merge. The best way to understand this is to engage in a thought experiment.

Imagine that you have a common code base, the trunk, from which two branches, A and B, are created. Branch A has a set of changes (Set 1) applied to it. Branch B has a different set of changes applied to it (Set 2). Then, in another commit, Branch B has another set of changes applied to it (Set 3). At this point, the user wishes to merge the two branches. Let us also suppose that Set 1 and Set 2 are different sets of changes that do not conflict with one another. Finally, let us also suppose that changes in Set 1 and Set 3 do conflict.

In a legacy version control system, when the merge took place and some file were examined that was part of Set 1, 2 and 3, the version control system would pick the trunk as the ancestor of both versions of the file. This would mean all the changes from Set 1, 2 and 3 would be examined simultaneously, presenting a complicated picture.

In a modern version control system, the file with changes from Set 1 would be easily merged with the files with changes from Set 2. This new file would then be chosen as the ancestor and merged with Set 3. This approach vastly simplifies the merging process for both the computer and the user, but it requires that groups of changes be tracked together and applied individually.


Repository Data Representation

In most client-server (and local) version control implementations, the version control system tracks individual files. The implication of this approach is that the entire view of the data within the system is file-centric. CVS, for example, allows annotations, but only on a per-file basis. One of the major recent advancements (pioneered by Git) is the ability to track the entire repository as a single monolithic entity.

One of the results of this is that modern systems (usually DVCS) can track single blocks of text (like a single function in a computer program) across several files if it were moved. This allows users to focus on the actual data in the repository, rather than the files that hold the data. One of the most well known negative effects of the file-centric approach appeared in CVS in which file renames would appear in the repository as delete-create pairs, resulting in loss of version history.

Another result of decoupling data representation from the file system has already been mentioned: it allows groups of changes across multiple files to be tracked and associated. A related notion is the the change history that was traditionally represented as a line in most client-server systems is being represented as a DAG in more modern systems. The DAG is a more restricted data structure in some ways, but make many operations much simpler for the end user. The most notable example is when merging from a branch multiple times, a DAG can accurately represent what changes have been incorporated, whereas the traditional file-system oriented approach leaves it as an exercise for the user to re-merge the same changes over and over. This particular example is discussed in depth in Eric Sink's article on merging.


Wikis as an Extension to Version Control

As use of the internet became more widespread and users of the web became some of its most important contributors, concepts like the Wiki allowed users of a website to edit the website itself. Often, the website was just a collection of files (pages) being served, that visitors could create, modify and remove. One of the aspects of WikiCommunity is it is discussed on the original Wiki is the notion of a Wiki being open to change, and thus, Wikis tend to allow anonymous modifications to the wiki's pages. From this simple fact emerged the idea of saving previous versions of the wiki's pages.

So, taken loosely, Wikis are essentially web-enabled version control systems in which the users are anyone who visits the wiki. This participatory culture lends itself to a distributed model of information exchange, and yet most wikis are still using what amounts to a local or client-server version control system behind the scenes.

Looking to the future, it will be very interesting to see what can be done with wiki communities if they incorporate the important lessons learned from distributed version control systems.

The first (and only) example I have found of a community of users developing wiki-like content using distributed version control systems is Worg, a group of Emacs Org-Mode users sharing information about the ways they use the Org-Mode software. Incidentally, Worg is itself a collection of Org-Mode files stored in Git, one of the most prominent distributed version control systems.

Further Reading

For quick refreshers on Version Control and Distributed Version Control, BetterExplained has excellent, concise articles.

Linus Torvalds, the creator and maintainer of both Linux kernel and Git, gave a entertaining and inforative tech talk at Google on version control and Git. A fabulous resource.

An excellent entry in Eric Sink's blog on how merge history and repostory representation are related.

IBM has an well-written DeveloperWorks article describing the ways is which Subversion improved upon CVS, along with some history of version control up until Subversion.