Three Evolutionary Stages of Version Control

Saturday, Jun 27, 2009, 17:28

Recently, an edit to the VersionControlSystem article on the WLUG wiki drew me in to expand on it. My own work on that edit ended up taking up 4 days (in spurts), in the course of which I would add relevant points, only to notice that the additions revealed some structure trying to emerge, which would then cause me to go back to rearrange the text, which in turn would remind me of more relevant points to add, several times over. It was much more work than I expected, and I was quite tired and wanting to be done by the end.

However, the effort ended up crystallising my conceptualisation of version control systems quite thoroughly. I have long been aware of all the individual points I wrote about, but it was only during this work that I came to understand their relations systematically. In order to make the effort expended more worthwhile, and because I do not know of any other article summarising the evolution in this manner (although of course I may well be ignorant), I thought I should give the article additional exposure by also posting it on my weblog.

Remember that getting from each of step to the next in this sequence took a long time, both because it took time to realise that there was a problem in the first place, and because the respective right solution was not clear from foresight – trivially obvious as it may all seem when you see it laid out like here.

1+1: One Repository, One Working Copy

The design of the earliest systems revolved around versioning a single working copy, directly edited by all users. To prevent attempts at simultaneous modification of a single file, editing was not allowed without checking files out, which only one user at a time could do for any given file.

Having to give each user access to the same machine and file system in order to work on code was natural at the time these systems were designed, in the mainframe era, but today would obviously be a problem. Also, the requirement to check files out was a cause of friction even at the time, since everyone has to wait on one another – not to mention that someone might forget to check a file back in before leaving on vacation.

1+n: One Repository, Many Working Copies

The next evolutionary step was to decouple the repository from the working copy, so that there may then be many working copies. The exemplar in this class of systems, known as centralised VCSs, is CVS. It lifts the obvious restrictions of earlier systems with a design in which the repository is mediated by a server. Multiple users can collaborate by each checking out a private working copy of the project.

Note that in CVS, “checking out” no longer implies locking. (In other centralised VCSs, it may; eg. Visual SourceSafe. In some, such as Perforce, it is optional.) Checking in changes is simply blocked if someone else has already checked in other changes in the meantime. Before the latecomer is allowed to check in their own changes, they have to update their working copy with the upstream changes, resolving any conflicts manually.

This works reasonably well. CVS ended up as the de facto standard for a decade.

However, its single-repository nature, subsequently adopted by most following major systems, perpetuates problems harking back to the earlier model – and adds new ones:

n+n: Many Working Copies, Paired With Equally Many Repositories

The solution to all this was to not only give each collaborator a separate working copy, but a separate repository also. This class of system, whose pioneering solid implementation was BitKeeper, is known as distributed version control systems. The technical basis that allows this is algorithmic merging: 3-way merging (in the simplest case) allows combining non-overlapping changes automatically, and merge point tracking allows repeatedly merging branches without unnecessary conflicts.

Since each collaborator has their own repository and can make commits, the effect is that everyone has their own private branch, with full versioning for local changes, and these branches can be published at the discretion of their author and can be merged by others easily. Actually, each collaborator often has several local branches – since merging is easy and branches never ”need” be published, it is painless to create short-lived branches for experiments or tests, to use them as a general workflow aspect (eg. start a new branch for every separate bug fix), or for any other purpose, whether intended for public consumption or not.

Everyone has full offline access to the project history, and all repository operations (except pushing or pulling changes, obviously) take place at full local disk speed.

All this immensely accelerates collaborative development and removes the political headaches surrounding commit access.