Chapter 4. Software Configuration Management

The first half of this chapter describes why keeping track of how your software changes , a process more formally known as software configuration management (SCM), is vital for any project. This chapter covers exactly what is meant by SCM, and how it differs from change management or configuration management (CM). Seven of the most commonly used or promising SCM tools are examined: CVS, Subversion, Arch, Perforce, BitKeeper, ClearCase, and Visual SourceSafe.

The second half of this chapter discusses some of the most common annoyances encountered when using SCM tools and describes some of the ways you can avoid them.

Warning

The acronym SCM has been reverse-engineered over the years to stand for “source configuration management” and “source code management.” The original, most widely used meaning is “software configuration management.” SCM is also known colloquially as “version control” and “revision control.” Since the number of TLAs (three-letter acronyms) is limited, reuse is inevitable; thus SCM also refers to “supply chain management” and “software compliance management,” luckily in slightly different contexts.

Why Do I Need SCM?

The source code of all projects changes over time as the projects grow. Most of the time, the people working on the project add new parts to it and fix the broken ones. Occasionally, large reorganizations of the source code can occur, sometimes as part of cleaning up the code (also known as refactoring).

The simplest and most unwise way to work on a project is for a group of people to work on the project’s files in a shared directory. Each developer has to be very careful not to change any file unless he can be sure that he is the only one changing it.

Being well aware of the woefully short life span of most hard disks, if the group is wise it makes regular nightly backups off site, keeping the last three copies available locally for convenience. If an individual is going to make a large change, she can make her own copy of the affected files locally, just in case something goes horribly wrong while she’s making the change. When the time comes for a release, the current versions of all the files are copied to “somewhere safe.”

This simple way of working on projects is how an estimated 40% of software projects are developed.[1] I don’t know about you, but that figure is hard for me to believe. Sure, it’s just an average, and on average every human being has one ovary and one testicle, but if the 40% value really is true, then stunned contemplation is my first reaction. Surely all those people must have heard that there are software tools to help with this kind of thing? The Capability Maturity Model (CMM; see http://www.sei.cmu.edu/cmm) certainly has. For your project to be anything but “ad hoc, and occasionally even chaotic,” it says you need SCM.

The rest of this chapter can serve as an introduction to some of the major problems that such an environment will almost inevitably get tangled up in, and some of the ways to avoid those problems. If this is the situation you are in, the next few paragraphs should also help motivate you and your group to introduce an SCM tool. In the too-simple environment described, eventually the following situations or questions will occur:

  • More than one person wants to work on the same file at the same time, but it’s too hard to find everyone and to get them to agree that a file is available, so nobody works on that file. Integrating different people’s work becomes very hard to schedule and takes a long time to finish.

  • “All this code used to work! What changed since yesterday?”

  • “When was this line of code changed? There’s a bug in it and we need to know how many versions of the product are affected.”

  • “Who changed that line?” This question usually means either “For what purpose was that change made?” or “We need to know who would write code like that!"

  • “Most of those changes were a mistake and they should be removed, but we do want to keep a few of the changes.”

  • “We need to fix that bug in multiple versions of the product.”

  • “Hey! Who stomped on that change I made yesterday?”

A good SCM tool can provide solutions to all of the above situations and questions. No software project of any size should be attempted without some form of SCM, and occasional copying of the source directories doesn’t count as adequate SCM!

What SCM Is and Is Not

A simple description of SCM is that it’s a way to keep track of the different versions (the configuration part of SCM) of everything that is necessary for a software project over time. What is tracked is usually files of one kind or another, but could just as well be versions of entries in a database. SCM tools are usually separate applications from the filesystem, though this is by no means always the case.

Sometimes people confuse build tools and SCM tools, but the difference is simple. Keeping track of which files go into a product is the task of build tools. Keeping track of all the versions of those files as they change is the task of SCM tools. Some build tools can use SCM tools to obtain the files they need to build a product, but that doesn’t make them SCM tools.

Using an SCM tool, you can recover older versions of files after the files have been changed later on. This is very useful when you make a mistake. One view of SCM is that it gives you the ability to retrieve a snapshot of the project at a moment in time and then allows you to move forward or backward in time from that point. You can often tag or label the project at different moments in time and then retrieve the files exactly as they were when the tag was applied.

You can also use an SCM tool to share your changes to files with other people in a controlled manner. Many SCM tools show the differences (or diffs) between two versions of a file, as well as who made the changes, when the changes were made, and which other files changed at the same time.

Many SCM tools also support the idea of branches, which are versions of files in parallel universes. What that means is that you can have two (or more) different versions of a file, both derived from a common version, and you can work with either version at the same time. Branches let you support an existing product made from one set of files, while you develop the next release based on different versions of those same files. Many SCM tools help you with merging changes between branches. Figure 4-4 (in Branches and Tags, later in this chapter) shows this diagramatically.

SCM tools can be divided into two different kinds: centralized and distributed. Centralized tools store the different versions of the files in a central location, usually on a single server. Distributed tools store the different versions on multiple machines. The difference is somewhat blurred, since distributed tools can choose to use a single location (just like centralized tools), and some centralized tools support distributing their files to multiple servers. There are also SCM tools that support replication, where for performance reasons their files can be read from many different servers but are written to only one server. The difference sometimes simply comes down to how the tool was originally designed.

Another way in which SCM tools can differ is whether they expect each file to be changed by more than one person at a time. Some SCM tools stop other people from changing a file while you are editing it; this is known as a locking or serial model. Other tools expect you to resolve changes that other people may have made while you were all editing the same file; this is the concurrent model. All SCM tools have different ways of declaring who can read and write the files that are controlled by the tool. These permissions are often described using a list of permissions, also known as an access control list (ACL), for each file.

Some SCM tools use simple text files (“flat text”) while others use a database to store their files. This is a sure source of discussion about the merit of each tool. On one hand, simple text files make it somewhat easier to detect corruption, and you can use existing, independent tools to inspect and edit the files. Text files scale well enough for most projects, and you don’t have to be a database administrator to use them.

On the other hand, databases have many useful properties such as atomic transactions and faster access times. Also, since flat text files generally don’t scale as well as databases do, you might as well use a database right from the start. Databases also let you search more efficiently within the older versions of your files. Subversion (see Subversion, later in this chapter) allows you to choose either approach. The jury is still out on this choice, perhaps because tools based on the two different approaches are aimed at different-sized projects.

Some modern SCM tools support the concept of changesets. A changeset is a group of changes to the files controlled by the SCM tool that were made as one logical operation. The advantage of changesets is that they can be applied or later removed as a single operation.

Tip

It’s worth noting that SCM tools are not the same thing as configuration management (CM) and change management systems (CMS). These systems contain SCM abilities for tracking different versions of files, but also contain and enforce complex procedures which have to be followed to make a change. Such procedures may include scheduled reviews of the change, written approval, and formal tests that the change must pass before being accepted. This is much more than what SCM does. Sometimes people expect SCM tools to magically enforce some change management policy or other, which is really the wrong way around; choosing how to configure an SCM tool is just one part of your chosen process for allowing changes to a project.

Such CM processes are often considered to be too heavyweight for many software projects, though there certainly are instances where they are appropriate; nuclear reactor controls, aviation software, and medical devices are three examples that spring immediately to mind. This chapter is about SCM, not CMS.

Drawbacks of SCM

You might agree that SCM is vital to your project, but at what cost? All tools seem to have some drawbacks associated with them, and SCM tools are no exception. This section mentions a few complications of using SCM tools, but it should be stressed that the benefits of SCM outweigh all these issues. I’m sure that there are trapeze artists who feel that safety nets take away some of the thrills of their act, but you never see them work without a net.

Disk space

Keeping track of the different versions of a large number of files soon begins to take up lots of disk space. Even storing just the source code for a product with a million lines of code can easily take 10MB. Naively keeping complete copies of every file will use up 10MB for each tag. SCM tools usually store only the differences between versions, which are much smaller in most cases. Even with just storing the differences, a total of 250MB would not be unusual for such a product after a year’s worth of changes. The price of storage is cheap enough to allow us to ignore this argument.

Performance

Using an SCM tool to obtain a set of files to work with is generally slower than copying the files over from another directory. The SCM tool may keep the files on a remote server across a busy network, and it may have to regenerate in real time the precise versions of the files you requested. You may also have to wait for someone to finish making her changes before you can get the latest set of files. All that work takes a bit more time, but it’s usually not much time.

Connectivity

Some SCM tools don’t work when they are disconnected from a network—for instance, when you are using your laptop on an airplane. If you are going to do a lot of development disconnected from a network, choose a distributed SCM tool that will work in that mode, or at least one that won’t stop you from accessing your files without a connection to its central server.

Complexity

Some minimal training in how to use the SCM tool is likely to be necessary, and any infrequently used commands are quickly forgotten. Complicated activities such as merging different versions of files or merging whole branches of source code together are particularly hard to get right with many SCM tools. This is one reason why the quality of the documentation and support is important to consider when you are choosing an SCM tool.

Cost

If the SCM tool chosen is not free of charge, then the financial cost can become a limiting factor on how the project can grow, especially if a license is needed for each developer who uses the tool. Still, there are plenty of good SCM tools that cost nothing, and my opinion is that you can always find ways to get more money, but you’ll never recover time lost to poor SCM practices.

Risk of corruption

Finally, and most disturbingly, if there is a bug in the SCM tool, or bad hardware, or even operating system errors, then your files could gradually become corrupted within the SCM tool itself. This nightmare scenario is thankfully very rare, but is a great reason to use SCM tools with checksums on their files and with tools to validate their files, and to do nightly backups of your SCM tool’s files.

A Typical Day’s Work with SCM

Each SCM tool has a different name for the collection of files that it tracks. In the rest of this chapter, I’ll use the CVS term repository for these files, simply because it is familiar to many people. The set of files in which a developer makes changes is named the working copy (CVS calls this a sandbox). Obtaining a working copy using CVS is known as checking out a copy. Publishing the changes to a repository is known as committing or checking in the changes.

A typical session with an SCM tool involves the following activities:

Checkout

A developer decides to work on some part of the project. He checks out copies of the necessary files onto his machine. This is his personal working copy. Checking out the files has not changed anything in the repository, and all changes he makes are local to his machine. No one else is affected by his work yet.

Edit

The developer changes the files in some interesting way, maybe even creating new files, and probably builds a new version of the product using the changed files.

Warning

Probably the most common mistake people make when they use SCM tools is to forget to add newly created files to the SCM tool. Even though your own builds and tests work just fine, this mistake breaks the build when your changes are committed, leading to self-defensive comments such as “But it works for me!” and “I ran all the tests.” Some SCM tools will alert you to the presence of local files that they don’t know anything about, but it’s still good practice to get used to adding new files to your SCM tool right after you create them, while you still remember that they are new.

Diff

One common thing to do with an SCM tool is to see what changes have been made in the working copy, compared with the versions of the files in the repository. Another diff-related activity is to see who last changed a particular file and exactly what those changes were.

Update

While the developer was working on the files in his working copy, someone else may have changed those same files in the repository. The developer has to get the latest versions of those files and make sure that his changes still work correctly with the changes from other people.

Commit

Finally, the developer has resolved all these changes, added all his new files, tested a version of the product created from his working copy, and is now ready to let other people in the project see his changes. This happens by committing the changes to the repository. It’s helpful if you commit related changes all together, along with a descriptive comment about what the changes were for.

Various tests can be required by the SCM tool before it accepts the changes. For instance, was there a (possibly required) bug associated with these changes? Were the unit tests run and did they behave as expected? Have the changes been reviewed or checked for security or copyright problems?

Last, when the changes are accepted by the SCM tool, some notification (such as an email) is sent to the group, describing the changes and who made them. A change log may also be updated. If the files are tagged, then information about the tag should appear in the change log as well.

Figure 4-1 shows a centralized repository being used by three users: Alice, Bert, and Cuthbert. Alice is checking out her own working copy of some of the files in the repository. Bert is updating his working copy, merging in the changes that other people have made to the files in the repository. Cuthbert is committing the changes to the files that he has made in his working copy to the repository, thus making them available to other people.

Using a centralized SCM tool
Figure 4-1. Using a centralized SCM tool

Using a distributed SCM tool is similar to the process just described, except that there are now many repositories. In addition to the usual checkout, update, and commit operations on a repository, there are equivalents for repositories themselves, at the next level of abstraction. You can:

  • Create your repository by copying one of the existing ones, which is similar to checking out a working copy

  • Merge in changes from another repository, which is similar to updating a working copy

  • Merge your changes to another repository, which is similar to committing changes in a local copy

Figure 4-2 shows distributed repositories being used in the same way as shown in Figure 4-1 for centralized repositories. One way to think about distributed repositories is that each person has her own repository on her machine, and she can commit files to it while disconnected from a network. Then when she is reconnected to a network, she can synchronize her repository with the other repositories.

Using a distributed SCM tool
Figure 4-2. Using a distributed SCM tool

SCM Annoyances

This section describes some of the common problems that people run into when they use SCM tools with a project. Some problems such as merging are hard work due to the basic nature of the problem, but all the problems can be tamed with a little forethought.

Branches and Tags

To recap, a tag is a name for all the versions of a group of files at one moment in time, just as though you had made a copy of all the files as they were at that moment. A branch does the same thing, but allows SCM-controlled changes to the files later on. Figure 4-3 shows an example of this.

Changing a file on a branch
Figure 4-3. Changing a file on a branch

Branches are vital because they allow you to make changes to an older version of the product—for example, when you need to fix a bug in a file belonging to the last release of a product. At the same time, you can make changes for the next release to a different version of the same file. If you don’t use branches but instead only fix bugs in future releases, this can put pressure on the project to create premature releases.

Tip

You should consider how you are going to use branches before you release the first version of your product. You should also check that all your other SCM-related tools work properly with branches

However, you should try to minimize the number of active branches in your project. Branches make things more complicated because there are now more changes to manage. Imagine three versions of a product: the oldest one is the one that is being maintained, the middle one is the one that is being made available to customers right now, and the newest one is next year’s “yup, that bug’s fixed in the next release” version. A set of changes to fix some problem has to be created for one of the three versions, tested there, then ported to the other two versions and then tested there too. Even if it is straightforward to port the changes to the other versions, the amount of testing work for one bug has just been tripled. Tracking the same bug in multiple releases is also a hard thing to do well with most bug tracking tools (see One Bug, Multiple Releases).

To really see why the number of branches in a product should be minimized, look at Figure 4-4. Each of the source files is named on the vertical axis, and each different version of each source file is a solid circle in the horizontal direction. Every branch that is created is a (logical) copy of all the files into the third axis, the one labeled Branches. Just the copies of File 1 and File 2 are shown, and there have been three changed versions of File 1 on Branch 1. Now this third dimension has an odd characteristic compared with the other two: it’s very easy to move in one direction (creating a branch), but it’s always much more work to move in the other direction (merging). The more branches of a project that you keep active, the more time you will spend building, testing, and documenting the changes to the project. For the sake of simplicity, I recommend keeping the number of active branches small: two or three at most for a medium-sized commercial product.

Branches are in a different dimension
Figure 4-4. Branches are in a different dimension

To inexperienced project managers, the concept of branching may seem like an easy answer to many of a project’s growing pains. Got a new product? Just put it on a branch. Developing for a new hardware platform? Put it on a branch. Don’t like that developer’s coding style? Put him on a branch. Some SCM tools even encourage you to think like this. My advice is simple: avoid it! You should use just enough branches for your project and no more. The next section discusses what to do when you do have to create a branch.

When to Branch? When to Tag?

The previous section was pretty emphatic about why you want to minimize the number of active branches in a project. So when is creating a branch appropriate? There are just two common cases:

  • A branch for each major release of a product. These long-lasting branches will become inactive when that version of the product is no longer supported.

  • Branches for a small number of developers to work on for a short period (days or weeks, usually not months). If the work on the branch is to be useful, it has to be merged back to the main development branch sooner rather than later.

These two cases can be summarized as “branch on incompatible policies.” That is, create a branch when the guidelines for committing files are different. For example, the rules about who can commit to a release branch are usually different from the more open nature of the main development branch. Since the two sets of rules are different for the same source files, a branch is probably necessary. A useful article that expands this idea is “High-Level Best Practices in Software Configuration Management,” from http://www.perforce.com/perforce/bestpractices.html. (There are other articles that encourage each developer to have his own branch for his work, or even a branch per changeset, but these approaches assume effortless merging abilities from your SCM tool, which is rarely the case in practice.)

Tip

Before you create a branch, create a branch point tag. Then create the branch using that tag. That way, if you branch only a few files but later decide that you want to branch some other files, you can use the tag to branch from the very same point in time. Some SCM tools do this automatically for you.

When you create a branch, always consider when you are going to be able to stop using it, and put as many parts of the project as seems sensible onto the branch. If you branch only a few parts of a project, then it’s good to record which parts were and were not branched somewhere. It’s also a good idea to record the name of the branch, the branch point, and the intended purpose of the branch somewhere that everyone in the project can find it.

When is it a good idea to tag a project? Good practice is to create a tag whenever anything happens to the project that you might want to reproduce. Examples are creating a release, giving an internal demo, reaching a point in time that you might want to branch from one day, or just getting a build to work again. Since tags are just a way to name a set of particular versions of files, they don’t involve the dreaded third dimension of Figure 4-4. Consequently, they require much less effort to work with—there are no merge headaches to deal with later on. However, depending on the SCM tool and the size of the project, tagging may take hours rather than minutes or require locking the repository to stop the files being changed during this time.

Naming Branches and Tags

The naming of branches and tags has surprisingly wide effects on a project. Tag names become associated with builds, test results, and eventually releases, so they appear in many of the related tools such as bug tracking systems. A document with the name of each branch, the branch point tag, and the intended purpose of the branch can help to reduce confusion about how to use different branches. Since there are generally many more tags than branches, it’s easier to simply make the tag and branch names meaningful. Labeling Builds describes the idea of build labels, which are a good basis for tag names.

Warning

If there is no overall naming scheme for your branches and tags, then ad hoc ones will spring up. Changing the names of branches later on is difficult for some SCM tools such as CVS.

Before you settle on a naming scheme for your branches and tags, note that some SCM tools have nonintuitive quirks about what a name can look like. In CVS, for example, names must start with a letter, not a numeral, so 2_1_release is not permitted. Periods and spaces are also not allowed, so release 2.1 won’t work, but hyphens and underscores are permitted (though underscores tend to disappear when the name is used as part of an HTML link). Branch and tag names also have to be unique within a file in CVS; that is, you can’t tag two different versions of a file with, say, ALPHA_RELEASE, even if the versions are on different branches. CVS also makes no distinction between tag names and branch names, and working out whether a name is a tag or branch after the fact can be tedious.

Create a document that describes the chosen naming scheme for your project’s tags and branches, and try to make sure that the naming scheme follows the release numbering scheme (see Release Numbering) as closely as possible. If you can enforce the chosen naming scheme using the SCM tool itself, so much the better. Restrict who is allowed to create branches, make sure they know what is expected for branch and tag names, and make sure that they have some good sense about when to create a branch. Once you know who can create branches, automate the process as much as possible for them.

A simple naming scheme that has been used successfully with CVS is as follows:

  • All branch names end in _branch or _b. Tag names do not.

  • Private branches and tags should have _private in their name.

  • Tag names that are connected to points where branches occurred should have _bp (for “branch point”) in their name. Another idea is to start the names of branch point tags with Root-of.

  • Tag names that are connected to points where merges occurred should have _mp (for “merge point”) in their name.

Some examples of tags and branch names using this scheme are:

rel_1_1_branch

The branch for release 1.1 and any of its subsequent patch releases

bob_i18n_private_branch

A private branch, probably used by Bob for some internationalization work

QA#fugu_139

A tag for the internal release of build 139 of the project named “fugu”

Root-of#rel_1_1_branch

The tag that records where the branch rel_1_1_branch originally diverged from the main line

susan_private_branch#main#2_mp

A tag to record the second merging of the branch susan_private_branch back to the main line

Dates can be troublesome in branch and tag names, especially if the project has people from different countries reading the dates. Some people like to have the name of the tag that was used as the branch point (or root) of a branch included in the branch name. This seems to make the branch name overly long, in my opinion, and you should be able to use the SCM tool itself to tell you where the branch came from.

Merge Madness

Merging is taking the changes that were made to files on one branch and making the same changes to another branch. Perhaps the branch was where some experimental changes were developed, and now they’re ready for everyone else to use. Perhaps a bug was fixed on a branch for one series of releases, and the same bug needs to be fixed in a different series of releases.

Branching is so tempting, so easy: just copy all those files and make your changes to the copies. Merging is so much harder, and only gets harder as the original and the copies diverge over time. Indeed, there are people who make a whole career out of merging different versions of classical texts back together, word by painful word, but you probably don’t want to spend your career merging files. Even with the merge tools that are mentioned next, merges still take time, usually because some human intervention is necessary when the tools can’t figure out what to do. Large merges inevitably destabilize the branch they are merged into, so extra testing effort is needed after the merge is complete.

In most SCM tools, automated merging uses the diff and patch tools in some manner. diff uses an algorithmic equivalent of finding the shortest path between two points to create the minimum number of hunks, which are groups of lines that could be removed or added to one file to transform it into the other file. patch takes these hunks and applies them to one file to create the other file, along with some smart attempts to cope with changes to where the hunks should be applied within the file. Many SCM tools help you only with merges between branched versions of the same file, not between separate files. For more information about diff and patch, see “Comparing and Merging Files” at http://www.gnu.org/software/diffutils/manual.

So what makes an automated merge fail? Generally, if two files have a common ancestor and both files have had the same lines changed, it is unclear which changes are the correct ones to use. In this case, the changes are conflicts, and someone has to resolve them by choosing one or another of the changes. Luckily for SCM and branches, developers tend not to modify the same lines of code at the same time as other developers. You may be pleasantly surprised by how few conflicts there are when merging changes from one branch to another.

Some SCM tools (including CVSNT, Arch, Perforce, and BitKeeper) automatically keep track of when files were merged. If you have a large number of files to merge and they have many conflicts, then graphical merge tools may be useful. Some of the better-known standalone merge tools are the commercial Araxis Merge (Windows only) and Guiffy (all platforms), and the open source WinMerge (Windows only) and xxdiff (for Unix).

One good way to organize larger merges is to designate a small number of people as “mergemeisters” and let them perform the merge and resolve as many conflicts as possible. Then have the mergemeisters call in the appropriate people for each group of files that still need to be merged by hand.

Security

Some other important aspects of SCM to consider are those related to security. The source code is the heart of your project, where all your intentions, shortcuts, and errors are plain to see. Several large companies including Microsoft and Cisco have been the targets of successful exploits aimed at acquiring their source code. Even the repository of the source to the CVS tool has itself been cracked.

An SCM tool must make sure that only authorized people can read and change files, and it must keep a record of such actions for audits. It must also be able to protect its own files from accidental or malicious corruption, and it should not be vulnerable to denial-of-service attacks.

Some practical suggestions for securing your SCM tool, and CVS in particular, include:

  • Use separate and well-secured machines as SCM servers, which few or no developers can log in to directly. If you have secure server rooms, keep your SCM machines in there. Emergency power is often available in server rooms, which helps keep your filesystem intact, as do redundant disks.

  • Use encrypted connections from SCM clients to SCM servers, especially if there is a wireless connection involved anywhere in the network. If people have to have accounts on the SCM server, use a secure shell such as smrsh to limit the commands that they are allowed to execute.

  • Carefully guard the physical security of your backups of the repository. Destroy the physical media of outdated backups.[2]

  • Track each change in the repository using notifications of commits and inspection of diffs. Train developers to expect to see email when they make changes and to occasionally confirm that the information in the email is also appearing in any change logs.

  • The CVS pserver access mode is not designed to be a secure access method; it should be used only inside trusted networks. Use ssh and the ext mode for external access, and avoid anonymous access to CVS servers if at all possible.

  • Disable the CVS admin command for most people, since this command makes it too easy to change or corrupt a repository in untraceable ways.

An excellent source of further information about this topic is the paper “Software Configuration Management (SCM) Security,” by David A. Wheeler, which is available from http://www.dwheeler.com/essays/scm-security.html.

Access Wars

The development of a software product is often broken up into functional groups, such as networking, GUI developers, testers, technical writers, and toolsmiths. Not surprisingly, the way that a product’s source code is stored in an SCM tool tends to reflect how the groups are divided. Disagreements about who gets to make changes (“commit rights”) in each group’s files is a common source of irritation in a project.

In many projects, it is considered polite to mention proposed changes in another group’s files to that group before you make them; you can also send diffs by email to the group. Otherwise, someone in the affected group always seems to take offense, whether at the changes themselves, or because they were surprised by who made the changes, or because “you might do it again, and it might break something in the future!” There’s not much you can do to argue with that, so you might as well coordinate changes in other groups’ files with them beforehand: egoless programming only goes so far when it’s a whole group’s ego.

Even more far-reaching than these seemingly petty territorial conflicts are the effects on a project when different groups start to deny others read access to their files. These aren’t the files containing the name of the next CEO of the company or telling where the last project leader was buried. These are cases such as one group of developers allowing only compiled versions of their libraries to be used by other groups, or the Technical Publications group wanting people to use copies of only those documents that they have personally issued. This kind of information restriction hinders effective software development.

Still, looking at the issue from a different angle, preventing your salespeople from promising features in the next release based on a single comment they saw committed to the source code a few weeks ago can actually make software development more coherent. As with all information, it’s what you expect the owner to do with it that matters most. The beauty of SCM tools is that if someone else makes changes that you don’t like to your group’s files, you can not only talk to him but also back out his changes.

Filenames to Avoid

All filesystems have their quirks about what characters are valid in filenames and how long filenames can be. SCM tools have their own set of restrictions on the names of files.

First, a little history. Filenames with spaces in them were most uncommon in older Unix filesystems. Windows 95 began to make them more popular, but Windows also dragged along “8.3” (pronounced “eight dot three”) filename restrictions from its DOS ancestry, where the filename could be at most eight characters long, with an extension of up to three characters. Other characters in filenames that have been known to break cross-platform compatibility, or even corrupt the files stored in SCM tools, are /, \, and newline characters. Just to be safe, these characters are all still worth avoiding in filenames.

For example, since CVS was originally developed on Unix, filenames longer than 8.3 were just fine, but support for spaces came later. Unfortunately, the format originally chosen for passing the names of files and their versions to the CVS info scripts, which are part of customizing a CVS server for your site, did not really support spaces in the filenames until more recently, around Version 1.12.6.

Windows filesystems are set up by default to be insensitive to the case of filenames. So three files named FileWriter.java, Filewriter.java, and filewriter.java (which differ only in the case of one or two characters) would all be treated as the same file in a Windows filesystem. On Unix, and most other operating systems, they would be three different files. This becomes a problem when a Windows user tries to extract these files from a Unix server; it’s not clear which file the Windows user will finally see, since the three filenames may be identical in their local filesystem. It should be noted that the same problem occurs with tools such as FTP and with shared filesystems such as NFS. The most obvious solution is to use names that are unique on case-insensitive filesystems.

In general, avoid using the name or abbreviated name of the SCM tool as a filename or directory name. A particularly unpleasant problem can occur if you are working in Unix and are using CVS to store information about CVS—for example, some documents about how you configured CVS for your environment. You won’t be permitted to create a subdirectory named CVS, because one already exists as part of how CVS works. However, you can create a subdirectory named cvs, because cvs is a different directory name from CVS in the Unix filesystem. Unpleasant surprises are now in store for anyone who tries to check out the subdirectory to a Windows system. The cvs directory will interfere with the CVS directory that is used by CVS. My suggestion here is to call the subdirectory scm.

Some more general advice about the naming of files and directories in a project:

  • When naming directories, make sure their names start with different characters. Then completing their names will be easier when using a shell prompt at the command line.

  • Use common prefixes for the names of files within the same directory. The extra information can give you more of an idea about where to find the file.

  • Don’t reuse directory names that are significant in your operating system (e.g., sys in Unix and system in Windows). It’s confusing, and one day some tool will pick up files from the wrong sys directory and you may not even realize it.

  • Avoid embedding version numbers into the name of a file or directory that’s managed using an SCM tool—tracking versions is what the SCM tool is for! Put a version into a filename only if there is an occasion when multiple versions might be used at the same time.

Backups and SCM

SCM tools behave like backups for their users’ files, but it is good to remember that unless the SCM tool’s own data is properly backed up, the users’ files are no better protected than if the users had just copied their files over to another machine. Backups of an SCM tool’s data serve at least three purposes:

Disaster recovery

That is, being prepared for “The SCM server just crashed and it won’t come back up!”

Corruption detection

By comparing the files or database contents in backups

Intrusion detection

By tracking all the changes that have been made from backup to backup

Standard server backup practices can usually be followed for SCM servers. If necessary, quiesce or shut down the server, export the data from the database or copy the files, compress, encrypt, and uniquely identify the backup files, and archive them off site on permanent media. As with any backup strategy, all this effort is wasted if you don’t periodically test that the SCM server can be recreated using a recent backup. Keeping one or more identical SCM servers on standby is useful both for testing recovery of backups and for periodic maintenance. Personally, I like to make my own nightly backups to CD and DVD for all the SCM data that I am responsible for, and then have an IT department also back up the SCM machines. One place to read more about basic backup and recovery best practices is Chapter 11 of Essential System Administration, by Æleen Frisch (O’Reilly).

The backup files’ size can vary quite erratically due to compression artifacts, but the total size of the files always grows every few days, since version control systems can’t discard information if they are to reconstruct the past correctly. Large unexpected changes in the size of consecutive backups can occur and are worth investigating, usually by comparing the contents of the different backups.

What happens in the worst case, if you lose all your SCM data? If you’re lucky, someone will have a recent copy of the files on her local machine. You can recreate the recent state of the project by adding these files back into the SCM tool. For this reason, it’s a good idea to regularly check out the entire contents of the repository onto at least one machine. Automated builds have to do this regularly anyway.

Backing up CVS

Example 4-1 shows an example script that can be used on a locked repository to create a gzip‘d tarball of the repository. The backup file should be copied to another machine after it has been created. On a Unix server, this kind of script is typically set up to run nightly, using a cron job. Scripts used to back up CVS repositories should expect to encounter filenames with spaces in them.

Example 4-1. A shell script for backing up a CVS repository
#!/bin/bash
#
# Backup a CVS repository to a gzipped tarball. Also generate output
# describing what has changed since the last backup.
#

# The root of the local CVS repository, the one to be backed up
CVSROOT=/usr/local/cvs

# The uniquely-identified backup filename
backup_home=/backups
backup_file=${backup_home}/cvs_backup_`date +"%m%d%Y.tgz"`

# Record what has changed between each consecutive backup
cd ${CVSROOT}
if [ -f ${backup_home}/du.today ]
then
  mv ${backup_home}/du.today ${backup_home}/du.yesterday
fi
du -k [A-Za-z0-9]* | sort +1 > ${backup_home}/du.today
diff -N ${backup_home}/du.yesterday ${backup_home}/du.today

# Create a list of all the files in the repository.  Note that only
# files whose _full_ name starts with [A-Za-z0-9] are matched.  Make
# sure that empty directories and soft links are handled correctly
# (find -type f loses both of these).
repos_filelist=/tmp/all_files.$$
find [A-Za-z0-9]* -not -type l -print > ${repos_filelist}

# You could also use grep -v here to select portions of the
# repository, and you may want to add this script to the list of files
# that are backed up.
tar --files-from ${repos_filelist} --no-recursion -czf ${backup_file}
chmod ogu-w ${backup_file}

# Clean up
rm -f ${repos_filelist}

# And copy the backup file to another machine ...

The source to CVS contains a useful script in the contrib directory named validate_repo.pl, also known as check_cvs in earlier versions. This script can be run nightly to confirm that the repository has not been corrupted in any obvious way.

SCM Tools

The seven different SCM tools examined in this section are a mixture of closed and open source software. There are noticeably more usable SCM tools available than build tools (see Build Tools), and there are certainly more tools available from commercial organizations.

What should you look for in an SCM tool? Beyond the basic saving and retrieving of different versions of files, I suggest, in order of importance:

  1. Confidence in the integrity of your data

  2. Fast and simple creation of tags, extraction of tagged files, and generation of diffs

  3. Good support for branching and merging, ideally with both command-line and graphical interfaces

  4. Integration with other existing tools such as bug tracking systems

  5. A good web interface to let people browse the different versions of their files and also to search through earlier versions of the files

  6. Good support from the tool vendor or the tool’s community

Comparison of SCM Tools, later in this chapter, summarizes the major differences between the tools discussed in this chapter.

CVS

CVS (http://www.cvshome.org) is by far the most commonly used open source SCM tool. The CIA project (http://cia.navi.cx), which tracks commits from hundreds of open source projects, shows that 70% of their commits come from projects using CVS. Many of the terms used by CVS, such as commit and check out, have become de facto terms used by other SCM tools. Other SCM tools such as Subversion and Arch are careful to provide a “Migration Guide for CVS Users” document and tools. CVS is licensed under the GPL.

CVS is most commonly used over a network, with a single Unix or Windows-based server providing the repository, though some partial support for distributed servers was added with Version 1.12.10. Developers use CVS clients to check out a sandbox, which is their local working copy of the files under control of CVS. Different developers can check out the same files at the same time, since the C in CVS stands for concurrent. The opposite is true with SCM tools such as Visual SourceSafe, which let only one person at a time work on each file; this becomes a bottleneck even with medium-sized projects. After making changes, the files are checked in to the repository, along with some text comments about the changes. The first person to commit her changes forces the other developers to update their files before they can commit. CVS doesn’t care how long you take between checkout and commit. CVS logs are available for each file, and these logs describe all the checkins for that file. CVS supports branches, tags, and also some basic assistance for merges.

The CVS project uses the GNU Autotools suite (see GNU Autotools) to build executables for DEC Alpha, Cray, HP-UX, Solaris, GNU/Linux, FreeBSD, NetBSD, IRIX, OS/2, Windows, Mac OS X, and VMS, among others (see the file INSTALL in the CVS source for the complete list). The CVS source also includes an extensive set of unit tests known as the “sanity checks.” CVSNT (http://www.cvsnt.org) is a well-established fork of CVS taken in 1999 by Tony Hoyle, originally to add native support for Windows NT to CVS, but the two products still interoperate well. Features that have been added to CVSNT include better support for Unicode, ACLs, and Windows authentication. WinCVS and MacCVS, which are popular GUIs for using CVS on Windows and Macintoshes, respectively, both use CVSNT under the covers.

For many years, the best documentation for CVS was “the Cederqvist,” also known formally as “Version Management with CVS” (https://www.cvshome.org/docs/manual), an online manual written by Per Cederqvist that extends the manpage written by Roland Peschand and the FAQ maintained by David G. Grubbs. While the Cederqvist is still useful, and has even been published as a book by Network Theory (http://www.network-theory.co.uk), there are now a number of other good books about CVS. The best ones are Essential CVS, by Jennifer Vesperman (O’Reilly); Open Source Development with CVS, by Moshe Bar and Karl Fogel (Paraglyph), which is also available online at http://cvsbook.red-bean.com; and Pragmatic Version Control Using CVS, by Dave Thomas and Andy Hunt (Pragmatic Bookshelf). There are also numerous how-to documents and tutorials all over the Internet, with particularly good ones at http://en.wikipedia.org/wiki/Concurrent_Versions_System and http://www.devguy.com/fp/cfgmgmt/cvs.

The biggest strength of CVS is that many developers are already familiar with it. It does scale well with reasonably large projects (hundreds of users, thousands of files, millions of lines of code) and large file sizes (tens of megabytes), though the time to tag files increases linearly with the number of files and their sizes. CVS is simple to set up and maintain; most CVS servers have the longest uptimes of any machine in a company. It’s secure against casual attacks, though it has been cracked in the past (see Security, earlier in this chapter).

Since CVS is both open source and mature, there are also dozens of separate tools to add extra functionality to CVS. A few of the most useful are:

ACLs

These allow you to control who can commit files, according to the user, the branch, and the directory name. The cvs_acls script from the contrib directory of the CVS source and the patches from http://cvsacl.sourceforge.net are examples of such add-ons.

Browsing CVS files

For web-based viewing of repositories, the Python-based ViewCVS interface (http://viewcvs.sourceforge.net) is excellent; it also supports browsing of Subversion repositories.

Graphical CVS clients

There are a number of graphical CVS clients in common use, and they all hide some of the details of the CVS command line. The oldest one is WinCVS (http://www.wincvs.org). TortoiseCVS (http://www.tortoisecvs.org) is well integrated with the Windows filesystem browser. My current favorite graphical CVS client is SmartCVS (http://www.smartcvs.com) because it runs on any platform with a JVM and provides all the add-ons of the other clients by default.

Commit email

The activitymail Perl script, available from https://activitymail.cvshome.org, has a large number of choices for sending email about commits. One particularly useful addition to email is to include links to a web-based view of the files’ changes.

Change logs

The cvs2cl Perl script from http://www.red-bean.com/cvs2cl can generate change logs in HTML or XML. These change logs comply with the GNU standard for change logs, which is part of the coding standards at http://www.gnu.org/prep/standards/standards.html#Change-Logs. They can also act as a collection of “poor man’s changesets” for CVS, and you can generate scripts to revert complete changesets or merge them to other branches.

Changesets

CVSps (http://www.cobite.com/cvsps) generates changesets from individual commits to a CVS repository.

Local changes

cvsdelta (http://directory.fsf.org/cvsdelta.html) creates summaries of what has changed locally in your sandbox.

Clients for CVS have been written in Java, Tcl, and C++. Most modern IDEs and many bug tracking systems have some level of integration with CVS. CVS is still the default SCM tool for many preconstructed environments, including SourceForge, which is probably the largest CVS user in the world. (The GNU project may have the largest single CVS repository.) Other products that tie all this extra information into one convenient web site for your project are the excellent FishEye (http://www.cenqua.com) and the open source CVS Monitor project (http://ali.as/devel/cvsmonitor).

The weaknesses of CVS in many ways reflect the fact that it evolved, rather than being designed as a whole. Interactions with a CVS server are atomic on only a per-directory basis, not per transaction. So if you update your local sandbox at the same time that another developer is checking in his changes, you may get only some of his changes. Alternatively, if something nasty happens to the CVS server during a commit, your commit may fail, with some files changed but with others unchanged. Try hitting Ctrl-C sometime during a CVS commit and then see which files were committed and which ones weren’t. (Don’t worry—another commit will catch the files that were missed by the first one.) When you create a tag, CVS doesn’t let you record a message with a description of why the tag was created. Renaming a file causes a break in the recorded history of that file. Changing the name of a directory requires intervention in the repository by the CVS administrator and may not always be possible, so choose your directory names and hierarchy very carefully.

Living with branches and merging in CVS is somewhat of a headache, as described earlier in this chapter in Branches and Tags and When to Branch? When to Tag?; you should always tag CVS branches before merging from them. Using CVS to keep track of source code from a third party by importing it into your repository is a task to do with a clear head and a written set of notes in front of you, and be careful not to use the files that you just imported from again—check out a fresh copy instead. Authentication, authorization, and accounting support in CVS is rather rudimentary, and there is no support for an internationalized version of the tool. CVS works best with text files but can handle binary files, albeit inefficiently (and don’t forget to use cvs add -kb to disable keyword substitution, in order to avoid corrupting such nontext files). Once an RCS file in a CVS repository exceeds about 10 versions and 100MB on a server with 1GB RAM, you can expect to see slower checkouts of that file, especially if it is on a branch.

Making your life with CVS easier

This section contains a number of ideas that can make administering more complex installations of CVS easier:

Use modules

The name of what you ask CVS to check out for you is referred to as a module. The top-level directories in your repository are the default modules. The interesting thing about modules is how they can be used to collect different directories from the repository together into a single target for checking out. For instance, if there is a project in the directory projects/projectA and projectA also wants to use files from a directory named common/xml, then entries in the CVSROOT/modules administrative file such as:

# The module named common refers to the top-level directory "common"
common          common
# The module named common_xml refers to the "xml" subdirectory 
# in "common" but it will be named src/xml when checked out
common_xml      -d src/xml    common/xml
# The module named projectA is a combination of the 
# projects/projectA directory and the common/xml directory
projectA        projects/projectA &common_xml

will cause the command cvs co projectA to create a local subdirectory projectA with subdirectories src/xml and the directories from projectA. This kind of indirection is important because it can create different directory structures simply by defining new modules. Be warned, though, that you can’t tell CVS to use one particular version of the modules file, so be careful not to change the module definitions that are needed for older releases of projects. Modules are an aspect of CVS that are often overlooked, perhaps because they seem complicated to configure, but understanding what you can do with them will make your life with CVS much easier.

Avoid symbolic links

The temptation is so strong. You want to move a directory within the source tree and yet somehow preserve the change history of all its files. You know that just moving the directory in the repository will break your ability to go back in time, since CVS doesn’t version directories, only files. But what if you moved the directory anyway and then created a symbolic link (a file that points to another file, also known as a soft link) from the old location to the new one? Yes, it works: developers will see the directory in both the old and new locations, and can commit files in either directory, though locking the directory may not work properly if you configure CVS to use LockDir to keep your locks elsewhere. But what about when the next directory move comes along three months from now? Then you’ll have soft links to soft links, and so on. CVS does not keep track of different versions of soft links, so using soft links within a CVS repository always leads to extra work later on.

Sometimes the idea to use soft links arises from wanting to share a directory between two top-level directories without one group having to check out multiple modules. A better approach is to use alias and ampersand modules, as discussed in the previous item in this list.

Synchronize clocks

It’s good practice, both for CVS and for build tools such as make, to synchronize the clocks on every machine that will use the tools. ntp is the most common synchronization client and server for Unix, and your local time server may well even be named something like http://ntp.example.com. Windows XP has its own synchronization client, and the Tardis tool works for all earlier versions of Windows.

Know which commands make immediate changes

After using CVS for a while, you may be lulled into believing that nothing you do in your sandbox can affect the rest of your team until you commit the changes. Wrong! CVS commands that modify the repository, apart from the tagging and branching ones (both the local and remote versions), include cvs add directory, which adds a directory immediately, and cvs import, which changes the head of the tree straightaway. (There is a -X argument with more recent versions of the import command to avoid this problem.) To make your life easier, pause to consider before using the tag, add directory, and import commands.

Save the output

When you are creating tags or branches with cvs tag, or merging versions with cvs update -j, or using cvs import, it’s a good idea to save the lengthy output from these commands. Important information—such as existing tags not being moved or the names of files with merge conflicts in them—appears in the output and is not saved anywhere else. If you do lose the output from a command, you may be able to see which files have conflicts by running cvs -n update.

Be careful with top-level directories

Since renaming directories and moving them around is hard to do well with CVS, some CVS administrators find it helpful to keep all project directories under a single top-level directory. When the time comes to change the directory structure of the project, they can create a new top-level directory and copy the subdirectories into that. One problem with this approach is that it’s now more complicated to merge changes into both the old and new top-level directory structures. The neater approach to this problem is to define a module per project and then have the module refer to the directories that make up the project.

Some CVS administrators also find it convenient to make the top-level directory in their repository unwritable by people who aren’t also CVS administrators, so that accidental imports don’t leave their mistakes there. This does mean that new top-level directories have to be created by a CVS administrator.

Avoid keywords and strings that complicate merges

CVS has some convenient keywords such as $ Date$ and $ Id$ that are automatically expanded during commits to the current date or other information about the file. Unfortunately, when merging files from one branch to another, CVS does not treat the expanded versions of these variables as special, and merges can end up with hundreds of conflicts to be resolved by hand, where most of them are just changes in the date a file was modified. Many people avoid using these keywords and rely on cvs log for the same information. Still, the $ Id$ keyword can be useful if you suspect that releases might escape without their source code being tagged.

Another tip to make merges easier is to avoid using the strings <<<<< and >>>>> in your files. These strings are inserted by CVS to mark conflicts in merged files.

Beware of unexpected shell expansions

If the cvs commit command is used with the -m "some comment here" argument to make a comment about a commit, then shell characters in the comment are expanded. So a comment such as cvs commit -m "Changed the default $PATH value" will have $PATH replaced by its value in the current shell, and the commit message will end up looking something like “Changed the default /usr/local/bin:/usr/bin:/bin value” in your logs. This doesn’t happen if you use an editor to add the comment or if you use single quotes instead of double quotes.

Change your shell prompt

When you have lots of different branches checked out in different sandboxes, it’s easy to forget which one you’re working on. Obviously, naming your local directory something suggestive helps, but you can also add the branch name to your shell prompt and even change the color of the cursor. The following incantation does this for the bash shell: just replace _branch with some text that appears in your branch names. Other shells have similar abilities.

PS1="[\u@\h\$(\
if [ -d CVS ]; then \
  if [ -e CVS/Tag ]; then \
    cat CVS/Tag | sed -e 's/^T/ /' | sed -e 's/^N/ /' \
    | sed -e 's/^D/ Date /' | sed -e 's/_branch/\[\033]12;blue\007\]/'; \
  else \
    echo ' \[\033]12;black\007\]MAIN' ; \
  fi; \
else \
  echo '\[\033]12;black\007\]' ; \
fi) \W]\\$ "
Avoid empty directories

You can create empty directories in your CVS repository, and when you check out a tree, the directories will appear as you would expect. There is a handy -P argument to cvs update to remove, or prune, empty directories. However, if you check out a tagged version of your tree, the empty directories are automatically pruned, and you have to run cvs update -d to get them back. The easiest thing to do is avoid empty directories in your source tree and instead create them as needed with your build tool. Adding empty dummy files is an ugly workaround.

Tag CVSROOT too

When you tag some files for a release, don’t forget to tag the files in CVSROOT too. These files describe how CVS is configured and can change over time. If you want to know which directories a particular module represented at the time of a release, this will help.

CVS is the default choice for SCM for many open source and commercial projects. It is also the base standard by which other SCM tools, both commercial and open source, are measured. Subversion (described in the next section) is designed to be a replacement for CVS, but it will be a long time, if ever, before CVS goes away.

Subversion

Subversion (http://subversion.tigris.org) is an open source SCM tool designed as a “compelling replacement for CVS.” Subversion development has been partially funded by CollabNet (http://www.collabnet.com), a commercial PDE discussed in CollabNet. Subversion is released under the Apache Software Foundation license, with CollabNet given as the copyright holder.

Subversion (also known as SVN) really is like CVS 2.0. Even typing the main command svn feels somehow similar to typing cvs. Even apart from the fact that Subversion has an order of magnitude more code, there are substantial differences between Subversion and CVS under the hood, including a default Berkeley DB database backend rather than the flat-file RCS format used by CVS. (A filesystem backend called FSFS is also available.) However, the basic client/server model used by CVS is unchanged, and you still check files out, edit them, update, and commit them.

While using a Subversion client is as easy as using a CVS client, configuring a Subversion server can be a little harder. The default network protocol used to connect a Subversion server and its clients is based on an extension to HTTP that is called WebDAV. If you already have an Apache web server running on your Subversion server machine, you can configure it to use WebDAV and then install and configure the Berkeley DB database. Alternatively, you can use the svnserve executable, which is much more like CVS’s cvs server process in concept.

The major changes in Subversion compared with CVS are:

Renaming directories and files

Directories are now versioned, just like files. You can rename directories and files and still follow their commit history.

Atomic operations

All Subversion operations either succeed fully, or fail with no changes made to the repository.

Versioned metadata

Every file and directory can have arbitrary information (metadata) associated with it as key/value pairs, and this information is versioned. Recording files’ owners, ACLs, and any other information needed for specific sites can be implemented using this mechanism.

Full support for binary files

Subversion is designed to fully support both binary and text files much more efficiently than CVS does.

Cheaper branching and tagging

The cost of branching and tagging need not increase with the project size.

Subversion can run on most Unix versions, Windows 2000 (and later for the server), and Mac OS X. Windows support is native and has always been part of the project. The limitation on the server for Windows is due to the use of Berkeley DB, which apparently doesn’t run on Windows 95, 98, or ME. Using the FSFS filesystem backend should remove this limitation.

A number of tools to convert data from many other SCM tools to Subversion have been developed as part of the product. The script cvs2svn is one such useful tool; it converts existing CVS repositories to Subversion repositories. Some Apache projects have converted some of their repositories to Subversion, and GCC is in the process of doing so.

One of the most remarkable things about Subversion has been just how many other projects have sprung up around it, integrating it into existing IDEs and extending existing tools to support it. Even the effort to provide internationalized versions has been impressive. For web-based viewing of repositories, the Python-based ViewCVS (http://viewcvs.sourceforge.net) also supports browsing of Subversion repositories. TortoiseSVN (http://www.tortoisesvn.org) is one graphical client for Subversion that is well integrated with the Windows filesystem browser. Another graphical client for Subversion that can be used on Windows, Linux, and Macintosh machines is SmartSVN (http://www.smartsvn.com).

Development of all these supporting tools for Subversion has been made easier by clear documentation from the beginning of the project. One of the main sources of information is the book Version Control with Subversion, by Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato (O’Reilly), which is also available online at http://svnbook.red-bean.com. Other books about Subversion include Practical Subversion, by Garrett Rooney (Apress), which is aimed more at SCM administrators; Pragmatic Version Control Using Subversion, by Mike Mason (Pragmatic Bookshelf); and Subversion in Action, by Jeffrey Machols (Manning). Another useful source of information and discussion about Subversion is the Subversionary web site at http://www.subversionary.org.

However, Subversion has limited support for ACLs and the cvs2svn script may have some difficulties handling complex branching schemes. The known bugs in Subversion are publicly available at the Subversion home page. Subversion still has plenty of room left to grow, with a number of ideas already scheduled for later releases. One such idea is the ability to track who is editing which files. Another is the ability to lock files so only one person can edit them at a time.

In summary, Subversion set out to build a replacement for CVS while keeping its familiar parts, and for the most part it has succeeded. Expect to see Subversion become the other choice for public SCM tools in PDEs like SourceForge and the Apache Project. CollabNet already uses Subversion as the underlying SCM tool in its PDE product, and more companies are likely to follow.

Arch

Arch (http://www.gnu.org/software/gnu-arch) is a distributed open source SCM tool, as opposed to the centralized servers of CVS and Subversion. It’s designed to scale to tens of thousands of users, in the same way that peer-to-peer (P2P) tools such as BitTorrent have scaled well for distributing large files. Arch is licensed under the GNU General Public License. Note that Arch is still changing, and the version discussed here is tla-1.3, released in December 2004.

At its simplest, using Arch is like having a repository on your own machine, one that you can make commits to, branch, and generally rearrange as you wish, even on your laptop on an airplane. Then you synchronize from other repositories when you want, and they can accept your changes at their discretion.

Arch is carefully designed to minimize server-side work, so that it can scale well. It assumes that disk space is cheap and that network communication is the most costly operation. Just like Subversion, Arch provides atomic commits across entire source trees. Practically any shared resource such as a directory, FTP server, or web server can be used as an Arch server. Different versions of the metadata such as tags are stored, in addition to the versions of the files. Arch keeps track of file and directory rename operations by using unique identifiers for everything; these don’t change, even when the name of a directory changes.

Changesets are a key part of Arch and use the familiar diff format, at least for text files. The unique identifiers for each file make it possible to automatically patch files, even when their names have changed. Arch also remembers which changesets have already been applied, so the potential multiple-merge problems of CVS can be avoided. The default format used for storing files and changesets is simple in the extreme—compressed tarballs and a file formatted exactly like an email message. These tarballs have checksums and can also be cryptographically signed to help ensure their integrity. The simple format means that only a few commonly available tools are required for Arch to work properly after installation.

Arch is known to work on GNU/Linux, FreeBSD, NetBSD, AIX, and Solaris. Portability to Windows is planned for the near future, but the main focus for Arch still seems to be Unix-based platforms. Other versions of Arch have been written in languages other than C, but tla by Tom Lord seems to be the most commonly used version of Arch.

Currently, the best sources of documentation on Arch are the “Arch Meets Hello World” tutorial at http://www.gnu.org/software/gnu-arch/tutorial/arch.html and the ever-changing Wiki at http://wiki.gnuarch.org. Documentation of the rather large number of Arch commands (over a hundred) is terse, which contributes to the generally steep learning curve for Arch.

Like any newer product, Arch has its rough edges. When it was evaluated in April 2005 for use with the Linux kernel, it was felt to be too slow for such a large project. Some people feel that the filenames used to refer to particular versions are too long to type comfortably, and that the choice of special characters in the names clashes awkwardly with the same characters used by common shells such as bash and also tools such as vi and vim. Arch has not yet been internationalized, though a fork of it named ArX has been. Other problem areas, which may or may not have been fixed by the time you read this, include the lack of symbolic links, the lack of file permissions (for controlling access), spaces not being allowed in filenames, and some Unix/Windows end-of-line formatting problems. One issue that is unlikely to have changed is that Arch developers can seem arrogant in their zeal for their project.

Arch is the best open source example of a trend in SCM tools toward tools that are distributed, rather than centralized on a single server. The emphasis on changes to a project’s source code being seen as a collection of separate changesets is also a distinct trend in all modern SCM tools. In terms of development, Arch is roughly where CVS was 10 years ago: definitely usable for noncritical projects, but rough around the edges, particularly with regard to ease of use and documentation. Still, it has the backing that comes with being an official GNU project, and if development continues as it has, Arch could be a strong contender among open source SCM tools.

Perforce

Perforce (http://www.perforce.com) is a commercial SCM tool, currently licensed for around $750 per user, which includes a year of support. There are a range of licensing options, including free use for open source projects.

Perforce, also known as P4, is a modern, centralized, fully networked SCM tool. It provides atomic commits across entire depots (repositories) and supports branching and merging well, including automatically tracking when files were merged. Concurrent access to multiple files is the normal way of using Perforce, but unlike CVS, Perforce also keeps track of who is editing each file. Depots store binary files as compressed files and use an RCS-like format for text files. Metadata about the files and changelists (changesets), such as branch information and associated bugs, are stored in a separate, proprietary, journaled database. Backups of Perforce server depots can be made without stopping the server from being used, and no separate licensing server is used, which also reduces administrative work.

Perforce is supported on a wide variety of platforms, including almost all recent Unixes; Windows NT, 2000, and later; Macintosh Classic and Mac OS X; and VMS. Windows 95 and 98 are not supported for Perforce servers. Dozens of other platforms are supported for Perforce clients. APIs to use Perforce as part of an application exist for C, C++, Java, Perl, and Python, among other languages.

Documentation for Perforce is extensive and of good quality. All the documentation is freely downloadable in convenient file formats from the company’s web site. Judging by comments in newsgroups and weblogs and from what I’ve heard through other sources, the product support team at Perforce is excellent. Training and other consulting services are readily available.

Perforce has been carefully designed to scale well as projects grow. For instance, tagging and branching operations are fast, taking much less than the linear time seen with CVS. The Perforce web page http://www.perforce.com/perforce/reviews.html provides some useful comparisons of various SCM tools and tells how each one scales as a project grows.

Like any SCM tool that uses a database, Perforce requires attention to maintenance. Disk space allocation and tuning procedures are well documented in the Perforce System Administrator’s Guide. Integrity-checking tools are provided to guard against database corruption. Renaming directories and files is a two-step process, but the history of each step is retained. Files on the client machine are read-only until the user tells Perforce that she wants to edit them. This can be awkward if you are working offline, or if an external application wants to write temporary changes to files that are stored in Perforce.

In summary, Perforce is similar in architecture to CVS but has stronger functionality and is much faster. The product is mature and well supported, and there are numerous tools that extend or integrate Perforce in various customized ways. Perforce is a good choice for larger groups of developers, especially within a company with the resources to administer it properly.

BitKeeper

BitKeeper (http://www.bitkeeper.com) is a commercial SCM product from BitMover. BitKeeper is licensed per person who modifies files, and licenses can either be purchased for around $1,750 or leased for around a third of the purchase cost. There is also a different license for using BitKeeper at no cost. The version described here is 3.2.3, released in August 2004.

BitKeeper, also known as BK, is a modern, distributed SCM tool, complete with atomic operations, changesets, file metadata, strong support for branching and merging, and a web-based graphical interface. Since BitKeeper is fully distributed, it has no central point of failure and it scales extremely well. It also helps that the bandwidth requirements for most common BitKeeper actions are relatively small. Every developer effectively has a copy of the repository on his machine, which makes working with the proverbial laptop on an airplane easy. You can make your local changes available (a push) using a wide variety of protocols from SSH to HTTP, or even using email.

BitKeeper handles all the complexity of pushing the changes in a local repository out to other developers’ repositories. Renaming of files is handled well, including the tricky problem of two developers renaming the same file at the same time. You can add different comments to different files in a changeset, which is sometimes useful. The data format used by BitKeeper is based on SCCS, the original Unix SCM tool created by Marc Rochkind in 1972. SCCS files include checksums to help avoid corrupted data.

BitKeeper runs on most modern Unixes, Mac OS X, and Windows 98 and later releases. There is an long-standing offer from BitMover to support any platform for a sale of over 50 licenses, providing it is POSIX-compliant and not prohibitively expensive.

Documentation for BitKeeper is good, though the printable versions are available only with the product. Online documentation is extensive, and support is reportedly very responsive. There is a good demonstration of BitKeeper available at http://www.bitkeeper.com/Test.html. There is an open source BitKeeper client available from BitMover (http://www.bitmover.com/bk-client.shar), though this tool only extracts files from repositories. There is also an open source tool called SourcePuller (http://sourceforge.net/projects/sourcepuller) that can interact more generally with BitKeeper. Development of this tool was what led to the free version of BitKeeper ceasing in 2005.

BitKeeper is an attractive commercial SCM tool. The pricing scheme seems to indicate that BitKeeper is competing against ClearCase and is intended for use by large businesses, while still working closely with the open source community for the good publicity. Being chosen for GNU/Linux kernel development is a strong endorsement for any SCM tool.

ClearCase

ClearCase (http://www.ibm.com/software/rational) is the SCM part of a large change management environment known as the Rational Unified Process. ClearCase is licensed commercially at around $5,000 per developer, though this is negotiated on a per-site basis, and there is a “lite” version available for around $1,250.

ClearCase is unique among the major SCM tools in that it uses a separate, versioned, distributed filesystem on each developer’s machine. Once in this filesystem, you automatically see the chosen versions of the files managed by ClearCase. So you never have to manually update your local copy of a file—the filesystem just makes it appear for you. Alternatively, you can freeze different parts of what you see at particular versions. Developers choose which versions of which sets of files they wish to see by modifying their “configuration specification” file, also known as the “config spec.” These files can build on top of each other, allowing for complicated descriptions of which files you end up actually using.

Warning

If the ClearCase server is unavailable, not only will developers be unable to use the SCM tool, they won’t see the directories containing the ClearCase controlled files. To ensure that the networked filesystem remains available all the time, ClearCase supports redundant servers as well as the ability to distribute source trees across multiple servers.

Directories as well as files are versioned, and the ClearCase filesystem supports soft links. The branching and merging environment provided by ClearCase has good graphical support, and the merge tools seem particularly well liked. The ClearCase make tool, ClearMake, provides extensive information about all generated objects—you can even view the precise command used to generate an object file at any time. ClearCase can also use this information to wink in object files that have already been built, rather like ccache does (see Slow Builds). However, ClearMake is noticeably slower than other versions of make, though the accuracy of dependency checking is much improved. ClearMake can also automatically produce a “bill of materials” (BOM) for a release, listing the specific version of each file used to construct the build. Of course, a BOM is only one part of what is needed to reproduce a release: the tools used and their versions are others.

ClearCase servers and clients are supported on AIX, HP-UX, IRIX, GNU/Linux, Solaris, and Windows NT, 2000, and later versions.

ClearCase comes with extensive documentation and support from IBM. Two useful books are The Art of ClearCase Deployment: The Secrets to Successful Implementation, by Darren W. Pulsipher and Christian D. Buckley (Addison-Wesley), and Software Configuration Management Strategies and Rational ClearCase: A Practical Introduction, by Brian A. White (Addison-Wesley).

The biggest drawback of ClearCase for many organizations is its cost, both the initial per-seat cost and the cost of the substantial administrative team required to keep ClearCase working. The large amount of administrative work needed to keep ClearCase running properly explains why it is rarely found in smaller companies. ClearCase can use large amounts of disk space on developers’ machines, depending on how it is configured, and places substantial demands on networks. When either of these resources is limited, the performance of ClearCase can become very slow. For small to medium projects, ClearCase is usually seen as overkill.

Visual SourceSafe

Visual SourceSafe (http://msdn.microsoft.com/vstudio/productinfo) is a commercial centralized SCM tool from Microsoft. As of 2005, licenses are available for approximately $500 per seat.

Visual SourceSafe is a centralized SCM tool, usually used in a locking (pinning) manner, where only one developer can change a file at a time. It’s designed to be used almost exclusively on Windows-based platforms by small groups of developers. One of its strengths is its tight integration with Visual Studio and other Microsoft tools. However, it is not unique in that respect, since Perforce, BitKeeper, and ClearCase also integrate well with Visual Studio. Commits are not atomic across a source tree.

There is one non-Microsoft book about Visual SourceSafe—Essential SourceSafe, by Ted Roche and Larry C. Whipple (Hentzenwerke Publishing)—but it doesn’t cover the subjects that many developers find hard to use, such as branching. In the end, the tool’s own online help and the MSDN library have the largest amount of information about Visual SourceSafe.

Visual SourceSafe is an older product, and frankly, it’s showing its age. You can find some (mostly negative) opinions about it at http://www.highprogrammer.com/alan/windev/sourcesafe.html and http://www.developsense.com/testing/VSSDefects.html, and a more balanced discussion at http://c2.com/cgi/wiki?SourceSafe. You could also pay $99 for a formal report by Forrester (http://www.forrester.com). Some people claim that they have had their stored files corrupted using the tool, while others dismiss these claims. Using branches with Visual Studio projects seems to be more complicated than usual to get right, and performance is never fast enough. Supporting multiple time zones for developers requires other add-on products.

Some of these issues may be addressed in future releases, but I don’t recommend using Visual SourceSafe for any new project. If you are looking for a product that feels like Visual SourceSafe, there is Vault, a commercial SCM tool from SourceGear (http://www.sourcegear.com) that uses the same terminology as Visual SourceSafe but does everything more robustly and over larger networks. There is also a new SCM product from Microsoft, provisionally named Visual Studio 2005 Team System, that’s intended for larger groups of developers than is Visual SourceSafe; it is due for release sometime in late 2005.

Comparison of SCM Tools

Table 4-1 briefly summarizes my opinion of how each of the seven SCM tools described in this chapter matches up to the suggestions at the start of SCM Tools, earlier in this chapter, about what to look for in such a tool. Several such comparisons exist on the Internet—for example, http://better-scm.berlios.de/comparison (which has no comparison of merging and is undated) and http://wiki.gnuarch.org/moin.cgi/SubVersionAndCvsComparison (which is a mutable Wiki). However, when comparing SCM tools using these tables, be careful to choose one that will work for your project; don’t just go by the number of features the tool has. In Table 4-1, a plus sign (+) indicates a strength and a minus sign (-) indicates a relative weakness.

Table 4-1. Comparison of SCM tools

Requirement

CVS

Subversion

Arch

Perforce

BitKeeper

ClearCase

Visual SourceSafe

Data integrity

+

+

+

+

+

-

-

Fast tagging

-

+

+

+

+

+

-

Easy branching/merging

-

+

+

+

+

+

-

Integration

+

+

-

+

+

-

+

Web interface

+

+

-

+

+

+

-

Good support

+

+

-

+

+

+

+

Wider Uses of SCM

Most of this chapter has been about using SCM tools to control the source code for a product. To have confidence that you can repeat previous releases, you need to control much more than just the source code. Parts of the development environment to consider include all the tools used in the build process, the operating system as configured on the build machine, the test environment and the target operating systems, documentation, and finally how the SCM tool itself was configured at the time of the build. If all that sounds like too much load for your SCM tool, then at least create backups of the various tools and machines and store them somewhere safe off site.

One great use for SCM in your development environment is for people’s personal machines, where the data is perhaps not backed up in any other way. On Unix machines, keeping a copy of each person’s /etc directory and all the dotfiles from the home directory provides easy recovery when a disk fails. On Windows, a copy of each person’s My Documents directory will let you recover key files at some point.

Checklist

This section contains a short list of questions that you should feel comfortable answering about how you use an existing SCM tool:

  • What is saved in your SCM system? What is not in your SCM system? Why?

  • What have you overlooked? Often the only time this question is carefully answered is when a hard disk dies and you try to recreate your environment. List all the files, tools, and other pieces of your environment that you use to build a release.

  • Can you still recreate older releases if a file is renamed?

  • Can you still recreate older releases if a directory is renamed?

  • How do you know the date on which a file was branched?

  • How do you know the intended purpose of each branch?

  • Who can change permissions for write and read access to the SCM tool?

  • What happens with your SCM tool if two files in the same directory have the same name, but one is uppercase and one is lowercase? What happens if a filename has spaces in it?

  • How does the backup size of your SCM tool’s files change over time? When will you next fill up a key disk, CD, DVD, or tape?

  • Can you develop on a laptop on an airplane? How much of your SCM tool still works, and how do you resynchronize when you reconnect later on?

  • How would you add a process to your SCM tool—for example, requiring each change to be reviewed by other people?

  • Do you have good integration between your SCM tool and your bug tracking system?

  • How do you decide when to upgrade your SCM tool, whether it’s to fix bugs in the tool or for extra functionality?

  • What is the most common mistake that people using your SCM tool make? How could you help them to avoid doing that?



[1] Three different references that suggest this value are:

[2] Heating any CD or DVD in an ordinary microwave oven for 5 to 10 seconds will both physically destroy the disk and entertain onlookers. My wife and lawyer say, “Don’t try this at home!” but my children say, “Again, Daddy, again!”

Get Practical Development Environments now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.