Often, when there is discord between a tool and a project, the developers simply create a new tool. Indeed, in the world of software, the temptation to create new tools can be deceptively easy and inviting. In the face of many existing version control systems, the decision to create another shouldn’t be made casually. However, given a critical need, a bit of insight, and a healthy dose of motivation, forging a new tool can be exactly the right course.
Git, affectionately termed “the information manager from hell” by its creator is such a tool. Although the precise circumstances and timing of its genesis are shrouded in political wrangling within the Linux Kernel community, there is no doubt that what came from that fire is a well-engineered version control system capable of supporting worldwide development of software on a large scale.
Prior to Git, the Linux Kernel was developed using the commercial BitKeeper VCS, which provided sophisticated operations not available in then-current, free software version control systems such as RCS and CVS. However, when the company that owned BitKeeper placed additional restrictions on its “free as in beer” version in the spring of 2005, the Linux community realized that BitKeeper was no longer a viable solution.
Linus looked for alternatives. Eschewing commercial solutions, he studied the free software packages but found the same limitations and flaws that led him to reject them previously. What was wrong with the existing VCS systems? What were the elusive missing features or characteristics that Linus wanted and couldn’t find?
There are many facets to “distributed development,” and Linus wanted a new VCS that would cover most of them. It had to allow parallel as well as independent and simultaneous development in private repositories without the need for constant synchronization with a central repository, which could form a development bottleneck. It had to allow multiple developers in multiple locations even if some of them were offline temporarily.
It isn’t enough just to have a distributed development model. Linus knew that thousands of developers contribute to each Linux release, so any new VCS had to handle a very large number of developers, whether they were working on the same or on different parts of a common project. And the new VCS had to be able to integrate all of their work reliably.
Linus was determined to ensure that a new VCS was fast and efficient. In order to support the sheer volume of update operations that would be made on the Linux Kernel alone, he knew that both individual update operations and network transfer operations would have to be very fast. To save space and thus transfer time, compression and “delta” techniques would be needed. Using a distributed model instead of a centralized model also ensured that network latency would not hinder daily development.
Because Git is a distributed revision control system, it is vital to obtain absolute assurance that data integrity is maintained and is not somehow being altered. How do you know the data hasn’t been altered in transition from one developer to the next, or from one repository to the next? For that matter, how do you know that the data in a Git repository is even what it purports to be?
Git uses a common cryptographic hash function, called Secure Hash Function (SHA1), to name and identify objects within its database. Although perhaps not absolute, in practice it has proven to be solid enough to ensure integrity and trust for all of Git’s distributed repositories.
One of the key aspects of a version control system is knowing who changed files, and if at all possible, why. Git enforces a change log on every commit that changes a file. The information stored in that change log is left up to the developer, project requirements, management, convention, etc. Git ensures that changes will not happen mysteriously to files under version control because there is an accountability trail for all changes.
Git’s repository database contains data objects that are immutable. That is, once they have been created and placed in the database, they cannot be modified. They can be recreated differently, of course, but the original data cannot be altered without consequences. The design of the Git database means that the entire history stored within the version control database is also immutable. Using immutable objects has several advantages, including very quick comparison for equality.
With atomic transactions, a number of different but related changes are performed either all together or not at all. This property ensures that the version control database is not left in a partially changed (and hence possibly corrupted) state while an update or commit is happening. Git implements atomic transactions by recording complete, discrete repository states that cannot be broken down into individual or smaller state changes.
Almost all VCSs can name different genealogies of development within a single project. For instance, one sequence of code changes could be called “development” while another is referred to as “test.” Each version control system can also split a single line of development into multiple lines and then unify, or merge, the disparate threads. As with most VCSs, Git calls a line of development a branch and assigns each branch a name.
Along with branching comes merging. Just as Linus wanted easy branching to foster alternate lines of development, he also wanted to facilitate easy merging of those branches. Because branch merging has often been a painful and difficult operation in version control systems, it would be essential to support clean, fast, easy merging.
So that individual developers needn’t query a centralized repository server for historical revision information, it was essential that each repository have a complete copy of all historical revisions of every file.
Even though end users might not be concerned about a clean internal design, it was important to Linus and ultimately to other Git developers as well. Git’s object model has simple structures that capture fundamental concepts for raw data, directory structure, recording changes, etc. Coupling the object model with a globally unique identifier technique allowed a very clean data model that could be managed in a distributed development environment.
Given a clean slate to create a new VCS, many talented software engineers collaborated and Git was born. Necessity was the mother of invention again!