Git Concepts at Work

With some tenets out of the way, let’s see how all these concepts and components fit together in the repository itself. Let’s create a new repository and inspect the internal files and object store in much greater detail.

Inside the .git directory

To begin, initialize an empty repository using git init and then run find to reveal what’s created:

$ mkdir /tmp/hello
$ cd /tmp/hello
$ git init
Initialized empty Git repository in /tmp/hello/.git/

# List all the files in the current directory
$ find .
.
./.git
./.git/hooks
./.git/hooks/commit-msg.sample
./.git/hooks/applypatch-msg.sample
./.git/hooks/pre-applypatch.sample
./.git/hooks/post-commit.sample
./.git/hooks/pre-rebase.sample
./.git/hooks/post-receive.sample
./.git/hooks/prepare-commit-msg.sample
./.git/hooks/post-update.sample
./.git/hooks/pre-commit.sample
./.git/hooks/update.sample
./.git/refs
./.git/refs/heads
./.git/refs/tags
./.git/config
./.git/objects
./.git/objects/pack
./.git/objects/info
./.git/description
./.git/HEAD
./.git/branches
./.git/info
./.git/info/exclude

As you can see, .git contains a lot of stuff. All of the files are based on a template directory that you can adjust, if you so choose. Depending on the version of Git you are using, your actual manifest may look a little different. For example, older versions of Git do not use a .sample suffix on the .git/hooks files.

In general, you don’t have to view or manipulate the files in .git. These “hidden” files are considered part of Git’s plumbing, or configuration. Git has a small set of plumbing commands to manipulate these hidden files, but you will rarely use them.

Initially, the .git/objects directory (the directory for all of Git’s objects) is empty, except for a few placeholders:

$ find .git/objects

.git/objects
.git/objects/pack
.git/objects/info

Let’s now carefully create a simple object:

$ echo "hello world" > hello.txt
$ git add hello.txt

If you typed “hello world” exactly as it appears here (with no changes to spacing or capitalization), your objects directory should now look like this:

$ find .git/objects
.git/objects
.git/objects/pack
.git/objects/3b
.git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
.git/objects/info

All this looks mysterious. But it’s not, as the following sections explain.

Objects, Hashes, and Blobs

When it creates an object for hello.txt, Git doesn’t care that the filename is hello.txt. Git cares only about what’s inside the file: the sequence of 12 bytes that represent “hello world” and the terminating newline (the same blob created earlier). Git performs a few operations on this blob, calculates its SHA1 hash, and enters it into the object store as a file named after the hexadecimal representation of the hash.

How Do We Know a SHA1 Hash Is Unique?

There is an extremely slim chance that two different blobs yield the same SHA1 hash. When this happens, it is called a collision. However, SHA1 collision is so unlikely that you can safely bank on it never interfering with our use of Git.

SHA1 is a cryptographically secure hash. Until recently there was no known way (better than blind luck) for a user to cause a collision on purpose. But could a collision happen at random? Let’s see.

With 160 bits, you have 2¹⁶⁰, or about 10⁴⁸ (1 with 48 zeros after it), possible SHA1 hashes. That number is just incomprehensibly huge. Even if you hired a trillion people to produce a trillion new unique blobs per second for a trillion years, you would still only have about 10⁴³ blobs.

If you hashed 2⁸⁰ random blobs, you might find a collision.

Don’t trust us. Go read Bruce Schneier.

The hash in this case is 3b18e512dba79e4c8300dd08aeb37f8e728b8dad. The 160 bits of an SHA1 hash correspond to 20 bytes, which takes 40 bytes of hexadecimal to display, so the content is stored as .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad. Git inserts a / after the first two digits to improve filesystem efficiency. (Some filesystems slow down if you put too many files in the same directory; making the first byte of the SHA1 into a directory is an easy way to create a fixed, 256-way partitioning of the namespace for all possible objects with an even distribution.)

To show that Git really hasn’t done very much with the content in the file (it’s still the same comforting “hello world”), you can use the hash to pull it back out of the object store any time you want:

$ git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
hello world

Tip

Git also knows that 40 characters is a bit chancy to type by hand, so Git provides a command to look up objects by a unique prefix of the object hash:

$ git rev-parse 3b18e512d
3b18e512dba79e4c8300dd08aeb37f8e728b8dad

Files and Trees

Now that the “hello world” blob is safely ensconced in the object store, what happens to its filename? Git wouldn’t be very useful if it couldn’t find files by name.

As mentioned earlier, Git tracks the pathnames of files through another kind of object called a tree. When you use git add, Git creates an object for the contents of each file you add, but it doesn’t create an object for your tree right away. Instead, it updates the index. The index is found in .git/index and keeps track of file pathnames and corresponding blobs. Each time you run commands such as git add, git rm, or git mv, Git updates the index with the new pathname and blob information.

Whenever you want, you can create a tree object from your current index by capturing a snapshot of its current information with the low-level git write-tree command .

At the moment, the index contains exactly one file, hello.txt:

$ git ls-files -s
100644 3b18e512dba79e4c8300dd08aeb37f8e728b8dad 0       hello.txt

Here you can see the association of the file hello.txt and the blob 3b18e5....

Next, let’s capture the index state and save it to a tree object:

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

$ find .git/objects
.git/objects
.git/objects/68
.git/objects/68/aba62e560c0ebc3396e8ae9335232cd93a3f60
.git/objects/pack
.git/objects/3b
.git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
.git/objects/info

Now there are two objects, the “hello world” object at 3b18e5 and a new one, the tree object, at 68aba6. As you can see, the SHA1 object name corresponds exactly to the subdirectory and filename in .git/objects.

But what does a tree look like? Because it’s an object, just like the blob, you can use the same low-level command to view it:

$ git cat-file -p 68aba6
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    hello.txt

The contents of the object should be easy to interpret. The first number, 100644, represents the file attributes of the object in octal, which should be familiar to anyone who has used the Unix chmod command. Here 3b18e5 is the object name of the “hello world” blob, and hello.txt is the name associated with that blob.

It is now easy to see that the tree object has captured the information that was in the index when you ran git ls-files -s.

A Note on Git’s Use of SHA1

Before peering at the contents of the tree object in more detail, let’s check out an important feature of SHA1 hashes:

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

Every time you compute another tree object for the same index, the SHA1 hash remains exactly the same. Git doesn’t need to recreate a new tree object. If you’re following these steps at the computer, you should be seeing exactly the same SHA1 hashes as the ones published in this book.

In this sense, the hash function is a true function in the mathematical sense: for a given input, it always produces the same output. Such a hash function is sometimes called a digest to emphasize that it serves as a sort of summary of the hashed object. Of course, any hash function, even the lowly parity bit, has this property.

That’s extremely important. For example, if you create the exact same content as another developer, regardless of where or when or how both of you work, an identical hash is proof enough that the full content is identical, too. In fact, Git treats them as identical.

But hold on a second—aren’t SHA1 hashes unique? What happened to the trillions of people with trillions of blobs per second who never produce a single collision? This is a common source of confusion among new Git users. So read on carefully, because if you can understand this distinction, everything else in this chapter is easy.

Identical SHA1 hashes in this case do not count as a collision. It would be a collision only if two different objects produced the same hash. Here, you created two separate instances of the very same content, and the same content always has the same hash.

Git depends on another consequence of the SHA1 hash function: it doesn’t matter how you got a tree called 68aba62e560c0ebc3396e8ae9335232cd93a3f60. If you have it, you can be extremely confident it is the same tree object another reader of this book has. Bob might have created the tree by combining commits A and B from Jennie and commit C from Sergey, whereas you got commit A from Sue and an update from Lakshmi that combines commits B and C. The results are the same, and this facilitates distributed development.

If you look for object 68aba62e560c0ebc3396e8ae9335232cd93a3f60 and can find it, then you can be confident that you are looking at precisely the same data from which the hash was created (because SHA1 is a cryptographic hash).

The converse is also true: if you don’t find an object with a specific hash in your object store, you can be confident that you do not hold a copy of that exact object.

Thus, you can determine whether your object store does or does not have a particular object even though you know nothing about its (potentially very large) contents. The hash thus serves as a reliable “label” or name for the object.

But Git also relies on something stronger than that conclusion, too. Consider the most recent commit (or its associated tree object). Since it contains, as part of its content, the hash of its parent commits and of its tree, and since that in turn contains the hash of all of its subtrees and blobs, recursively through the whole data structure, it follows by induction that the hash of the original commit uniquely identifies the state of the whole data structure rooted at that commit.

Finally, the implications of my claim in the previous paragraph lead to a powerful use of the hash function: it provides an efficient way to compare two objects , even two very large and complex data structures,^[8] without transmitting either in full.

Tree Hierarchies

I t’s nice to have information regarding a single file, as was shown in the previous section, but projects contain complex, deeply nested directories that are refactored and moved around over time. Let’s see how Git handles this by creating a new subdirectory that contains an identical copy of the hello.txt file:

$ pwd
/tmp/hello
$ mkdir subdir
$ cp hello.txt subdir/
$ git add subdir/hello.txt
$ git write-tree
492413269336d21fac079d4a4672e55d5d2147ac

$ git cat-file -p 4924132693
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    hello.txt
040000 tree 68aba62e560c0ebc3396e8ae9335232cd93a3f60    subdir

The new top-level tree contains two items: the original hello.txt file as well as the new subdir directory, which is of type tree instead of blob.

Notice anything unusual? Look closer at the object name of subdir. It’s your old friend, 68aba62e560c0ebc3396e8ae9335232cd93a3f60!

What just happened? The new tree for subdir contains only one file, hello.txt, and that file contains the same old “hello world” content. So the subdir tree is exactly the same as the older, top-level tree! And of course it has the same SHA1 object name as before.

Let’s look at the .git/objects directory and see what this most recent change affected:

$ find .git/objects
.git/objects
.git/objects/49
.git/objects/49/2413269336d21fac079d4a4672e55d5d2147ac
.git/objects/68
.git/objects/68/aba62e560c0ebc3396e8ae9335232cd93a3f60
.git/objects/pack
.git/objects/3b
.git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
.git/objects/info

There are still only three unique objects: a blob containing “hello world”; a tree containing hello.txt, which contains the text “hello world” plus a newline; and a second tree that contains another reference to hello.txt along with the first tree.

Commits

The next object to discuss is the commit. Now that hello.txt has been added with git add and the tree object has been produced with git write-tree, you can create a commit object using low-level commands like this:

$ echo -n "Commit a file that says hello\n" \
    | git commit-tree 492413269336d21fac079d4a4672e55d5d2147ac
3ede4622cc241bcb09683af36360e7413b9ddf6c

And it will look something like this:

$ git cat-file -p 3ede462
author Jon Loeliger <jdl@example.com> 1220233277 -0500
committer Jon Loeliger <jdl@example.com> 1220233277 -0500

Commit a file that says hello

If you’re following along on your computer, you probably found that the commit object you generated does not have the same name as the one in this book. If you’ve understood everything so far, the reason for that should be obvious: it’s not the same commit. The commit contains your name and the time you made the commit, so of course it is different, however subtly. On the other hand, your commit does have the same tree. This is why commit objects are separate from their tree objects: different commits often refer to exactly the same tree. When that happens, Git is smart enough to transfer around only the new commit object—which is tiny—instead of the tree and blob objects, which are probably much larger.

In real life, you can (and should!) skip the low-level git write-tree and git commit-tree steps and just use the git commit command. You don’t need to remember all those plumbing commands to be a perfectly happy Git user.

A basic commit object is fairly simple, and it’s the last ingredient required for a real revision control system. The commit object just shown is the simplest possible one, containing:

The name of a tree object that actually identifies the associated files
The name of the person who composed the new version (the author) and the time when it was composed
The name of the person who placed the new version into the repository (the committer) and the time when it was committed
A description of the reason for this revision (the commit message)

By default, the author and committer are the same; there are a few situations where they’re different.

Tip

You can use the command git show --pretty=fuller to see additional details about a given commit.

Commit objects are also stored in a graph structure, although it’s completely different from the structures used by tree objects. When you make a new commit, you can give it one or more parent commits. By following back through the chain of parents, you can discover the history of your project. More details about commits and the commit graph are given in Chapter 6.

Version Control with Git by Jon Loeliger

Git Concepts at Work

Inside the .git directory

Objects, Hashes, and Blobs

Tip

Files and Trees

A Note on Git’s Use of SHA1

Tree Hierarchies

Commits

Tip

Tags

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly