Git Concepts at Work

With some tenets out of the way, let’s see how all these concepts and components fit together in the repository itself. Let’s create a new repository and inspect the internal files and object store in much greater detail.

Inside the .git directory

To begin, initialize an empty repository using git init and then run find to reveal what’s created:

$ mkdir /tmp/hello
$ cd /tmp/hello
$ git init
Initialized empty Git repository in /tmp/hello/.git/

# List all the files in the current directory
$ find .
.
./.git
./.git/hooks
./.git/hooks/commit-msg.sample
./.git/hooks/applypatch-msg.sample
./.git/hooks/pre-applypatch.sample
./.git/hooks/post-commit.sample
./.git/hooks/pre-rebase.sample
./.git/hooks/post-receive.sample
./.git/hooks/prepare-commit-msg.sample
./.git/hooks/post-update.sample
./.git/hooks/pre-commit.sample
./.git/hooks/update.sample
./.git/refs
./.git/refs/heads
./.git/refs/tags
./.git/config
./.git/objects
./.git/objects/pack
./.git/objects/info
./.git/description
./.git/HEAD
./.git/branches
./.git/info
./.git/info/exclude

As you can see, .git contains a lot of stuff. All of the files are based on a template directory that you can adjust, if you so choose. Depending on the version of Git you are using, your actual manifest may look a little different. For example, older versions of Git do not use a .sample suffix on the .git/hooks files.

In general, you don’t have to view or manipulate the files in .git. These hidden files are considered part of Git’s plumbing, or configuration. Git has a small set of plumbing commands to manipulate these hidden files, but you will rarely use them.

Initially, the .git/objects directory (the directory for all of Git’s objects) is empty, except for a few placeholders:

$ find .git/objects

.git/objects
.git/objects/pack
.git/objects/info

Let’s now carefully create a simple object:

$ echo "hello world" > hello.txt
$ git add hello.txt

If you typed hello world exactly as it appears here (with no changes to spacing or capitalization), your objects directory should now look like this:

$ find .git/objects
.git/objects
.git/objects/pack
.git/objects/3b
.git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
.git/objects/info

All this looks mysterious. But it’s not, as the following sections explain.

Objects, Hashes, and Blobs

When it creates an object for hello.txt, Git doesn’t care that the filename is hello.txt. Git cares only about what’s inside the file: the sequence of 12 bytes that represent hello world and the terminating newline (the same blob created earlier). Git performs a few operations on this blob, calculates its SHA1 hash, and enters it into the object store as a file named after the hexadecimal representation of the hash.

The hash in this case is 3b18e512dba79e4c8300dd08aeb37f8e728b8dad. The 160 bits of an SHA1 hash correspond to 20 bytes, which takes 40 bytes of hexadecimal to display, so the content is stored as .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad. Git inserts a / after the first two digits to improve filesystem efficiency. (Some filesystems slow down if you put too many files in the same directory; making the first byte of the SHA1 into a directory is an easy way to create a fixed, 256-way partitioning of the namespace for all possible objects with an even distribution.)

To show that Git really hasn’t done very much with the content in the file (it’s still the same comforting hello world), you can use the hash to pull it back out of the object store any time you want:

$ git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
hello world

Tip

Git also knows that 40 characters is a bit chancy to type by hand, so Git provides a command to look up objects by a unique prefix of the object hash:

$ git rev-parse 3b18e512d
3b18e512dba79e4c8300dd08aeb37f8e728b8dad

Files and Trees

Now that the hello world blob is safely ensconced in the object store, what happens to its filename? Git wouldn’t be very useful if it couldn’t find files by name.

As mentioned earlier, Git tracks the pathnames of files through another kind of object called a tree. When you use git add, Git creates an object for the contents of each file you add, but it doesn’t create an object for your tree right away. Instead, it updates the index. The index is found in .git/index and keeps track of file pathnames and corresponding blobs. Each time you run commands such as git add, git rm, or git mv, Git updates the index with the new pathname and blob information.

Whenever you want, you can create a tree object from your current index by capturing a snapshot of its current information with the low-level git write-tree command.

At the moment, the index contains exactly one file, hello.txt:

$ git ls-files -s
100644 3b18e512dba79e4c8300dd08aeb37f8e728b8dad 0       hello.txt

Here you can see the association of the file hello.txt and the blob 3b18e5....

Next, let’s capture the index state and save it to a tree object:

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

$ find .git/objects
.git/objects
.git/objects/68
.git/objects/68/aba62e560c0ebc3396e8ae9335232cd93a3f60
.git/objects/pack
.git/objects/3b
.git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
.git/objects/info

Now there are two objects, the hello world object at 3b18e5 and a new one, the tree object, at 68aba6. As you can see, the SHA1 object name corresponds exactly to the subdirectory and filename in .git/objects.

But what does a tree look like? Because it’s an object, just like the blob, you can use the same low-level command to view it:

$ git cat-file -p 68aba6
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    hello.txt

The contents of the object should be easy to interpret. The first number, 100644, represents the file attributes of the object in octal, which should be familiar to anyone who has used the Unix chmod command. Here 3b18e5 is the object name of the “hello world” blob, and hello.txt is the name associated with that blob.

It is now easy to see that the tree object has captured the information that was in the index when you ran git ls-files -s.

A Note on Git’s Use of SHA1

Before peering at the contents of the tree object in more detail, let’s check out an important feature of SHA1 hashes:

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

$ git write-tree
68aba62e560c0ebc3396e8ae9335232cd93a3f60

Every time you compute another tree object for the same index, the SHA1 hash remains exactly the same. Git doesn’t need to recreate a new tree object. If you’re following these steps at the computer, you should be seeing exactly the same SHA1 hashes as the ones published in this book.

In this sense, the hash function is a true function in the mathematical sense: for a given input, it always produces the same output. Such a hash function is sometimes called a digest to emphasize that it serves as a sort of summary of the hashed object. Of course, any hash function, even the lowly parity bit, has this property.

That’s extremely important. For example, if you create the exact same content as another developer, regardless of where or when or how both of you work, an identical hash is proof enough that the full content is identical, too. In fact, Git treats them as identical.

But hold on a second—aren’t SHA1 hashes unique? What happened to the trillions of people with trillions of blobs per second who never produce a single collision? This is a common source of confusion among new Git users. So read on carefully, because if you can understand this distinction, everything else in this chapter is easy.

Identical SHA1 hashes in this case do not count as a collision. It would be a collision only if two different objects produced the same hash. Here, you created two separate instances of the very same content, and the same content always has the same hash.

Git depends on another consequence of the SHA1 hash function: it doesn’t matter how you got a tree called 68aba62e560c0ebc3396e8ae9335232cd93a3f60. If you have it, you can be extremely confident it is the same tree object another reader of this book has. Bob might have created the tree by combining commits A and B from Jennie and commit C from Sergey, whereas you got commit A from Sue and an update from Lakshmi that combines commits B and C. The results are the same, and this facilitates distributed development.

If you look for object 68aba62e560c0ebc3396e8ae9335232cd93a3f60 and can find it, then you can be confident that you are looking at precisely the same data from which the hash was created (because SHA1 is a cryptographic hash).

The converse is also true: if you don’t find an object with a specific hash in your object store, you can be confident that you do not hold a copy of that exact object.

Thus, you can determine whether your object store does or does not have a particular object even though you know nothing about its (potentially very large) contents. The hash thus serves as a reliable label or name for the object.

But Git also relies on something stronger than that conclusion, too. Consider the most recent commit (or its associated tree object). Since it contains, as part of its content, the hash of its parent commits and of its tree, and since that in turn contains the hash of all of its subtrees and blobs, recursively through the whole data structure, it follows by induction that the hash of the original commit uniquely identifies the state of the whole data structure rooted at that commit.

Finally, the implications of my claim in the previous paragraph lead to a powerful use of the hash function: it provides an efficient way to compare two objects, even two very large and complex data structures,[8] without transmitting either in full.

Tree Hierarchies

It’s nice to have information regarding a single file, as was shown in the previous section, but projects contain complex, deeply nested directories that are refactored and moved around over time. Let’s see how Git handles this by creating a new subdirectory that contains an identical copy of the hello.txt file:

$ pwd
/tmp/hello
$ mkdir subdir
$ cp hello.txt subdir/
$ git add subdir/hello.txt
$ git write-tree
492413269336d21fac079d4a4672e55d5d2147ac

$ git cat-file -p 4924132693
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad    hello.txt
040000 tree 68aba62e560c0ebc3396e8ae9335232cd93a3f60    subdir

The new top-level tree contains two items: the original hello.txt file as well as the new subdir directory, which is of type tree instead of blob.

Notice anything unusual? Look closer at the object name of subdir. It’s your old friend, 68aba62e560c0ebc3396e8ae9335232cd93a3f60!

What just happened? The new tree for subdir contains only one file, hello.txt, and that file contains the same old hello world content. So the subdir tree is exactly the same as the older, top-level tree! And of course it has the same SHA1 object name as before.

Let’s look at the .git/objects directory and see what this most recent change affected:

$ find .git/objects
.git/objects
.git/objects/49
.git/objects/49/2413269336d21fac079d4a4672e55d5d2147ac
.git/objects/68
.git/objects/68/aba62e560c0ebc3396e8ae9335232cd93a3f60
.git/objects/pack
.git/objects/3b
.git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
.git/objects/info

There are still only three unique objects: a blob containing hello world; a tree containing hello.txt, which contains the text hello world plus a newline; and a second tree that contains another reference to hello.txt along with the first tree.

Commits

The next object to discuss is the commit. Now that hello.txt has been added with git add and the tree object has been produced with git write-tree, you can create a commit object using low-level commands like this:

$ echo -n "Commit a file that says hello\n" \
    | git commit-tree 492413269336d21fac079d4a4672e55d5d2147ac
3ede4622cc241bcb09683af36360e7413b9ddf6c

And it will look something like this:

$ git cat-file -p 3ede462
author Jon Loeliger <jdl@example.com> 1220233277 -0500
committer Jon Loeliger <jdl@example.com> 1220233277 -0500

Commit a file that says hello

If you’re following along on your computer, you probably found that the commit object you generated does not have the same name as the one in this book. If you’ve understood everything so far, the reason for that should be obvious: it’s not the same commit. The commit contains your name and the time you made the commit, so of course it is different, however subtly. On the other hand, your commit does have the same tree. This is why commit objects are separate from their tree objects: different commits often refer to exactly the same tree. When that happens, Git is smart enough to transfer around only the new commit object—which is tiny—instead of the tree and blob objects, which are probably much larger.

In real life, you can (and should!) skip the low-level git write-tree and git commit-tree steps and just use the git commit command. You don’t need to remember all those plumbing commands to be a perfectly happy Git user.

A basic commit object is fairly simple, and it’s the last ingredient required for a real revision control system. The commit object just shown is the simplest possible one, containing:

  • The name of a tree object that actually identifies the associated files

  • The name of the person who composed the new version (the author) and the time when it was composed

  • The name of the person who placed the new version into the repository (the committer) and the time when it was committed

  • A description of the reason for this revision (the commit message)

By default, the author and committer are the same; there are a few situations where they’re different.

Tip

You can use the command git show --pretty=fuller to see additional details about a given commit.

Commit objects are also stored in a graph structure, although it’s completely different from the structures used by tree objects. When you make a new commit, you can give it one or more parent commits. By following back through the chain of parents, you can discover the history of your project. More details about commits and the commit graph are given in Chapter 6.

Tags

Finally, the last object Git manages is the tag. Although Git implements only one kind of tag object, there are two basic tag types, usually called lightweight and annotated.

Lightweight tags are simply references to a commit object and are usually considered private to a repository. These tags do not create a permanent object in the object store. An annotated tag is more substantial and creates an object. It contains a message, supplied by you, and can be digitally signed using a GnuPG key, according to RFC4880.

Git treats both lightweight and annotated tag names equivalently for the purposes of naming a commit. However, by default, many Git commands work only on annotated tags, as they are considered permanent objects.

You create an annotated, unsigned tag with a message on a commit using the git tag command:

$ git tag -m"Tag version 1.0" V1.0 3ede462

You can see the tag object via the git cat-file -p command, but what is the SHA1 of the tag object? To find it, use the tip from Objects, Hashes, and Blobs.

$ git rev-parse V1.0
6b608c1093943939ae78348117dd18b1ba151c6a

$ git cat-file -p 6b608c
object 3ede4622cc241bcb09683af36360e7413b9ddf6c
type commit
tag V1.0
tagger Jon Loeliger <jdl@example.com> Sun Oct 26 17:07:15 2008 -0500

Tag version 1.0

In addition to the log message and author information, the tag refers to the commit object 3ede462. Usually, Git tags a particular commit as named by some branch. Note that this behavior is notably different from that of other VCSs.

Git usually tags a commit object, which points to a tree object, which encompasses the total state of the entire hierarchy of files and directories within your repository.

Recall from Figure 4-1 that the V1.0 tag points to the commit named 1492, which in turn points to a tree (8675309) that spans multiple files. Thus, the tag simultaneously applies to all files of that tree.

This is unlike CVS, for example, which will apply a tag to each individual file and then rely on the collection of all those tagged files to reconstitute a whole tagged revision. And whereas CVS lets you move the tag on an individual file, Git requires a new commit, encompassing the file state change, onto which the tag will be moved.



[8] This data structure is covered in more detail in Commit Graphs.

Get Version Control with Git now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.