With some tenets out of the way, let’s see how all these concepts and components fit together in the repository itself. Let’s create a new repository and inspect the internal files and object store in much greater detail.
git initInitialized empty Git repository in /tmp/hello/.git/ # List all the files in the current directory $
find .. ./.git ./.git/hooks ./.git/hooks/commit-msg.sample ./.git/hooks/applypatch-msg.sample ./.git/hooks/pre-applypatch.sample ./.git/hooks/post-commit.sample ./.git/hooks/pre-rebase.sample ./.git/hooks/post-receive.sample ./.git/hooks/prepare-commit-msg.sample ./.git/hooks/post-update.sample ./.git/hooks/pre-commit.sample ./.git/hooks/update.sample ./.git/refs ./.git/refs/heads ./.git/refs/tags ./.git/config ./.git/objects ./.git/objects/pack ./.git/objects/info ./.git/description ./.git/HEAD ./.git/branches ./.git/info ./.git/info/exclude
As you can see, .git contains a lot of stuff. All of the files are based on a template directory that you can adjust, if you so choose. Depending on the version of Git you are using, your actual manifest may look a little different. For example, older versions of Git do not use a .sample suffix on the .git/hooks files.
In general, you don’t have to view or manipulate the files in .git. These “hidden” files are considered part of Git’s plumbing, or configuration. Git has a small set of plumbing commands to manipulate these hidden files, but you will rarely use them.
Initially, the .git/objects directory (the directory for all of Git’s objects) is empty, except for a few placeholders:
find .git/objects.git/objects .git/objects/pack .git/objects/info
echo "hello world" > hello.txt$
git add hello.txt
If you typed “hello world” exactly as it appears here (with no changes to spacing or capitalization), your objects directory should now look like this:
find .git/objects.git/objects .git/objects/pack .git/objects/3b .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad .git/objects/info
All this looks mysterious. But it’s not, as the following sections explain.
When it creates an object for hello.txt, Git doesn’t care that the filename is hello.txt. Git cares only about what’s inside the file: the sequence of 12 bytes that represent “hello world” and the terminating newline (the same blob created earlier). Git performs a few operations on this blob, calculates its SHA1 hash, and enters it into the object store as a file named after the hexadecimal representation of the hash.
The hash in this case is
3b18e512dba79e4c8300dd08aeb37f8e728b8dad. The 160
bits of an SHA1 hash correspond to 20 bytes, which takes 40 bytes of
hexadecimal to display, so the content is stored as
Git inserts a / after the first two digits to
improve filesystem efficiency. (Some filesystems slow down if you put
too many files in the same directory; making the first byte of the
SHA1 into a directory is an easy way to create a fixed, 256-way
partitioning of the namespace for all possible objects with an even
To show that Git really hasn’t done very much with the content in the file (it’s still the same comforting “hello world”), you can use the hash to pull it back out of the object store any time you want:
git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dadhello world
Git also knows that 40 characters is a bit chancy to type by hand, so Git provides a command to look up objects by a unique prefix of the object hash:
git rev-parse 3b18e512d3b18e512dba79e4c8300dd08aeb37f8e728b8dad
As mentioned earlier, Git tracks the pathnames of files through another kind of object called a tree. When you use git add, Git creates an object for the contents of each file you add, but it doesn’t create an object for your tree right away. Instead, it updates the index. The index is found in .git/index and keeps track of file pathnames and corresponding blobs. Each time you run commands such as git add, git rm, or git mv, Git updates the index with the new pathname and blob information.
At the moment, the index contains exactly one file, hello.txt:
git ls-files -s100644 3b18e512dba79e4c8300dd08aeb37f8e728b8dad 0 hello.txt
Here you can see the association of the file
hello.txt and the blob
Next, let’s capture the index state and save it to a tree object:
git write-tree68aba62e560c0ebc3396e8ae9335232cd93a3f60 $
find .git/objects.git/objects .git/objects/68 .git/objects/68/aba62e560c0ebc3396e8ae9335232cd93a3f60 .git/objects/pack .git/objects/3b .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad .git/objects/info
Now there are two objects, the “hello world” object
3b18e5 and a new one, the tree object, at
68aba6. As you can see, the SHA1 object name
corresponds exactly to the subdirectory and filename in
But what does a tree look like? Because it’s an object, just like the blob, you can use the same low-level command to view it:
git cat-file -p 68aba6100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad hello.txt
The contents of the object should be easy to interpret. The
100644, represents the file
attributes of the object in octal, which should be familiar to anyone
who has used the Unix chmod command. Here
3b18e5 is the object name of the “hello world”
blob, and hello.txt is the name associated with
It is now easy to see that the tree object has captured the information that was in the index when you ran git ls-files -s.
git write-tree68aba62e560c0ebc3396e8ae9335232cd93a3f60 $
git write-tree68aba62e560c0ebc3396e8ae9335232cd93a3f60 $
Every time you compute another tree object for the same index, the SHA1 hash remains exactly the same. Git doesn’t need to recreate a new tree object. If you’re following these steps at the computer, you should be seeing exactly the same SHA1 hashes as the ones published in this book.
In this sense, the hash function is a true function in the mathematical sense: for a given input, it always produces the same output. Such a hash function is sometimes called a digest to emphasize that it serves as a sort of summary of the hashed object. Of course, any hash function, even the lowly parity bit, has this property.
That’s extremely important. For example, if you create the exact same content as another developer, regardless of where or when or how both of you work, an identical hash is proof enough that the full content is identical, too. In fact, Git treats them as identical.
But hold on a second—aren’t SHA1 hashes unique? What happened to the trillions of people with trillions of blobs per second who never produce a single collision? This is a common source of confusion among new Git users. So read on carefully, because if you can understand this distinction, everything else in this chapter is easy.
Identical SHA1 hashes in this case do not count as a collision. It would be a collision only if two different objects produced the same hash. Here, you created two separate instances of the very same content, and the same content always has the same hash.
Git depends on another consequence of the SHA1 hash function: it
doesn’t matter how you got a tree called
68aba62e560c0ebc3396e8ae9335232cd93a3f60. If you
have it, you can be extremely confident it is the same tree object
another reader of this book has. Bob might have created the tree by
combining commits A and B from Jennie and commit C from Sergey,
whereas you got commit A from Sue and an update from Lakshmi that
combines commits B and C. The results are the same, and this
facilitates distributed development.
If you look for object
68aba62e560c0ebc3396e8ae9335232cd93a3f60 and can
find it, then you can be confident that you are looking at precisely
the same data from which the hash was created (because SHA1 is a cryptographic
The converse is also true: if you don’t find an object with a specific hash in your object store, you can be confident that you do not hold a copy of that exact object.
Thus, you can determine whether your object store does or does not have a particular object even though you know nothing about its (potentially very large) contents. The hash thus serves as a reliable “label” or name for the object.
But Git also relies on something stronger than that conclusion, too. Consider the most recent commit (or its associated tree object). Since it contains, as part of its content, the hash of its parent commits and of its tree, and since that in turn contains the hash of all of its subtrees and blobs, recursively through the whole data structure, it follows by induction that the hash of the original commit uniquely identifies the state of the whole data structure rooted at that commit.
Finally, the implications of my claim in the previous paragraph lead to a powerful use of the hash function: it provides an efficient way to compare two objects, even two very large and complex data structures, without transmitting either in full.
It’s nice to have information regarding a single file, as was shown in the previous section, but projects contain complex, deeply nested directories that are refactored and moved around over time. Let’s see how Git handles this by creating a new subdirectory that contains an identical copy of the hello.txt file:
cp hello.txt subdir/$
git add subdir/hello.txt$
git write-tree492413269336d21fac079d4a4672e55d5d2147ac $
git cat-file -p 4924132693100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad hello.txt 040000 tree 68aba62e560c0ebc3396e8ae9335232cd93a3f60 subdir
Notice anything unusual? Look closer at the object name of
subdir. It’s your old friend,
What just happened? The new tree for subdir
contains only one file, hello.txt, and that file
contains the same old “hello world” content. So the
subdir tree is exactly the
same as the older, top-level tree! And of course it has the same SHA1
object name as before.
Let’s look at the .git/objects directory and see what this most recent change affected:
find .git/objects.git/objects .git/objects/49 .git/objects/49/2413269336d21fac079d4a4672e55d5d2147ac .git/objects/68 .git/objects/68/aba62e560c0ebc3396e8ae9335232cd93a3f60 .git/objects/pack .git/objects/3b .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad .git/objects/info
There are still only three unique objects: a blob containing “hello world”; a tree containing hello.txt, which contains the text “hello world” plus a newline; and a second tree that contains another reference to hello.txt along with the first tree.
The next object to discuss is the commit. Now that hello.txt has been added with git add and the tree object has been produced with git write-tree, you can create a commit object using low-level commands like this:
echo -n "Commit a file that says hello\n" \
| git commit-tree 492413269336d21fac079d4a4672e55d5d2147ac3ede4622cc241bcb09683af36360e7413b9ddf6c
And it will look something like this:
git cat-file -p 3ede462author Jon Loeliger <firstname.lastname@example.org> 1220233277 -0500 committer Jon Loeliger <email@example.com> 1220233277 -0500 Commit a file that says hello
If you’re following along on your computer, you probably found that the commit object you generated does not have the same name as the one in this book. If you’ve understood everything so far, the reason for that should be obvious: it’s not the same commit. The commit contains your name and the time you made the commit, so of course it is different, however subtly. On the other hand, your commit does have the same tree. This is why commit objects are separate from their tree objects: different commits often refer to exactly the same tree. When that happens, Git is smart enough to transfer around only the new commit object—which is tiny—instead of the tree and blob objects, which are probably much larger.
In real life, you can (and should!) skip the low-level git write-tree and git commit-tree steps and just use the git commit command. You don’t need to remember all those plumbing commands to be a perfectly happy Git user.
A basic commit object is fairly simple, and it’s the last ingredient required for a real revision control system. The commit object just shown is the simplest possible one, containing:
The name of a tree object that actually identifies the associated files
The name of the person who composed the new version (the author) and the time when it was composed
The name of the person who placed the new version into the repository (the committer) and the time when it was committed
A description of the reason for this revision (the commit message)
By default, the author and committer are the same; there are a few situations where they’re different.
Commit objects are also stored in a graph structure, although it’s completely different from the structures used by tree objects. When you make a new commit, you can give it one or more parent commits. By following back through the chain of parents, you can discover the history of your project. More details about commits and the commit graph are given in Chapter 6.
Lightweight tags are simply references to a commit object and are usually considered private to a repository. These tags do not create a permanent object in the object store. An annotated tag is more substantial and creates an object. It contains a message, supplied by you, and can be digitally signed using a GnuPG key, according to RFC4880.
Git treats both lightweight and annotated tag names equivalently for the purposes of naming a commit. However, by default, many Git commands work only on annotated tags, as they are considered “permanent” objects.
git tag -m"Tag version 1.0" V1.0 3ede462
You can see the tag object via the git cat-file -p command, but what is the SHA1 of the tag object? To find it, use the tip from Objects, Hashes, and Blobs.
git rev-parse V1.06b608c1093943939ae78348117dd18b1ba151c6a $
git cat-file -p 6b608cobject 3ede4622cc241bcb09683af36360e7413b9ddf6c type commit tag V1.0 tagger Jon Loeliger <firstname.lastname@example.org> Sun Oct 26 17:07:15 2008 -0500 Tag version 1.0
In addition to the log message and author information, the tag
refers to the commit object
3ede462. Usually, Git
tags a particular commit as named by some branch. Note that this
behavior is notably different from that of other VCSs.
Git usually tags a commit object, which points to a tree object, which encompasses the total state of the entire hierarchy of files and directories within your repository.
Recall from Figure 4-1 that the
V1.0 tag points to the commit named
1492, which in turn points to a tree
8675309) that spans multiple files. Thus, the tag
simultaneously applies to all files of that tree.
This is unlike CVS, for example, which will apply a tag to each individual file and then rely on the collection of all those tagged files to reconstitute a whole tagged revision. And whereas CVS lets you move the tag on an individual file, Git requires a new commit, encompassing the file state change, onto which the tag will be moved.