Chapter 6. Tracking Other Repositories

This chapter discusses copying or “cloning” an existing repository, and thereafter sharing changes between original and clone using the Git “push” and “pull” commands.

Cloning a Repository

The git clone command initializes a new repository with the contents of another one and sets up tracking branches in the new repository so that you can easily coordinate changes between the two with the push/pull mechanism. We call the first repository a “remote” (even if it is in fact on the same host), and by default, this remote is named origin; you can change this with the --origin (-o) option, or with git remote rename later on. You can view and manipulate remotes with git remote; a repository can have more than one remote with which it synchronizes different sets of branches.

After cloning the remote repository, Git checks out the remote HEAD branch (often master); you can have it check out a different branch with -b branch, or none at all with -n:

$ git clone http://nifty-software.org/foo.git
Cloning into 'foo'...
remote: Counting objects: 528, done.
remote: Compressing objects: 100% (425/425), done.
remote: Total 528 (delta 100), reused 528 (delta 100)
Receiving objects: 100% (528/528), 1.31 MiB | 1.30 Mi…
Resolving deltas: 100% (100/100), done.

If you give a second argument, Git will create a directory with that name for the new repository (or use an existing directory, so long as it’s empty); otherwise, it derives the name from that of source repository using some ad hoc rules. For example, foo stays foo, but foo.git and bar/foo also become foo.

You can specify the remote repository with a URL as shown, or with a simple path to a directory in the filesystem containing a Git repository. Git supports a number of transport schemes natively to access remote repositories, including HTTP, HTTPS, its own git protocol, FTP, FTPS, and rsync.

Git will also automatically use SSH if you use the ssh URL scheme (ssh://), or give the repository as [user@]host:/path/to/repo; this uses SSH to run git upload-pack on the remote side. If the path is relative (no leading slash), then it is usually relative to the home directory of the login account on the server, though this depends on the SSH server configuration. You can specify the SSH program to use with the environment variable GIT_SSH (the default is, unsurprisingly, ssh). With the long form you can also give a TCP port number for the server, e.g., ssh://nifty-software.org:2222/foo.

When you give the origin repository as a simple directory name, and the new repository is on the same filesystem, Git uses Unix “hard links” to the originals for certain files instead of copying them when populating the object database of the clone, saving time and disk space. This is safe for two reasons. First, the semantics of hard links are such that someone deleting a shared file in the origin repository has no effect on you; files remain accessible until the last link is removed. Second, because of content-based addressing, Git objects are immutable; an object with a given ID will not suddenly change out from under you. You can turn off this feature and force actual copying with --no-hardlinks, or by using a URL with the “file” scheme to access the same path: file:///path/to/repo.git (the empty hostname between the second and third slash indicates the local host).

Note

When we refer to a “local” repository in this section, we mean one accessible to Git using the filesystem, as opposed to needing an explicit network connection (SSH, HTTP, and so on). That may not in fact be “local” to the host itself, however (meaning on hardware directly attached to it); it could be on a file server accessed over the network via NFS or CIFS, for example. Thus, a repository that is “local” to Git might still be “remote” from the host.

Shared Clone

An even faster method when cloning a local repository is the --shared option. Rather than either copy or link files between the origin and clone repositories, this simply configures the clone to search the object database of the origin in addition to its own. Initially, the object database of the clone is completely empty, because all the objects it needs are in the origin. New objects you create in the clone are added to its own database; the clone never modifies the origin’s database via this link.

It’s important to keep in mind, though, that the clone is now dependent on the origin repository to function; if the origin is not accessible, Git may abort, complaining that its object database is corrupted because it can’t find objects that used to be there. If you know you’re going to remove the origin repository, you can use git repack -a in the clone to force it to copy all the objects it needs into its own database. If you have to recover from accidentally deleting the origin, you can edit .git/objects/info/alternates if you have another local copy. You can also add the other repository with git remote add, then use git fetch --all remote to pull over the objects you need.

Another issue with shared clones is garbage collection: if garbage collection is later run on the remote and by then it has removed some refs you still have, objects that are still part of your history may just disappear, again leading to “database corrupted” errors on your side.

Bare Repositories

A “bare” repository is one without a working tree or index, created by git init --bare; the files normally under .git are right inside the repository directory instead. A bare repository is usually a coordination point for a centralized workflow: each person pushes and pulls to and from the bare copy, which represents the current “official” state of the project. No one uses the bare copy directly, so it doesn’t need a working tree (you can’t push into a non-bare repository if the push tries to update the currently checked-out branch, as that would change the branch out from under the person using it). Another use for a bare repository, using git clone --bare, is shown in the next section.

Reference Repositories

Suppose that:

  • You want to have checkouts of multiple branches of the same project at once; or
  • Several people with access to the same filesystem want clones of the same repository; or
  • Some process requires you to clone the same repository frequently

…and that the repository takes a long time to clone; perhaps it has a large history, or there’s a slow network link in the way. A solution is to share one local copy of the object database, rather than pull it over repeatedly, but using git clone --shared is awkward for this, because it introduces two levels of push/pull: you push from your clone to the local shared (bare) clone, and then you have to push from there to the origin (and similarly for pull).

Git has another option that exactly fits this bill: a “reference repository.” Here’s how it works: first, we make a bare clone of the remote repository, to be shared locally as a reference repository (hence named “refrep”):

$ git clone --bare http://foo/bar.git refrep
Cloning into 'refrep'...
remote: Counting objects: 21259, done.
remote: Compressing objects: 100% (6730/6730), done.
Receiving objects: 100% (21259/21259), 39.84 MiB | 12…
remote: Total 21259 (delta 15427), reused 20088 (delt…
Resolving deltas: 100% (15427/15427), done.

Then, we clone the remote again, but this time giving refrep as a reference:

$ git clone --reference refrep http://foo/bar.git
Cloning into 'bar'...
done.

This happens very quickly, and you see no messages about transferring objects, because none were needed; all the objects were already available in the reference repository. Others using this repository in your site can use this command to create their clones as well, sharing the reference.

The key difference between this and the --shared option is that you are still tracking the remote repository, not the refrep clone. When you pull, you still contact http://foo/, but you don’t need to wait for it to send any objects that are already stored locally in refrep; when you push, you are updating the branches and other refs of the foo repository directly.

Of course, as soon as you and others start pushing new commits, the reference repository will become out of date, and you’ll start to lose some of the benefit. Periodically, you can run git fetch --all in refrep to pull in any new objects. A single reference repository can be a cache for the objects of any number of others; just add them as remotes in the reference:

$ git remote add zeus http://olympus/zeus.git
$ git fetch --all zeus

Warning

  1. You can’t safely run garbage collection in a reference repository. Someone using it may be still using a branch that has been deleted in the upstream repository, or otherwise have references to objects that have become unreachable there. Garbage collection might delete those objects, and that person’s repository would then have problems, as it now can’t find objects it needs. Some Git commands periodically run garbage collection automatically, as routine maintenance. You should turn off pruning of unreachable objects in the reference repository with git config gc.pruneexpire never. This still allows other safe operations to run during garbage collection, such as collecting objects stored in individual files (“loose objects”) into more efficient data structures called “packs.” Since people don’t normally use a reference repository directly and thus won’t trigger automatic garbage collection, you may want to arrange for a periodic job to run git gc in a reference repository (after setting gc.pruneexpire as shown).
  2. Be careful about security. If you have restricted who can clone a repository, but then add its objects to a reference, then anyone who can read the files in the reference can get the same information.

Local, Remote, and Tracking Branches

When you clone a repository, Git sets up “remote-tracking” branches corresponding to the branches in the origin repository. These are branches in your local repository, which show you the state of the origin branches at the time of your last push or pull. When you check out a branch that doesn’t yet exist, but there is a remote-tracking branch by that name, Git automatically creates it and sets its upstream to be that tracking branch, so that subsequent push/pull operations will synchronize your local version of this branch with the remote’s version. For example, when you first clone a repository, Git checks out the remote’s HEAD branch, so this happens right away for one branch:

$ git clone git://nifty-software.org/nifty.git
...
$ cd nifty
$ git branch --all
master
origin/master
origin/topic

To begin with, your local and remote-tracking branches for master are at the same commit:

$ git log --oneline --decorate=short
3a9ee5f3 (origin/master, master) in principio

If you add a commit, you will see your branch pull ahead:

$ git log --oneline --decorate=short
3307465c (master) the final word
3a9ee5f3 (origin/master) in principio

If you run git fetch, you may find that someone else has also added a commit, and the branches have now diverged:

$ git log --graph --all
* commit baa699bc (origin/master)
| Author: Nefarious O. Committer <nefarious@qoxp.net>
| Date:   Fri Aug 24 09:33:10 2012 -0400
|
|     not quite
|
| * commit 3307465c (master)
|/  Author: Richard E. Silverman <res@qoxp.net>
|   Date:   Fri Aug 24 09:32:54 2012 -0400
|
|       the final word
|
* commit 3a9ee5f3
  Author: Mysterious Author <ma@qoxp.net>
  Date:   Fri Aug 24 09:42:27 2012 -0400

      in principio

git pull will try to merge the now-distinct branches, which is necessary before you can push your changes; otherwise, git push would update origin/master to match your master, and lose commit baa699bc in the process.

Synchronization: Push and Pull

Having cloned a repository, you use git push and git pull to reconcile your changes with those of others using the same upstream repository. Various things can happen when your changes conflict with theirs; we’ll start discussing that here, and continue in Chapter 7.

Pulling

If a branch foo is tracking a branch in a remote repository, that remote is configured as branch.foo.remote in this repository, and is said to be the remote associated with this branch, or just the “remote of this branch.” git pull updates the tracking branches of the remote for the current branch (or of the origin remote if the branch has none), fetching new objects as needed and recording new upstream branches. If the current branch is tracking an upstream in that remote, Git then tries to reconcile the current state of your branch with that of the newly updated tracking branch. If only you or the upstream has added commits to this branch since your last pull, then this will succeed with a “fast-forward” update: one branch head just moves forward along the branch to catch up with the other. If both sides have added commits, though, then a fast-forward update is not possible: just setting one side’s branch head to match the other would discard the opposite side’s new commits (they would become unreachable from the new head). This is the situation shown previously, and the solution is a merge:

$ git log --graph --oneline
*   2ee20b94     (master, origin/master) Merge branch…
|\
| * 3307465c     the final word
* | baa699bc     not quite
|/
* 3a9ee5f3       in principio

The merge commit 2ee20b94 brings together the divergent local and upstream versions of the branch, and allows both master and origin/master to advance to the same commit without losing information. git pull will automatically attempt this, and if it can combine the actual changes cleanly, this will all happen smoothly. If not, Git will stop and ask you to deal with the conflicts before making the merge commit; we’ll discuss that process in Chapter 7.

Pushing

git push is the converse of git pull, with which you apply your changes to the upstream repository. If, as before, your history has diverged from that of the remote, Git will refuse to push unless you address the divergence, which you do by pulling first (as Git helpfully reminds you):

$ git push
To git://nifty-software.org/nifty.git
! [rejected]      master -> master (non-fast-forward)
error: failed to push some refs to 'git://nifty-softw…
hint: Updates were rejected because the tip of your
hint: current branch is behind its remote
hint: counterpart. Merge the remote changes
hint: (e.g. 'git pull') before pushing again.  See
hint: the 'Note about fast-forwards' in 'git push
hint: --help' for details.

Once you pull and resolve any conflicts, you can push again successfully. The goal of pulling with regard to pushing is to integrate the upstream changes with your own so that you can push without discarding any commits in the upstream history. You may accomplish that by merging as previously shown, or by “rebasing” (see Pull with Rebase).

If you have added a local branch of your own and want to start sharing it with others, use the -u option to have Git add your branch to the remote, and set up tracking for your local branch in the usual way, for example:

$ git push -u origin new-branch

After this initial setup you can use just git push on this branch, with no options or arguments, to push to the same remote.

Push Defaults

There are several approaches Git can use when given no specific remote and ref to push (just plain git push, as opposed to git push remote branch):

matching
Push all branches with matching local and remote names
upstream
Push the current branch to its upstream (making push and pull symmetric operations)
simple
Like upstream, but check that the branch names are the same (to guard against mistaken upstream settings)
current
Push the current branch to a remote one with the same name (creating it if necessary)
nothing
Push nothing (require explicit arguments)

You can set this with the push.default configuration variable. The default as of this writing is matching, but with Git 2.0, this will change to simple, which is more conservative and avoids easy accidental pushing of changes on other branches that are not yet ready to be published. To choose an option, think about what would happen in your particular situation if you accidentally typed git push with each of these options in force, and pick the one that makes you most comfortable. Remember that like all options, you can set this on a per-repository basis (see Basic Configuration).

Pull with Rebase

Along with the facility of merge commits comes the need to make them wisely. The notion of what a merge should indicate with respect to content is subjective and varies as a matter of version control discipline and style, but generally you want a merge to point out a substantive combination of two lines of development. Certainly, too many merges creates a commit graph that is difficult to read, thus reducing the usefulness of the structural merge feature itself. In this context, certain workflows can easily create what one might call “spurious merges,” which do not actually correspond to such merging of content. Having lots of these clutters up the commit graph, and makes it difficult to discern the real history of a project.

As an example: suppose you and a colleague are coordinating your individual repositories via push/pull with a shared central one. You commit a change to your repository, while he commits an unrelated change on the same branch. The changes might be to different files, or even to the same file but such that they do not require manual conflict resolution. If he pushes first, then as described earlier, your subsequent push will fail, so you will pull; then Git will do a successful automatic merge (since the changes were independent), and this becomes part of the repository history with your final push. But if you think of a merge as a deliberate step to signal the combination of conflicting or substantially different content, then you don’t really want this merge. The telltale sign of this sort of spurious merge is that it’s purely an artifact of timing; if the order of events had instead been:

  1. You commit and push.
  2. He pulls.
  3. He commits and pushes.

then there would have been no conflict, and no merge. This observation is the key to avoiding such merges using git pull --rebase, which reorders your changes. “Rebasing” is a more general idea, which we treat in Rebasing; the pull-with-rebase option is a special case. Briefly, what happens is this: suppose your master branch diverged from its upstream several commits back. For each divergent commit on your branch, Git constructs a patch representing the changes introduced by that commit; then it applies these in order starting at the tip of the upstream tracking branch origin/master. After applying each patch, Git makes a new commit preserving the author information and message from the original commit. Finally, it resets your master branch to point to the last of these commits. The effect is to “replay” your work on top of the upstream branch as new commits, rather than affecting a merge with your existing .commits.

In the earlier example, git pull --rebase would produce the following simple, linear history instead of the “merge bubble” previously pictured, with its extra commit:

* 1e6f2cb2     the final word
* baa699bc     not quite
* 3a9ee5f3     in principio

A push now will succeed without further work (and without merging), because you’ve simply added to the upstream branch; it will be a fast-forward update of that branch. Note that the commit ID for “the final word” has changed; that’s because it’s a new commit made by replaying the changes of the original on top of commit baa699bc.

If git pull starts a merge when you know there’s no need for it, you can always cancel it by giving an empty commit message, or with git merge --abort if the merge failed leaving you in conflict-resolution mode. If you complete such a merge and want to undo it, use git reset HEAD^ to move your branch back again, discarding the merge commit. You can then use git pull --rebase instead. You can set a specific branch to automatically use --rebase when pulling:

$ git config branch.branch-name.rebase yes

and the configuration variable branch.autosetuprebase controls how this is set for new branches:

never
Default: do not set rebase
remote
Set for branches tracking remote branches
local
Set for branches tracking other branches in the same repository
always
Set for all tracking branches

Notes

  1. If you know it’s the right thing to do, you can perform destructive, non–fast-forward updates with the --force option to either push or pull, although in the case of push the remote must be configured to allow it; repositories created with git init --shared have this disabled by setting receive.denyNonFastForwards.

    Beware! It’s one thing to do a forced pull; you’re just discarding some of your own history. A forced push, on the other hand, causes grief for other people, who will be unable to pull cleanly as a result. For a repository shared by a small set of people in close communication, or that is a read-only reference for most, this may be occasionally appropriate. For anything shared by a wide audience, though, you really don’t want to do this.

  2. The command git remote show remote gives a useful summary of the status of your repository in relation to a remote:

    $ git remote show origin
    * remote origin
      Fetch URL: git://tamias.org/chipmunks.git
      Push  URL: git://tamias.org/chipmunks.git
      HEAD branch: master
      Remote branches:
        alvin    tracked
        theodore tracked
        simon    tracked
      Local branches configured for 'git pull':
        alvin  merges with remote alvin
        simon  merges with remote simon
      Local refs configured for 'git push':
        alvin  pushes to alvin  (up to date)
        simon  pushes to simon  (local out of date)

    Note that unlike most informational commands, this actually examines the remote repository, so it will run ssh or otherwise use the network if necessary. You can use the -n switch to avoid this; Git will skip those operations that require contacting the remote and note them as such in the output.

  3. git branch -vv gives a more compact summary without contacting the remote (and thus reflects the state as of the last fetch or pull; remember that the remote might have changed in the meantime). The following shows a purely local master branch, plus two branches tracking remote ones: alvin is up to date with respect to its upstream, whereas the current local branch, simon, has moved three commits forward:

    $ git branch -vv
      alvin  7e55cfe3 [origin/alvin] I love chestnuts.
      master a675f734 Chipmunks are the real nuts.
    * simon  9b0e3dc5 [origin/simon: ahead 3] Walnuts!
    

    (This state is not one resulting from previous examples.)

  4. There appears to be a lot of pointless redundancy in many of these messages; things like “alvin pushes to alvin,” or updates indicating “master→master.” The reason is that the default, common situation is for corresponding local and remote branches to have matching names, but this need not be the case; for more complex situations, you can have arbitrary associations, and the Git messages take this into account. For example, if you have a repository with two remotes each having a master branch, your local tracking branches can’t both be named master as well. You could proceed this way:

    $ git remote add foo git://foo.com/foo.git
    $ git remote add bar http://bar.com/bar.git
    $ git fetch --all
    Fetching foo
    remote: Counting objects: 6, done.
    remote: Compressing objects: 100% (2/2), done.
    remote: Total 6 (delta 0), reused 0 (delta 0)
    Unpacking objects: 100% (6/6), done.
    From foo git://foo.com/foo.git
    * [new branch]        master     -> foo/master
    Fetching bar
    remote: Counting objects: 5, done.
    remote: Total 3 (delta 0), reused 0 (delta 0)
    Unpacking objects: 100% (3/3), done.
    From http://bar.com/bar.git
    * [new branch]        master     -> bar/master
    $ git checkout -b foo-master --track foo/master
    Branch foo-master set up to track remote branch
    master from foo.
    Switched to a new branch 'foo-master'
    $ git checkout -b bar-master --track bar/master
    Branch bar-master set up to track remote branch
    master from bar.
    Switched to a new branch 'bar-master'
    $ git branch -vv
    * bar-master f1ace62e [bar/master] bars are boring
      foo-master 11e4af82 [foo/master] foosball is fab
    ...
    

    These messages from git clone:

     * [new branch]        master     -> foo/master
     ...
     * [new branch]        master     -> bar/master
     ...

    might be a little confusing; they indicate that the remote branch master in each repository is now being tracked by local branches foo/master and bar/master, respectively (not that it somehow overwrote a local master branch, which might or might not exist and is not relevant here).

Access Control

In a word (or three): there is none.

It is important to understand that Git by itself does not provide any sort of authentication or comprehensive access control when accessing a remote repository. Git has no internal notion of “user” or “account,” and although some specific actions may be forbidden by configuration (e.g., non–fast-forward updates), generally you can do whatever is possible with the operating-system level access controls in place. For example, remote repositories are often accessed via SSH. This usually means that you need to be able to log into an account on the remote machine (which account may be shared with other people); you can clone and pull from the repository if that account has read access to the repository files on that machine, and you can push to the repository if that account has write access. If you’re using HTTP for access instead, then similar comments apply to the configuration of the web server and the account under which it accesses the repository. That’s it. There is no way within Git to limit access to particular users according to more fine-grained notions, such as granting read-only access to one branch, commit access to another, and no access to a third. There are, however, third-party tools that add such features; Gitolite, Gitorious, and Gitosis are popular ones.

Get Git Pocket Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.