Publishing the next part of my talk about Git. See the first part here.
Let’s continue to the second part of the talk. Whereas the building blocks we were considering in the first part are more or less abstract and can be used for understanding generic principles of any distributed version control system, the areas and especially commands are specific to Git.
Let’s start with Working tree and Index. The Working tree is something that is understood by all Git users—it’s their sandbox where they do all your edits and file structure manipulations. It can be also referred to as “Working Directory”.
As for the Index, it’s a more confusing area and depending on your experience with other version control systems, index can be either a benefit or a hazard (rephrasing Rick Dekkard). Some version control systems do not have index, thus changes in the working tree are immediately committed into the repository. Not so with Git. Here, Index is an intermediate area where the next commit is staged. That’s why Index is also referred to as “Staging Area”. In order to commit a change to the file, it must be first registered with Index.
Because Index sits between the working tree and the repository, it is also referred to as “Cache”. No wonder why reading Git’s messages and help pages related to Index can be very confusing! Speaking of the building blocks we were considering earlier, both Working tree and Index can be represented as Tree objects. Also note that both Working tree and Index are optional and are only needed to prepare new commits. On a headless Git server they do not exist as there are no users who would commit changes to the server repository directly.
As an example, let’s consider typical output from the git
status
command. The above output
appears when there are changes both to the index and to the working
tree. The changes that have already been registered with Index are
listed first. Changes that haven’t yet been registered are listed at the
bottom.
Note that the same file can be listed in both sections meaning that some of the changes in it are in the Index and some are not. That illustrates the “benefit or a hazard” behavior. On one hand, index allows you to split the changes your are making to the working tree into several commits. But this also can produce an unfinished commit that misses some changes you’ve done in the working tree but forgot to stage.
In this output Git hints us about some commands that can be used in order to move files between the index and the working directory. Let’s explore them.
git add
is used to stage to the
index any changes in the file. It’s also used to put a file under
version control. If you want to undo the changes done to the file in the
working tree and use previously staged version, use git
restore
command. This is a
relatively new command and long time git users know that a more
universal git checkout
command can also be used for the same purpose.
If we want to remove the file from under version control, we use git
rm
command which erases it in the
working tree and also removes from the index. If there are any untracked
garbage files or directories in the working tree, they can be easily
removed using git clean
command.
If we want to see the changes done to the files in the working tree
compared to the staging area, we need to use git
diff
command which generates a
diff object.
Moving onto the next level, let’s consider how do we move changes
between the index and the repository. The git
commit
command is well known—it
creates a commit object that points to the tree object of the index and
adds both to the repository.
In order to see the changes before we commit them we can use git
diff
with --cached
argument—it shows the difference between the
index tree and the tree from the last commit.
Previously mentioned git rm
and git restore
commands can be used
to remove or undo changes in the index. Note that there is an
inconsistency between the argument names they use.
And of course, there is no direct way to remove a committed change from the repository.
An interesting question: what does it contain in a “clean” state, e.g.
after doing git checkout
or git
reset
? A good answer to this
question is in the “Reset demystified” section of “Pro Git”
book.
On checkout, the tree of the index is the same as the tree from the
checked out commit (and same as your working dir assuming you haven’t
made any changes to it yet). Thus, both git diff
and git diff
--cached
produce an empty diff immediately after a checkout. If we
make a change in the working directory to an already tracked file, git
diff
shows it. Once we stage this change (git add
), it is now
shown by git diff --cached
, but isn’t shown by git diff
.
Finally, for those users who don’t want to fiddle with the cache, there
is a way to work through it, as if it didn’t exist. Using -a
argument with git commit command automatically puts all the current
changes from the working tree into the index before creating a commit.
git reset
with --hard
argument can be used to replace the
contents of the working tree and the index from the last commit in the
repository directly. And git diff HEAD
(all caps) can be used to
show combined differences between the working tree plus index and the
last commit. git diff HEAD
puts the tree of the working dir on top
of the tree of Index and diffs the result against the tree of the last
commit (so if there are both staged and unstaged changes to the same
line the latter will be shown).
What is HEAD, exactly? It’s a special reference in the repository that we consider next.
This is the already familiar diagram of the commit graph with branches. HEAD usually points to what is called “current” branch—the line of changes we are working on. Note that HEAD is a meta-reference—instead of pointing to a commit via its hash it instead contains the branch name, thus when we use HEAD, double dereferencing is needed to get to the commit.
This is how HEAD can be used and changed. When we do a commit, git
commit
uses HEAD to get to the last commit on the current branch and
make it the parent of the freshly created commit. Then git commit
updates that branch to point to the new commit.
git branch
command can be
used for creating a new branch without switching to it. A relatively new
command for changing the current branch, and thus updating HEAD is
git switch
. Note that git
switch
will refuse to work if there are any uncommitted changes in the
working tree, because switching to a branch also means populating the
working tree contents using the last commit of the branch, and any
uncommitted changes will be lost. In case we actually want to lose them,
we can use git reset --hard
and provide it with the branch name.
git switch
with -c
argument is used for creating a new branch
and switching to it (that is, updating HEAD).
People often ask, what are the differences between relatively new git
restore
and git switch
commands vs. good old git checkout
and
git reset
. There is an excellent write-up on GitHub
blog about
this. In short, git restore
and git switch
are safer
alternatives to old commands, but they don’t replace them fully. There
are some similarities in their arguments, e.g. for interactive
piece-by-piece restoration both git reset
and git restore
use
-p
argument, but there are differences too, e.g. git switch -c
is the same as git checkout -b
. In general, the arguments of the new
commands should be more straightforward and easier to understand.
The old multi-purpose git checkout
command can be used for changing
HEAD, too. Unlike git switch
which only accept branch names, git
checkout
can switch HEAD to any commit.
However, if the commit is not referenced by any branch, switching to it
creates a so called “detached” HEAD. This isn’t a normal state for
development as git commit
will not be able to refer the new commit
from any branch. That means, the new commit will be considered “garbage”
by Git and removed from the repository sooner or later.
That’s why Git displays this lengthy message when you switch to a
detached HEAD. BTW, it also hints you about a quick way to flip between
the last two branches by passing minus (-
) argument to git
switch
. The takeaway is—if you see this message, don’t commit any
changes unless you really know what you are doing.
As Git was created for distributed development, it natively supports synchronization with remote repositories. “Remote” doesn’t necessarily means that data is on another machine, it can reside in a different folder on the same hard drive.
To add a remote, use git remote add
command. As you can guess, Git sets no artificial restrictions on
how many remote repositories you may have.
Let’s do a brief recap on the main commands dealing with remote
repositories. git clone
allows
starting a new local repository from a remote one. It’s a shorthand for
creating an empty repository, adding a remote and syncing from it. There
are actually two commands for syncing: git
fetch
and git
pull
, we will consider the
differences between them later. git
push
is used to sync your local
changes back to the remote.
You never manipulate the contents of remote repositories directly. When you sync from a remote repository, all its objects and references get imported into your local repository, thus it effectively becomes a cache for remote data. After you’ve made any changes which means you have created new objects and changed references, these changes can be synced back to the remote using the same synchronization process.
As Git was designed as a peer-to-peer system, it doesn’t matter which side initiates data transfer. The remote side can request data from you, or you can push your local data to remote side. In the latter case you must have the necessary permissions.
Now let’s talk about the difference between git fetch
and git
pull
. The former doesn’t change the state of your branches or the
working tree, it just does the synchronization. git pull
by default
is a shorthand for doing git fetch
and then executing git
merge
. UPDATE: You can pass
--rebase
argument to do a rebase instead.
Here is an example of using git remote
command. First, we list the
remotes by passing the -v
argument. There are two remotes here, one
called “aosp” which is accessed via HTTPS, and another called “internal”
which lives on the same machine. It’s possible to use different URLs for
fetching and pushing.
Then we execute git fetch
command to receive any new objects and
references from the remote called “internal”. After it has finished you
can access the contents of objects received from the remote.
As we know, branches are the entry points to repository objects.
Obviously, it’s a good idea to use branches of the remote repository to
access its commits. From the building blocks point of view they don’t
represent anything new—they are also files pointing to commits. However,
because conceptually they are in remote repository, they can’t be
updated in any other way than syncing from a remote. For the same
reason, if you attempt to do a git checkout
using a remote branch
name, this will result in a detached HEAD.
The name of a remote branch reference is the same as the name of the branch in the corresponding remote repository. For namespacing purposes, names of remote branch references are prefixed with the name of the remote.
And here is another term from Git vocabulary—a ”tracking” branch. Since remote branch references are read-only, in order to do any development on a branch coming from a remote repository—it is called “upstream” branch—you need to have a local branch. Usually it has the same name as the upstream branch, but that’s not mandatory. The name of the upstream branch is stored in the metadata of the “tracking” branch.
If you attempt to switch to a non-existing branch using the same name as
one of remote branches has, Git will automatically assume that you want
to create a new branch that should track that remote branch. In order to
have control over this behavior, you can use --track
argument of the
git switch
command.
Here is an example of what happens when you switch to a local tracking
branch after doing git fetch
. If there were remote changes on the
upstream branch since the last sync—the last common commit is labelled
“C” on the diagram—Git will offer to merge your local branch with
the remote branch to integrate the changes together. Since these
branches have diverged, a fast forward is not possible, and a merge
commit will need to be created.
If you are not allowed to have merge commits, then instead of doing a merge you should rebase your changes on top of remote changes. We will consider the mechanics of this a bit later.
Enough with remotes, let’s talk about the tool that helps handling work interruptions. Sometimes you are in the middle of working on a change and then something unexpected happens: your previous change has caused a build break and a simple fix is needed immediately, or you’ve got some ideas and need to test them quickly on some other branch. You don’t want to lose your current changes, but you also don’t want to commit them as is.
One possible way of handling this is to create a branch and commit your
current changes there, then checkout another branch to restore the
working tree contents to a known state. However, there is another more
convenient solution—to stash the change away using git
stash
command. This command saves
your current index and working tree modifications into the repository as
commits and brings the working tree and index into the previous
committed state.
In terms of building blocks, Stash is just another reference, like a branch. However, the last stashed commit does not use the hash of the previously stashed commit as a parent. It’s actually interesting to understand how stashing is implemented.
Let’s say we have the last commit called “M”. We have made some changes
and added them to the staging area, then we have also made some changes
to the working tree and haven’t yet staged them. Now we execute git
stash
, what happens?
First, Git creates a commit from the contents of the index, setting “M”
as a parent—as if we have executed git commit
command. This commit
is shown as “I” on the diagram. Git also snapshots working tree contents
and creates another commit for them, called “W” on the diagram. This
commit has two parents: “M” and “I”, so this is a merge commit. The
stash reference is then set to point to that commit. After that, Git
resets the index and the working tree to the state of the commit “M”.
This brings up a question: if we do git stash
multiple times, how
can we find previously stashed commits?
To answer the question, we introduce another powerful tool of Git—the Reflog. Speaking in terms of building blocks, this is just a list of commit hashes. The power comes from the fact that Git adds a node to this list every time any local reference changes. This includes HEAD, branch references, and Stash.
So, answering the question we have previously stated—in order to find
previously stashed commits we can look into the reflog of Stash. git
stash list
command does exactly that.
Another useful application for reflog is finding commits that you have
accidentally unlinked from branches. You can find their hashes either in
the scroll buffer of your terminal—if they are still there— or in the
reflog.
Here is an example of executing git
reflog
command. It lists the
history of changes to HEAD. The reference specifications in the reflog
commits list use special syntax which we will consider later.