The title of this post isn’t supposed to be provocative. After all, it’s simply the textbook definition of what git is. So why bother stating it? Well, I’ve worked with a fair few junior developers now and I’m starting to see a pattern. Many of these developers have never programmed without git and they see git simply as “the way to get new code into a repository”. A glorified copy, essentially—but an annoying one that is prone to going wrong.
But git is so much more than a glorified copy. In this post I want to go back to basics and show what a version control system is and what it can do for you. I hope this will provide a different view of git that might help you in your git journey.
Dumb version control
Back in the day, before everything was on the cloud, it was frighteningly common to see the following turn up in an email attachment:
important-document-v6-2024-02-16-(gpk).doc
People who knew better would scoff at this, but what you’re seeing here is version control. It’s just very manual, dumb version control. It was scoffed it because it’s the kind of thing that is prone to going wrong, but if implemented very carefully, it could go right. Here’s how it might work:
- Type up the first version of a document, say
important-document.doc
, - Make a copy of that, called
important-document-v1.doc
, - Continue making further additions/edits to
important-document.doc
, - Make another copy of that, called
important-document-v2.doc
.
The important thing here is discipline. For this to go well, the v1
, v2
documents
must never be edited again or you’ll undermine the whole system. To make it easier to do
the right thing the dumb version control user might opt to keep the untouchable copies
in a hidden directory, like .vcs
, which might look like:
.
├── important-document.doc
└── .vcs
├── important-document-v1.doc
└── important-document-v2.doc
What about those other parts in the first example, like the (gpk)
? These are to enable
collaboration. The way this worked is you would send v6
to me, then continue working
and produce a v7
. Later, I would send you back some corrections. You now have two
branches that need to be reconciled. And that’s exactly what people would do, they would
go through the corrected v6-(gpk)
and apply all the changes to v7
. People just kept
this stuff in their head and, for the most part, it kind of worked.
Git is dumb version control
The big secret is git is, in essence, nothing more than an implementation of the above system, with one small difference.
The first thing to understand about git is a commit is a copy of your entire working
directory. This also means a commit and a version are the same thing. Just like the
dumb system, making a commit is nothing more than copying the current working
directory into a separate storage place. With git, the storage place is actually a
.git
directory.
The second, and arguably most important, thing to understand is commits are
immutable. Remember in the dumb system we said we must not ever touch the v1
, v2
etc. copies? Git enforces this. There is no command in git that can modify, overwrite or
delete any commit that has been made.1
The small difference between the dumb system and git is what version numbers look
like. In the dumb system we used a linear sequence of numbers. But this falls apart as
soon as we have a second person working on a project. Essentially, my v2
and your v2
are different versions and if we ever hope to merge these together the system needs to
be able to store them and refer to them at the same time.
There are many solutions to this problem, but git’s solution is simple: it uses the hash of the entire commit as the version number. These are virtually guaranteed to be universally unique. But, since hashes are not sequential, it also stores a link to the previous version with every version to establish the lineage.
Doing dumb things with git
So how do we actually use git? Let’s compare and contrast the dumb version control system with git. Note the dumb VCS commands are supposed to be illustrative and almost certainly don’t work in all cases (like with hidden files/dirs). Also note, when there are multiple commands they are to be taken together as atomic operations; I’m not saying the individual commands are analogous to each other.
Making a commit
To make a new commit in the dumb system we copy the working copy into the .vcs
directory:
1mkdir .vcs/v6
2cp -r * .vcs/v6
Note we have to somehow know that v6
is the next version number.
In git we do:
1git add -A
2git commit -m "New version"
We didn’t have to know the previous version number, nor the new version number. Git instead tells us the hash of the new version after it’s done.
Checkout an old version
In the dumb system we must first wipe our working copy then copy the version we want:
1rm -r *
2cp -r .vcs/v1/* .
Note the symmetry between commit and checkout.
With git we need to specify a version somehow. We could use a hash, or a relative lookup
like HEAD^
, which means the previous commit to the one currently checked out (recall
git stores a link to the previous commit with every commit):
1git checkout HEAD^
Git warns us about being in a detached head state because anything you do in this state is kind of difficult to keep track of unless you’re good at remembering commit hashes.
It turns out checkout is actually a pretty rare thing to do in git, but it’s included for completeness.
Using meaningful version labels
In the dumb system the version labels are up to us. The v1
labels are already
meaningful, but we could use even more meaningful labels if we wish:
1mkdir .vcs/v6-test2
2cp -r * .vcs/v6-test2
In git, we can’t change the hashes, but we can add as many additional labels to a commit as we like. There are two types of labels in git: branches and tags.
To create a new branch new-branch
that labels a commit 124b7c6
:
1git branch new-branch 124b7c6
To create a tag new-tag
that labels the same commit:
1git tag -am "New tag" new-tag 124b7c6
Note that in both cases we have only added labels to existing commits. Nothing else has changed.
We can use our meaningful names instead of hashes, for example to create another tag for the very same commit:
1git tag -am "Another tag" another-tag new-branch
The difference between branches and tags are branches are mutable while tags are immutable. If you make a commit git updates your current branch (if there is one) to point to the new commit. Tags, on the other hand, will forever point to the same commit.
What is the current version/branch?
In the dumb system you just store the current version in your head. Since we were using
sequential numbers you could know by inspecting the .vcs
directory and seeing the
largest number is v6
. This is how you would know the next version is to be v7
.
Git stores the current version/branch in its head. Quite literally, in a file called
HEAD
. You can check this in any git repository by running cat .git/HEAD
. You would
probably see something like ref: refs/heads/master
.
This is how git “knows” what the previous version is when you make a commit. It’s also how it knows which branch to update when you make a commit.
You can use HEAD
as a label in its own right as we saw above when we checked out
HEAD^
(the ^
is a relative lookup and means the parent of HEAD
in this case).
A detached head state happens when you checkout a commit directly using its hash. If you
were to look at .git/HEAD
in this state you would see an entire commit hash instead of
a ref. If you make commits in this state there is no branch to update so these commits
can only be found using their hash. Git warns you before and after leaving a detached
head state. If in doubt, create a branch like it tells you to do!
Syncing with a remote
With the dumb system, syncing to a remote can be done using any sync tool, like rsync:
1rsync .vcs my-server:my-project
This copies just the .vcs
directory so everything we have so far committed.
Git is much more clever in this regard as it tries to minimise the amount of data it sends and manages your remotes itself, but you can do something similar like this:
1git remote add my-remote my-server
2git push my-remote --follow-tags '*:*'
This pushes all commits as well as all branches and all tags.
Note that in neither case is your working directory transferred. Only things you have already committed.
Differences between versions
In the dumb system, we can use the standard diff
tool to see the differences between
two versions:
1diff -ur .vcs/v2 .vcs/v3
Git has a much more powerful and specialised diff tool built in and there are many
different ways to invoke it, but to compare two versions, say a1bf365
and main
it
looks almost the same:
1git diff a1bf365 main
Beyond dumb version control
So why use git at all then? So far we’ve seen it can all be done using simple tools and some discipline. Let’s look at what git can do beyond the dumb system.
Composing commits
You might have noticed git required two commands to make a commit. One of them is called
commit
, which makes sense, but what is add
? Well, unlike the dumb version control
system, git lets us choose what to add to the next commit. Imagine you made two
unrelated changes, one in file1
and another in file2
. To make your next version to
contain only the change in file1
:
1git add file1
2git commit -m "Changes to file1"
You can go even further and break down files line by line using git add -p
, but I find
this is something much easier to achieve with a graphical git client.
This makes it much easier to produce atomic commits rather than one big commit with a bunch of unrelated changes at the end of the day.
Tracking branches
When you add a remote, git automatically downloads everything—all commits and all branches and tags—from that remote and keeps a copy of it all locally. The branches end up as locally immutable branches in your local clone called remote-tracking branches.
They are locally immutable in the sense that they can only be updated to reflect the
state of the remote when syncing with the remote. You can’t update these branches any
other way. The branch names will be prefixed with the remote name, like
my-remote/my-branch
and can be safely updated at any time by running git fetch
.
Git allows you to set any other branch as the upstream of a branch. The meaning of
upstream is usually “the branch I eventually want my changes merged into”. You could set
my-remote/my-branch
as the upstream of your current branch like so:
1git branch -u my-remote/my-branch
When you check the status of your local branch git can now tell you useful information
like “Your branch is ahead of ‘my-remote/my-branch’ by 1 commit.” If you periodically
sync with the remote using git fetch
you can see how far behind the upstream branch
you are getting.
Merging
Both of our systems allow branching, but branching isn’t very useful without merging. In the dumb version control system merging is a laborious process of combing through both versions and creating a combined version.
With git you can create such a “combined” version with one command:
1git merge another-branch
This automatically calculates all the changes on my-branch
that don’t exist on your
current branch and applies them, creating a new merge commit. Sometimes there are
conflicts, like if both you and them touched the same line in different ways. Git can’t
resolve these conflicts automatically so presents them to you to resolve before
completing the merge.
Rebasing
Often when working on a feature for a while you will find your local branch and your upstream branch will diverge due to other changes happening upstream. If you set your upstream as above, git will say something like “Your branch and ‘my-remote/my-branch’ have diverged, and have 8 and 1 different commits each, respectively.”
This means you’ve got 8 commits locally that haven’t been merged and the upstream has 1 commit that you haven’t yet seen. Over time the upstream will get more commits and the longer this happens, the higher the chances of difficult merge conflicts happening later (remember, the only point of a branch is to be able to merge it).
You can keep on top of this by “rebasing” your local branch on to the upstream like this:
1git rebase
What git does is takes those 8 commits on your branch and, one by one, re-applies the changes to the top of the upstream. This can cause conflicts but the hope is if you rebase frequently the conflicts are smaller and the changes you are applying are still fresh in your head. By keeping on top of this you’ll never diverge too far from upstream and be stuck with a difficult merge before you can finish your work.
Rebasing also allows you to edit the commits as they are being re-applied. This is very powerful and is one way you can “clean up” a local working branch ready for it to be reviewed and merged.
Resetting
Reset is one of the scarier git commands and that is somewhat justified given that it
has the --hard
option. This is one of the few commands that can actually overwrite
your work. But remember, no command in git can change, delete or overwrite commits so,
when in doubt, commit your work!
Resetting tells git to point your current branch at a different commit. Normally branches are only updated when you make new commits, as mentioned above. But there a few reasons why it’s useful to point a branch at some other commit.
One reason to reset is to simply undo any changes in your working directory, this uses
the scary --hard
option to intentionally overwrite your working directory.
Another is to re-commit some changes using a different set of commits. Perhaps you made
a chain of “work in progress” commits and want to rewrite it as one final commit. You
can --soft
reset to the commit before the first WIP commit then commit your changes
again. This can also be achieved with a rebase but sometimes the reset is easier.
One more reason is if you have a branching model like git’s own git repository which has
a next
branch for “pre-release” features. This branch is reset to the top of master
after each release. Complicated branching structures like this aren’t recommended if you
don’t need them, but git gives you the option.
Finally, resetting is how you make use of the reflog…
The reflog
What happens to the “old” commits following a rebase or a reset? I’ve already mentioned, and it’s worth mentioning again, that no command in git can delete commits. However, unless you somehow remember their commit hashes, commits are no longer practically reachable without some kind of reference (ie. a branch or tag).
That’s where the reflog comes in. Since branches are mutable, git keeps a log of all changes to a branch including commits, rebases and resets. If you want to “undo” a rebase or a reset, the reflog is where you need to look. Following a rebase or reset, the reflog might be the only way to find some commits.
You can view the reflog for you current branch by running git reflog
.
The reflog will be automatically pruned after 90 days by default. After that time, the commits themselves will actually be deleted. This is to prevent git repos growing indefinitely. So, yes, I have been lying when I said commits can never be deleted, but there is a time delay of at least 90 days following any command before they will be. For this reason you shouldn’t be regularly using the reflog to find important commits; always make sure important stuff is referenced by tags or branches.
The reflog is your safety rope and I thoroughly recommend exercising your safety rope until you are confident in how git works. Do a stupid rebase and undo it using the reflog:
1git rebase some-silly-place
2git reset HEAD@{1}
The way to read the second command is “reset my current branch to where my current branch was one operation ago”.
The reflog can’t save you if you’re in a detached head state, though, because there’s no ref to record the changes against. This is why git warns you about it and gives you every opportunity to record the hashes of any commits you make. Just heed the warnings and be careful in a detached head state.
Bisecting
In the dumb version control system you’d probably start deleting old versions at some point as your disk fills up. Git stores all the copies much more efficiently and people tend to keep git histories forever. But why do we bother keeping all those old versions? The answer is often a question: why not? But there is a real answer: we keep them to track down potential regressions.
In any long standing project there will eventually be unintended breakage. A user may report a feature that was working in version 23 is broken in version 24. There could be hundreds of commits between those versions, but one of them introduced the regression and finding it can significantly cut down on debugging time.
Git bisect can efficiently and (semi-)automatically find the commit that first broke the feature. It looks something like this:
1git bisect start
2git bisect bad v24 # the bad version
3git bisect good v23 # the good version
Now git will repeatedly checkout commits and let you test them. You can either test them
manually somehow and tell git they are good or bad with git bisect good
or git bisect bad
or you can run a script to do it completely automatically with git bisect run
. It’s so cool you’ll be wishing for the next opportunity to use it.
Conclusion
Version control can be difficult. Some of that difficulty is naturally inherited by git. Git adds to the difficulty with a somewhat cumbersome UI. But I do believe most of the difficulties stem from misconceptions and not starting with a basic idea of what version control is.
I’m amazed by how many people, even experienced developers and git users, think git stores diffs and does something more clever than our dumb version control system to make and checkout commits.2 This is a bad start when it comes to understanding git.
In my career I’ve always found myself being the “git guy”. I don’t know why this is. This article is an attempt for me to teach git in a slightly different way, starting at a lower level with no preconceptions of what version control is which is, I think, how I learnt it. Whether this is a useful way to learn or not remains to be seen. I’d love to hear feedback either way!
Of course, this is only true if you operate within the confines of git. Git can’t help you if you
rm -rf
your entire repo or something. There is also garbage collection, but this can be safely ignored in normal usage and even disabled if you really wish. ↩︎OK, it does do something a lot more clever than
cp -r
internally but, as a user, you do not need to know or worry about that. The details are fascinating if you are interested, though. ↩︎