I was surprised that people were so unconfident about their understanding –
I’d been thinking of HEAD
as a pretty straightforward topic.
Usually when people say that a topic is confusing when I think it’s not, the
reason is that there’s actually some hidden complexity that I wasn’t
considering. And after some follow up conversations, it turned out that HEAD
actually was a bit more complicated than I’d appreciated!
Here’s a quick table of contents:
After talking to a bunch of different people about HEAD
, I realized that
HEAD
actually has a few different closely related meanings:
.git/HEAD
HEAD
as in git show HEAD
(git calls this a “revision parameter”)HEAD
in the output of various commands (<<<<<<<<<<HEAD
, (HEAD -> main)
, detached HEAD state
, On branch main
, etc)These are extremely closely related to each other, but I don’t think the relationship is totally obvious to folks who are starting out with git.
.git/HEAD
Git has a very important file called .git/HEAD
. The way this file works is that it contains either:
ref: refs/heads/main
)96fa6899ea34697257e84865fefc56beb42d6390
)This file is what determines what your “current branch” is in Git. For example, when you run git status
and see this:
$ git status
On branch main
it means that the file .git/HEAD
contains ref: refs/heads/main
.
If .git/HEAD
contains a commit ID instead of a branch, git calls that
“detached HEAD state”. We’ll get to that later.
(People will sometimes say that HEAD contains a name of a reference or a
commit ID, but I’m pretty sure that that the reference has to be a branch.
You can technically make .git/HEAD
contain the name of a reference that
isn’t a branch by manually editing .git/HEAD
, but I don’t think you can do it
with a regular git command. I’d be interested to know if there is a
regular-git-command way to make .git/HEAD a non-branch reference though, and if
so why you might want to do that!)
HEAD
as in git show HEAD
It’s very common to use HEAD
in git commands to refer to a commit ID, like:
git diff HEAD
git rebase -i HEAD^^^^
git diff main..HEAD
git reset --hard HEAD@{2}
All of these things (HEAD
, HEAD^^^
, HEAD@{2}
) are called “revision parameters”. They’re documented in man
gitrevisions, and Git will try to
resolve them to a commit ID.
(I’ve honestly never actually heard the term “revision parameter” before, but that’s the term that’ll get you to the documentation for this concept)
HEAD in git show HEAD
has a pretty simple meaning: it resolves to the
current commit you have checked out! Git resolves HEAD
in one of two ways:
.git/HEAD
contains a branch name, it’ll be the latest commit on that branch (for example by reading it from .git/refs/heads/main
).git/HEAD
contains a commit ID, it’ll be that commit IDNow we’ve talked about the file .git/HEAD
, and the “revision parameter”
HEAD
, like in git show HEAD
. We’re left with all of the various ways git
uses HEAD
in its output.
git status
: “on branch main” or “HEAD detached”When you run git status
, the first line will always look like one of these two:
on branch main
. This means that .git/HEAD
contains a branch.HEAD detached at 90c81c72
. This means that .git/HEAD
contains a commit ID.I promised earlier I’d explain what “HEAD detached” means, so let’s do that now.
“HEAD is detached” or “detached HEAD state” mean that you have no current branch.
Having no current branch is a little dangerous because if you make new commits, those commits won’t be attached to any branch – they’ll be orphaned! Orphaned commits are a problem for 2 reasons:
git log somebranch
to find them)Personally I’m very careful about avoiding creating commits in detached HEAD state, though some people prefer to work that way. Getting out of detached HEAD state is pretty easy though, you can either:
git checkout main
)git checkout -b newbranch
)git rebase --abort
)Okay, back to other git commands which have HEAD
in their output!
git log
: (HEAD -> main)
When you run git log
and look at the first line, you might see one of the following 3 things:
commit 96fa6899ea (HEAD -> main)
commit 96fa6899ea (HEAD, main)
commit 96fa6899ea (HEAD)
It’s not totally obvious how to interpret these, so here’s the deal:
(...)
, git lists every reference that points at that commit, for example (HEAD -> main, origin/main, origin/HEAD)
means HEAD
, main
, origin/main
, and origin/HEAD
all point at that commit (either directly or indirectly)HEAD -> main
means that your current branch is main
HEAD,
instead of HEAD ->
, it means you’re in detached HEAD state (you have no current branch)if we use these rules to explain the 3 examples above: the result is:
commit 96fa6899ea (HEAD -> main)
means:
.git/HEAD
contains ref: refs/heads/main
.git/refs/heads/main
contains 96fa6899ea
commit 96fa6899ea (HEAD, main)
means:
.git/HEAD
contains 96fa6899ea
(HEAD is “detached”).git/refs/heads/main
also contains 96fa6899ea
commit 96fa6899ea (HEAD)
means:
.git/HEAD
contains 96fa6899ea
(HEAD is “detached”).git/refs/heads/main
either contains a different commit ID or doesn’t exist<<<<<<< HEAD
is just confusingWhen you’re resolving a merge conflict, you might see something like this:
<<<<<<< HEAD
def parse(input):
return input.split("\n")
=======
def parse(text):
return text.split("\n\n")
>>>>>>> somebranch
I find HEAD
in this context extremely confusing and I basically just ignore it. Here’s why.
HEAD
in the merge conflict is the same as what HEAD
when you ran git merge
. Simple.HEAD
in the merge conflict is something totally
different: it’s the other commit that you’re rebasing on top of. So it’s
totally different from what HEAD
was when you ran git rebase
. It’s like
this because rebase works by first checking out the other commit and then
repeatedly cherry-picking commits on top of it.Similarly, the meaning of “ours” and “theirs” are flipped in a merge and rebase.
The fact that the meaning of HEAD
changes depending on whether I’m doing a
rebase or merge is really just too confusing for me and I find it much simpler
to just ignore HEAD
entirely and use another method to figure out which part
of the code is which.
I think HEAD would be more intuitive if git’s terminology around HEAD were a little more internally consistent.
For example, git talks about “detached HEAD state”, but never about “attached
HEAD state” – git’s documentation never uses the term “attached” at all to
refer to HEAD
. And git talks about being “on” a branch, but never “not on” a
branch.
So it’s very hard to guess that on branch main
is actually the opposite of
HEAD detached
. How is the user supposed to guess that HEAD detached
has
anything to do with branches at all, or that “on branch main” has anything to
do with HEAD
?
If I think of other ways HEAD
is used in Git (especially ways HEAD appears in
Git’s output), I might add them to this post later.
If you find HEAD confusing, I hope this helps a bit!
]]>So I asked about people’s favourite git config options on Mastodon:
what are your favourite git config options to set? Right now I only really have
git config push.autosetupremote true
andgit config init.defaultBranch main
set in my~/.gitconfig
, curious about what other people set
As usual I got a TON of great answers and learned about a bunch of very popular git config options that I’d never heard of.
I’m going to list the options, starting with (very roughly) the most popular ones. Here’s a table of contents:
All of the options are documented in man git-config
, or this page.
pull.ff only
or pull.rebase true
These two were the most popular. These both have similar goals: to avoid accidentally creating a merge commit
when you run git pull
on a branch where the upstream branch has diverged.
pull.rebase true
is the equivalent of running git pull --rebase
every time you git pull
pull.ff only
is the equivalent of running git pull --ff-only
every time you git pull
I’m pretty sure it doesn’t make sense to set both of them at once, since --ff-only
overrides --rebase
.
Personally I don’t use either of these since I prefer to decide how to handle
that situation every time, and now git’s default behaviour when your branch has
diverged from the upstream is to just throw an error and ask you what to do
(very similar to what git pull --ff-only
does).
merge.conflictstyle zdiff3
Next: making merge conflicts more readable! merge.conflictstyle zdiff3
and merge.conflictstyle diff3
were both super popular (“totally indispensable”).
The main idea is The consensus seemed to be “diff3 is great, and zdiff3 (which is newer) is even better!”.
So what’s the deal with diff3
. Well, by default in git, merge conflicts look like this:
<<<<<<< HEAD
def parse(input):
return input.split("\n")
=======
def parse(text):
return text.split("\n\n")
>>>>>>> somebranch
I’m supposed to decide whether input.split("\n")
or text.split("\n\n")
is
better. But how? What if I don’t remember whether \n
or \n\n
is right? Enter diff3!
Here’s what the same merge conflict look like with merge.conflictstyle diff3
set:
<<<<<<< HEAD
def parse(input):
return input.split("\n")
||||||| b9447fc
def parse(input):
return input.split("\n\n")
=======
def parse(text):
return text.split("\n\n")
>>>>>>> somebranch
This has extra information: now the original version of the code is in the middle! So we can see that:
\n\n
to \n
input
to text
So presumably the correct merge conflict resolution is return
text.split("\n")
, since that combines the changes from both sides.
I haven’t used zdiff3, but a lot of people seem to think it’s better. The blog post Better Git Conflicts with zdiff3 talks more about it.
rebase.autosquash true
Autosquash was also a new feature to me. The goal is to make it easier to modify old commits.
Here’s how it works:
add parsing code
git commit --fixup OLD_COMMIT_ID
, which gives the new commit the commit message fixup! add parsing code
git rebase --autosquash main
, it will automatically combine all the fixup!
commits with their targetsrebase.autosquash true
means that --autosquash
always gets passed automatically to git rebase
.
rebase.autostash true
This automatically runs git stash
before a git rebase and git stash pop
after. It basically passes --autostash
to git rebase
.
Personally I’m a little scared of this since it potentially can result in merge conflicts after the rebase, but I guess that doesn’t come up very often for people since it seems like a really popular configuration option.
push.default simple
, push.default current
, push.autoSetupRemote true
These push
options tell git push
to automatically push the current branch to a remote branch with the same name.
push.default simple
is the default in Git. It only works if your branch is already tracking a remote branchpush.default current
is similar, but it’ll always push the local branch to a remote branch with the same name.push.autoSetupRemote true
is a little different – this one makes it so when you first push a branch, it’ll automatically set up tracking for itI think I prefer push.autoSetupRemote true
to push.default current
because
push.autoSetupRemote true
also lets you pull from the matching remote
branch (though you do need to push first to set up tracking). push.default
current
only lets you push.
I believe the only thing to be careful of with push.autoSetupRemote true
and
push.default current
is that you need to be confident that you’re never going
to accidentally make a local branch with the same name as an unrelated remote
branch. Lots of people have branch naming conventions (like julia/my-change
)
that make this kind of conflict very unlikely, or just have few enough
collaborators that branch name conflicts probably won’t happen.
init.defaultBranch main
Create a main
branch instead of a master
branch when creating a new repo.
commit.verbose true
This adds the whole commit diff in the text editor where you’re writing your commit message, to help you remember what you were doing.
rerere.enabled true
This enables rerere (”reuse recovered resolution”), which remembers how you resolved merge conflicts
during a git rebase
and automatically resolves conflicts for you when it can.
help.autocorrect 10
By default git’s autocorrect try to check for typos (like git ocmmit
), but won’t actually run the corrected command.
If you want it to run the suggestion automatically, you can set
help.autocorrect
to 1
(run after 0.1 seconds), 10
(run after 1 second), immediate
(run
immediately), or prompt
(run after prompting)
core.pager delta
The “pager” is what git uses to display the output of git diff
, git log
, git show
, etc. People set it to:
delta
(a fancy diff viewing tool with syntax highlighting)less -x5,9
(sets tabstops, which I guess helps if you have a lot of files with tabs in them?)less -F -X
(not sure about this one, -F
seems to disable the pager if everything fits on one screen if but my git seems to do that already anyway)cat
(to disable paging altogether)I used to use delta
but turned it off because somehow I messed up the colour
scheme in my terminal and couldn’t figure out how to fix it. I think it’s a
great tool though.
I believe delta also suggests that you set up interactive.diffFilter delta --color-only
to syntax highlight code when you run git add -p
.
diff.algorithm histogram
Git’s default diff algorithm often handles functions being reordered badly. For example look at this diff:
-.header {
+.footer {
margin: 0;
}
-.footer {
+.header {
margin: 0;
+ color: green;
}
I find it pretty confusing. But with diff.algorithm histogram
, the diff looks like this instead, which I find much clearer:
-.header {
- margin: 0;
-}
-
.footer {
margin: 0;
}
+.header {
+ margin: 0;
+ color: green;
+}
Some folks also use patience
, but histogram
seems to be more popular. When to Use Each of the Git Diff Algorithms has more on this.
core.excludesfile
: a global .gitignorecore.excludeFiles = ~/.gitignore
lets you set a global gitignore file that
applies to all repositories, for things like .idea
or .DS_Store
that you
never want to commit to any repo. It defaults to ~/.config/git/ignore
.
includeIf
: separate git configs for personal and workLots of people said they use this to configure different email addresses for personal and work repositories. You can set it up something like this:
[includeIf "gitdir:~/code/<work>/"]
path = "~/code/<work>/.gitconfig"
url."git@github.com:".insteadOf 'https://github.com/'
I often accidentally clone the HTTP version of a repository instead of the
SSH version and then have to manually go into ~/.git/config
and edit the
remote URL. This seems like a nice workaround: it’ll replace
https://github.com
in remotes with git@github.com:
.
Here’s what it looks like in ~/.gitconfig
since it’s kind of a mouthful:
[url "git@github.com:"]
insteadOf = "https://github.com/"
One person said they use pushInsteadOf
instead to only do the replacement for
git push
because they don’t want to have to unlock their SSH key when
pulling a public repo.
A couple of other people mentioned setting insteadOf = "gh:"
so they can git
remote add gh:jvns/mysite
to add a remote with less typing.
fsckobjects
: avoid data corruptionA couple of people mentioned this one. Someone explained it as “detect data corruption eagerly. Rarely matters but has saved my entire team a couple times”.
transfer.fsckobjects = true
fetch.fsckobjects = true
receive.fsckObjects = true
I’ve never understood anything about submodules but a couple of person said they like to set:
status.submoduleSummary true
diff.submodule log
submodule.recurse true
I won’t attempt to explain those but there’s an explanation on Mastodon by @unlambda here.
Here’s everything else that was suggested by at least 2 people:
blame.ignoreRevsFile .git-blame-ignore-revs
lets you specify a file with commits to ignore during git blame
, so that giant renames don’t mess up your blamesbranch.sort -committerdate
, makes git branch
sort by most recently used branches instead of alphabetical, to make it easier to find branches. tag.sort taggerdate
is similar for tags.color.ui false
: to turn off colourcommit.cleanup scissors
: so that you can write #include
in a commit message without the #
being treated as a comment and removedcore.autocrlf false
: on Windows, to work well with folks using Unixcore.editor emacs
: to use emacs (or another editor) to edit commit messagescredential.helper osxkeychain
: use the Mac keychain for managingdiff.tool difftastic
: use difftastic (or meld
or nvimdiffs
) to display diffsdiff.colorMoved default
: uses different colours to highlight lines in diffs that have been “moved”diff.colorMovedWS allow-indentation-change
: with diff.colorMoved
set, also ignores indentation changesdiff.context 10
: include more context in diffsfetch.prune true
and fetch.prunetags
- automatically delete remote tracking branches that have been deletedgpg.format ssh
: allow you to sign commits with SSH keyslog.date iso
: display dates as 2023-05-25 13:54:51
instead of Thu May 25 13:54:51 2023
merge.keepbackup false
, to get rid of the .orig
files git creates during a merge conflictmerge.tool meld
(or nvim
, or nvimdiff
) so that you can use git mergetool
to help resolve merge conflictspush.followtags true
: push new tags along with commits being pushedrebase.missingCommitsCheck error
: don’t allow deleting commits during a rebaserebase.updateRefs true
: makes it much easier to rebase multiple stacked branches at a time. Here’s a blog post about it.I generally set git config options with git config --global NAME VALUE
, for
example git config --global diff.algorithm histogram
. I usually set all of my
options globally because it stresses me out to have different git behaviour in
different repositories.
If I want to delete an option I’ll edit ~/.gitconfig
manually, where they look like this:
[diff]
algorithm = histogram
My git config is pretty minimal, I already had:
init.defaultBranch main
push.autoSetupRemote true
merge.tool meld
diff.colorMoved default
(which actually doesn’t even work for me for some reason but I haven’t found the time to debug)and I added these 3 after writing this blog post:
diff.algorithm histogram
branch.sort -committerdate
merge.conflictstyle zdiff3
I’d probably also set rebase.autosquash
if making carefully crafted pull
requests with multiple commits were a bigger part of my life right now.
I’ve learned to be cautious about setting new config options – it takes me a
long time to get used to the new behaviour and if I change too many things at
once I just get confused. branch.sort -committerdate
is something I was
already using anyway (through an alias), and I’m pretty sold that diff.algorithm
histogram
will make my diffs easier to read when I reorder functions.
I’m always amazed by how useful to just ask a lot of people what stuff they like and then list the most commonly mentioned ones, like with this list of new-ish command line tools I put together a couple of years ago. Having a list of 20 or 30 options to consider feels so much more efficient than combing through a list of all 600 or so git config options
It was a little confusing to summarize these because git’s default options have actually changed a lot of the years, so people occasionally have options set that were important 8 years ago but today are the default. Also a couple of the experimental options people were using have been removed and replaced with a different version.
I did my best to explain things accurately as of how git works right now in 2024 but I’ve definitely made mistakes in here somewhere, especially because I don’t use most of these options myself. Let me know on Mastodon if you see a mistake and I’ll try to fix it.
I might also ask people about aliases later, there were a bunch of great ones that I left out because this was already getting long.
]]>main
) and a remote branch (maybe also called
main
) have diverged.
There are two things that make this situation hard:
main
has diverged from the remote main
(git
will often just give you an intimidating but generic error message like
! [rejected] main -> main (non-fast-forward) error: failed to push some refs to 'github.com:jvns/int-exposed'
)main
, there
no single clear way to handle it (what you need to do depends on the
situation and your git workflow)So let’s talk about a) how to recognize when you’re in a situation where a local branch and remote branch have diverged and b) what you can do about it! Here’s a quick table of contents:
Let’s start with what it means for 2 branches to have “diverged”.
If you have a local main
and a remote main
, there are 4 basic configurations:
1: up to date. The local and remote main
branches are in the exact same place. Something like this:
a - b - c - d
^ LOCAL
^ REMOTE
2: local is behind
Here you might want to git pull
. Something like this:
a - b - c - d - e
^ LOCAL ^ REMOTE
3: remote is behind
Here you might want to git push
. Something like this:
a - b - c - d - e
^ REMOTE ^ LOCAL
4: they’ve diverged :(
This is the situation we’re talking about in this blog post. It looks something like this:
a - b - c - d - e
\ ^ LOCAL
-- f
^ REMOTE
There’s no one recipe for resolving this (how you want to handle it depends on the situation and your git workflow!) but let’s talk about how to recognize that you’re in that situation and some options for how to resolve it.
There are 3 main ways to tell that your branch has diverged.
git status
The easiest way to is to run git fetch
and then git status
. You’ll get a message something like this:
$ git fetch
$ git status
On branch main
Your branch and 'origin/main' have diverged, <-- here's the relevant line!
and have 1 and 2 different commits each, respectively.
(use "git pull" to merge the remote branch into yours)
git push
When I run git push
, sometimes I get an error like this:
$ git push
To github.com:jvns/int-exposed
! [rejected] main -> main (non-fast-forward)
error: failed to push some refs to 'github.com:jvns/int-exposed'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
This doesn’t always mean that my local main
and the remote main
have
diverged (it could just mean that my main
is behind), but for me it often
means that. So if that happens I might run git fetch
and git status
to
check.
git pull
If I git pull
when my branches have diverged, I get this error message:
$ git pull
hint: You have divergent branches and need to specify how to reconcile them.
hint: You can do so by running one of the following commands sometime before
hint: your next pull:
hint:
hint: git config pull.rebase false # merge
hint: git config pull.rebase true # rebase
hint: git config pull.ff only # fast-forward only
hint:
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
fatal: Need to specify how to reconcile divergent branches.
This is pretty clear about the issue (“you have divergent branches”).
git pull
doesn’t always spit out this error message though when your branches have diverged: it depends on how
you configure git. The three other options I’m aware of are:
git config pull.rebase false
, it’ll automatically start merging the remote main
git config pull.rebase true
, it’ll automatically start rebasing onto the remote main
git config pull.ff only
, it’ll exit with the error fatal: Not possible to fast-forward, aborting.
Now that we’ve talked about some ways to recognize that you’re in a situation where your local branch has diverged from the remote one, let’s talk about what you can do about it.
There’s no “best” way to resolve branches that have diverged – it really depends on your workflow for git and why the situation is happening.
I use 3 main solutions, depending on the situation:
main
. To do this, I’ll run git
pull --rebase
.git push --force
git reset --hard origin/main
Here are some more details about all 3 of these solutions.
git pull --rebase
This is what I do when I want to keep both sets of changes. It rebases main
onto the remote main
branch. I mostly use this in repositories where I’m
doing all of my work on the main
branch.
You can configure git config pull.rebase true
, to do this automatically every
time, but I don’t because sometimes I actually want to use solutions 2 or 3
(overwrite my local changes with the remote, or the reverse). I’d rather be
warned “hey, these branches have diverged, how do you want to handle it?” and
decide for myself if I want to rebase or not.
git pull --no-rebase
This starts a merge between the local
and remote main
. Here you’ll need to:
git pull --no-rebase
. This starts a merge and (if it succeeds) opens a text editor so that you can confirm that you want to commit the mergeI don’t have too much to say about this because I’ve never done it. I always use rebase instead. That’s a personal workflow choice though, lots of people have very legitimate reasons to avoid rebase.
git push --force
Sometimes I know that the work on the remote main
is actually useless and I
just want to overwrite it with whatever is on my local main
.
I do this pretty often on private repositories where I’m the only committer, for example I might:
git push
some commitsgit commit --amend
git push --force
Of course, if the repository has many different committers, force-pushing in this way can cause a lot of problems. On shared repositories I’ll usually enable github branch protection so that it’s impossible to force push.
git push --force-with-lease
I’ve still never actually used git push --force-with-lease
, but I’ve seen a
lot of people recommend it as an alternative to git push --force
that makes
sure that nobody else has changed the branch since the last time you pushed or
fetched, so that you don’t accidentally blow their changes away.
Seems like a good option. I did notice that --force-with-lease
isn’t
foolproof though – for example this git commit
talks about how if you use VSCode’s autofetching feature to continuously git fetch
,
then --force-with-lease
won’t help you.
Apparently now Git also has --force-with-lease --force-if-includes
(documented here),
which I think checks the reflog to make sure that you’ve already integrated the
remote branch into your branch somehow. I still don’t totally understand this
but I found this stack overflow conversation
helpful.
git reset --hard origin/main
You can use this as the reverse of git push --force
(since there’s no git pull --force
). I do this when I know that
my local work shouldn’t be there and I want to throw it away and replace it
with whatever’s on the remote branch.
For example, I might do this if I accidentally made a commit to main
that
actually should have been on new branch. In that case I’ll also create a new
branch (new-branch
in this example) to store my local work on the main
branch, so it’s not really being thrown away.
Fixing that problem looks like this:
git checkout main
# 1. create `new-branch` to store my work
git checkout -b new-branch
# 2. go back to the `main` branch I messed up
git checkout main
# 3. make sure that my `origin/main` is up to date
git fetch
# 4. double check to make sure I don't have any uncomitted
# work because `git reset --hard` will blow it away
git status
# 5. force my local branch to match the remote `main`
# NOTE: replace `origin/main` with the actual name of the
# remote/branch, you can get this from `git status`.
git reset --hard origin/main
This “store your work on main
on a new branch and then git reset --hard
” pattern can
also be useful if you’re not sure yet how to solve the conflict, since most
people are more used to merging 2 local branches than dealing with merging a
remote branch.
As always git reset --hard
is a dangerous action and you can permanently lose
your uncommitted work. I always run git status
first to make sure I don’t
have any uncommitted changes.
Some alternatives to using git reset --hard
for this:
git branch -f main origin/main
.git fetch origin main:main --force
I’d never really thought about how confusing the git push
and git pull
error messages can be if you’re not used to reading them.
.git
directory and someone requested a text version, so here it is. I added some
extra notes too. First, here’s the image. It’s a ~15 word explanation of each
part of your .git
directory.
You can git clone https://github.com/jvns/inside-git
if you want to run all
these examples yourself.
Here’s a table of contents:
The first 5 parts (HEAD
, branch, commit, tree, blobs) are the core of git.
.git/head
HEAD
is a tiny file that just contains the name of your current branch.
Example contents:
$ cat .git/HEAD
ref: refs/heads/main
HEAD
can also be a commit ID, that’s called “detached HEAD state”.
.git/refs/heads/main
A branch is stored as a tiny file that just contains 1 commit ID. It’s stored
in a folder called refs/heads
.
Example contents:
$ cat .git/refs/heads/main
1093da429f08e0e54cdc2b31526159e745d98ce0
.git/objects/10/93da429...
A commit is a small file containing its parent(s), message, tree, and author.
Example contents:
$ git cat-file -p 1093da429f08e0e54cdc2b31526159e745d98ce0
tree 9f83ee7550919867e9219a75c23624c92ab5bd83
parent 33a0481b440426f0268c613d036b820bc064cdea
author Julia Evans <julia@example.com> 1706120622 -0500
committer Julia Evans <julia@example.com> 1706120622 -0500
add hello.py
These files are compressed, the best way to see objects is with git cat-file -p HASH
.
.git/objects/9f/83ee7550...
Trees are small files with directory listings. The files in it are called blobs.
Example contents:
$ git cat-file -p 9f83ee7550919867e9219a75c23624c92ab5bd83
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 .gitignore
100644 blob 665c637a360874ce43bf74018768a96d2d4d219a hello.py
040000 tree 24420a1530b1f4ec20ddb14c76df8c78c48f76a6 lib
The permissions here LOOK like unix permissions, but they’re actually super restricted, only 644 and 755 are allowed.
.git/objects/5a/475762c...
blobs are the files that contain your actual code
Example contents:
$ git cat-file -p 665c637a360874ce43bf74018768a96d2d4d219a
print("hello world!")
Storing a new blob with every change can get big, so git gc
periodically
packs them for efficiency in .git/objects/pack
.
.git/logs/refs/heads/main
The reflog stores the history of every branch, tag, and HEAD. For (mostly) every file in .git/refs
, there’s a corresponding log in .git/logs/refs
.
Example content for the main
branch:
$ tail -n 1 .git/logs/refs/heads/main
33a0481b440426f0268c613d036b820bc064cdea
1093da429f08e0e54cdc2b31526159e745d98ce0
Julia Evans <julia@example.com>
1706119866 -0500
commit: add hello.py
each line of the reflog has:
Normally it’s all one line, I just wrapped it for readability here.
.git/refs/remotes/origin/main
Remote-tracking branches store the most recently seen commit ID for a remote branch
Example content:
$ cat .git/refs/remotes/origin/main
fcdeb177797e8ad8ad4c5381b97fc26bc8ddd5a2
When git status says “you’re up to date with origin/main
”, it’s just looking
at this. It’s often out of date, you can update it with git fetch origin
main
.
.git/refs/tags/v1.0
A tag is a tiny file in .git/refs/tags
containing a commit ID.
Example content:
$ cat .git/refs/tags/v1.0
1093da429f08e0e54cdc2b31526159e745d98ce0
Unlike branches, when you make new commits it doesn’t update the tag.
.git/refs/stash
The stash is a tiny file called .git/refs/stash
. It contains the commit ID of a commit that’s created when you run git stash
.
cat .git/refs/stash
62caf3d918112d54bcfa24f3c78a94c224283a78
The stash is a stack, and previous values are stored in .git/logs/refs/stash
(the reflog for stash
).
cat .git/logs/refs/stash
62caf3d9 e85c950f Julia Evans <julia@example.com> 1706290652 -0500 WIP on main: 1093da4 add hello.py
00000000 62caf3d9 Julia Evans <julia@example.com> 1706290668 -0500 WIP on main: 1093da4 add hello.py
Unlike branches and tags, if you git stash pop
a commit from the stash, it’s
deleted from the reflog so it’s almost impossible to find it again. The
stash is the only reflog in git where things get deleted very soon after
they’re added. (entries expire out of the branch reflogs too, but generally
only after 90 days)
A note on refs:
At this point you’ve probably noticed that a lot of things (branches,
remote-tracking branches, tags, and the stash) are commit IDs in .git/refs
.
They’re called “references” or “refs”. Every ref is a commit ID, but the
different types of refs are treated VERY differently by git, so I find it
useful to think about them separately even though they all use
the same file format. For example, git deletes things from the stash reflog in
a way that it won’t for branch or tag reflogs.
.git/config
is a config file for the repository. It’s where you configure
your remotes.
Example content:
[remote "origin"]
url = git@github.com: jvns/int-exposed
fetch = +refs/heads/*: refs/remotes/origin/*
[branch "main"]
remote = origin
merge refs/heads/main
git has local and global settings, the local settings are here and the global
ones are in ~/.gitconfig
hooks
.git/hooks/pre-commit
Hooks are optional scripts that you can set up to run (eg before a commit) to do anything you want.
Example content:
#!/bin/bash
any-commands-you-want
(this obviously isn’t a real pre-commit hook)
.git/index
The staging area stores files when you’re preparing to commit. This one is a binary file, unlike a lot of things in git which are essentially plain text files.
As far as I can tell the best way to look at the contents of the index is with git ls-files --stage
:
$ git ls-files --stage
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 .gitignore
100644 665c637a360874ce43bf74018768a96d2d4d219a 0 hello.py
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0 lib/empty.py
There are some other things in .git
like FETCH_HEAD
, worktrees
, and
info
. I only included the ones that I’ve found it useful to understand.
One of the most common pieces of advice I hear about git is “just learn how
the .git
directory is structured and then you’ll understand everything!“.
I love understanding the internals of things more than anyone, but there’s a LOT that “how the .git directory is structured” doesn’t explain, like:
Hopefully this will be useful to some folks out there though.
Understanding how git commits are implemented feels pretty straightforward to me (those are facts! I can look it up!), but it’s been much harder to figure out how other people think about commits. So like I’ve been doing a lot recently, I went on Mastodon and started asking some questions.
I did a highly unscientific poll on Mastodon about how people think about Git commits: is it a snapshot? is it a diff? is it a list of every previous commit? (Of course it’s legitimate to think about it as all three, but I was curious about the primary way people think about Git commits). Here it is:
The results were:
I was really surprised that it was so evenly split between diffs and snapshots. People also made some interesting kind of contradictory statements like “in my mind a commit is a diff, but I think it’s actually implemented as a snapshot” and “in my mind a commit is a snapshot, but I think it’s actually implemented as a diff”. We’ll talk more about how a commit is actually implemented later in the post.
Before we go any further: when we say “a diff” or “a snapshot”, what does that mean?
What I mean by a diff is probably obvious: it’s what you get when you run git show
COMMIT_ID
. For example here’s a typo fix from rbspy:
diff --git a/src/ui/summary.rs b/src/ui/summary.rs
index 5c4ff9c..3ce9b3b 100644
--- a/src/ui/summary.rs
+++ b/src/ui/summary.rs
@@ -160,7 +160,7 @@ mod tests {
";
let mut buf: Vec<u8> = Vec::new();
- stats.write(&mut buf).expect("Callgrind write failed");
+ stats.write(&mut buf).expect("summary write failed");
let actual = String::from_utf8(buf).expect("summary output not utf8");
assert_eq!(actual, expected, "Unexpected summary output");
}
You can see it on GitHub here: https://github.com/rbspy/rbspy/commit/24ad81d2439f9e63dd91cc1126ca1bb5d3a4da5b
When I say “a snapshot”, what I mean is “all the files that you get when you
run git checkout COMMIT_ID
”.
Git often calls the list of files for a commit a “tree” (as in “directory tree”), and you can see all of the files for the above example commit here on GitHub:
https://github.com/rbspy/rbspy/tree/24ad81d2439f9e63dd91cc1126ca1bb5d3a4da5b (it’s /tree/
instead of /commit/
)
Probably the most common piece of advice I hear related to learning Git is “just learn how Git represents things internally, and everything will make sense”. I obviously find this perspective extremely appealing (if you’ve spent any time reading this blog, you know I love thinking about how things are implemented internally).
But as a strategy for teaching Git, it hasn’t been as successful as I’d hoped! Often I’ve eagerly started explaining “okay, so git commits are snapshots with a pointer to their parent, and then a branch is a pointer to a commit, and…“, but the person I’m trying to help will tell me that they didn’t really find that explanation that useful at all and they still don’t get it. So I’ve been considering other options.
Let’s talk about the internals a bit anyway though.
Internally, git represents commits as snapshots (it stores the “tree” of the current version of every file). I wrote about this in In a git repository, where do your files live?, but here’s a very quick summary of what the internal format looks like.
Here’s how a commit is represented:
$ git cat-file -p 24ad81d2439f9e63dd91cc1126ca1bb5d3a4da5b
tree e197a79bef523842c91ee06fa19a51446975ec35
parent 26707359cdf0c2db66eb1216bf7ff00eac782f65
author Adam Jensen <adam@acj.sh> 1672104452 -0500
committer Adam Jensen <adam@acj.sh> 1672104890 -0500
Fix typo in expectation message
and here’s what we get when we look at this tree object: a list of every file / subdirectory in the repository’s root directory as of that commit:
$ git cat-file -p e197a79bef523842c91ee06fa19a51446975ec35
040000 tree 2fcc102acd27df8f24ddc3867b6756ac554b33ef .cargo
040000 tree 7714769e97c483edb052ea14e7500735c04713eb .github
100644 blob ebb410eb8266a8d6fbde8a9ffaf5db54a5fc979a .gitignore
100644 blob fa1edfb73ce93054fe32d4eb35a5c4bee68c5bf5 ARCHITECTURE.md
100644 blob 9c1883ee31f4fa8b6546a7226754cfc84ada5726 CODE_OF_CONDUCT.md
100644 blob 9fac1017cb65883554f821914fac3fb713008a34 CONTRIBUTORS.md
100644 blob b009175dbcbc186fb8066344c0e899c3104f43e5 Cargo.lock
100644 blob 94b87cd2940697288e4f18530c5933f3110b405b Cargo.toml
What this means is that checking out a Git commit is always fast: it’s just as easy for Git to check out a commit from yesterday as it is to check out a commit from 1 million commits ago. Git never has to replay 10000 diffs to figure out the current state or anything, because commits just aren’t stored as diffs.
I just said that Git commits are snapshots, but when someone says “I think of git commits as a snapshot, but I think internally they’re actually diffs”, that’s actually kind of true too! Git commits are not represented as diffs in the sense you’re probably used to (they’re not represented on disk as a diff from the previous commit), but the basic intuition that if you’re editing a 10,000 lines 500 times, it would be inefficient to store 500 copies of that file is right.
Git does have a way of storing files as differences from other ways. This is
called “packfiles” and periodically git will do a garbage collection and
compress your data into packfiles to save disk space. When you git clone
a
repository git will also compress the data.
I don’t have space for a full explanation of how packfiles work in this post (Aditya Mukerjee’s Unpacking Git packfiles is my favourite writeup of how they work). But here’s a quick summary of my understanding of how deltas work and how they’re different from diffs:
When I run git show SOME_COMMIT
to look at the diff for a commit, what
actually happens is kind of counterintuitive. My understanding is:
So it takes deltas, turns them into a snapshot, and then calculates a diff. It feels a little weird because it starts with a diff-like-thing and ends up with another diff-like-thing, but the deltas and diffs are actually totally different so it makes sense.
That said, the way I think of it is that git stores commits as snapshots and packfiles are just an implementation detail to save disk space and make clones faster. I’ve never actually needed to know how packfiles work for any practical reason, but it does help me understand how it’s possible for git commits to be snapshots without using way too much disk space.
I think a pretty common “wrong” mental model for Git is:
This model is obviously not true (in real life, commits are stored as snapshots, and diffs are calculated from those snapshots), but it seems very useful and coherent to me! It gets a little weird with merge commits, but maybe you just say it’s stored as a diff from the first parent of the merge.
I think wrong mental models are often extremely useful, and this one doesn’t seem very problematic to me for every day Git usage. I really like that it makes the thing that we deal with the most often (the diff) the most fundamental – it seems really intuitive to me.
I’ve also been thinking about other “wrong” mental models you can have about Git which seem pretty useful like:
I feel like there’s a whole very coherent “wrong” set of ideas you can have about git that are pretty well supported by Git’s UI and not very problematic most of the time. I think it can get messy when you want to undo a change or when something goes wrong though.
Personally even though I know that in Git commits are snapshots, I probably think of them as diffs most of the time, because:
git show
, you see the diff, so it’s just what I’m used to seeingI also think about commits as snapshots sometimes though, because:
old.py
and new.py
are similar.git checkout COMMIT_ID
is doing (the idea of replaying 10000 commits just feels stressful to me)Some folks in the Mastodon replies also mentioned:
some other words people use to talk about commits might be less ambiguous:
It’s been very difficult for me to get a sense of what different mental models people have for git. It’s especially tricky because people get really into policing “wrong” mental models even though those “wrong” models are often really useful, so folks are reluctant to share their “wrong” ideas for fear of some Git Explainer coming out of the woodwork to explain to them why they’re Wrong. (these Git Explainers are often well-intentioned, but it still has a chilling effect either way)
But I’ve been learning a lot! I still don’t feel totally clear about how I want to talk about commits, but we’ll get there eventually.
Thanks to Marco Rogers, Marie Flanagan, and everyone on Mastodon for talking to me about git commits.
]]>My motivation for this was that previously I was using Ansible to provision the server, but then I’d ad hoc installed a bunch of stuff on the server in a chaotic way separately from Ansible, so in the end I had no real idea of what was on that server and it felt like it would be a huge pain to recreate it if I needed to.
This server just runs a few small personal Go services, so it seemed like a good candidate for experimentation.
I had trouble finding explanations of how to set up NixOS and I needed to cobble together instructions from a bunch of different places, so here’s a very short summary of what worked for me.
I think the reason NixOS feels more reliable than Ansible to me is that NixOS is the operating system. It has full control over all your users and services and packages, and so it’s easier for it to reliably put the system into the state you want it to be in.
Because Nix has so much control over the OS, I think that if I tried to make
any ad-hoc changes at all to my Nix system, Nix would just blow them away the
next time I ran nixos-rebuild
. But with Ansible, Ansible only controls a few
small parts of the system (whatever I explicitly tell it to manage), so it’s
easy to make changes outside Ansible.
That said, here’s what I did to set up NixOS on my server and run a Go service on it.
To install NixOS, I created a new Hetzner instance running Ubuntu, and then ran nixos-infect on it to convert the Ubuntu installation into a NixOS install, like this:
curl https://raw.githubusercontent.com/elitak/nixos-infect/master/nixos-infect | PROVIDER=hetznercloud NIX_CHANNEL=nixos-23.11 bash 2>&1 | tee /tmp/infect.log
I originally tried to do this on DigitalOcean, but it didn’t work for some reason, so I went with Hetzner instead and that worked.
This isn’t the only way to install NixOS (this wiki page lists options for setting up NixOS cloud servers), but it seemed to work. It’s possible that there are problems with installing that way that I don’t know about though. It does feel like using an ISO is probably better because that way you don’t have to do this transmogrification of Ubuntu into NixOS.
I definitely skipped Step 1 in nixos-infect
’s README (“Read and understand
the script”), but I didn’t feel too worried because I was running it on a
new instance and I figured that if something went wrong I’d just delete it.
Next I needed to copy the generated Nix configuration to a new local Git repository, like this:
scp root@SERVER_IP:/etc/nixos/* .
This copied 3 files: configuration.nix
, hardware-configuration.nix
, and networking.nix
. configuration.nix
is the main file. I didn’t touch anything in hardware-configuration.nix
or networking.nix
.
I created a flake to wrap configuration.nix
. I don’t remember why I did this
(I have some idea of what the advantages of flakes are, but it’s not clear to
me if any of them are actually relevant in this case) but it seems to work. Here’s
my flake.nix
:
{ inputs.nixpkgs.url = "github:NixOS/nixpkgs/23.11";
outputs = { nixpkgs, ... }: {
nixosConfigurations.default = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
modules = [ ./configuration.nix ];
};
};
}
The main gotcha about flakes that I needed to remember here was that you need
to git add
every .nix
file you create otherwise Nix will pretend it doesn’t
exist.
The rules about git and flakes seem to be:
git add
your filesgit add
edThese rules feel very counterintuitive to me (why require that you git add
files but allow unstaged changes?) but that’s how it works. I think it might be
an optimization because Nix has to copy all your .nix
files to the Nix store for some
reason, so only copying files that have been git add
ed makes the copy faster. There’s a GitHub issue tracking it here so maybe the way this works will change at some point.
Next I needed to figure out how to deploy changes to my configuration. There are a bunch
of tools for this, but I found the blog post Announcing nixos-rebuild: a “new” deployment tool for NixOS
that said you can just use the built-in nixos-rebuild
, which has
--target-host
and --build-host
options so that you can specify which host
to build on and deploy to, so that’s what I did.
I wanted to be able to get Go repositories and build the Go code on the target host, so I created a bash script that runs this command:
nixos-rebuild switch --fast --flake .#default --target-host my-server --build-host my-server --option eval-cache false
Making --target-host
and --build-host
the same machine is certainly not
something I would do for a Serious Production Machine, but this server is
extremely unimportant so it’s fine.
This --option eval-cache false
is because Nix kept not showing me my errors
because they were cached – it would just say error: cached failure of
attribute 'nixosConfigurations.default.config.system.build.toplevel'
instead
of showing me the actual error message. Setting --option eval-cache false
turned off caching so that I could see the error messages.
Now I could run bash deploy.sh
on my laptop and deploy my configuration to the server! Hooray!
I also needed to set up a my-server
host in my ~/.ssh/config
. I set up SSH
agent forwarding so that the server could download the private Git repositories
it needed to access.
Host my-server
Hostname MY_IP_HERE
User root
Port 22
ForwardAgent yes
AddKeysToAgent yes
The thing I found the hardest was to figure out how to compile and configure a Go web service to run on the server. The norm seems to be to define your package and define your service’s configuration in 2 different files, but I didn’t feel like doing that – I wanted to do it all in one file. I couldn’t find a simple example of how to do this, so here’s what I did.
I’ve replaced the actual repository name with my-service
because it’s a
private repository and you can’t run it anyway.
{ pkgs ? (import <nixpkgs> { }), lib, stdenv, ... }:
let myservice = pkgs.callPackage pkgs.buildGoModule {
name = "my-service";
src = fetchGit {
url = "git@github.com:jvns/my-service.git";
rev = "efcc67c6b0abd90fb2bd92ef888e4bd9c5c50835"; # put the right git sha here
};
vendorHash = "sha256-b+mHu+7Fge4tPmBsp/D/p9SUQKKecijOLjfy9x5HyEE"; # nix will complain about this and tell you the right value
}; in {
services.caddy.virtualHosts."my-service.example.com".extraConfig = ''
reverse_proxy localhost:8333
'';
systemd.services.my-service = {
enable = true;
description = "my-service";
after = ["network.target"];
wantedBy = ["multi-user.target"];
script = "${myservice}/bin/my-service";
environment = {
DB_FILENAME = "/var/lib/my-service/db.sqlite";
};
serviceConfig = {
DynamicUser = true;
StateDirectory = "my-service"; # /var/lib/my-service
};
};
}
Then I just needed to do 2 more things:
./my-service.nix
to the imports section of configuration.nix
services.caddy.enable = true;
to configuration.nix
to enable Caddyand everything worked!!
Some notes on this service configuration file:
extraConfig
to configure Caddy because I didn’t feel like learning
Nix’s special Caddy syntax – I wanted to just be able to refer to the Caddy
documentation directly.DynamicUser
to create a user dynamically to run the
service. I’d never used this before but it seems like a great simple way to
create a different user for every service without having to write a bunch of
repetitive boilerplate and being really careful to choose unique UID and
GIDs. The blog post Dynamic Users with systemd talks
about how it works.StateDirectory
to get systemd to create a persistent directory where I could store a SQLite database. It creates a directory at /var/lib/my-service/
I’d never heard of DynamicUser
or StateDirectory
before Kamal told me about
them the other day but they seem like cool systemd features and I wish
I’d known about them earlier.
One quick note on Caddy: I switched to Caddy a while back from nginx because it automatically sets up Let’s Encrypt certificates. I’ve only been using it for tiny hobby services, but it seems pretty great so far for that, and its configuration language is simpler too.
One problem I ran into was this error message:
error: in pure evaluation mode, 'fetchTree' requires a locked input, at «none»:0
I found this really perplexing – what is fetchTree
? What is «none»:0
? What did I do wrong?
I learned 4 things from debugging this (with help from the folks in the Nix discord):
fetchGit
calls an internal function called fetchTree
. So errors that say fetchTree
might actually be referring to fetchGit
.--show-trace
.--show-trace
. I’m not sure why this is. Some people told me this is because fetchTree
is a built in function but – why can’t I see the line number in my nix code that called that built in function?--option eval-cache false
to turn off caching so that Nix will always show you the error message instead of error: cached failure of attribute 'nixosConfigurations.default.config.system.build.toplevel'
Ultimately the problem turned out to just be that I forgot to pass the Github
revision ID (rev = "efcc67c6b0abd90fb2bd92ef888e4bd9c5c50835";
) to fetchGit
which was really easy to fix.
I still don’t really understand the nix language syntax that well, but I
haven’t felt motivated to get better at it yet – I guess learning new language
syntax just isn’t something I find fun. Maybe one day I’ll learn it. My plan
for now with NixOS is to just keep copying and pasting that my-service.nix
file above forever.
I think my main outstanding questions are:
nixos-rebuild
, Nix checks that my systemd services are still
working in some way. What does it check exactly? My best guess is that it
checks that the systemd service starts successfully, but if the service
starts and then immediately crashes, it won’t notice.I really do like having all of my service configuration defined in one file, and the approach Nix takes does feel more reliable than the approach I was taking with Ansible.
I just started doing this a week ago and as with all things Nix I have no idea if I’ll end up liking it or not. It seems pretty good so far though!
I will say that I find using Nix to be very difficult and I really struggle
when debugging Nix problems (that fetchTree
problem I mentioned sounds
simple, but it was SO confusing to me at the time), but I kind of like it
anyway. Maybe because I’m not using Linux on my laptop right now I miss having
linux evenings and Nix feels
like a replacement for that :)
I published How Integers and Floats Work, which I worked on with Marie.
This one started out its life as “how your computer represents things in memory”, but once we’d explained how integers and floats were represented in memory the zine was already long enough, so we just kept it to integers and floats.
This zine was fun to write: I learned about why signed integers are represented in memory the way they are, and I’m really happy with the explanation of floating point we ended up with.
When explaining to people how your computer represents people in memory, I kept
wanting to open up gdb
or lldb
and show some example C programs and how the
variables in those C programs are represented in memory.
But gdb is kind of confusing if you’re not used to looking at it! So me and
Marie made a cute interface on top of lldb
, where you can put in any C program,
click on a line, and see what the variable looks like. It’s called memory spy and here’s what it looks like:
I got really obsessed with float.exposed by Bartosz Ciechanowski for seeing how floats are represented in memory. So with his permission, I made a copy of it for integers called integer.exposed.
Here’s a screenshot:
It was pretty straightforward to make (copying someone else’s design is so much easier than making your own!) but I learned a few CSS tricks from analyzing how he implemented it.
I’ve been working on a big project to show people how to implement a working networking stack (TCP, TLS, DNS, UDP, HTTP) in 1400 lines of Python, that you can use to download a webpage using 100% your own networking code. Kind of like Nand to Tetris, but for computer networking.
This has been going VERY slowly – writing my own working shitty implementations was relatively easy (I finished that in October 2022), but writing clear tutorials that other people can follow is not.
But in March, I released the first part: Implement DNS in a Weekend. The response was really good – there are dozens of people’s implementations on GitHub, and people have implemented it in Go, C#, C, Clojure, Python, Ruby, Kotlin, Rust, Typescript, Haskell, OCaml, Elixir, Odin, and probably many more languages too. I’d like to see more implementations in less systems-y languages like vanilla JS and PHP, need to think about what I can do to encourage that.
I think “Implement IPv4 in a Weekend” might be the next one I release. It’s going to come with bonus guides to implementing ICMP and UDP too.
I gave a keynote at Strange Loop this year called Making Hard Things Easy (video + transcript), about why some things are so hard to learn and how we can make them easier. I’m really proud of how it turned out.
In September I decided to work on a second zine about Git, focusing more on how Git works. This is one of the hardest projects I’ve ever worked on, because over the last 10 years of using it I’d completely lost sight of what’s hard about Git.
So I’ve been doing a lot of research to try to figure out why Git is hard, and I’ve been writing a lot of blog posts. So far I’ve written:
What’s been most surprising so far is that I originally thought “to understand Git, people just need to learn git’s internal data model!“. But the more I talk to people about their struggles with Git, the less I think that’s true. I’ll leave it at that for now, but there’s a lot of work still to do.
I worked on a couple of fun Git tools this year:
git undo
?“. I learned a bunch of things about why that’s not easy
through writing the prototype, I might write a longer blog post about it
later.I’ve been trying to put a little less pressure on myself to release software that’s Amazing and Perfect – sometimes I have an idea that I think is cool but don’t really have the time or energy to fully execute on it. So I decided to just put these both on Github in a somewhat unfinished state, so I can come back to them if later if I want. Or not!
I’m also working on another Git software project, which is a collaboration with a friend.
This year I hired an Operations Manager for Wizard Zines! Lee is incredible and has done SO much to streamline the logistics of running the company, so that I can focus more on writing and coding. I don’t talk much about the mechanics of running the business on here, but it’s a lot and I’m very grateful to have some help.
A few of the many things Lee has made possible:
I spent 10 years building up a Twitter presence, but with the Recent Events, I spent a lot of time in 2023 working on building up a Mastodon account. I’ve found that I’m able to have more interesting conversations about computers on Mastodon than on Twitter or Bluesky, so that’s where I’ve been spending my time. We’ve been having a lot of great discussions about Git there recently.
I’ve run into a few technical issues with Mastodon (which I wrote about at Notes on using a single-person Mastodon server) but overall I’m happy there and I’ve been spending a lot more time there than on Twitter.
one of my questions for 2022 was:
Maybe I’ll work on that in 2024! Maybe not! I did make a little bit of progress on that question this year (I wrote What helps people get comfortable on the command line?).
Some other questions I’m thinking about on and off:
I’ve started to come to terms with the fact that projects always just take longer than I think they will. I started working this “implement your own terrible networking stack” project in 2022, and I don’t know if I’ll finish it in 2024. I’ve been working on this Git zine since September and I still don’t completely understand why Git is hard yet. There’s another small secret project that I initally thought of 5 years ago, made a bunch of progress on this year, but am still not done with. Things take a long time and that’s okay.
As always, thanks for reading and for making it possible for me to do this weird job.
]]>But FUSE is pretty annoying to use on Mac – you need to install a kernel extension, and Mac OS seems to be making it harder and harder to install kernel extensions for security reasons. Also I had a few ideas for how to organize the filesystem differently than those projects.
So I thought it would be fun to experiment with ways to mount filesystems on Mac OS other than FUSE, so I built a project that does that called git-commit-folders. It works (at least on my computer) with both FUSE and NFS, and there’s a broken WebDav implementation too.
It’s pretty experimental (I’m not sure if this is actually a useful piece of software to have or just a fun toy to think about how git works) but it was fun to write and I’ve enjoyed using it myself on small repositories so here are some of the problems I ran into while writing it.
The main reason I wanted to make this was to give folks some intuition for how git works under the hood. After all, git commits really are very similar to folders – every Git commit contains a directory listing of the files in it, and that directory can have subdirectories, etc.
It’s just that git commits aren’t actually implemented as folders to save disk space.
So in git-commit-folders
, every commit is actually a folder, and if you want
to explore your old commits, you can do it just by exploring the filesystem!
For example, if I look at the initial commit for my blog, it looks like this:
$ ls commits/8d/8dc0/8dc0cb0b4b0de3c6f40674198cb2bd44aeee9b86/
README
and a few commits later, it looks like this:
$ ls /tmp/git-homepage/commits/c9/c94e/c94e6f531d02e658d96a3b6255bbf424367765e9/
_config.yml config.rb Rakefile rubypants.rb source
In the filesystem mounted by git-commit-folders
, commits are the only real folders – everything
else (branches, tags, etc) is a symlink to a commit. This mirrors how git works under the hood.
$ ls -l branches/
lr-xr-xr-x 59 bork bazil-fuse -> ../commits/ff/ff56/ff563b089f9d952cd21ac4d68d8f13c94183dcd8
lr-xr-xr-x 59 bork follow-symlink -> ../commits/7f/7f73/7f73779a8ff79a2a1e21553c6c9cd5d195f33030
lr-xr-xr-x 59 bork go-mod-branch -> ../commits/91/912d/912da3150d9cfa74523b42fae028bbb320b6804f
lr-xr-xr-x 59 bork mac-version -> ../commits/30/3008/30082dcd702b59435f71969cf453828f60753e67
lr-xr-xr-x 59 bork mac-version-debugging -> ../commits/18/18c0/18c0db074ec9b70cb7a28ad9d3f9850082129ce0
lr-xr-xr-x 59 bork main -> ../commits/04/043e/043e90debbeb0fc6b4e28cf8776e874aa5b6e673
$ ls -l tags/
lr-xr-xr-x - bork 31 Dec 1969 test-tag -> ../commits/16/16a3/16a3d776dc163aa8286fb89fde51183ed90c71d0
This definitely doesn’t completely explain how git works (there’s a lot more to it than just “a commit is like a folder!”), but my hope is that it makes thie idea that every commit is like a folder with an old version of your code” feel a little more concrete.
Before I get into the implementation, I want to talk about why having a filesystem with a folder for every git commit in it might be useful. A lot of my projects I end up never really using at all (like dnspeep) but I did find myself using this project a little bit while I was working on it.
The main uses I’ve found so far are:
grep someFunction branch_histories/main/*/commit.go
to find an old version of itvim branches/other-branch/go.mod
grep someFunction branches/*/commit.go
All of these are through symlinks to commits instead of referencing commits directly.
None of these are the most efficient way to do this (you can use git show
and
git log -S
or maybe git grep
to accomplish something similar), but
personally I always forget the syntax and navigating a filesystem feels easier
to me. git worktree
also lets you have multiple branches checked out at the same
time, but to me it feels weird to set up an entire worktree just to look at 1
file.
Next I want to talk about some problems I ran into.
The two filesystems I could that were natively supported by Mac OS were WebDav and NFS. I couldn’t tell which would be easier to implement so I just tried both.
At first webdav seemed easier and it turns out that golang.org/x/net has a webdav implementation, which was pretty easy to set up.
But that implementation doesn’t support symlinks, I think because it uses the io/fs
interface
and io/fs
doesn’t support symlinks yet. Looks like that’s in progress
though. So I gave up on webdav and decided to focus on the NFS implementation, using this go-nfs NFSv3 library.
Someone also mentioned that there’s FileProvider on Mac but I didn’t look into that.
I was implementing 3 different filesystems (FUSE, NFS, and WebDav), and it wasn’t clear to me how to avoid a lot of duplicated code.
My friend Dave suggested writing one core implementation and then writing
adapters (like fuse2nfs
and fuse2dav
) to translate it into the NFS and
WebDav verions. What this looked like in practice is that I needed to implement
3 filesystem interfaces:
fs.FS
for FUSEbilly.Filesystem
for NFSwebdav.Filesystem
for webdavSo I put all the core logic in the fs.FS
interface, and then wrote two functions:
func Fuse2Dav(fs fs.FS) webdav.FileSystem
func Fuse2NFS(fs fs.FS) billy.Filesystem
All of the filesystems were kind of similar so the translation wasn’t too hard, there were just 1 million annoying bugs to fix.
Some git repositories have thousands or millions of commits. My first idea for how to address this was to make commits/
appear empty, so that it works like this:
$ ls commits/
$ ls commits/80210c25a86f75440110e4bc280e388b2c098fbd/
fuse fuse2nfs go.mod go.sum main.go README.md
So every commit would be available if you reference it directly, but you can’t list them. This is a weird thing for a filesystem to do but it actually works fine in FUSE. I couldn’t get it to work in NFS though. I assume what’s going on here is that if you tell NFS that a directory is empty, it’ll interpret that the directory is actually empty, which is fair.
I ended up handling this by:
.git/objects
does (so that ls commits
shows 0b 03 05 06 07 09 1b 1e 3e 4a
), but doing
2 levels of this so that a 18d46e76d7c2eedd8577fae67e3f1d4db25018b0
is at commits/18/18df/18d46e76d7c2eedd8577fae67e3f1d4db25018b0
This seems to work okay on the Linux kernel which has ~1 million commits. It takes maybe a minute to do the initial load on my machine and then after that it just needs to do fast incremental updates.
Each commit hash is only 20 bytes so caching 1 million commit hashes isn’t a big deal, it’s just 20MB.
I think a smarter way to do this would be to load the commit listings lazily –
Git sorts its packfiles by commit ID, so you can pretty easily do a binary
search to find all commits starting with 1b
or 1b8c
. The git library I was using
doesn’t have great support for this though, because listing all commits in a
Git repository is a really weird thing to do. I spent maybe a couple of days
trying to implement it but I didn’t manage to get the performance I wanted so I
gave up.
I kept getting this error:
"/tmp/mnt2/commits/59/59167d7d09fd7a1d64aa1d5be73bc484f6621894/": Not a directory (os error 20)
This really threw me off at first but it turns out that this just means that there was an error while listing the directory, and the way the NFS library handles that error is with “Not a directory”. This happened a bunch of times and I just needed to track the bug down every time.
There were a lot of weird errors like this. I also got cd: system call
interrupted
which was pretty upsetting but ultimately was just some other bug
in my program.
Eventually I realized that I could use Wireshark to look at all the NFS packets being sent back and forth, which made some of this stuff easier to debug.
At first I was accidentally setting all my directory inode numbers to 0. This
was bad because if if you run find
on a directory where the inode number of
every directory is 0, it’ll complain about filesystem loops and give up, which
is very fair.
I fixed this by defining an inode(string)
function which hashed a string to
get the inode number, and using the tree ID / blob ID as the string to hash.
I kept getting this “Stale NFS file handle” error. The problem is that I need to be able to take an opaque 64-byte NFS “file handle” and map it to the right directory.
The way the NFS library I’m using works is that it generates a file handle for every file and caches those references with a fixed size cache. This works fine for small repositories, but if there are too many files then it’ll overflow the cache and you’ll start getting stale file handle errors.
This is still a problem and I’m not sure how to fix it. I don’t understand how real NFS servers do this, maybe they just have a really big cache?
The NFS file handle is 64 bytes (64 bytes! not bits!) which is pretty big, so it does seem like you could just encode the entire file path in the handle a lot of the time and not cache it at all. Maybe I’ll try to implement that at some point.
The branch_histories/
directory only lists the latest 100 commits for each
branch right now. Not sure what the right move is there – it would be nice to
be able to list the full history of the branch somehow. Maybe I could use a
similar subfolder trick to the commits/
directory.
Git repositories sometimes have submodules. I don’t understand anything about submodules so right now I’m just ignoring them. So that’s a bug.
I built this with NFSv3 because the only Go library I could find at the time was an NFSv3 library. After I was done I discovered that the buildbarn project has an NFSv4 server in it. Would it be better to use that?
I don’t know if this is actually a problem or how big of an advantage it would be to use NFSv4. I’m also a little unsure about using the buildbarn NFS library because it’s not clear if they expect other people to use it or not.
There are probably more problems I forgot but that’s all I can think of for now. I may or may not fix the NFS stale file handle problem or the “it takes 1 minute to start up on the linux kernel” problem, who knows!
Thanks to my friend vasi who explained one million things about filesystems to me.
]]>So in this post I want to briefly talk about
Nothing in this post is remotely groundbreaking so I’m going to try to keep it pretty short.
Of course, people have many different intuitions about branches. Here’s the one that I think corresponds most closely to the physical “a branch of an apple tree” metaphor.
My guess is that a lot of people think about a git branch like this: the 2 commits in pink in this picture are on a “branch”.
I think there are two important things about this diagram:
main
) which it’s an offshoot ofThat seems pretty reasonable, but that’s not how git defines a branch – most importantly, git doesn’t have any concept of a branch’s “parent”. So how does git define a branch?
In git, a branch is the full history of every previous commit, not just the “offshoot” commits. So in our picture above both branches (main
and branch
) have 4 commits on them.
I made an example repository at https://github.com/jvns/branch-example which has its branches set up the same way as in the picture above. Let’s look at the 2 branches:
main
has 4 commits on it:
$ git log --oneline main
70f727a d
f654888 c
3997a46 b
a74606f a
and mybranch
has 4 commits on it too. The bottom two commits are shared
between both branches.
$ git log --oneline mybranch
13cb960 y
9554dab x
3997a46 b
a74606f a
So mybranch
has 4 commits on it, not just the 2 commits 13cb960
and 9554dab
that are “offshoot” commits.
You can get git to draw all the commits on both branches like this:
$ git log --all --oneline --graph
* 70f727a (HEAD -> main, origin/main) d
* f654888 c
| * 13cb960 (origin/mybranch, mybranch) y
| * 9554dab x
|/
* 3997a46 b
* a74606f a
Internally in git, branches are stored as tiny text files which have a commit ID in them. That commit is the latest commit on the branch. This is the “technically correct” definition I was talking about at the beginning.
Let’s look at the text files for main
and mybranch
in our example repo:
$ cat .git/refs/heads/main
70f727acbe9ea3e3ed3092605721d2eda8ebb3f4
$ cat .git/refs/heads/mybranch
13cb960ad86c78bfa2a85de21cd54818105692bc
This makes sense: 70f727
is the latest commit on main
and 13cb96
is the latest commit on mybranch
.
The reason this works is that every commit contains a pointer to its parent(s), so git can follow the chain of pointers to get every commit on the branch.
Like I mentioned before, the thing that’s missing here is any relationship at
all between these two branches. There’s no indication that mybranch
is an
offshoot of main
.
Now that we’ve talked about how the intuitive notion of a branch is “wrong”, I want to talk about how it’s also right in some very important ways.
I think it’s pretty popular to tell people that their intuition about git is “wrong”. I find that kind of silly – in general, even if people’s intuition about a topic is technically incorrect in some ways, people usually have the intuition they do for very legitimate reasons! “Wrong” models can be super useful.
So let’s talk about 3 ways the intuitive “offshoot” notion of a branch matches up very closely with how we actually use git in practice.
Now let’s go back to our original picture.
When you rebase mybranch
on main
, it takes the commits on the “intuitive”
branch (just the 2 pink commits) and replays them onto main
.
The result is that just the 2 (x
and y
) get copied. Here’s what that looks like:
$ git switch mybranch
$ git rebase main
$ git log --oneline mybranch
952fa64 (HEAD -> mybranch) y
7d50681 x
70f727a (origin/main, main) d
f654888 c
3997a46 b
a74606f a
Here git rebase
has created two new commits (952fa64
and 7d50681
) whose
information comes from the previous two x
and y
commits.
So the intuitive model isn’t THAT wrong! It tells you exactly what happens in a rebase.
But because git doesn’t know that mybranch
is an offshoot of main
, you need
to tell it explicitly where to rebase the branch.
Merges don’t copy commits, but they do need a “base” commit: the way merges work is that it looks at two sets of changes (starting from the shared base) and then merges them.
Let’s undo the rebase we just did and then see what the merge base is.
$ git switch mybranch
$ git reset --hard 13cb960 # undo the rebase
$ git merge-base main mybranch
3997a466c50d2618f10d435d36ef12d5c6f62f57
This gives us the “base” commit where our branch branched off, 3997a4
.
That’s exactly the commit you would think it might be based on our intuitive
picture.
If we create a pull request on GitHub to merge mybranch
into main
, it’ll
also show us 2 commits: the commits x
and y
. That makes sense and also
matches our intuitive notion of a branch.
I assume if you make a merge request on GitLab it shows you something similar.
This leaves our intuitive definition of a branch looking pretty good actually! The “intuitive” idea of what a branch is matches exactly with how merges and rebases and GitHub pull requests work.
You do need to explicitly
specify the other branch when merging or rebasing or making a pull request (like git rebase main
),
because git doesn’t know what branch you think your offshoot is based on.
But the intuitive notion of a branch has one fairly serious problem: the way
you intuitively think about main
and an offshoot branch are very different,
and git doesn’t know that.
So let’s talk about the different kinds of git branches.
To a human, main
and mybranch
are pretty different, and you probably have
pretty different intentions around how you want to use them.
I think it’s pretty normal to think of some branches as being “trunk” branches, and some branches as being “offshoots”. Also you can have an offshoot of an offshoot.
Of course, git itself doesn’t make any such distinctions (the term “offshoot” is one I just made up!), but what kind of a branch it is definitely affects how you treat it.
For example:
mybranch
onto main
but you probably wouldn’t rebase main
onto mybranch
– that would be weird!One thing I think throws people off about git is – because git doesn’t have any notion of whether a branch is an “offshoot” of another branch, it won’t give you any guidance about if/when it’s appropriate to rebase branch X on branch Y. You just have to know.
for example, you can do either:
$ git checkout main
$ git rebase mybranch
or
$ git checkout mybranch
$ git rebase main
Git will happily let you do either one, even though in this case git rebase main
is
extremely normal and git rebase mybranch
is pretty weird. A lot of people
said they found this confusing so here’s a picture of the two kinds of rebases:
Similarly, you can do merges “backwards”, though that’s much more normal than
doing a backwards rebase – merging mybranch
into main
and main
into
mybranch
are both useful things to do for different reasons.
Here’s a diagram of the two ways you can merge:
I hear the statement “the main
branch is not special” a lot and I’ve been
puzzled about it – in most of the repositories I work in, main
is
pretty special! Why are people saying it’s not?
I think the point is that even though branches do have relationships
between them (main
is often special!), git doesn’t know anything about those
relationships.
You have to tell git explicitly about the relationship between branches every
single time you run a git command like git rebase
or git merge
, and if you
make a mistake things can get really weird.
I don’t know whether git’s design here is “right” or “wrong” (it definitely has some pros and cons, and I’m very tired of reading endless arguments about it), but I do think it’s surprising to a lot of people for good reason.
Let’s say you want to look at just the “offshoot” commits on a branch, which as we’ve discussed is a completely normal thing to want.
Here’s how to see just the 2 offshoot commits on our branch with git log
:
$ git switch mybranch
$ git log main..mybranch --oneline
13cb960 (HEAD -> mybranch, origin/mybranch) y
9554dab x
You can look at the combined diff for those same 2 commits with git diff
like this:
$ git diff main...mybranch
So to see the 2 commits x
and y
with git log
, you need to use 2 dots
(..
), but to look at the same commits with git diff
, you need to use 3 dots
(...
).
Personally I can never remember what ..
and ...
mean so I just avoid
them completely even though in principle they seem useful.
Also, it’s worth mentioning that GitHub does have a “special branch”: every
github repo has a “default branch” (in git terms, it’s what HEAD
points at),
which is special in the following ways:
git clone
the repositoryand probably even more that I’m not thinking of.
This all seems extremely obvious in retrospect, but it took me a long time to figure out what a more “intuitive” idea of a branch even might be because I was so used to the technical “a branch is a reference to a commit” definition.
I also hadn’t really thought about how git makes you tell it about the
hierarchy between your branches every time you run a git rebase
or git
merge
command – for me it’s second nature to do that and it’s not a big deal,
but now that I’m thinking about it, it’s pretty easy to see how somebody could
get mixed up.
I found it very hard to find simple examples of flake files and I ran into a few problems that were very confusing to me, so I wanted to write down some very basic examples and some of the problems I ran into in case it’s helpful to someone else who’s getting started with flakes.
First, let’s talk about what a flake is a little.
Every explanation I’ve found of flakes explains them in terms of other nix concepts (“flakes simplify nix usability”, “flakes are processors of Nix code”). Personally I really needed a way to think about flakes in terms of other non-nix things and someone made an analogy to Docker containers that really helped me, so I’ve been thinking about flakes a little like Docker container images.
Here are some ways in which flakes are like Docker containers:
flake.nix
file and
then they can build the software exactly the same way you built it (a little
like how you can share a Dockerfile
, though flakes are MUCH better at the
“exactly the same way you built it” thing)flakes are also different from Docker containers in a LOT of ways:
Dockerfile
, you’re not actually guaranteed to get the exact same
results as another user. With flake.nix
and flake.lock
you are.flake.nix
files are programs in the nix programming language instead of mostly a bunch of shell commandsObviously this analogy breaks down pretty quickly (the list of differences is VERY long), but they do share the “you can share a dev environment with a single configuration file” design goal.
To me one of the biggest advantages of nix is that I’m on a Mac and nix has a repository with a lot of pre-compiled binaries of various packages for Mac. I mostly mention this because people always say that nix is good because it’s “declarative” or “reproducible” or “functional” or whatever but my main motivation for using nix personally is that it has a lot of binary packages. I do appreciate that it makes it easier for me to build a 5-year-old version of hugo on mac though.
My impression is that nix has more binary packages than Homebrew does, so installing things is faster and I don’t need to build as much from source.
Previously I was using nix as a Homebrew replacement like this (which I talk about more in this blog post):
nix-env -iA nixpkgs.whatever
to install stuffThis worked great (except that it randomly broke occasionally, but someone helped me find a workaround for that so the random breaking wasn’t a big issue).
I thought it might be fun to have a single flake.nix
file where I could maintain a list
of all the packages I wanted installed and then put all that stuff in a
directory in my PATH
. This isn’t very well motivated: my previous setup was
generally working just fine, but I have a long history of fiddling with my
computer setup (Arch Linux ftw) and so I decided to have a Day Of Fiddling.
I think the only practical advantages of flakes for me are:
flake.nix
file to set up a new computer more easilyThese are pretty minor though.
Okay, so I want to make a flake with a bunch of packages installed in it, let’s say Ruby and cowsay to start. How do I
do that? I went to zero-to-nix and copied and pasted some things and ended up with this flake.nix
file (here it is in a gist):
{
inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-23.05-darwin";
outputs = { self, nixpkgs }: {
devShell.aarch64-darwin = nixpkgs.legacyPackages.aarch64-darwin.mkShell {
buildInputs = with nixpkgs.legacyPackages.aarch64-darwin; [
cowsay
ruby
];
};
};
}
This has a little bit of boilerplate so let’s list the things I understand about this:
aarch64-darwin
is my machine’s architecture, this is important because I’m asking nix to download binariesnixpkgs
is my one
input. I get to pick and choose which bits of it I want to bring into my
flake though.github:NixOS/nixpkgs/nixpkgs-23.05-darwin
url scheme is a bit unusual:
the format is github:USER/REPO_NAME/TAG_OR_BRANCH_NAME
. So this is looking
at the nixpkgs-23.05-darwin
tag in the NixOS/nixpkgs
repository.mkShell
is a nix function that’s apparently useful if you want to run nix develop
. I stopped using it after this so I don’t know more than that.devShell.aarch64-darwin
is the name of the output. Apparently I need to give it that exact name or else nix develop
will yell at mecowsay
and ruby
are the things I’m taking from nixpkgs to put in my outputself
is doing here or what legacyPackages
is aboutOkay, cool. Let’s try to build it:
$ nix build
error: getting status of '/nix/store/w1v41cyqyx4d7q4g7c8nb50bp9dvjm29-source/flake.nix': No such file or directory
This error is VERY mysterious – what is /nix/store/w1v41cyqyx4d7q4g7c8nb50bp9dvjm29-source/
and why does nix think it should exist???
I was totally stuck until a very nice person on Mastodon helped me. So let’s talk about what’s going wrong here.
Apparently nix flakes have some Weird Rules about git. The way it works is:
git add
ed to git, everything is fineflake.nix
file isn’t tracked by
git yet (because you just created it and are trying to get it to work), nix
will COMPLETELY IGNORE YOUR FILEAfter someone kindly told me what was happening, I found that this is mentioned in this blog post about flakes, which says:
Note that any file that is not tracked by Git is invisible during Nix evaluation
There’s also a github issue discussing what to do about this.
So we need to git add
the file to get nix to pay attention to it. Cool. Let’s keep going.
To get any of the commands we’re going to talk about to work (like nix build
), you need to enable two nix features:
I set this up by putting experimental-features = nix-command flakes
in my
~/.config/nix/nix.conf
, but you can also run nix --extra-experimental-features "flakes nix-command" build
instead of nix build
.
nix develop
The instructions I was following told me that I could now run nix develop
and get a shell inside my new environment. I tried it and it works:
$ nix develop
grapefruit:nix bork$ cowsay hi
____
< hi >
----
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
Cool! I was curious about how the PATH was set up inside this environment so I took a look:
grapefruit:nix bork$ echo $PATH
/nix/store/v5q1bxrqs6hkbsbrpwc81ccyyfpbl8wk-clang-wrapper-11.1.0/bin:/nix/store/x9jmvvxcys4zscff39cnpw0kyfvs80vp-clang-11.1.0/bin:/nix/store/3f1ii2y5fs1w7p0id9mkis0ffvhh1n8w-coreutils-9.1/bin:/nix/store/8ldvi6b3ahnph19vm1s0pyjqrq0qhkvi-cctools-binutils-darwin-wrapper-973.0.1/bin:/nix/store/5kbbxk18fp645r4agnn11bab8afm0ry3-cctools-binutils-darwin-973.0.1/bin:/nix/store/5si884h02nqx3dfcdm5irpf7caihl6f8-cowsay-3.7.0/bin:/nix/store/5bs5q2dw5bl7c4krcviga6yhdrqbvdq6-ruby-3.1.4/bin:/nix/store/3f1ii2y5fs1w7p0id9mkis0ffvhh1n8w-coreutils-9.1/bin
It looks like every dependency has been added to the PATH separately: for example there’s
/nix/store/5si884h02nqx3dfcdm5irpf7caihl6f8-cowsay-3.7.0/bin
for cowsay
and
/nix/store/5bs5q2dw5bl7c4krcviga6yhdrqbvdq6-ruby-3.1.4/bin
for ruby
. That’s
fine but it’s not how I wanted my setup to work: I wanted a single directory of
symlinks that I could just put in my PATH in my normal shell.
buildEnv
I asked in the Nix discord and someone told me I could use buildEnv
to turn
my flake into a directory of symlinks. As far as I can tell it’s just a way to
take nix packages and copy their symlinks into another directory.
After some fiddling, I ended up with this: (here’s a gist)
{
inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-23.05-darwin";
outputs = { self, nixpkgs }: {
defaultPackage.aarch64-darwin = nixpkgs.legacyPackages.aarch64-darwin.buildEnv {
name = "julia-stuff";
paths = with nixpkgs.legacyPackages.aarch64-darwin; [
cowsay
ruby
];
pathsToLink = [ "/share/man" "/share/doc" "/bin" "/lib" ];
extraOutputsToInstall = [ "man" "doc" ];
};
};
}
This put a bunch of symlinks in result/bin
:
$ ls result/bin/
bundle bundler cowsay cowthink erb gem irb racc rake rbs rdbg rdoc ri ruby typeprof
Sweet! Now I have a thing I can theoretically put in my PATH – this result
directory. Next I mostly just
needed to add every other package I wanted to install to this flake.nix
file (I got the list
from nix-env -q
).
I ran into a bunch of weird problems adding all the packges I already had installed to my nix, so let’s talk about them.
I wanted to install a non-free package called ngrok
. Nix gave me 3 options for how I could do this. Option C seemed the most promising:
c) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
{ allowUnfree = true; }
to ~/.config/nixpkgs/config.nix.
But adding { allowUnfree = true}
to ~/.config/nixpkgs/config.nix
didn’t do
anything for some reason so instead I went with option A, which did seem to
work:
$ export NIXPKGS_ALLOW_UNFREE=1
Note: For `nix shell`, `nix build`, `nix develop` or any other Nix 2.4+
(Flake) command, `--impure` must be passed in order to read this
environment variable.
I made a couple of flakes for custom Nix packages I’d made (which I wrote about in my first nix blog post, and I wanted to set them up like this (you can see the full configuration here):
hugoFlake.url = "path:../hugo-0.40";
paperjamFlake.url = "path:../paperjam";
This worked fine the first time I ran nix build
, but when I reran nix build
again later I got some totally inscrutable error.
My workaround was just to run rm flake.lock
everytime before running nix
build
, which seemed to fix the problem.
I don’t really understand what’s going on here but there’s a very long github issue thread about it.
For a while, every time I ran nix build
, I got this error:
$ nix build
error:
… while reading the response from the build hook
error: unexpected EOF reading a line
I spent a lot of time poking at my flake.nix
trying to guess at what I could
have gone wrong.
A very nice person on Mastodon also helped me with this one and it turned out
that what I needed to do was find the nix-daemon
process and kill it. I still
have no idea what happened here or what that error message means, I did upgrade
nix at some point during this whole process so I guess the
upgrade went wonky somehow.
I don’t think this one is a common problem.
I wanted to install the zulu
package for Java, but when I tried to add it to
my list of packages I got this error complaining about a broken symlink:
$ nix build
error: builder for '/nix/store/4n9c4707iyiwwgi9b8qqx7mshzrvi27r-julia-dev.drv' failed with exit code 2;
last 1 log lines:
> error: not a directory: `/nix/store/2vc4kf5i28xcqhn501822aapn0srwsai-zulu-11.62.17/share/man'
For full logs, run 'nix log /nix/store/4n9c4707iyiwwgi9b8qqx7mshzrvi27r-julia-dev.drv'.
$ ls /nix/store/2vc4kf5i28xcqhn501822aapn0srwsai-zulu-11.62.17/share/ -l
lrwxr-xr-x 29 root 31 Dec 1969 man -> zulu-11.jdk/Contents/Home/man
I think what’s going on here is that the zulu
package in nixpkgs-23.05
was
just broken (looks like it’s since been fixed in the unstable version).
I decided I didn’t feel like dealing with that and it turned out I already had
Java installed another way outside nix, so I just removed zulu
from my list
and moved on.
Now that I knew how to fix all of the weird problems I’d run into, I wrote a
little shell script called nix-symlink
to build my flake and symlink it to
the very unimaginitively named ~/.nix-flake
. The idea was that then I could
put ~/.nix-flake
in my PATH
and have all my programs available.
I think people usually use nix flakes in a per-project way instead of “a single global flake”, but this is how I wanted my setup to work so that’s what I did.
Here’s the nix-symlink
script. The rm flake.lock
is because of that relative path issue,
and the NIXPKGS_ALLOW_UNFREE
is so I could install ngrok.
#!/bin/bash
set -euo pipefail
export NIXPKGS_ALLOW_UNFREE=1
cd ~/work/nixpkgs/flakes/grapefruit || exit
rm flake.lock
nix build --impure --out-link ~/.nix-flake
I put ~/.nix-flake
at the beginning of my PATH
(not at the end), but I might revisit that, we’ll see.
At the end of all this, I wanted to run a garbage collection because I’d
installed a bunch of random stuff that was taking about 20GB of extra hard
drive space in my /nix/store
. I think there are two different ways to collect
garbage in nix:
nix-store --gc
nix-collect-garbage
I have no idea what the difference between them is, but nix-collect-garbage
seemed to delete more stuff for some reason.
I wanted to check that my ~/.nix-flake
directory was actually a GC root, so
that all my stuff wouldn’t get deleted when I ran a GC.
I ran nix-store --gc --print-roots
to print out all the GC roots and my
~/.nix-flake
was in there so everything was good! This command also runs a GC
so it was kind of a dangerous way to check if a GC was going to delete
everything, but luckily it worked.
The last problem I ran into is speed. Previously, installing a new small package took me 2 seconds with nix-env -iA
:
$ time nix-env -iA nixpkgs.sl
installing 'sl-5.05'
these 2 paths will be fetched (0.41 MiB download, 3.77 MiB unpacked):
/nix/store/yv1c98m5pncx3i5q7nr7i7mfjkiyii72-ncurses-6.4
/nix/store/2k78vf30czicjs0dq9x0sj4017ziwxkn-sl-5.05
copying path '/nix/store/yv1c98m5pncx3i5q7nr7i7mfjkiyii72-ncurses-6.4' from 'https://cache.nixos.org'...
copying path '/nix/store/2k78vf30czicjs0dq9x0sj4017ziwxkn-sl-5.05' from 'https://cache.nixos.org'...
building '/nix/store/zadpfs9k1cw5x7iniwwcqd8lb7nnc7bb-user-environment.drv'...
________________________________________________________
Executed in 1.96 secs fish external
Installing the same package with flakes takes 7 seconds, plus the time to edit the config file:
$ vim ~/work/nixpkgs/flakes/grapefruit/flake.nix
$ time nix-symlink
________________________________________________________
Executed in 7.04 secs fish external
usr time 1.78 secs 0.29 millis 1.78 secs
sys time 0.51 secs 2.03 millis 0.51 secs
I don’t know what to do about this so I’ll just live with it. We’ll see if this ends up being annoying or not
Now my new nix workflow is:
flake.nix
to add or remove packages (this file)nix-symlink
script after editing itnix-collect-garbage
The last thing I wanted to do was run
nix registry add nixpkgs github:NixOS/nixpkgs/nixpkgs-23.05-darwin
so that if I want to ad-hoc run a flake with nix run nixpkgs#cowsay
, it’ll
take the version from the 23.05 version of nixpkgs. Mostly I just wanted this
so I didn’t have to download new versions of the nixpkgs repository all the
time – I just wanted to pin the 23.05 version.
I think nixpkgs-unstable
is the default which I’m sure is fine too if you
want to have more up-to-date software.
My solutions to all the nix problems I described are maybe not The Best ™,
but I’m happy that I figured out a way to install stuff that just involves one
relatively simple flake.nix
file and a 6-line bash script and not a lot of other
machinery.
Personally I still feel extremely uncomfortable with nix and so it’s important to me to keep my configuration as simple as possible without a lot of extra abstraction layers that I don’t understand. I might try out flakey-profile at some point though because it seems extremely simple.
You can manage a lot more stuff with nix, like:
npm install
and pip install
and bundle install
)There are all kind of tools that build on top of nix and flakes like home-manager. Like I said before though, it’s important to me to keep my configuration super simple so that I can have any hope of understanding how it works and being able to fix problems when it breaks so I haven’t paid attention to any of that stuff.
I’ve been complaining about nix a little in this post, but as usual with open source projects I assume that nix has all of these papercuts because it’s a complicated system run by a small team of volunteers with very limited time.
Folks on the unofficial nix discord have been helpful, I’ve had a somewhat mixed experience there but they have a “support forum” section in there and I’ve gotten answers to a lot of my questions.
the main resources I’ve found for understanding nix flakes are:
Also Kamal (my partner) uses nix and that really helps, I think using nix with an experienced friend around is a lot easier.
I still kind of like nix after using it for 9 months despite how confused I am about it all the time, I feel like once I get things working they don’t usually break.
We’ll see if that’s continues to be the case with flakes! Maybe I’ll go back to
just using nix-env -iA
ing everything if it goes badly.
git cherry-pick
works the other
day, and I found myself getting confused.
What went wrong was: I thought that git cherry-pick
was basically applying a
patch, but when I tried to actually do it that way, it didn’t work!
Let’s talk about what I thought cherry-pick
did (applying a patch), why
that’s not quite true, and what it actually does instead (a “3-way merge”).
This post is extremely in the weeds and you definitely don’t need to understand this stuff to use git effectively. But if you (like me) are curious about git’s internals, let’s talk about it!
The way I previously understood git cherry-pick COMMIT_ID
is:
COMMIT_ID
, like git show COMMIT_ID --patch > out.patch
git apply out.patch
Before we get into this – I want to be clear that this model is mostly right, and if that’s your mental model that’s fine. But it’s wrong in some subtle ways and I think that’s kind of interesting, so let’s see how it works.
If I try to do the “calculate the diff and apply the patch” thing in a case where there’s a merge conflict, here’s what happens:
$ git show 10e96e46 --patch > out.patch
$ git apply out.patch
error: patch failed: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown:17
error: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown: patch does not apply
This just fails – it doesn’t give me any way to resolve the conflict or figure out how to solve the problem.
This is quite different from what actually happens when run git cherry-pick
,
which is that I get a merge conflict:
$ git cherry-pick 10e96e46
error: could not apply 10e96e46... wip
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
So it seems like the “git is applying a patch” model isn’t quite right. But the error message literally does say “could not apply 10e96e46”, so it’s not quite wrong either. What’s going on?
I went digging through git’s source code to see how cherry-pick
works, and
ended up at this line of code:
res = do_recursive_merge(r, base, next, base_label, next_label, &head, &msgbuf, opts);
So a cherry-pick is a… merge? What? How? What is it even merging? And how does merging even work in the first place?
I realized that I didn’t really know how git’s merge worked, so I googled it and found out that git does a thing called “3-way merge”. What’s that?
Let’s say I want to merge these 2 files. We’ll call them v1.py
and v2.py
.
def greet():
greeting = "hello"
name = "julia"
return greeting + " " + name
def say_hello():
greeting = "hello"
name = "aanya"
return greeting + " " + name
There are two lines that differ: we have
def greet()
and def say_hello
name = "aanya"
and name = "julia"
How do we know what to pick? It seems impossible!
But what if I told you that the original function was this (base.py
)?
def say_hello():
greeting = "hello"
name = "julia"
return greeting + " " + name
Suddenly it seems a lot clearer! v1
changed the function’s name to greet
and v2
set name = "aanya"
. So to merge, we should make both those changes:
def greet():
greeting = "hello"
name = "aanya"
return greeting + " " + name
We can ask git to do this merge with git merge-file
, and it gives us exactly
the result we expected: it picks def greet()
and name = "aanya"
.
$ git merge-file v1.py base.py v2.py -p
def greet():
greeting = "hello"
name = "aanya"
return greeting + " " + name⏎
This way of merging where you merge 2 files + their original version is called a 3-way merge.
If you want to try it out yourself in a browser, I made a little playground at jvns.ca/3-way-merge/. I made it very quickly so it’s not mobile friendly.
The way I think about the 3-way merge is – git merges changes, not files. We have an original file and 2 possible changes to it, and git tries to combine both of those changes in a reasonable way. Sometimes it can’t (for example if both changes change the same line), and then you get a merge conflict.
Git can also merge more than 2 possible changes: you can have an original file and 8 possible changes, and it can try to reconcile all of them. That’s called an octopus merge but I don’t know much more than that, I’ve never done one.
Now let’s get a little weird! When we talk about git “applying a patch” (as you
do in a rebase
or revert
or cherry-pick
), it’s not actually creating a
patch file and applying it. Instead, it’s doing a 3-way merge.
Here’s how applying commit X
as a patch to your current commit corresponds to
this v1
, v2
, and base
setup from before:
v1
.base
v2
git merge-file v1 base v2
to combine them (technically git does not
actually run git merge-file
, it runs a C function that does it)Together, you can think of base
and v2
as being the “patch”: the diff between
them is the change that you want to apply to v1
.
Let’s say we have this commit graph, and we want to cherry-pick Y
on to main
:
A - B (main)
\
\
X - Y - Z
How do we turn that into a 3-way merge? Here’s how it translates into our v1
, v2
and base
from earlier:
B
is v1X
is the base, Y
is v2So together X
and Y
are the “patch”.
And git rebase
is just like git cherry-pick
, but repeated a bunch of times.
Now let’s say we want to run git revert Y
on this commit graph
X - Y - Z - A - B
B
is v1Y
is the base, X
is v2This is exactly like a cherry-pick, but with X
and Y
reversed. We have to
flip them because we want to apply a “reverse patch”.
Revert and cherry-pick are so closely related in git that they’re actually implemented in the same file: revert.c.
This trick of using a 3-way merge to apply a commit as a patch seems really clever and cool and I’m surprised that I’d never heard of it before! I don’t know of a name for it, but I kind of want to call it a “3-way patch”.
The idea is that with a 3-way patch, you specify the patch as 2 files: the file
before the patch and after (base
and v2
in our language in this post).
So there are 3 files involved: 1 for the original and 2 for the patch.
The point is that the 3-way patch is a much better way to patch than a normal patch, because you have a lot more context for merging when you have both full files.
Here’s more or less what a normal patch for our example looks like:
@@ -1,1 +1,1 @@:
- def greet():
+ def say_hello():
greeting = "hello"
and a 3-way patch. This “3-way patch” is not a real file format, it’s just something I made up.
BEFORE: (the full file)
def greet():
greeting = "hello"
name = "julia"
return greeting + " " + name
AFTER: (the full file)
def say_hello():
greeting = "hello"
name = "julia"
return greeting + " " + name
The book Building Git by James Coglan
is the only place I could find other than the git source code explaining how
git cherry-pick
actually uses 3-way merge under the hood (I thought Pro Git might
talk about it, but it didn’t seem to as far as I could tell).
I actually went to buy it and it turned out that I’d already bought it in 2019 so it was a good reference to have here :)
There’s more to merging in git than the 3-way merge – there’s something called a “recursive merge” that I don’t understand, and there are a bunch of details about how to deal with handling file deletions and moves, and there are also multiple merge algorithms.
My best idea for where to learn more about this stuff is Building Git, though I haven’t read the whole thing.
git apply
do?I also went looking through git’s source to find out what git apply
does, and it
seems to (unsurprisingly) be in apply.c
. That code parses a patch file, and
then hunts through the target file to figure out where to apply it. The core logic
seems to be around here:
I think the idea is to start at the line number that the patch suggested and
then hunt forwards and backwards from there to try to find it:
/*
* There's probably some smart way to do this, but I'll leave
* that to the smart and beautiful people. I'm simple and stupid.
*/
backwards = current;
backwards_lno = line;
forwards = current;
forwards_lno = line;
current_lno = line;
for (i = 0; ; i++) {
...
That all seems pretty intuitive and about what I’d naively expect.
git apply --3way
worksgit apply
also has a --3way
flag that does a 3-way merge. So we actually
could have more or less implemented git cherry-pick
with git apply
like
this:
$ git show 10e96e46 --patch > out.patch
$ git apply out.patch --3way
Applied patch to 'content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown' with conflicts.
U content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown
--3way
doesn’t just use the contents of the patch file though! The patch file starts with:
index d63ade04..65778fc0 100644
d63ade04
and 65778fc0
are the IDs of the old/new versions of that file in
git’s object database, so git can retrieve them to do a 3-way patch
application. This won’t work if someone emails you a patch and you don’t have
the files for the new/old versions of the file though: if you’re missing the
blobs you’ll get this error:
$ git apply out.patch
error: repository lacks the necessary blob to perform 3-way merge.
A couple of people pointed out that 3-way merge is much older than git, it’s from the late 70s or something. Here’s a paper from 2007 talking about it
I was pretty surprised to learn that I didn’t actually understand the core way that git applies patches internally – it was really cool to learn about!
I have lots of issues with git’s UI but I think this particular thing is not one of them. The 3-way merge seems like a nice unified way to solve a bunch of different problems, it’s pretty intuitive for people (the idea of “applying a patch” is one that a lot of programmers are used to thinking about, and the fact that it’s implemented as a 3-way merge under the hood is an implementation detail that nobody actually ever needs to think about).
Also a very quick plug: I’m working on writing a zine about git, if you’re interested in getting an email when it comes out you can sign up to my very infrequent announcements mailing list.
]]>I’ve found that if many people have a very strong opinion that’s different from mine, usually it’s because they have different experiences around that thing from me.
So I asked on Mastodon:
today I’m thinking about the tradeoffs of using
git rebase
a bit. I think the goal of rebase is to have a nice linear commit history, which is something I like.but what are the costs of using rebase? what problems has it caused for you in practice? I’m really only interested in specific bad experiences you’ve had here – not opinions or general statements like “rewriting history is bad”
I got a huge number of incredible answers to this, and I’m going to do my best to summarize them here. I’ll also mention solutions or workarounds to those problems in cases where I know of a solution. Here’s the list:
My goal with this isn’t to convince anyone that rebase is bad and you shouldn’t use it (I’m certainly going to keep using rebase!). But seeing all these problems made me want to be more cautious about recommending rebase to newcomers without explaining how to use it safely. It also makes me wonder if there’s an easier workflow for cleaning up your commit history that’s harder to accidentally mess up.
First, I know that people use a lot of different Git workflows. I’m going to be talking about the workflow I’m used to when working on a team, which is:
main
branch. It’s protected from force pushes.main
main
every time a pull request is merged.main
is by making a pull request on Github/Gitlab and merging itThis is not the only “correct” git workflow (it’s a very “we run a web service” workflow and open source project or desktop software with releases generally use a slightly different workflow). But it’s what I know so that’s what I’ll talk about.
Also before we start: one big thing I noticed is that there were 2 different kinds of rebase that kept coming up, and only one of them requires you to deal with merge conflicts.
git rebase -i HEAD^^^^^^^
to squash many
small commits into one. As long as you’re just squashing commits, you’ll
never have to resolve a merge conflict while doing this.git rebase main
. This can cause merge conflicts.I think it’s useful to make this distinction because sometimes I’m thinking about rebase type 1 (which is a lot less likely to cause problems), but people who are struggling with it are thinking about rebase type 2.
Now let’s move on to all the problems!
If you make many tiny commits, sometimes you end up in a hellish loop where you have to fix the same merge conflict 10 times. You can also end up fixing merge conflicts totally unnecessarily (like dealing with a merge conflict in code that a future commit deletes).
There are a few ways to make this better:
git rebase -i HEAD^^^^^^^^^^^
to squash all of the tiny commits
into 1 big commit and then a git rebase main
to rebase onto a different
branch. Then you only have to fix the conflicts once.git rerere
to automate repeatedly resolving the same merge conflicts
(“rerere” stands for “reuse recorded resolution”, it’ll record your previous merge conflict resolutions and replay them).
I’ve never tried this but I think you need to set git config rerere.enabled
true
and then it’ll automatically help you.Also if I find myself resolving merge conflicts more than once in a rebase,
I’ll usually run git rebase --abort
to stop it and then squash my commits into
one and try again.
Generally when I’m doing a rebase onto a different branch, I’m rebasing 1-2 commits. Maybe sometimes 5! Usually there are no conflicts and it works fine.
Some people described rebasing hundreds of commits by many different people onto a different branch. That sounds really difficult and I don’t envy that task.
I heard from several people that when they were new to rebase, they messed up a rebase and permanently lost a week of work that they then had to redo.
The problem here is that undoing a rebase that went wrong is much more complicated
than undoing a merge that went wrong (you can undo a bad merge with something like git reset --hard HEAD^
).
Many newcomers to rebase don’t even realize that undoing a rebase is even
possible, and I think it’s pretty easy to understand why.
That said, it is possible to undo a rebase that went wrong. Here’s an example of how to undo a rebase using git reflog
.
step 1: Do a bad rebase (for example run git rebase -I HEAD^^^^^
and just delete 3 commits)
step 2: Run git reflog
. You should see something like this:
ee244c4 (HEAD -> main) HEAD@{0}: rebase (finish): returning to refs/heads/main
ee244c4 (HEAD -> main) HEAD@{1}: rebase (pick): test
fdb8d73 HEAD@{2}: rebase (start): checkout HEAD^^^^^^^
ca7fe25 HEAD@{3}: commit: 16 bits by default
073bc72 HEAD@{4}: commit: only show tooltips on desktop
step 3: Find the entry immediately before rebase (start)
. In my case that’s ca7fe25
step 4: Run git reset --hard ca7fe25
A couple of other ways to undo a rebase:
@
always refers to your current branch in git, so you can run
git reset --hard @{1}
to reset your branch to its previous location.git switch -c backup
before rebasing, so you
can easily get back to the old commit.A few people mentioned the following situation:
git push --force
(maybe by accident)git pull
, it’s a mess – you get the a fatal: Need to specify how to reconcile divergent branches
errorThis is an even worse situation than the “undoing a rebase is hard” situation because the missing commits might be split across many different people’s and the only worse thing than having to hunt through the reflog is multiple different people having to hunt through the reflog.
This has never happened to me because the only branch I’ve ever collaborated on
is main
, and main
has always been protected from force pushing (in my
experience the only way you can get something into main
is through a pull
request). So I’ve never even really been in a situation where this could
happen. But I can definitely see how this would cause problems.
The main tools I know to avoid this are:
--force-with-lease
when force pushing, to make sure that nobody else has pushed to the branch since your last fetchApparently the “since your last fetch” is important here – if you run git
fetch
immediately before running git push --force-with-lease
, the
--force-with-lease
won’t protect you at all.
I was curious about why people would run git push --force
on a shared branch. Some reasons people gave were:
main
. The idea here is that you’re just really careful about coordinating the rebase so nothing gets lost.git rebase
and git push --force
as a solution, and followed them without understanding the consequencesgit push --force
on a personal branch and ran it on a shared branch by accidentThe situation here is:
One way to avoid this is to push new commits addressing the review comments, and then after the PR is approved do a rebase to reorganize everything.
I think some reviewers are more annoyed by this problem than others, it’s kind of a personal preference. Also this might be a Github-specific issue, other code review tools might have better tools for managing this.
If you’re rebasing to squash commits, you can lose important commit metadata
like Co-Authored-By
. Also if you GPG sign your commits, rebase loses the
signatures.
There’s probably other commit metadata that you can lose that I’m not thinking of.
I haven’t run into this one so I’m not sure how to avoid it. I think GPG signing commits isn’t as popular as it used to be.
Someone mentioned that it’s important for them to be able to easily revert merging any branch (in case the branch broke something), and if the branch contains multiple commits and was merged with rebase, then you need to do multiple reverts to undo the commits.
In a merge workflow, I think you can revert merging any branch just by reverting the merge commit.
If you’re trying to have a very clean commit history where the tests pass on every commit (very admirable!), rebasing can result in some intermediate commits that are broken and don’t pass the tests, even if the final commit passes the tests.
Apparently you can avoid this by using git rebase -x
to run the test suite at
every step of the rebase and make sure that the tests are still passing. I’ve
never done that though.
git commit --amend
instead of git rebase --continue
A couple of people mentioned issues with running git commit --amend
instead of git rebase --continue
when resolving a merge conflict.
The reason this is confusing is that there are two reasons when you might want to edit files during a rebase:
edit
in git rebase -i
), where you need to write git commit --amend
when you’re donegit rebase --continue
when you’re doneIt’s very easy to get these two cases mixed up because they feel very similar. I think what goes wrong here is that you:
git add file.txt
git commit
because that’s what you’re used to doing after you run git add
git rebase --continue
! Now you have a weird extra commit, and maybe it has the wrong commit message and/or authorThe whole point of rebase is to clean up your commit history, and combining
commits with rebase is pretty easy. But what if you want to split up a commit into 2
smaller commits? It’s not as easy, especially if the commit you want to split
is a few commits back! I actually don’t really know how to do it even though I
feel very comfortable with rebase. I’d probably just do git reset HEAD^^^
or
something and use git add -p
to redo all my commits from scratch.
One person shared their workflow for splitting commits with rebase.
If you try to do too many things in a single git rebase -i
(reorder commits
AND combine commits AND modify a commit), it can get really confusing.
To avoid this, I personally prefer to only do 1 thing per rebase, and if I want to do 2 different things I’ll do 2 rebases.
If your branch is long-lived (like for 1 month), having to rebase repeatedly gets painful. It might be easier to just do 1 merge at the end and only resolve the conflicts once.
The dream is to avoid this problem by not having long-lived branches but it doesn’t always work out that way in practice.
A few more issues that I think are not that common:
git reset --hard
instead of git rebase --abort
, things will behave
weirdly until you stop it properlyI’ve seen a lot of people arguing about rebase. I’ve been thinking about why this is and I’ve noticed that people work at a few different levels of “commit discipline”:
Often I think different people inside the same company have different levels of commit discipline, and I’ve seen people argue about this a lot. Personally I’m mostly a Level 2 person. I think Level 3 might be what people mean when they say “clean commit history”.
I think Level 1 and Level 2 are pretty easy to achieve without rebase – for
level 1, you don’t have to do anything, and for level 2, you can either press
“squash and merge” in github or run git switch main; git merge --squash mybranch
on the command line.
But for Level 3, you either need rebase or some other tool (like GitUp) to help you organize your commits to tell a nice story.
I’ve been wondering if when people argue about whether people “should” use rebase or not, they’re really arguing about which minimum level of commit discipline should be required.
I think how this plays out also depends on how big the changes folks are making – if folks are usually making pretty small pull requests anyway, squashing them into 1 commit isn’t a big deal, but if you’re making a 6000-line change you probably want to split it up into multiple commits.
A couple of people mentioned using this workflow that doesn’t use rebase:
git merge main
to merge main into the branch periodically (and fix conflicts if necessary)git checkout main; git merge --squash mybranch
) to
squash all of the changes into 1 commit. This gets rid of all the “ugly” merge
commits.I originally thought this would make the log of commits on my branch too ugly,
but apparently git log main..mybranch
will just show you the changes on your
branch, like this:
$ git log main..mybranch
756d4af (HEAD -> mybranch) Merge branch 'main' into mybranch
20106fd Merge branch 'main' into mybranch
d7da423 some commit on my branch
85a5d7d some other commit on my branch
Of course, the goal here isn’t to force people who have made beautiful atomic commits to squash their commits – it’s just to provide an easy option for folks to clean up a messy commit history (“add new feature; wip; wip; fix; fix; fix; fix; fix;“) without having to use rebase.
I’d be curious to hear about other people who use a workflow like this and if it works well.
I went into this really feeling like “rebase is fine, what could go wrong?” But many of these problems actually have happened to me in the past, it’s just that over the years I’ve learned how to avoid or fix all of them.
And I’ve never really seen anyone share best practices for rebase, other than “never force push to a shared branch”. All of these honestly make me a lot more reluctant to recommend using rebase.
To recap, I think these are my personal rebase rules I follow:
git rebase --abort
)git reflog
to undo a bad rebasegit rebase -i HEAD^^^^
and then git rebase main
)git rebase -i
. Keep it simple.main
Thanks to Marco Rogers for encouraging me to think about the problems people have with rebase, and to everyone on Mastodon who helped with this.
]]>So I asked people on Mastodon:
what git jargon do you find confusing? thinking of writing a blog post that explains some of git’s weirder terminology: “detached HEAD state”, “fast-forward”, “index/staging area/staged”, “ahead of ‘origin/main’ by 1 commit”, etc
I got a lot of GREAT answers and I’ll try to summarize some of them here. Here’s a list of the terms:
I’ve done my best to explain what’s going on with these terms, but they cover basically every single major feature of git which is definitely too much for a single blog post so it’s pretty patchy in some places.
HEAD
and “heads”A few people said they were confused by the terms HEAD
and refs/heads/main
,
because it sounds like it’s some complicated technical internal thing.
Here’s a quick summary:
.git/refs/heads
. (technically the official git glossary says that the branch is all the commits on it and the head is just the most recent commit, but they’re 2 different ways to think about the same thing)HEAD
is the current branch. It’s stored in .git/HEAD
.I think that “a head
is a branch, HEAD
is the current branch” is a good
candidate for the weirdest terminology choice in git, but it’s definitely too
late for a clearer naming scheme so let’s move on.
There are some important exceptions to “HEAD is the current branch”, which we’ll talk about next.
You’ve probably seen this message:
$ git checkout v0.1
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
[...]
Here’s the deal with this message:
main
.HEAD
.git merge other_branch
, that will also affect your current branchHEAD
doesn’t have to be a branch! Instead it can be a commit ID.git pull
doesn’t work at all (since the whole point of it is to update your current branch)git push
unless you use it in a special waygit commit
, git merge
, git rebase
, and git cherry-pick
do still
work, but they’ll leave you with “orphaned” commits that aren’t connected
to any branch, so those commits will be hard to findIf you have a merge conflict, you can run git checkout --ours file.txt
to pick the version of file.txt
from the “ours” side. But which side is “ours” and which side is “theirs”?
I always find this confusing and I never use git checkout --ours
because of
that, but I looked it up to see which is which.
For merges, here’s how it works: the current branch is “ours” and the branch you’re merging in is “theirs”, like this. Seems reasonable.
$ git checkout merge-into-ours # current branch is "ours"
$ git merge from-theirs # branch we're merging in is "theirs"
For rebases it’s the opposite – the current branch is “theirs” and the target branch we’re rebasing onto is “ours”, like this:
$ git checkout theirs # current branch is "theirs"
$ git rebase ours # branch we're rebasing onto is "ours"
I think the reason for this is that under the hood git rebase main
is
repeatedly merging commits from the current branch into a copy of the main
branch (you can
see what I mean by that in this weird shell script the implements git rebase
using git merge
. But I
still find it confusing.
This nice tiny site explains the “ours” and “theirs” terms.
A couple of people also mentioned that VSCode calls “ours”/“theirs” “current change”/“incoming change”, and that it’s confusing in the exact same way.
This message seems straightforward – it’s saying that your main
branch is up
to date with the origin!
But it’s actually a little misleading. You might think that this means that
your main
branch is up to date. It doesn’t. What it actually means is –
if you last ran git fetch
or git pull
5 days ago, then your main
branch
is up to date with all the changes as of 5 days ago.
So if you don’t realize that, it can give you a false sense of security.
I think git could theoretically give you a more useful message like “is up to
date with the origin’s main
as of your last fetch 5 days ago” because the time
that the most recent fetch happened is stored in the reflog, but it doesn’t.
HEAD^
, HEAD~
HEAD^^
, HEAD~~
, HEAD^2
, HEAD~2
I’ve known for a long time that HEAD^
refers to the previous commit, but I’ve
been confused for a long time about the difference between HEAD~
and HEAD^
.
I looked it up, and here’s how these relate to each other:
HEAD^
and HEAD~
are the same thing (1 commit ago)HEAD^^^
and HEAD~~~
and HEAD~3
are the same thing (3 commits ago)HEAD^3
refers the the third parent of a commit, and is different from HEAD~3
This seems weird – why are HEAD~
and HEAD^
the same thing? And what’s the
“third parent”? Is that the same thing as the parent’s parent’s parent? (spoiler: it
isn’t) Let’s talk about it!
Most commits have only one parent. But merge commits have multiple parents –
they’re merging together 2 or more commits. In Git HEAD^
means “the parent of
the HEAD commit”. But what if HEAD is a merge commit? What does HEAD^
refer
to?
The answer is that HEAD^
refers to the the first parent of the merge,
HEAD^2
is the second parent, HEAD^3
is the third parent, etc.
But I guess they also wanted a way to refer to “3 commits ago”, so HEAD^3
is
the third parent of the current commit (which may have many parents if it’s a merge commit), and HEAD~3
is the parent’s parent’s
parent.
I think in the context of the merge commit ours/theirs discussion earlier, HEAD^
is “ours” and HEAD^2
is “theirs”.
..
and ...
Here are two commands:
git log main..test
git log main...test
What’s the difference between ..
and ...
? I never use these so I had to look it up in man git-range-diff. It seems like the answer is that in this case:
A - B main
\
C - D test
main..test
is commits C and Dtest..main
is commit Bmain...test
is commits B, C, and DBut it gets worse: apparently git diff
also supports ..
and ...
, but
they do something completely different than they do with git log
? I think the summary is:
git log test..main
shows changes on main
that aren’t on test
, whereas git log test...main
shows changes on both sides.git diff test..main
shows test
changes and main
changes (it diffs B
and D
) whereas git diff test...main
diffs A
and D
(it only shows you the diff on one side).this blog post talks about it a bit more.
Here’s a very common message you’ll see in git status
:
$ git status
On branch main
Your branch is behind 'origin/main' by 2 commits, and can be fast-forwarded.
(use "git pull" to update your local branch)
What does “fast-forwarded” mean? Basically it’s trying to say that the two branches look something like this: (newest commits are on the right)
main: A - B - C
origin/main: A - B - C - D - E
or visualized another way:
A - B - C - D - E (origin/main)
|
main
Here origin/main
just has 2 extra commits that main
doesn’t have, so it’s
easy to bring main
up to date – we just need to add those 2 commits.
Literally nothing can possibly go wrong – there’s no possibility of merge
conflicts. A fast forward merge is a very good thing! It’s the easiest way to combine 2 branches.
After running git pull
, you’ll end up this state:
main: A - B - C - D - E
origin/main: A - B - C - D - E
Here’s an example of a state which can’t be fast-forwarded.
A - B - C - X (main)
|
- - D - E (origin/main)
Here main
has a commit that origin/main
doesn’t have (X
). So
you can’t do a fast forward. In that case, git status
would say:
$ git status
Your branch and 'origin/main' have diverged,
and have 1 and 2 different commits each, respectively.
I’ve always found the term “reference” kind of confusing. There are at least 3 things that get called “references” in git
main
and v0.2
HEAD
, which is the current branchHEAD^^^
which git will resolve to a commit ID. Technically these are probably not “references”, I guess git calls them “revision parameters” but I’ve never used that term.“symbolic reference” is a very weird term to me because personally I think the only
symbolic reference I’ve ever used is HEAD
(the current branch), and HEAD
has a very central place in git (most of git’s core commands’ behaviour depends
on the value of HEAD
), so I’m not sure what the point of having it as a
generic concept is.
When you configure a git remote in .git/config
, there’s this +refs/heads/main:refs/remotes/origin/main
thing.
[remote "origin"]
url = git@github.com:jvns/pandas-cookbook
fetch = +refs/heads/main:refs/remotes/origin/main
I don’t really know what this means, I’ve always just used whatever the default
is when you do a git clone
or git remote add
, and I’ve never felt any
motivation to learn about it or change it from the default.
The man page for git checkout
says:
git checkout [-f|--ours|--theirs|-m|--conflict=<style>] [<tree-ish>] [--] <pathspec>...
What’s tree-ish
??? What git is trying to say here is when you run git checkout THING .
, THING
can be either:
182cd3f
)main
or HEAD^^
or v0.3.2
)main:./docs
)Personally I’ve never used the “directory inside a commit” thing and from my perspective “tree-ish” might as well just mean “commit or reference to commit”.
All of these refer to the exact same thing (the file .git/index
, which is where your changes are staged when you run git add
):
git diff --cached
git rm --cached
git diff --staged
.git/index
Even though they all ultimately refer to the same file, there’s some variation in how those terms are used in practice:
--index
and --cached
do not generally mean the same
thing. I have personally never used the --index
flag so I’m not
going to get into it, but this blog post by Junio
Hamano (git’s lead maintainer)
explains all the gnarly detailsA bunch of people mentioned that “reset”, “revert” and “restore” are very similar words and it’s hard to differentiate them.
I think it’s made worse because
git reset --hard
and git restore .
on their own do basically the same thing. (though git reset --hard COMMIT
and git restore --source COMMIT .
are completely different from each other)git reset
: “Reset current HEAD to the specified state”git revert
: “Revert some existing commits”git restore
: “Restore working tree files”Those short descriptions do give you a better sense for which noun is being affected (“current HEAD”, “some commits”, “working tree files”) but they assume you know what “reset”, “revert” and “restore” mean in this context.
Here are some short descriptions of what they each do:
git revert COMMIT
: Create a new commit that’s the “opposite” of COMMIT on your current branch (if COMMIT added 3 lines, the new commit will delete those 3 lines)git reset --hard COMMIT
: Force your current branch back to the state it was at COMMIT
, erasing any new changes since COMMIT
. Very dangerous operation.git restore --source=COMMIT PATH
: Take all the files in PATH
back to how they were at COMMIT
, without changing any other files or commit history.Git uses the word “track” in 3 different related ways:
Untracked files:
in the output of git status
. This means those files aren’t managed by Git and won’t be included in commits.origin/main
. This is a local reference, and it’s the commit ID that main
pointed to on the remote origin
the last time you ran git pull
or git fetch
.The “untracked files” and “remote tracking branch” thing is not too bad – they both use “track”, but the context is very different. No big deal. But I think the other two uses of “track” are actually quite confusing:
main
is a branch that tracks a remoteorigin/main
is a remote-tracking branchBut a “branch that tracks a remote” and a “remote-tracking branch” are different things in Git and the distinction is pretty important! Here’s a quick summary of the differences:
main
is a branch. You can make commits to it, merge into it, etc. It’s often configured to “track” the remote main
in .git/config
, which means that you can use git pull
and git push
to push/pull changes.origin/main
is not a branch. It’s a “remote-tracking branch”, which is not
a kind of branch (I’m sorry). You can’t make commits to it. The only way
you can update it is by running git pull
or git fetch
to get the latest
state of main
from the remote.I’d never really thought about this ambiguity before but I think it’s pretty easy to see why folks are confused by it.
Checkout does two totally unrelated things:
git checkout BRANCH
switches branchesgit checkout file.txt
discards your unstaged changes to file.txt
This is well known to be confusing and git has actually split those two
functions into git switch
and git restore
(though you can still use
checkout if, like me, you have 15 years of muscle memory around git checkout
that you don’t feel like unlearning)
Also personally after 15 years I still can’t remember the order of the
arguments to git checkout main file.txt
for restoring the version of
file.txt
from the main
branch.
I think sometimes you need to pass --
to checkout
as an argument somewhere
to help it figure out which argument is a branch and which ones are paths but I
never do that and I’m not sure when it’s needed.
Lots of people mentioning reading reflog as re-flog
and not ref-log
. I
won’t get deep into the reflog here because this post is REALLY long but:
A bunch of people mentioned being confused about the difference between merge and rebase and not understanding what the “base” in rebase was supposed to be.
I’ll try to summarize them very briefly here, but I don’t think these 1-line explanations are that useful because people structure their workflows around merge / rebase in pretty different ways and to really understand merge/rebase you need to understand the workflows. Also pictures really help. That could really be its whole own blog post though so I’m not going to get into it.
rebase --onto
git rebase
has an flag called onto
. This has always seemed confusing to me
because the whole point of git rebase main
is to rebase the current branch
onto main. So what’s the extra onto
argument about?
I looked it up, and --onto
definitely solves a problem that I’ve rarely/never
actually had, but I guess I’ll write down my understanding of it anyway.
A - B - C (main)
\
D - E - F - G (mybranch)
|
otherbranch
Imagine that for some reason I just want to move commits F
and G
to be
rebased on top of main
. I think there’s probably some git workflow where this
comes up a lot.
Apparently you can run git rebase --onto main otherbranch mybranch
to do
that. It seems impossible to me to remember the syntax for this (there are 3
different branch names involved, which for me is too many), but I heard about it from a
bunch of people so I guess it must be useful.
Someone mentioned that they found it confusing that commit is used both as a verb and a noun in git.
for example:
main
“My guess is that most folks get used to this relatively quickly, but this use of “commit” is different from how it’s used in SQL databases, where I think “commit” is just a verb (you “COMMIT” to end a transaction) and not a noun.
Also in git you can think of a Git commit in 3 different ways:
None of those are wrong: different commands use commits in all of these ways.
For example git show
treats a commit as a diff, git log
treats it as a
history, and git restore
treats it as a snapshot.
But git’s terminology doesn’t do much to help you understand in which sense a commit is being used by a given command.
Here are a bunch more confusing terms. I don’t know what a lot of these mean.
things I don’t really understand myself:
git log -S
and git log -G
, for searching the diffs of previous commits?)things that people mentioned finding confusing but that I left out of this post because it was already 3000 words:
push
and pull
aren’t oppositesfetch
and pull
(pull = fetch + merge)origin main
(like git push origin main
) vs origin/main
github terms people mentioned being confused by:
git merge --squash
until yesterday, I thought “squash and merge” was a special github feature)I was surprised that basically every other core feature of git was mentioned by at least one person as being confusing in some way. I’d be interested in hearing more examples of confusing git terms that I missed too.
There’s another great post about this from 2012 called the most confusing git terminology. It talks more about how git’s terminology relates to CVS and Subversion’s terminology.
If I had to pick the 3 most confusing git terms, I think right now I’d pick:
head
is a branch, HEAD
is the current branchI learned a lot from writing this – I learned a few new facts about git, but more importantly I feel like I have a slightly better sense now for what someone might mean when they say that everything in git is confusing.
I really hadn’t thought about a lot of these issues before – like I’d never realized how “tracking” is used in such a weird way when discussing branches.
Also as usual I might have made some mistakes, especially since I ended up in a bunch of corners of git that I hadn’t visited before.
Also a very quick plug: I’m working on writing a zine about git, if you’re interested in getting an email when it comes out you can sign up to my very infrequent announcements mailing list.
None of these things feel super surprising in retrospect, but I hadn’t thought about them clearly before.
The facts are:
Let’s talk about them!
When you run git add file.txt
, and then git status
, you’ll see something like this:
$ git add content/post/2023-10-20-some-miscellaneous-git-facts.markdown
$ git status
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: content/post/2023-10-20-some-miscellaneous-git-facts.markdown
People usually call this “staging a file” or “adding a file to the staging area”.
When you stage a file with git add
, behind the scenes git adds the file to its object
database (in .git/objects
) and updates a file called .git/index
to refer to
the newly added file.
This “staging area” actually gets referred to by 3 different names in Git. All
of these refer to the exact same thing (the file .git/index
):
git diff --cached
git diff --staged
.git/index
I felt like I should have realized this earlier, but I didn’t, so there it is.
When I run git stash
to stash my changes, I’ve always been a bit confused
about where those changes actually went. It turns out that when you run git
stash
, git makes some commits with your changes and labels them with a reference
called stash
(in .git/refs/stash
).
Let’s stash this blog post and look at the log of the stash
reference:
$ git log stash --oneline
6cb983fe (refs/stash) WIP on main: c6ee55ed wip
2ff2c273 index on main: c6ee55ed wip
... some more stuff
Now we can look at the commit 2ff2c273
to see what it contains:
$ git show 2ff2c273 --stat
commit 2ff2c273357c94a0087104f776a8dd28ee467769
Author: Julia Evans <julia@jvns.ca>
Date: Fri Oct 20 14:49:20 2023 -0400
index on main: c6ee55ed wip
content/post/2023-10-20-some-miscellaneous-git-facts.markdown | 40 ++++++++++++++++++++++++++++++++++++++++
Unsurprisingly, it contains this blog post. Makes sense!
git stash
actually creates 2 separate commits: one for the index, and one for
your changes that you haven’t staged yet. I found this kind of heartening
because I’ve been working on a tool to snapshot and restore the state of a git
repository (that I may or may not ever release) and I came up with a very
similar design, so that made me feel better about my choices.
Apparently older commits in the stash are stored in the reflog.
Git’s documentation often refers to “references” in a generic way that I find
a little confusing sometimes. Personally 99% of the time when I deal with
a “reference” in Git it’s a branch or HEAD
and the other 1% of the time it’s a tag. I
actually didn’t know ANY examples of references that weren’t branches or tags or HEAD
.
But now I know one example – the stash is a reference, and it’s not a branch or tag! So that’s cool.
Here are all the references in my blog’s git repository (other than HEAD
):
$ find .git/refs -type f
.git/refs/heads/main
.git/refs/remotes/origin/HEAD
.git/refs/remotes/origin/main
.git/refs/stash
Some other references people mentioned in reponses to this post:
refs/notes/*
, from git notes
refs/pull/123/head
, and `refs/pull/123/head
for GitHub pull requests (which you can get with git fetch origin refs/pull/123/merge
)refs/bisect/*
, from git bisect
Here’s a toy git repo where I created two branches x
and y
, each with 1
file (x.txt
and y.txt
) and merged them. Let’s look at the merge commit.
$ git log --oneline
96a8afb (HEAD -> y) Merge branch 'x' into y
0931e45 y
1d8bd2d (x) x
If I run git show 96a8afb
, the commit looks “empty”: there’s no diff!
git show 96a8afb
commit 96a8afbf776c2cebccf8ec0dba7c6c765ea5d987 (HEAD -> y)
Merge: 0931e45 1d8bd2d
Author: Julia Evans <julia@jvns.ca>
Date: Fri Oct 20 14:07:00 2023 -0400
Merge branch 'x' into y
But if I diff the merge commit against each of its two parent commits separately, you can see that of course there is a diff:
$ git diff 0931e45 96a8afb --stat
x.txt | 1 +
1 file changed, 1 insertion(+)
$ git diff 1d8bd2d 96a8afb --stat
y.txt | 1 +
1 file changed, 1 insertion(+)
It seems kind of obvious in retrospect that merge commits aren’t actually “empty” (they’re snapshots of the current state of the repo, just like any other commit), but I’d never thought about why they appear to be empty.
Apparently the reason that these merge diffs are empty is that merge diffs only show conflicts – if I instead create a repo
with a merge conflict (one branch added x
and another branch added y
to the
same file), and show the merge commit where I resolved the conflict, it looks
like this:
$ git show HEAD
commit 3bfe8311afa4da867426c0bf6343420217486594
Merge: 782b3d5 ac7046d
Author: Julia Evans <julia@jvns.ca>
Date: Fri Oct 20 15:29:06 2023 -0400
Merge branch 'x' into y
diff --cc file.txt
index 975fbec,587be6b..b680253
--- a/file.txt
+++ b/file.txt
@@@ -1,1 -1,1 +1,1 @@@
- y
-x
++z
It looks like this is trying to tell me that one branch added x
, another
branch added y
, and the merge commit resolved it by putting z
instead. But
in the earlier example, there was no conflict, so Git didn’t display a diff at all.
(thanks to Jordi for telling me how merge diffs work)
I’ll keep this post short, maybe I’ll write another blog post with more git facts as I learn them.
]]>Here’s the video, as well as the slides and a transcript of (roughly) what I said in the talk.
I often give talks about things that I'm excited about, or that I think are really fun.
But today, I want to talk about something that I'm a little bit mad about, which is that sometimes things that seem like they should be basic take me 10 years or 20 years to learn, way longer than it seems like they should.
And sometimes this would feel kind of personal! This shouldn't be so hard for me! I should understand this already. It's been seven years!
And this "it's just me" attitude is often encouraged -- when I write about finding things hard to learn on the Internet, Internet strangers will sometimes tell me: "yeah, this is easy! You should get it already! Maybe you're just not very smart!"
But luckily I have a pretty big ego so I don't take the internet strangers too seriously. And I have a lot of patience so I'm willing to keep coming back to a topic I'm confused about. There were maybe four different things that were going wrong with DNS in my life and eventually I figured them all out.
So, hooray! I understood DNS! I win! But then I see some of my friends struggling with the exact same things.
They're wondering, hey, my DNS isn't working. Why not?
And it doesn't end. We're still having the same problems over and over and over again. And it's frustrating! It feels redundant! It makes me mad. Especially when friends take it personally, and they feel like "hey I should really understand this already".
Because everyone is going through this. From the sounds of recognition I hear, I think a lot of you have been through some of these same problems with DNS.
I started a little publishing company called Wizard Zines where --
(applause)
Wow. Where I write about some of these topics and try to demystify them.
We're going to talk about bash, HTTP, SQL, and DNS.
For each of them, we're going to talk a little bit about:
a. what's so hard about it?
b. what are some things we can do to make it a little bit easier for each other?
First, let's run this script, bad.sh
:
mv ./*.txt /tmmpp echo "success!"
This moves a file and prints "success!". And with most of the programming languages that I use, if there's a problem, the program will stop.
[laughter from audience]
But I think a lot of you know from maybe sad experience that bash does not stop, right? It keeps going. And going... and sometimes very bad things happen to your computer in the process.
When I run this program, here's the output:
mv: cannot stat './*.txt': No such file or directory success!
It didn't stop after the failed mv
.
Eventually I learned that you can write set
-e
at the top of your program, and that will make bash stop if
there's a problem.
When we run this new program with set -e
at the top, here's the output:
mv: cannot stat './*.txt': No such file or directory
Here we've put our code in a function. And if the function fails, we want to echo "failed".
So use set -e
at the beginning, and you might think everything should be okay.
But if we run it... this is the output we get
mv: cannot stat './*.txt': No such file or directory success
We get the "success" message again! It didn't stop, it just kept going. This is
because the "or" (|| echo "failed"
) globally disables set -e
in the
function.
Which is certainly not what I wanted, and not what I would expect. But this is not a bug in bash, it's is the documented behavior.
And I think one reason this is tricky is a lot of us don't use bash very often. Maybe you write a bash script every six months and don't look at it again.
When you use a system very infrequently and it's full of a lot of weird trivia and gotchas, it's hard to use the system correctly.
But I would say this is factually untrue. How many of you are using bash?
A lot of us ARE using it! And it doesn't always work perfectly, but often it gets the job done.
The way I think this is -- you have some people on the left in this diagram who are confused about bash, who think it seems awful and incomprehensible.
And some people on the right who know how to make the bash work for them, mostly.
So how do we move people from the left to the right, from being overwhelmed by a pile of impossible gotchas to being able to mostly use the system correctly?
And for bash, we have this incredible tool called shellcheck.
[ Applause ]
Yes! Shellcheck is amazing! And shellcheck knows a lot of things that can go wrong and can tell you "oh no, you don't want to do that. You're going to have a bad time."
I'm very grateful for shellcheck, it makes it much easier for me to write tiny bash scripts from time to time.
Now let's do a shellcheck demo!
$ shellcheck -o all bad-again.sh In bad-again.sh line 7: f || echo "failed!" ^-- SC2310 (info): This function is invoked in an || condition so set -e will be disabled. Invoke separately if failures should cause the script to exit.
Shellcheck gives us this
lovely error message. The message isn't completely obvious on its own (and this
check is only run if you invoke shellcheck with -o all
). But
shellcheck tells you "hey, there's this problem, maybe you should be worried
about that".
And I think it's wonderful that all these tips live in this linter.
I'm not trying to tell you to write linters, though I think that some of you probably will write linters because this is that kind of crowd.
I've personally never written a linter, and I'm definitely not going to create something as cool as shellcheck!
But instead, the way I write linters is I tell people about shellcheck from time to time and then I feel a little like I invented shellcheck for those people. Because some people didn't know about the tool until I told them about it!
I didn't find out about shellcheck for a long time and I was kind of mad about it when I found out. I felt like -- excuse me? I could have been using shellcheck this whole time? I didn't need to remember all of this stuff in my brain?
So I think an incredible thing we can do is to reflect on the tools that we're using to reduce our cognitive load and all the things that we can't fit into our minds, and make sure our friends or coworkers know about them.
I also like to warn people about gotchas and some of the terrible things computers have done to me.
I think this is an incredibly valuable community service. The example I shared
about how set -e
got disabled is something I learned from my
friend Jesse a few weeks ago.
They told me how this thing happened to them, and now I know and I don't have to go through it personally.
One way I see people kind of trying to share terrible things that their computers have done to them is by sharing "best practices".
But I really love to hear the stories behind the best practices!
If someone has a strong opinion like "nobody should ever use bash", I want to hear about the story! What did bash do to you? I need to know.
The reason I prefer stories to best practices is if I know the story about how the bash hurt you, I can take that information and decide for myself how I want to proceed.
Maybe I feel like -- the computer did that to you? That's okay, I can deal with that problem, I don't mind.
Or I might instead feel like "oh no, I'm going to do the best practice you recommended, because I do not want that thing to happen to me".
These bash stories are a great example of that: my reaction to them is "okay, I'm going to keep using bash, I'll just use shellcheck and keep my bash scripts pretty simple". But other people see them and decide "wow, I never want to use bash for anything, that's awful, I hate it".
Different people have different reactions to the same stories and that's okay.
I was talking to Marco Rogers at some point, many years ago, and he mentioned some new developers he was working with were struggling with HTTP.
And at first, I was a little confused about this -- I didn't understand what was hard about HTTP.
The way I was thinking about it at the time was that if you have an HTTP response, it has a few parts: a response code, some headers, and a body.
I felt like -- that's a pretty simple structure, what's the problem? But of course there was a problem, I just couldn't see what it was at first.
So, I talked to a friend who was newer to HTTP. And they asked "why does it matter what headers you set?"
And I said: "well, the browser..."
The browser!
Firefox is 20 million lines of code! It's been evolving since the '90s. There have been as I understand it, 1 million changes to the browser security model as people have discovered new and exciting exploits and the web has become a scarier and scarier place.
The browser is really a lot to understand.
One trick for understanding why a topic is hard is -- if the implementation if the thing involves 20 million lines of code, maybe that's why people are confused!
Though that 20 million lines of code also involves CSS and JS and many other things that aren't HTTP, but still.
Once I thought of it in terms of how complex a modern web browser is, it made so much more sense! Of course newcomers are confused about HTTP if you have to understand what the browser is doing!
Then my problem changed from "why is this hard?" to "how do I explain this at all?"
So how do we make it easier? How do we wrap our minds around this 20 million lines of code?
One way I think about this for HTTP is: here are some of the HTTP request headers. That's kind of a big list there are 43 headers there.
There are more unofficial headers too.
My brain does not contain all of those headers, I have no idea what most of them are.
When I think about trying to explain big topics, I think about -- what is actually in my brain, which only contains a normal human number of things?
This is a comic I drew about HTTP request headers. You don't have to read the whole thing. This has 15 request headers.
I wrote that these are "the most important headers", but what I mean by "most important" here is that these are the ones that I know about and use. It's a subjective list.
I wrote about 12 words about each one, which I think is approximately the amount of information about each header that lives in my mind.
For example I know that you can set Accept-Encoding
to gzip
and then you might get back a compressed response. That's all I know,
and that's usually all I need to know!
This very small set of information is working pretty well for me.
The general way I think about this trick is "turn a big list into a small list".
Turn the set of EVERY SINGLE THING into just the things I've personally used. I find it helps a lot.
Another example of this "turn a big list into a small list" trick is command line arguments.
I use a lot of command line tools, the number of arguments they have can be overwhelming, and I've written about them a fair amount over the years.
Here are all the flags for grep, from its man page. That's too much! I've been using grep for 20 years but I don't know what all that stuff is.
But when I look at the grep man page, this is what I see.
I think it's very helpful to newcomers when a more experienced person says "look, I've been using this system for a while, I know about 7 things about it, and here's what they are".
We're just pruning those lists down to a more human scale. And it can even help other more experienced people -- often someone else will know a slightly different set of 7 things from me.
But what about the stuff that doesn't fit in my brain?
Because I have a few things about HTTP stored in my brain. But sometimes I need other information which is hard to remember, like maybe the exact details of how CORS works.
I often have trouble finding the right references.
For example I've been trying to learn CSS off and on for 20 years. I've made a lot of progress -- it's going well!
But only in the last 2 years or so I learned about this wonderful website called CSS Tricks.
And I felt kind of mad when I learned about CSS Tricks! Why didn't I know about this before? It would have helped me!
But anyway, I'm happy to know about CSS Tricks now. (though sadly they seem to have stopped publishing in April after the acquisition, I'm still happy the older posts are there)
For HTTP, I think a lot of us use the Mozilla Developer Network.
Another HTTP reference I love is the official RFC, RFC 9110 (also 9111, 9112, 9113, 9114)
It's a new authoritative reference for HTTP and it was written just last
year, in 2022! They decided to organize all the information really nicely. So if you
want to know exactly what the Connection
header does, you can look
it up.
This is not really my top reference. I'm usually on MDN. But I really appreciate that it's available.
So I love to share my favorite references.
I do sometimes find it tempting to kind of lie about references. Not on purpose. But I'll see something on the internet, and I'll think it's kind of cool, and tell a friend about. But then my friend might ask me -- "when have you used this?" And I'll have to admit "oh, never, I just thought it seemed cool".
I think it's important to be honest about what the references that I'm actually using in real life are. Even if maybe the real references I use are a little "embarrassing", like maybe w3schools or something.
I started thinking about SQL because someone mentioned they're trying to learn SQL. I get most of my zine ideas that way, one person will make an offhand comment and I'll decide "ok, I'm going to spend 4 months writing about that". It's a weird process.
So I was wondering -- what's hard about SQL? What gets in the way of trying to learn that?
I want to say that when I'm confused about what's hard about something, that's a fact about me. It's not usually that the thing is easy, it's that I need to work on understanding what's hard about it. It's easy to forget when you've been using something for a while.
So, I was used to reading SQL queries. For example this made up query that tries to find people who own exactly two cats. It felt straightforward to me, SELECT, FROM, WHERE, GROUP BY.
But then I was talking to a friend about these queries who was new to SQL. And my friend asked -- what is this doing?
I thought, hmm, fair point.
And I think the point my friend was making was that the order that this SQL query is written in, is not the order that it actually happens in. It happens in a different order, and it's not immediately obvious what that is.
I like to think about: what does the computer do first? What actually happens first chronologically?
Computers actually do live in the same timeline as us. Things happen. Things happen in an order. So what happens first?
cats
.
So, that's how I think about SQL. The way a query runs is first FROM, then WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT.
At least conceptually. Real life databases have optimizations and it's more complicated than that. But this is the mental model that I use most of the time and it works for me. Everything is in the same order as you write it, except SELECT is fifth.
One is CORS, in HTTP.
This comic is way too small to read on the slide. But the idea is if you're making a cross-origin request in your browser, you can write down every communication that's happening between your browser and the server, in chronological order.
And I think writing down everything in chronological order makes it a lot easier to understand and more concrete.
"What happens in chronological order?" is a very straightforward structure, which is what I like about it. "What happens first?" feels like it should be easy to answer. But it's not!
I've found that it's actually very hard to know what our computers is doing, and it's a really fun question to explore.
As an example of how this is hard: I wrote a blog post recently called "Behind Hello World on Linux". It's about what happens when you run "hello world" on a Linux computer. I wrote a bunch about it, and I was really happy with it.
But after I wrote the post, I thought -- haven't I written about this before? Maybe 10 years ago?
And sure enough, I'd tried to write a similar post 10 years before.
I think this is really cool. Because the 2013 version of this post was about 6 times shorter. This isn't because Linux is more complicated than it was 10 years ago -- I think everything in the 2023 post was probably also true in 2013. The 2013 post just has a lot less information in it.
The reason the 2023 post is longer is that I didn't know what was happening chronologically on my computer in 2013 very well, and in 2023 I know a lot more. Maybe in 2033 I'll know even more!
I think a lot of us -- like me in 2013 and honestly me now, often don't know the facts of what's happening on our computers. It's very hard, which is what makes it such a fun question to try and discuss.
I think it's cool that all of us have different knowledge about what is happening chronologically on our computers and we can all chip in to this conversation.
For example when I posted this blog post about Hello World on Linux, some people mentioned that they had a lot of thoughts about what happens exactly in your terminal, or more details about the filesystem, or about what's happening internally in the Python interpreter, or any number of things. You can go really deep.
I think it's just a really fun collaborative question.
I've seen "what happens chronologically?" work really well as an activity with coworkers, where you're ask: "when a request comes into this API endpoint we run, how does that work? What happens?"
What I've seen is that someone will understand some part of the system, like "X happens, then Y happens, then it goes over to the database and I have no idea how that works". And then someone else can chime in and say "ah, yes, with the database A B C happens, but then there's a queue and I don't know about that".
I think it's really fun to get together with people who have different specializations and try to make these little timelines of what the computers are doing. I've learned a lot from doing that with people.
Even though I struggled with DNS. Once I got figured it out, I felt like "dude, this is easy!". Even though it just took me 10 years to learn how it works.
But of course, DNS was pretty hard for me to learn. So -- why is that? Why did it take me so long?
So, I have a little chart here of how I think about DNS.
You have your browser on the left. And over on the right there's the authoritative nameservers, the source of truth of where the DNS records for a domain live.
In the middle, there's a function that you call and a cache. So you have browser, function, cache, source of truth.
One problem is that there are a lot of things in this diagram that are totally hidden from you.
The library code that you're using where you make a DNS request -- there are a lot of different libraries you could be using, and it's not straightforward to figure out which one is being used. That was the source of some of my confusion.
There's a cache which has a bunch of cached data. That's invisible to you, you can't inspect it easily and you have no control over it. that
And there's a conversation between the cache and the source of truth, these two red arrows which also you can't see at all.
So this is kind of tough! How are you supposed to develop an intuition for a system when it's mostly things that are completely hidden from you? Feels like a lot to expect.
So: let's talk about these red arrows on the right.
We have our cache and then we have the source of truth. This conversation is normally hidden from you because you often don't control either of these servers. Usually they're too busy doing high-performance computing to report to you what they're doing.
But I thought: anyone can write an authoritative nameserver! In particular, I could write one that reports back every single message that it receives to its users. So, with my friend Marie, we wrote a little DNS server.
(demo of messwithdns.net)
This is called Mess With DNS. The idea is I have a domain name and you
can do whatever you want with it. We're going to make a DNS record called
strangeloop
, and we're going to make a CNAME record pointing at
orange.jvns.ca
, which is just a picture of an orange. Because I
like oranges.
And then over here, every time a request comes in from a resolver, this will -- this will report back what happened. So, if we click on this link, we can see -- a Canadian DNS resolver, which is apparently what my browser is configured to use, is requesting an IPv4 record and an IPv6 record, A and AAAA.
(at this point in the demo everyone in the audience starts visiting the link and it gets a bit chaotic, it's very funny)So the trick here is to find ways to show people parts of what the computer is doing that are normally hidden.
Another great example of showing things that are hidden is this website called float.exposed by Bartosz Ciechanowski who makes a lot of incredible visualizations.
So if you look at this 32-bit floating point number and click the "up" button on the significand, it'll show you the next floating point number, which is 2 more. And then as you make the number bigger and bigger (by increasing the exponent), you can see that the floating point numbers get further and further apart.
Anyway, this is not a talk about floating point. I could do an entire talk about this site and how we can use it to see how floating point works, but that's not this talk.
Another thing that makes DNS confusing is that it's a giant distributed system -- maybe you're confused because there are 5 million computers involved (really, more!). Most of which you have no control over, and some are doing not what they're supposed to do.
So that's another trick for understanding why things are hard, check to see if there are actually 5 million computers involved.
So what else is hard about DNS?
We've talked about how most of the system is hidden from you, and about how it's a big distributed system.
One of the hidden things I talked about was: the resolver has cached data, right? And you might be curious about whether a certain domain name is cached or not by your resolver right now.
Just to understand what's happening: am I getting this result because it was cached? What's the deal?
I said this was hidden, but there are a couple of ways to query a resolver to see what it has cached, and I want to show you one of them.
dig
, and
it has a flag called +norecurse
. You can use it to query a
resolver and ask it to only return results it already has cached.
With dig +norecurse jvns.ca
, I'm kind of asking -- how popular is my website? Is it popular enough that someone has visited it in the last 5 minutes?
Because my records are not cached for that long, only for 5 minutes.
But when I look at this response, I feel like "please! What is all this?"
And when I show newcomers this output, they often respond by saying "wow, that's complicated, this DNS thing must be really complicated". But really this is just not a great output format, I think someone just made some relatively arbitrary choices about how to print this stuff out in the 90s and it's stayed that way ever since.
So a bad output format can mislead newcomers into thinking that something is more complicated than it actually is.
One of my favorite tricks, I call eraser eyes.
Because when I look at that output, I'm not looking at all of it, I'm just looking at a few things. My eyes are ignoring the rest of it.
When I look at the output, this is what I see: it says SERVFAIL
.
That's the DNS response code.
Which as I understand it is a very unintuitive way of it saying, "I do not have that in my cache". So nobody has asked that resolver about my domain name in the last 5 minutes, which isn't very surprising.
I've learned so much from people doing a little demo of a tool, and showing how they use it and which parts of the output or UI they pay attention to, and which parts they ignore.
Becuase usually we ignore most of what's on our screens!
I really love to use dig
even though it's a little hairy because
it has a lot of features (I don't know of another DNS debugging that supports this
+norecurse
trick), it's everywhere, and it hasn't changed in a
long time. And I know if I learn its weird output format once I can know that
forever. Stability is really valuable to me.
We've talked about some tricks I use to bring people over, like:
When I practiced this talk, I got some feedback from people saying "julia! I don't do those things! I don't have a blog, and I'm not going to start one!"
And it's true that most people are probably not going to start programming blogs.
But I really don't think you need to have a public presence on the internet to tell the people around you a little bit about how you use computers and how you understand them.
My experience is that a lot of people (who do not have blogs!) have helped me understand how computers work and have shared little pieces of their experience with computers with me.
I've learned a lot from my friends and my coworkers and honestly a lot of random strangers on the Internet too. I'm pretty sure some of you here today have helped me over the years, maybe on Twitter or Mastodon.
So I want to talk about some archetypes of helpful people
One kind of person who has really helped me is the grumpy old-timer. I'll say "this is so cool". And they'll reply yes, however, let me tell you some stories of how this has gone wrong in my life.
And those stories have sometimes helped spare me some suffering.
We have the loud newbie, who asks questions like "wait, how does that work?" And then everyone else feels relieved -- "oh, thank god. It's not just me."
I think it's especially valuable when the person who takes the "loud newbie" role is actually a pretty senior developer. Because when you're more secure in your position, it's easier to put yourself out there and say "uh, I don't get this" because nobody is going to judge you for that and think you're incompetent.
And then other people who feel more like they might be judged for not knowing something can ride along on your coattails.
Then we have the bug chronicler. Who decides "ok, that bug. This can never happen again".
"I'm gonna make sure we understand what happened. Because I want this to end now."
And much like when debugging a computer program, when you have a bug, you want to understand why the bug is happening if you're gonna fix it.
If we're all struggling with the same things together for the same reasons, if we can figure out what those reasons are, we can do a better job of fixing them.
And that's all I have for you. Thank you.
I brought some zines to the conference, if you come to the signing later on you can get one.
This was the last ever Strange Loop and I’m really grateful to Alex Miller and the whole organizing team for making such an incredible conference for so many years. Strange Loop accepted one of my first talks (you can be a kernel hacker) 9 years ago when I had almost no track record as a speaker so I owe a lot to them.
Thanks to Sumana for coming up with the idea for this talk, and to Marie, Danie, Kamal, Alyssa, and Maya for listening to rough drafts of it and helping make it better, and to Dolly, Jesse, and Marco for some of the conversations I mentioned.
Also after the conference Nick Fagerland wrote a nice post with thoughts on why git is hard in response to my “I don’t know why git is hard” comment and I really appreciated it. It had some new-to-me ideas and I’d love to read more analyses like that.
]]>.git
directory, but where exactly in there are all the versions of your old files?
For example, this blog is in a git repository, and it contains a file called
content/post/2019-06-28-brag-doc.markdown
. Where is that in my .git
folder?
And where are the old versions of that file? Let’s investigate by writing some
very short Python programs.
.git/objects
Every previous version of every file in your repository is in .git/objects
.
For example, for this blog, .git/objects
contains 2700 files.
$ find .git/objects/ -type f | wc -l
2761
note: .git/objects
actually has more information than “every previous version
of every file in your repository”, but we’re not going to get into that just yet
Here’s a very short Python program
(find-git-object.py) that
finds out where any given file is stored in .git/objects
.
import hashlib
import sys
def object_path(content):
header = f"blob {len(content)}\0"
data = header.encode() + content
digest = hashlib.sha1(data).hexdigest()
return f".git/objects/{digest[:2]}/{digest[2:]}"
with open(sys.argv[1], "rb") as f:
print(object_path(f.read()))
What this does is:
blob 16673\0
) and combine it with the contentse33121a9af82dd99d6d706d037204251d41d54
in this case).git/objects/e3/3121a9af82dd99d6d706d037204251d41d54
)We can run it like this:
$ python3 find-git-object.py content/post/2019-06-28-brag-doc.markdown
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
The term for this storage strategy (where the filename of an object in the database is the same as the hash of the file’s contents) is “content addressed storage”.
One neat thing about content addressed storage is that if I have two files (or
50 files!) with the exact same contents, that doesn’t take up any extra space
in Git’s database – if the hash of the contents is aabbbbbbbbbbbbbbbbbbbbbbbbb
, they’ll both be stored in .git/objects/aa/bbbbbbbbbbbbbbbbbbbbb
.
If I try to look at this file in .git/objects
, it gets a bit weird:
$ cat .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
x^A<8D><9B>}s<E3>Ƒ<C6><EF>o|<8A>^Q<9D><EC>ju<92><E8><DD>\<9C><9C>*<89>j<FD>^...
What’s going on? Let’s run file
on it:
$ file .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54: zlib compressed data
It’s just compressed! We can write another little Python program called decompress.py
that uses the zlib
module to decompress the data:
import zlib
import sys
with open(sys.argv[1], "rb") as f:
content = f.read()
print(zlib.decompress(content).decode())
Now let’s decompress it:
$ python3 decompress.py .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
blob 16673---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... the entire blog post ...
So this data is encoded in a pretty simple way: there’s this
blob 16673\0
thing, and then the full contents of the file.
One thing that surprised me here is the first time I learned it: there aren’t
any diffs here! That file is the 9th version of that blog post, but the version
git stores in the .git/objects
is the whole file, not the diff from the
previous version.
Git actually sometimes also does store files as diffs (when you run git gc
it
can combine multiple different files into a “packfile” for efficiency), but I
have never needed to think about that in my life so we’re not going to get into
it. Aditya Mukerjee has a great post called Unpacking Git packfiles about how the format works.
Now you might be wondering – if there are 8 previous versions of that blog
post (before I fixed some typos), where are they in the .git/objects
directory? How do we find them?
First, let’s find every commit where that file changed with git log
:
$ git log --oneline content/post/2019-06-28-brag-doc.markdown
c6d4db2d
423cd76a
7e91d7d0
f105905a
b6d23643
998a46dd
67a26b04
d9999f17
026c0f52
72442b67
Now let’s pick a previous commit, let’s say 026c0f52
. Commits are also stored
in .git/objects
, and we can try to look at it there. But the commit isn’t
there! ls .git/objects/02/6c*
doesn’t have any results! You know how we
mentioned “sometimes git packs objects to save space but we don’t need to worry
about it?“. I guess now is the time that we need to worry about it.
So let’s take care of that.
So we need to unpack the objects from the pack files. I looked it up on Stack Overflow and apparently you can do it like this:
$ mv .git/objects/pack/pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack .
$ git unpack-objects < pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack
This is weird repository surgery so it’s a bit alarming but I can always just clone the repository from Github again if I mess it up, so I wasn’t too worried.
After unpacking all the object files, we end up with way more objects: about 20000 instead of about 2700. Neat.
find .git/objects/ -type f | wc -l
20138
Now we can go back to looking at our commit 026c0f52
. You know how we said
that not everything in .git/objects
is a file? Some of them are commits! And
to figure out where the old version of our post
content/post/2019-06-28-brag-doc.markdown
is stored, we need to dig pretty
deep into this commit.
The first step is to look at the commit in .git/objects
.
The commit 026c0f52
is now in
.git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4
after doing some
unpacking and we can look at it like this:
$ python3 decompress.py .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4
commit 211tree 01832a9109ab738dac78ee4e95024c74b9b71c27
parent 72442b67590ae1fcbfe05883a351d822454e3826
author Julia Evans <julia@jvns.ca> 1561998673 -0400
committer Julia Evans <julia@jvns.ca> 1561998673 -0400
brag doc
We can also get same information with git cat-file -p 026c0f52
, which does the same thing but does a better job of formatting the data. (the -p
option means “format it nicely please”)
This commit has a tree. What’s that? Well let’s take a look. The tree’s ID
is 01832a9109ab738dac78ee4e95024c74b9b71c27
, and we can use our
decompress.py
script from earlier to look at that git object. (though I had to remove the .decode()
to get the script to not crash)
$ python3 decompress.py .git/objects/01/832a9109ab738dac78ee4e95024c74b9b71c27
b'tree 396\x00100644 .gitignore\x00\xc3\xf7`$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\x8e:h\xad100644 README.md\x00~\xba\xec\xb3\x11\xa0^\x1c\xa9\xa4?\x1e\xb9\x0f\x1cfG\x96\x0b
This is formatted in kind of an unreadable way. The main display issue here is that
the commit hashes (\xc3\xf7$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\
…) are raw
bytes instead of being encoded in hexadecimal. So we see \xc3\xf7$8\x9b\x8d
instead of c3f76024389b8d
. Let’s switch over to using git cat-file -p
which
formats the data in a friendlier way, because I don’t feel like writing a
parser for that.
$ git cat-file -p 01832a9109ab738dac78ee4e95024c74b9b71c27
100644 blob c3f76024389b8d4f192f18b77d7cc7ce8e3a68ad .gitignore
100644 blob 7ebaecb311a05e1ca9a43f1eb90f1c6647960bc1 README.md
100644 blob 0f21dc9bf1a73afc89634bac586271384e24b2c9 Rakefile
100644 blob 00b9d54abd71119737d33ee5d29d81ebdcea5a37 config.yaml
040000 tree 61ad34108a327a163cdd66fa1a86342dcef4518e content <-- this is where we're going next
040000 tree 6d8543e9eeba67748ded7b5f88b781016200db6f layouts
100644 blob 22a321a88157293c81e4ddcfef4844c6c698c26f mystery.rb
040000 tree 8157dc84a37fca4cb13e1257f37a7dd35cfe391e scripts
040000 tree 84fe9c4cb9cef83e78e90a7fbf33a9a799d7be60 static
040000 tree 34fd3aa2625ba784bced4a95db6154806ae1d9ee themes
This is showing us all of the files I had in the root directory of the
repository as of that commit. Looks like I accidentally committed some file
called mystery.rb
at some point which I later removed.
Our file is in the content
directory, so let’s look at that tree: 61ad34108a327a163cdd66fa1a86342dcef4518e
$ git cat-file -p 61ad34108a327a163cdd66fa1a86342dcef4518e
040000 tree 1168078878f9d500ea4e7462a9cd29cbdf4f9a56 about
100644 blob e06d03f28d58982a5b8282a61c4d3cd5ca793005 newsletter.markdown
040000 tree 1f94b8103ca9b6714614614ed79254feb1d9676c post <-- where we're going next!
100644 blob 2d7d22581e64ef9077455d834d18c209a8f05302 profiler-project.markdown
040000 tree 06bd3cee1ed46cf403d9d5a201232af5697527bb projects
040000 tree 65e9357973f0cc60bedaa511489a9c2eeab73c29 talks
040000 tree 8a9d561d536b955209def58f5255fc7fe9523efd zines
Still not done…
The file we’re looking for is in the post/
directory, so there’s one more tree:
$ git cat-file -p 1f94b8103ca9b6714614614ed79254feb1d9676c
.... MANY MANY lines omitted ...
100644 blob 170da7b0e607c4fd6fb4e921d76307397ab89c1e 2019-02-17-organizing-this-blog-into-categories.markdown
100644 blob 7d4f27e9804e3dc80ab3a3912b4f1c890c4d2432 2019-03-15-new-zine--bite-size-networking-.markdown
100644 blob 0d1b9fbc7896e47da6166e9386347f9ff58856aa 2019-03-26-what-are-monoidal-categories.markdown
100644 blob d6949755c3dadbc6fcbdd20cc0d919809d754e56 2019-06-23-a-few-debugging-resources.markdown
100644 blob 3105bdd067f7db16436d2ea85463755c8a772046 2019-06-28-brag-doc.markdown <-- found it!!!!!
Here the 2019-06-28-brag-doc.markdown
is the last file listed because it was
the most recent blog post when it was published.
Finally we have found the object file where a previous version of my blog post
lives! Hooray! It has the hash 3105bdd067f7db16436d2ea85463755c8a772046
, so
it’s in git/objects/31/05bdd067f7db16436d2ea85463755c8a772046
.
We can look at it with decompress.py
$ python3 decompress.py .git/objects/31/05bdd067f7db16436d2ea85463755c8a772046 | head
blob 15924---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... rest of the contents of the file here ...
This is the old version of the post! If I ran git checkout 026c0f52 content/post/2019-06-28-brag-doc.markdown
or git restore --source 026c0f52 content/post/2019-06-28-brag-doc.markdown
, that’s what I’d get.
git log
worksThis whole process we just went through (find the commit, go through the
various directory trees, search for the filename we wanted) seems kind of long
and complicated but this is actually what’s happening behind the scenes when we
run git log content/post/2019-06-28-brag-doc.markdown
. It needs to go through
every single commit in your history, check the version (for example
3105bdd067f7db16436d2ea85463755c8a772046
in this case) of
content/post/2019-06-28-brag-doc.markdown
, and see if it changed from the previous commit.
That’s why git log FILENAME
is a little slow sometimes – I have 3000 commits in this
repository and it needs to do a bunch of work for every single commit to figure
out if the file changed in that commit or not.
Right now I have 1530 files tracked in my blog repository:
$ git ls-files | wc -l
1530
But how many historical files are there? We can list everything in .git/objects
to see how many object files there are:
$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | wc -l
20135
Not all of these represent previous versions of files though – as we saw
before, lots of them are commits and directory trees. But we can write another little Python
script called find-blobs.py
that goes through all of the objects and checks
if it starts with blob
or not:
import zlib
import sys
for line in sys.stdin:
line = line.strip()
filename = f".git/objects/{line[0:2]}/{line[2:]}"
with open(filename, "rb") as f:
contents = zlib.decompress(f.read())
if contents.startswith(b"blob"):
print(line)
$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | python3 find-blobs.py | wc -l
6713
So it looks like there are 6713 - 1530 = 5183
old versions of files lying
around in my git repository that git is keeping around for me in case I ever
want to get them back. How nice!
Here’s the gist with all the code for this post. There’s not very much.
I thought I already knew how git worked, but I’d never really thought about
pack files before so this was a fun exploration. I also don’t spend too much
time thinking about how much work git log
is actually doing when I ask it to
track the history of a file, so that was fun to dig into.
As a funny postscript: as soon as I committed this blog post, git got mad about
how many objects I had in my repository (I guess 20,000 is too many!) and
ran git gc
to compress them all into packfiles. So now my .git/objects
directory is very small:
$ find .git/objects/ -type f | wc -l
14
]]>I’ve found Mastodon quite a bit more confusing than Twitter because it’s a distributed system, so here are a few technical things I’ve learned about it over the last 10 months. I’ll mostly talk about what using a single-person server has been like for me, as well as a couple of notes about the API, DMs and ActivityPub.
I might have made some mistakes, please let me know if I’ve gotten anything wrong!
First: Mastodon is a decentralized collection of independently run servers instead of One Big Server. The software is open source.
In general, if you have an account on one server (like ruby.social
), you
can follow people on another server (like hachyderm.io
), and they can
follow you.
I’m going to use the terms “Mastodon server” and “Mastodon instance” interchangeably in this post.
These were the things I was concerned about when choosing an instance:
@b0rk@infosec.exchange
because I’m not an infosec person.mastodon.social
, but some servers choose to block or limit mastodon.social
for various reasonsIn the end, I chose to run my own mastodon server because it seemed simplest – I could pick a domain I liked, and I knew I’d definitely agree with the moderation decisions because I’d be in charge.
I’m not going to give server recommendations here, but here’s a list of the top 200 most common servers people who follow me use.
One big thing I wondered was – can I use my own domain (and have the username @b0rk@jvns.ca
or something) but be on someone else’s Mastodon server?
The answer to this seems to be basically “no”: if you want to use your own
domain on Mastodon, you need to run your own server. (you can kind of do this,
but it’s more like an alias or redirect – if I used that method to direct b0rk@jvns.ca
to b0rk@mastodon.social
, my
posts would still show up as being from b0rk@mastodon.social
)
There’s also other ActivityPub software (Takahē) that supports people bringing their own domain in a first-class way.
I really wanted to have a way to use my own domain name for identity, but to share server hosting costs with other people. This isn’t possible on Mastodon right now, so I decided to set up my own server instead.
I chose to run a Mastodon server (instead of some other ActivityPub implementation) because Mastodon is the most popular one. Good managed Mastodon hosting is readily available, there are tons of options for client apps, and I know for sure that my server will work well with other people’s servers.
I use masto.host for Mastodon hosting, and it’s been great so far. I have nothing interesting to say about what it’s like to operate a Mastodon instance because I know literally nothing about it. Masto.host handles all of the server administration and Mastodon updates, and I never think about it at all.
Right now I’m on their $19/month (“Star”) plan, but it’s possible I could use a smaller plan with no problems. Right now their cheapest plan is $6/month and I expect that would be fine for someone with a smaller account.
Some things I was worried about when embarking on my own Mastodon server:
social.jvns.ca
, but I wanted my username to
be b0rk@jvns.ca
instead of b0rk@social.jvns.ca
. To get this to work I
followed these Setting up a personal fediverse ID directions from
Jacob Kaplan-Moss and it’s been fine.Being on a 1-person server has some significant downsides. To understand why, you need to understand a little about how Mastodon works.
Every Mastodon server has a database of posts. Servers only have posts that they were explicitly sent by another server in their database.
Some reasons that servers might receive posts:
As a 1-person server, my server does not receive that many posts! I only get posts from people I follow or posts that explicitly mention me in some way.
The causes several problems:
All of these things will happen to users of any small Mastodon server, not just 1-person servers.
A lot of people are on smaller servers, so when they’re participating in a conversation, they can’t see all the replies to the post.
This means that replies can get pretty repetitive because people literally cannot see each other’s replies. This is especially annoying for posts that are popular or controversial, because the person who made the post has to keep reading similar replies over and over again by people who think they’re making the point for the first time.
To get around this (as a reader), you can click “open link to post” or something in your Mastodon client, which will open up the page on the poster’s server where you can read all of the replies. It’s pretty annoying though.
As a poster, I’ve tried to reduce repetitiveness in replies by:
The Mastodon devs are extremely aware of these issues, there are a bunch of github issues about them:
My guess is that there are technical reasons these features are difficult to add because those issues have been open for 5-7 years.
The Mastodon devs have said that they plan to improve reply fetching, but that it requires a significant amount of work.
Some people have built workarounds for fetching profiles / replies.
Also, there are a couple of Mastodon clients which will proactively fetch replies. For iOS:
I haven’t tried those yet though.
Mastodon instances have a “local timeline” where you can see everything other people on the server are posting, and a “federated timeline” which shows sort of a combined feed from everyone followed by anyone on the server. This means that you can see trending posts and get an idea of what’s going on and find people to follow. You don’t get that if you’re on a 1-person server – it’s just me talking to myself! (plus occasional interjections from my reruns bot).
Some workarounds people mentioned for this:
If anyone else on small servers has suggestions for how to make discovery easier I’d love to hear them.
When I moved to my own server from mastodon.social
, I needed to run an account migration to move over my followers. First, here’s how migration works:
The follower move was the part I was most worried about. Here’s how it turned out:
One thing I love about Mastodon is – it has an API that’s MUCH easier to use than Twitter’s API. I’ve always been frustrated with how difficult it is to navigate large Twitter threads, so I made a small mastodon thread view website that lets you log into your Mastodon account. It’s pretty janky and it’s really only made for me to use, but I’ve really appreciated the ability to write my own janky software to improve my Mastodon experience.
Some notes on the Mastodon API:
Next I’ll talk about a few general things about Mastodon that confused or surprised me that aren’t specific to being on a single-person instance.
The way Mastodon DMs work surprised me in a few ways:
There are a couple of different ways for a server to block another Mastodon server. I haven’t really had to do this much but people talk about it a lot and I was confused about the difference, so:
One thing that wasn’t obvious to me is that who servers defederate / limit is sometimes hidden, so it’s hard to suss out what’s going on if you’re considering joining a server, or trying to understand why you can’t see certain posts.
There’s no way to search past posts you’ve read. If I see something interesting on my timeline and want to find it later, I usually can’t. (Mastodon has a Elasticsearch-based search feature, but it only allows you to search your own posts, your mentions, your favourites, and your bookmarks)
These limitations on search are intentional (and a very common source of arguments) – it’s a privacy / safety issue. Here’s a summary from Tim Bray with lots of links.
It would be personally convenient for me to be able to search more easily but I respect folks’ safety concerns so I’ll leave it at that.
My understanding is that the Mastodon devs are planning to add opt-in search for public posts relatively soon.
We’ve been talking about Mastodon a lot, but not everyone who I follow is using Mastodon: Mastodon uses a protocol called ActivityPub to distribute messages.
Here are some examples of other software I see people talking about, in no particular order:
I’m probably missing a bunch of important ones.
This confused me for a while, and I’m still not super clear on how ActivityPub works. What I’ve understood is:
I haven’t written here about what Mastodon culture is like because other people have done a much better job of talking about it than me, but of course it’s is the biggest thing that affects your experience and it was the thing that took me longest to get a handle on. A few links:
I don’t regret setting up a single-user server – even though it’s inconvenient, it’s important to me to have control over my social media. I think “have control over my social media” is more important to me than it is to most other people though, because I use Twitter/Mastodon a lot for work.
I am happy that I didn’t start out on a single-user server though – I think it would have made getting started on Mastodon a lot more difficult.
Mastodon is pretty rough around the edges sometimes but I’m able to have more interesting conversations about computers there than I am on Twitter (or Bluesky), so that’s where I’m staying for now.
]]>if you just stopped being scared of the command line in the last year or three — what helped you?
(no need to reply if you don’t remember, or if you’ve been using the command line comfortably for 15 years — this question isn’t for you :) )
This list is still a bit shorter than I would like, but I’m posting it in the hopes that I can collect some more answers. There obviously isn’t one single thing that works for everyone – different people take different paths.
I think there are three parts to getting comfortable: reducing risks, motivation and resources. I’ll start with risks, then a couple of motivations and then list some resources.
A lot of people are (very rightfully!) concerned about accidentally doing some destructive action on the command line that they can’t undo.
A few strategies people said helped them reduce risks:
rm
to a tool like safe-rm or rmtrash so that you can’t accidentally delete something you shouldn’t (or just rm -i
)rm *.txt
and show me exactly what it’s going to remove)--dry-run
options for dangerous commands, if they’re available--dry-run
options into your shell scriptsA few people mentioned a “killer command line app” that motivated them to start spending more time on the command line. For example:
A couple of people also mentioned getting frustrated with GUI tools (like heavy IDEs that use all your RAM and crash your computer) and being motivated to replace them with much lighter weight command line tools.
One person mentioned being motivated by seeing cool stuff other people were doing with the command line, like:
Several people mentioned explainshell where you can paste in any shell incantation and get it to break it down into different parts.
There were lots of little tips and tricks mentioned that make it a lot easier to work on the command line, like:
Ctrl+w
(to delete a word), Ctrl+a
(to go to
the beginning of the line), Ctrl+e
(to go to the end), and Ctrl+left arrow
/ Ctrl+right arrow
(to
jump back/forward a word)cd -
to go back to the previous directoryless
to read man pages or other large text files (how to search, scroll, etc)Lots of mentions of using fzf as a better way to fuzzy search shell history. Some other things people mentioned using fzf for:
git checkout $(git for-each-ref --format='%(refname:short)' refs/heads/ | fzf)
)nvim $(fzf)
)kubectl config use-context $(kubectl config get-contexts -o name | fzf --height=10 --prompt="Kubernetes Context> ")
)The general pattern here is that you use fzf to pick something (a file, a git branch, a command line argument), fzf prints the thing you picked to stdout, and then you insert that as the command line argument to another command.
You can also use fzf as an tool to automatically preview the output and quickly iterate, for example:
echo '' | fzf --preview "jq {q} < YOURFILE.json"
)sed
(echo '' | fzf --preview "sed {q} YOURFILE"
)awk
(echo '' | fzf --preview "awk {q} YOURFILE"
)You get the idea.
In general folks will generally define an alias for their fzf
incantations so
you can type gcb
or something to quickly pick a git branch to check out.
Some people started using a Raspberry Pi, where it’s safer to experiment without worrying about breaking your computer (you can just erase the SD card and start over!)
Lots of people said they got more comfortable with the command line when they started using a more user-friendly shell setup like oh-my-zsh or fish. I really agree with this one – I’ve been using fish for 10 years and I love it.
A couple of other things you can do here:
Some tools for theming your terminal:
A few people mentioned fancy terminal file managers like ranger or nnn, which I hadn’t heard of.
Someone who can answer beginner questions and give you pointers is invaluable.
Several mentions of watching someone more experienced using the terminal – there are lots of little things that experienced users don’t even realize they’re doing which you can pick up.
Lots of people said that making their own aliases or scripts for commonly used tasks felt like a magical “a ha!” moment, because:
A lot of man pages don’t have examples, for example the openssl s_client man page has no examples. This makes it a lot harder to get started!
People mentioned a couple of cheat sheet tools, like:
For example the cheat page for openssl is really
great – I think it includes almost everything I’ve ever actually used openssl
for in practice (except the -servername
option for openssl s_client
).
One person said that they configured their .bash_profile
to print out a cheat
sheet every time they log in.
A couple of people said that they needed to change their approach – instead of trying to memorize all the commands, they realized they could just look up commands as needed and they’d naturally memorize the ones they used the most over time.
(I actually recently had the exact same realization about learning to read x86 assembly – I was taking a class and the instructor said “yeah, just look everything up every time to start, eventually you’ll learn the most common instructions by heart”)
Some people also said the opposite – that they used a spaced repetition app like Anki to memorize commonly used commands.
One person mentioned that they started using vim on the command line to edit files, and once they were using a terminal text editor it felt more natural to use the command line for other things too.
Also apparently there’s a new editor called micro which is like a nicer version of pico/nano, for folks who don’t want to learn emacs or vim.
One person said that they started using Linux as their main daily driver, and having to fix Linux issues helped them learn. That’s also how I got comfortable with the command too back in ~2004 (I was really into installing lots of different Linux distributions to try to find my favourite one), but my guess is that it’s not the most popular strategy these days.
Some people said that they took a university class where the professor made them do everything in the terminal, or that they created a rule for themselves that they had to do all their work in the terminal for a while.
A couple of people said that workshops like Software Carpentry workshops (an introduction to the command line, git, and Python/R programming for scientists) helped them get more comfortable with the command line.
You can see the software carpentry curriculum here.
a few that were mentioned:
articles:
books:
videos:
I’ve often heard the advice “don’t read the comments”, but actually I’ve learned a huge amount from reading internet comments on my posts from strangers over the years, even if sometimes people are jerks. So I want to explain some tactics I use to try to make the comments on my posts more informative and useful to me, and to try to minimize the number of annoying comments I get.
On here I mostly talk about facts – either facts about computers, or stories about my experiences using computers.
For example this post about tcpdump contains some basic facts about how to use tcpdump, as well as an example of how I’ve used it in the past.
Talking about facts means I get a lot of fact-based comments like:
sudo tcpdump -s 0 -A
'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'
to filter for HTTP GET requests)In general, I’d say that people’s comments about facts tend to stay pretty normal. The main kinds of negative comments I get about facts are:
-n
in
that post because at the time I didn’t know why the -n
flag was useful
(it’s useful because it turns off this annoying reverse DNS lookup that
tcpdump does by default so you can see the IP addresses).I think stories encourage pretty good discussion. For example, why you should understand (a little) about TCP is a story about a time it was important for me to understand how TCP worked.
When I share stories about problems I solved, the comments really help me understand how what I learned fits into a bigger context. For example:
Also I think these kinds of stories are incredibly important – that post describes a bug that was VERY hard for me to solve, and the only reason I was able to figure it out in the first place was that I read this blog post.
Often in my blog posts I ask technical questions that I don’t know the answer to (or just mention “I don’t know X…”). This helps people focus their replies a little bit – an obvious comment to make is to provide an answer to the question, or explain the thing I didn’t know!
This is fun because it feels like a guaranteed way to get value out of people’s comments – people LOVE answering questions, and so they get to look smart, and I get the answer to a question I have! Everyone wins!
I make a lot of mistakes in my blog posts, because I write about a lot of things that are on the edge of my knowledge. When people point out mistakes, I often edit the blog post to fix it.
Usually I’ll stay near a computer for a few hours after I post a blog post so that I can fix mistakes quickly as they come up.
Some people are very careful to list every single error they made in their blog posts (“errata: the post previously said X which was wrong, I have corrected it to say Y”). Personally I make mistakes constantly and I don’t have time for that so I just edit the post to fix the mistakes.
A lot of the time when I post a blog post, people on Twitter/Mastodon will reply with various opinions they have about the thing. For example, someone recently replied to a blog post about DNS saying that they love using zone files and dislike web interfaces for managing DNS records. That’s not an opinion I share, so I asked them why.
They explained that there are some DNS record types (specifically TLSA
) that they find
often aren’t supported in web interfaces. I didn’t know that people used TLSA
records, so I learned something! Cool!
I’ve found that asking people to share their experiences (“I wanted to use X DNS record type and I couldn’t”) instead of their opinions (“DNS web admin interfaces are bad”) leads to a lot of useful information and discussion. I’ve learned a lot from it over the years, and written a lot of tweets like “which DNS record types have you needed?” to try to extract more information about people’s experiences.
I try to model the same behaviour in my own work when I can – if I have an opinion, I’ll try to explain the experiences I’ve had with computers that caused me to have that opinion.
I think internet strangers are more likely to reply in a weird way when they have no idea who you are or why you’re writing this thing. It’s easy to make incorrect assumptions! So often I’ll mention a little context about why I’m writing this particular blog post.
For example:
A little while ago I started using a Mac, and one of my biggest frustrations with it is that often I need to run Linux-specific software.
or
I’ve started to run a few more servers recently (nginx playground, mess with dns, dns lookup), so I’ve been thinking about monitoring.
or
Last night, I needed to scan some documents for some bureaucratic reasons. I’d never used a scanner on Linux before and I was worried it would take hours to figure out…
There are some kinds of programming conversations that I find extremely boring (like “should people learn vim?” or “is functional programming better than imperative programming?“). So I generally try to avoid writing blog posts that I think will result in a conversation/comment thread that I find annoying or boring.
For example, I wouldn’t write about my opinions about functional programming: I don’t really have anything interesting to say about it and I think it would lead to a conversation that I’m not interested in having.
I don’t always succeed at this of course (it’s impossible to predict what people are going to want to comment about!), but I try to avoid the most obvious flamebait triggers I’ve seen in the past.
There are a bunch of “flamebait” triggers that can set people off on a conversation that I find boring: cryptocurrency, tailwind, DNSSEC/DoH, etc. So I have a weird catalog in my head of things not to mention if I don’t want to start the same discussion about that thing for the 50th time.
Of course, if you think that conversations about functional programming are interesting, you should write about functional programming and start the conversations you want to have!
Also, it’s often possible to start an interesting conversation about a topic where the conversation is normally boring. For example I often see the same talking points about IPv6 vs IPv4 over and over again, but I remember the comments on Reasons for servers to support IPv6 being pretty interesting. In general if I really care about a topic I’ll talk about it anyway, but I don’t care about functional programming very much so I don’t see the point of bringing it up.
Another kind of “boring conversation” I try to avoid is suggestions of things I have already considered. Like when someone says “you should do X” but I already know I could have done X and chose not to because of A B C.
So I often will add a short note like “I decided not to do X because of A B C” or “you can also do X” or “normally I would do X, here I didn’t because…”. For example, in this post about nix, I list a bunch of Nix features I’m choosing not to use (nix-shell, nix flakes, home manager) to avoid a bunch of helpful people telling me that I should use flakes.
Listing the things I’m not doing is also helpful to readers – maybe someone new to nix will discover nix flakes through that post and decide to use them! Or maybe someone will learn that there are exceptions to when a certain “best practice” is appropriate.
Recently on Mastodon I complained about some gross terminology (“domain information groper”) that I’d just noticed in the dig man page on my machine. A few dudes in the replies (who by now have all deleted their posts) asked me to prove that the original author intended it to be offensive (which of course is besides the point, there’s just no need to have a term widely understood to be referring to sexual assault in the dig man page) or tried to explain to me why it actually wasn’t a problem.
So I blocked a few people and wrote a quick post:
man so many dudes in the replies demanding that i prove that the person who named dig “domain information groper” intended it in an offensive way. Big day for the block button I guess :)
I don’t do this too often, but I think it’s very important on social media to occasionally set some rules about what kind of behaviour I won’t tolerate. My goal here is usually to drive away some of the assholes (they can unfollow me!) and try to create a more healthy space for everyone else to have a conversation about computers in.
Obviously this only works in situations (like Twitter/Mastodon) where I have the ability to garden my following a little bit over time – I can’t do this on HN or Reddit or Lobsters or whatever and wouldn’t try.
As for fixing it – the dig maintainers removed the problem language years ago, but Mac OS still has a very outdated version for license reasons.
(you might notice that this section is breaking the “avoid boring conversations” rule above, this section was certain to start a very boring argument, but I felt it was important to talk about boundaries so I left it in)
Sometimes people seem to want to get into arguments or make dismissive comments. I don’t reply to them, even if they’re wrong. I dislike arguing on the internet and I’m extremely bad at it, so it’s not a good use of my time.
If I get a lot of negative comments that I didn’t expect, I try to see if I can get something useful out of it.
For example, I wrote a toy DNS resolver once and some of the commenters were upset that I didn’t handle parsing the DNS packet. At the time I thought this was silly (I thought DNS parsing was really straightforward and that it was obvious how to do it, who cares that I didn’t handle it?) but I realized that maybe the commenters didn’t think it was easy or obvious, and wanted to know how to do it. Which makes sense! It’s not obvious at all if you haven’t done it before!
Those comments partly inspired implement DNS in a weekend, which focuses much more heavily on the parsing aspects, and which I think is a much better explanation how to write a DNS resolver. So ultimately those comments helped me a lot, even if I found them annoying at the time.
(I realize this section makes me sound like a Perfectly Logical Person who does not get upset by negative public criticism, I promise this is not at all the case and I have 100000 feelings about everything that happens on the internet and get upset all the time. But I find that analyzing the criticism and trying to take away something useful from it helps a bit)
Thanks to Shae, Aditya, Brian, and Kamal for reading a draft of this.
Some other similar posts I’ve written in the past:
]]>print("hello world")
Here’s what it looks like at the command line:
$ python3 hello.py
hello world
But behind the scenes, there’s a lot more going on. I’ll
describe some of what happens, and (much much more importantly!) explain some tools you can use to
see what’s going on behind the scenes yourself. We’ll use readelf
, strace
,
ldd
, debugfs
, /proc
, ltrace
, dd
, and stat
. I won’t talk about the Python-specific parts at all – just what happens when you run any dynamically linked executable.
Here’s a table of contents:
execve
Before we even start the Python interpreter, there are a lot of things that have to happen. What executable are we even running? Where is it?
python3 hello.py
into a command to run and a list of arguments: python3
, and ['hello.py']
A bunch of things like glob expansion could happen here. For example if you run python3 *.py
, the shell will expand that into python3 hello.py
python3
Now we know we need to run python3
. But what’s the full path to that binary? The way this works is that there’s a special environment variable named PATH
.
See for yourself: Run echo $PATH
in your shell. For me it looks like this.
$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
When you run a command, the shell will search every directory in that list (in order) to try to find a match.
In fish
(my shell), you can see the path resolution logic here.
It uses the stat
system call to check if files exist.
See for yourself: Run strace -e stat
, and then run a command like python3
. You should see output like this:
stat("/usr/local/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/python3", 0x7ffcdd871f40) = -1 ENOENT (No such file or directory)
stat("/usr/bin/python3", {st_mode=S_IFREG|0755, st_size=5479736, ...}) = 0
You can see that it finds the binary at /usr/bin/python3
and stops: it
doesn’t continue searching /sbin
or /bin
.
(if this doesn’t work for you, instead try strace -o out bash
, and then grep
stat out
. One reader mentioned that their version of libc uses a different
system call instead of stat
)
execvp
If you want to run the same PATH searching logic as the shell does without
reimplementing it yourself, you can use the libc function execvp
(or one of
the other exec*
functions with p
in the name).
stat
, under the hoodNow you might be wondering – Julia, what is stat
doing? Well, when your OS opens a file, it’s split into 2 steps.
The stat
system call just returns the contents of the file’s inodes – it
doesn’t read the contents at all. The advantage of this is that it’s a lot
faster. Let’s go on a short adventure into inodes. (this great post “A disk is a bunch of bits” by Dmitry Mazin has more details)
$ stat /usr/bin/python3
File: /usr/bin/python3 -> python3.9
Size: 9 Blocks: 0 IO Block: 4096 symbolic link
Device: fe01h/65025d Inode: 6206 Links: 1
Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2023-08-03 14:17:28.890364214 +0000
Modify: 2021-04-05 12:00:48.000000000 +0000
Change: 2021-06-22 04:22:50.936969560 +0000
Birth: 2021-06-22 04:22:50.924969237 +0000
See for yourself: Let’s go see where exactly that inode is on our hard drive.
First, we have to find our hard drive’s device name
$ df
...
tmpfs 100016 604 99412 1% /run
/dev/vda1 25630792 14488736 10062712 60% /
...
Looks like it’s /dev/vda1
. Next, let’s find out where the inode for /usr/bin/python3
is on our hard drive:
$ sudo debugfs /dev/vda1
debugfs 1.46.2 (28-Feb-2021)
debugfs: imap /usr/bin/python3
Inode 6206 is part of block group 0
located at block 658, offset 0x0d00
I have no idea how debugfs
is figuring out the location of the inode for that filename, but we’re going to leave that alone.
Now, we need to calculate how many bytes into our hard drive “block 658, offset 0x0d00” is on the big array of bytes that is your hard drive. Each block is 4096 bytes, so we need to go 4096 * 658 + 0x0d00
bytes. A calculator tells me that’s 2698496
$ sudo dd if=/dev/vda1 bs=1 skip=2698496 count=256 2>/dev/null | hexdump -C
00000000 ff a1 00 00 09 00 00 00 f8 b6 cb 64 9a 65 d1 60 |...........d.e.`|
00000010 f0 fb 6a 60 00 00 00 00 00 00 01 00 00 00 00 00 |..j`............|
00000020 00 00 00 00 01 00 00 00 70 79 74 68 6f 6e 33 2e |........python3.|
00000030 39 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |9...............|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000060 00 00 00 00 12 4a 95 8c 00 00 00 00 00 00 00 00 |.....J..........|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 2d cb 00 00 |............-...|
00000080 20 00 bd e7 60 15 64 df 00 00 00 00 d8 84 47 d4 | ...`.d.......G.|
00000090 9a 65 d1 60 54 a4 87 dc 00 00 00 00 00 00 00 00 |.e.`T...........|
000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
Neat! There’s our inode! You can see it says python3
in it, which is a really
good sign. We’re not going to go through all of this, but the ext4 inode struct from the Linux kernel
says that the first 16 bits are the “mode”, or permissions. So let’s work that out how ffa1
corresponds to file permissions.
ffa1
correspond to the number 0xa1ff
, or 41471 (because x86 is little endian)0120777
777
, but what
are the first 3 digits? I’m not used to seeing those! You can find out what
the 012
means in man inode (scroll down to “The file type and mode”).
There’s a little table that says 012
means “symbolic link”.Let’s list the file and see if it is in fact a symbolic link with permissions 777
:
$ ls -l /usr/bin/python3
lrwxrwxrwx 1 root root 9 Apr 5 2021 /usr/bin/python3 -> python3.9
It is! Hooray, we decoded it correctly.
We’re still not ready to start python3
. First, the shell needs to create a
new child process to run. The way new processes start on Unix is a little weird
– first the process clones itself, and then runs execve
, which replaces the
cloned process with a new process.
*See for yourself: Run strace -e clone bash
, then run python3
. You should see something like this:
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f03788f1a10) = 3708100
3708100
is the PID of the new process, which is a child of the shell process.
Some more tools to look at what’s going on with processes:
pstree
will show you a tree of all the processes on your systemcat /proc/PID/stat
shows you some information about the process. The contents of that file are documented in man proc
. For example the 4th field is the parent PID.The new process (which will become python3
) has inherited a bunch of from the shell. For example, it’s inherited:
cat /proc/PID/environ | tr '\0' '\n'
ls -l /proc/PID/fd
execve
Now we’re ready to start the Python interpreter!
See for yourself: Run strace -f -e execve bash
, then run python3
. The -f
is important because we want to follow any forked child subprocesses. You should see something like this:
[pid 3708381] execve("/usr/bin/python3", ["python3"], 0x560397748300 /* 21 vars */) = 0
The first argument is the binary, and the second argument is the list of command line arguments. The command line arguments get placed in a special location in the program’s memory so that it can access them when it runs.
Now, what’s going on inside execve
?
The first thing that has to happen is that we need to open the python3
binary file and read its contents. So far we’ve only used the stat
system call to access its metadata,
but now we need its contents.
Let’s look at the output of stat
again:
$ stat /usr/bin/python3
File: /usr/bin/python3 -> python3.9
Size: 9 Blocks: 0 IO Block: 4096 symbolic link
Device: fe01h/65025d Inode: 6206 Links: 1
...
This takes up 0 blocks of space on the disk. This is because the contents of
the symbolic link (python3.9
) are actually in the inode itself: you can see
them here (from the binary contents of the inode above, it’s split across 2
lines in the hexdump output):
00000020 00 00 00 00 01 00 00 00 70 79 74 68 6f 6e 33 2e |........python3.|
00000030 39 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |9...............|
So we’ll need to open /usr/bin/python3.9
instead. All of this is happening
inside the kernel so you won’t see it another system call for that.
Every file is made up of a bunch of blocks on the hard drive. I think each of these blocks on my system is 4096 bytes, so the minimum size of a file is 4096 bytes – even if the file is only 5 bytes, it still takes up 4KB on disk.
See for yourself: We can find the block numbers using debugfs
like this: (again, I got these instructions from dmitry mazin’s “A disk is a bunch of bits” post)
$ debugfs /dev/vda1
debugfs: blocks /usr/bin/python3.9
145408 145409 145410 145411 145412 145413 145414 145415 145416 145417 145418 145419 145420 145421 145422 145423 145424 145425 145426 145427 145428 145429 145430 145431 145432 145433 145434 145435 145436 145437
Now we can use dd
to read the first block of the file. We’ll set the block size to 4096 bytes, skip 145408
blocks, and read 1 block.
$ dd if=/dev/vda1 bs=4096 skip=145408 count=1 2>/dev/null | hexdump -C | head
00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 3e 00 01 00 00 00 c0 a5 5e 00 00 00 00 00 |..>.......^.....|
00000020 40 00 00 00 00 00 00 00 b8 95 53 00 00 00 00 00 |@.........S.....|
00000030 00 00 00 00 40 00 38 00 0b 00 40 00 1e 00 1d 00 |....@.8...@.....|
00000040 06 00 00 00 04 00 00 00 40 00 00 00 00 00 00 00 |........@.......|
00000050 40 00 40 00 00 00 00 00 40 00 40 00 00 00 00 00 |@.@.....@.@.....|
00000060 68 02 00 00 00 00 00 00 68 02 00 00 00 00 00 00 |h.......h.......|
00000070 08 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00 |................|
00000080 a8 02 00 00 00 00 00 00 a8 02 40 00 00 00 00 00 |..........@.....|
00000090 a8 02 40 00 00 00 00 00 1c 00 00 00 00 00 00 00 |..@.............|
You can see that we get the exact same output as if we read the file with cat
, like this:
$ cat /usr/bin/python3.9 | hexdump -C | head
00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 3e 00 01 00 00 00 c0 a5 5e 00 00 00 00 00 |..>.......^.....|
00000020 40 00 00 00 00 00 00 00 b8 95 53 00 00 00 00 00 |@.........S.....|
00000030 00 00 00 00 40 00 38 00 0b 00 40 00 1e 00 1d 00 |....@.8...@.....|
00000040 06 00 00 00 04 00 00 00 40 00 00 00 00 00 00 00 |........@.......|
00000050 40 00 40 00 00 00 00 00 40 00 40 00 00 00 00 00 |@.@.....@.@.....|
00000060 68 02 00 00 00 00 00 00 68 02 00 00 00 00 00 00 |h.......h.......|
00000070 08 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00 |................|
00000080 a8 02 00 00 00 00 00 00 a8 02 40 00 00 00 00 00 |..........@.....|
00000090 a8 02 40 00 00 00 00 00 1c 00 00 00 00 00 00 00 |..@.............|
This file starts with ELF
, which is a “magic number”, or a byte sequence that
tells us that this is an ELF file. ELF is the binary file format on Linux.
Different file formats have different magic numbers, for example the magic
number for gzip is 1f8b
. The magic number at the beginning is how file blah.gz
knows that it’s a gzip file.
I think file
has a variety of heuristics for figuring out the file type of a
file, not just magic numbers, but the magic number is an important one.
Let’s parse the ELF file to see what’s in there.
See for yourself: Run readelf -a /usr/bin/python3.9
. Here’s what I get (though I’ve redacted a LOT of stuff):
$ readelf -a /usr/bin/python3.9
ELF Header:
Class: ELF64
Machine: Advanced Micro Devices X86-64
...
-> Entry point address: 0x5ea5c0
...
Program Headers:
Type Offset VirtAddr PhysAddr
INTERP 0x00000000000002a8 0x00000000004002a8 0x00000000004002a8
0x000000000000001c 0x000000000000001c R 0x1
-> [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
...
-> 1238: 00000000005ea5c0 43 FUNC GLOBAL DEFAULT 13 _start
Here’s what I understand of what’s going on here:
/lib64/ld-linux-x86-64.so.2
to start this program. This is called the dynamic linker and we’ll talk about it next0x5ea5c0
, which is where this program’s code starts)Now let’s talk about the dynamic linker.
Okay! We’ve read the bytes from disk and we’ve started this “interpreter” thing. What next? Well, if you run strace -o out.strace python3
, you’ll see a bunch of stuff like this right after the execve
system call:
execve("/usr/bin/python3", ["python3"], 0x560af13472f0 /* 21 vars */) = 0
brk(NULL) = 0xfcc000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=32091, ...}) = 0
mmap(NULL, 32091, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f718a1e3000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 l\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=149520, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f718a1e1000
...
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
This all looks a bit intimidating at first, but the part I want you to pay
attention to is openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0"
.
This is opening a C threading library called pthread
that the Python
interpreter needs to run.
See for yourself: If you want to know which libraries a binary needs to load at runtime, you can use ldd
. Here’s what that looks like for me:
$ ldd /usr/bin/python3.9
linux-vdso.so.1 (0x00007ffc2aad7000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2fd6554000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2fd654e000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)
You can see that the first library listed is /lib/x86_64-linux-gnu/libpthread.so.0
, which is why it was loaded first.
I’m honestly still a little confused about dynamic linking. Some things I know:
/lib64/ld-linux-x86-64.so.2
. If you’re missing the dynamic linker, you can end up with weird bugs like this weird “file not found” errorLD_LIBRARY_PATH
environment variable to find librariesLD_PRELOAD
environment to override any dynamically linked function you want (you can use this for fun hacks, or to replace your default memory allocator with an alternative one like jemalloc)mprotect
s in the strace output which are marking the library code as read-only, for security reasonsDYLD_LIBRARY_PATH
instead of LD_LIBRARY_PATH
You might be wondering – if dynamic linking happens in userspace, why don’t we
see a bunch of stat
system calls where it’s searching through
LD_LIBRARY_PATH
for the libraries, the way we did when bash was searching the
PATH
?
That’s because ld
has a cache in /etc/ld.so.cache
, and all of those
libraries have already been found in the past. You can see it opening the cache
in the strace output – openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
.
There are still a bunch of system calls after dynamic linking in the full strace output that I
still don’t really understand (what’s prlimit64
doing? where does the locale
stuff come in? what’s gconv-modules.cache
? what’s rt_sigaction
doing?
what’s arch_prctl
? what’s set_tid_address
and set_robust_list
?). But this feels like a good start.
Someone on mastodon pointed out that ldd
is actually a shell script
that just sets the LD_TRACE_LOADED_OBJECTS=1
environment variable and
starts the program. So you can do exactly the same thing like this:
$ LD_TRACE_LOADED_OBJECTS=1 python3
linux-vdso.so.1 (0x00007ffe13b0a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f01a5a47000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f01a5a41000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2fd6549000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2fd6405000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f2fd63d6000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2fd63b9000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2fd61e3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2fd6580000)
Apparently ld
is also a binary you can just run, so /lib64/ld-linux-x86-64.so.2 --list /usr/bin/python3.9
also does the the same thing.
init
and fini
Let’s talk about this line in the strace
output:
set_tid_address(0x7f58880dca10) = 3709103
This seems to have something to do with threading, and I think this might be
happening because the pthread
library (and every other dynamically loaded)
gets to run initialization code when it’s loaded. The code that runs when the
library is loaded is in the init
section (or maybe also the .ctors
section).
See for yourself: Let’s take a look at that using readelf:
$ readelf -a /lib/x86_64-linux-gnu/libpthread.so.0
...
[10] .rela.plt RELA 00000000000051f0 000051f0
00000000000007f8 0000000000000018 AI 4 26 8
[11] .init PROGBITS 0000000000006000 00006000
000000000000000e 0000000000000000 AX 0 0 4
[12] .plt PROGBITS 0000000000006010 00006010
0000000000000560 0000000000000010 AX 0 0 16
...
This library doesn’t have a .ctors
section, just an .init
. But what’s in
that .init
section? We can use objdump
to disassemble the code:
$ objdump -d /lib/x86_64-linux-gnu/libpthread.so.0
Disassembly of section .init:
0000000000006000 <_init>:
6000: 48 83 ec 08 sub $0x8,%rsp
6004: e8 57 08 00 00 callq 6860 <__pthread_initialize_minimal>
6009: 48 83 c4 08 add $0x8,%rsp
600d: c3
So it’s calling __pthread_initialize_minimal
. I found the code for that function in glibc,
though I had to find an older version of glibc because it looks like in more
recent versions libpthread is no longer a separate library.
I’m not sure whether this set_tid_address
system call actually comes from
__pthread_initialize_minimal
, but at least we’ve learned that libraries can
run code on startup through the .init
section.
Here’s a note from man elf
on the .init
section:
$ man elf
.init This section holds executable instructions that contribute to the process initialization code. When a program starts to run
the system arranges to execute the code in this section before calling the main program entry point.
There’s also a .fini
section in the ELF file that runs at the end, and
.ctors
/ .dtors
(constructors and destructors) are other sections that
could exist.
Okay, that’s enough about dynamic linking.
_start
After dynamic linking is done, we go to _start
in the Python interpreter.
Then it does all the normal Python interpreter things you’d expect.
I’m not going to talk about this because here I’m interested in general facts about how binaries are run on Linux, not the Python interpreter specifically.
We still need to print out “hello world” though. Under the hood, the Python print
function calls some function from libc. But which one? Let’s find out!
See for yourself: Run ltrace -o out python3 hello.py
.
$ ltrace -o out python3 hello.py
$ grep hello out
write(1, "hello world\n", 12) = 12
So it looks like it’s calling write
I honestly am always a little suspicious of ltrace – unlike strace (which I
would trust with my life), I’m never totally sure that ltrace is actually
reporting library calls accurately. But in this case it seems to be working. And
if we look at the cpython source code, it does seem to be calling write()
in some places. So I’m willing to believe that.
We just said that Python calls the write
function from libc. What’s libc?
It’s the C standard library, and it’s responsible for a lot of basic things
like:
malloc
execvp
, like we mentioned before)getaddrinfo
pthread
Programs don’t have to use libc (on Linux, Go famously doesn’t use it and calls Linux system calls directly instead), but most other programming languages I use (node, Python, Ruby, Rust) all use libc. I’m not sure about Java.
You can find out if you’re using libc by running ldd
on your binary: if you
see something like libc.so.6
, that’s libc.
You might be wondering – why does it matter that Python calls the libc write
and then libc calls the write
system call? Why am I making a point of saying
that libc
is in the middle?
I think in this case it doesn’t really matter (AFAIK the write
libc function
maps pretty directly to the write
system call)
But there are different libc implementations, and sometimes they behave differently. The two main ones are glibc (GNU libc) and musl libc.
For example, until recently musl’s getaddrinfo
didn’t support TCP DNS, here’s a blog post talking about a bug that that caused.
In this program, stdout (the 1
file descriptor) is a terminal. And you can do
funny things with terminals! Here’s one:
ls -l /proc/self/fd/1
. I get /dev/pts/2
echo hello > /dev/pts/2
hello
printed there!Hopefully you have a better idea of how hello world
gets printed! I’m going to stop
adding more details for now because this is already pretty long, but obviously there’s
more to say and I might add more if folks chip in with extra details. I’d
especially love suggestions for other tools you could use to inspect parts of
the process that I haven’t explained here.
Thanks to everyone who suggested corrections / additions – I’ve edited this blog post a lot to incorporate more things :)
Some things I’d like to add if I can figure out how to spy on them:
write(1, "hello world", 11)
gets sent to the TTY that I’m looking at)One of my frustrations with Mac OS is that I don’t know how to introspect my
system on this level – when I print hello world
, I can’t figure out how to
spy on what’s going on behind the scenes the way I can on Linux. I’d love to
see a really in depth explainer.
Some Mac equivalents I know about:
ldd
-> otool -L
readelf
-> otool
dtruss
or dtrace
on mac instead of strace but I’ve never been brave enough to turn off system integrity protection to get it to workstrace
-> sc_usage
seems to be able to collect stats about syscall usage, and fs_usage
about file usageSome more links:
execve