Julia Evans

Notes on git's error messages

2024-04-10T12:43:14+00:00

While writing about Git, I’ve noticed that a lot of folks struggle with Git’s error messages. I’ve had many years to get used to these error messages so it took me a really long time to understand why folks were confused, but having thought about it much more, I’ve realized that:

sometimes I actually am confused by the error messages, I’m just used to being confused
I have a bunch of strategies for getting more information when the error message git gives me isn’t very informative

So in this post, I’m going to go through a bunch of Git’s error messages, list a few things that I think are confusing about them for each one, and talk about what I do when I’m confused by the message.

improving error messages isn’t easy

Before we start, I want to say that trying to think about why these error messages are confusing has given me a lot of respect for how difficult maintaining Git is. I’ve been thinking about Git for months, and for some of these messages I really have no idea how to improve them.

Some things that seem hard to me about improving error messages:

if you come up with an idea for a new message, it’s hard to tell if it’s actually better!
work like improving error messages often isn’t funded
the error messages have to be translated (git’s error messages are translated into 19 languages!)

That said, if you find these messages confusing, hopefully some of these notes will help clarify them a bit.

error: `git push` on a diverged branch

$ git push
To github.com:jvns/int-exposed
! [rejected]        main -> main (non-fast-forward)
error: failed to push some refs to 'github.com:jvns/int-exposed'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

$ git status
On branch main
Your branch and 'origin/main' have diverged,
and have 2 and 1 different commits each, respectively.

Some things I find confusing about this:

You get the exact same error message whether the branch is just behind or the branch has diverged. There’s no way to tell which it is from this message: you need to run git status or git pull to find out.
It says failed to push some refs, but it’s not totally clear which references it failed to push. I believe everything that failed to push is listed with ! [rejected] on the previous line– in this case just the main branch.

What I like to do if I’m confused:

I’ll run git status to figure out what the state of my current branch is.
I think I almost never try to push more than one branch at a time, so I usually totally ignore git’s notes about which specific branch failed to push – I just assume that it’s my current branch

error: `git pull` on a diverged branch

$ git pull
hint: You have divergent branches and need to specify how to reconcile them.
hint: You can do so by running one of the following commands sometime before
hint: your next pull:
hint:
hint:   git config pull.rebase false  # merge
hint:   git config pull.rebase true   # rebase
hint:   git config pull.ff only       # fast-forward only
hint:
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
fatal: Need to specify how to reconcile divergent branches.

The main thing I think is confusing here is that git is presenting you with a kind of overwhelming number of options: it’s saying that you can either:

configure pull.rebase false, pull.rebase true, or pull.ff only locally
or configure them globally
or run git pull --rebase or git pull --no-rebase

It’s very hard to imagine how a beginner to git could easily use this hint to sort through all these options on their own.

If I were explaining this to a friend, I’d say something like “you can use git pull --rebase or git pull --no-rebase to resolve this with a rebase or merge right now, and if you want to set a permanent preference, you can do that with git config pull.rebase false or git config pull.rebase true.

git config pull.ff only feels a little redundant to me because that’s git’s default behaviour anyway (though it wasn’t always).

What I like to do here:

run git status to see the state of my current branch
maybe run git log origin/main or git log to see what the diverged commits are
usually run git pull --rebase to resolve it
sometimes I’ll run git push --force or git reset --hard origin/main if I want to throw away my local work or remote work (for example because I accidentally commited to the wrong branch, or because I ran git commit --amend on a personal branch that only I’m using and want to force push)

error: `git checkout asdf` (a branch that doesn't exist)

$ git checkout asdf
error: pathspec 'asdf' did not match any file(s) known to git

This is a little weird because we my intention was to check out a branch, but git checkout is complaining about a path that doesn’t exist.

This is happening because git checkout’s first argument can be either a branch or a path, and git has no way of knowing which one you intended. This seems tricky to improve, but I might expect something like “No such branch, commit, or path: asdf”.

What I like to do here:

in theory it would be good to use git switch instead, but I keep using git checkout anyway
generally I just remember that I need to decode this as “branch asdf doesn’t exist”

error: `git switch asdf` (a branch that doesn't exist)

$ git switch asdf
fatal: invalid reference: asdf

git switch only accepts a branch as an argument (unless you pass -d), so why is it saying invalid reference: asdf instead of invalid branch: asdf?

I think the reason is that internally, git switch is trying to be helpful in its error messages: if you run git switch v0.1 to switch to a tag, it’ll say:

$ git switch v0.1
fatal: a branch is expected, got tag 'v0.1'`

So what git is trying to communicate with fatal: invalid reference: asdf is “asdf isn’t a branch, but it’s not a tag either, or any other reference”. From my various git polls my impression is that a lot of git users have literally no idea what a “reference” is in git, so I’m not sure if that’s coming across.

What I like to do here:

90% of the time when a git error message says reference I just mentally replace it with branch in my head.

error: `git checkout HEAD^`

$ git checkout HEAD^
Note: switching to 'HEAD^'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c 

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 182cd3f add "swap byte order" button

This is a tough one. Definitely a lot of people are confused about this message, but obviously there's been a lot of effort to improve it too. I don't have anything smart to say about this one.

What I like to do here:

my shell prompt tells me if I’m in detached HEAD state, and generally I can remember not to make new commits while in that state
when I’m done looking at whatever old commits I wanted to look at, I’ll run git checkout main or something to go back to a branch

message: `git status` when a rebase is in progress

This isn’t an error message, but I still find it a little confusing on its own:

$ git status
interactive rebase in progress; onto c694cf8
Last command done (1 command done):
   pick 0a9964d wip
No commands remaining.
You are currently rebasing branch 'main' on 'c694cf8'.
  (fix conflicts and then run "git rebase --continue")
  (use "git rebase --skip" to skip this patch)
  (use "git rebase --abort" to check out the original branch)

Unmerged paths:
  (use "git restore --staged ..." to unstage)
  (use "git add ..." to mark resolution)
  both modified:   index.html

no changes added to commit (use "git add" and/or "git commit -a")

Two things I think could be clearer here:

I think it would be nice if You are currently rebasing branch 'main' on 'c694cf8'. were on the first line instead of the 5th line – right now the first line doesn’t say which branch you’re rebasing.
In this case, c694cf8 is actually origin/main, so I feel like You are currently rebasing branch 'main' on 'origin/main' might be even clearer.

What I like to do here:

My shell prompt includes the branch that I’m currently rebasing, so I rely on that instead of the output of git status.

error: `git rebase` when a file has been deleted

$ git rebase main
CONFLICT (modify/delete): index.html deleted in 0ce151e (wip) and modified in HEAD.  Version HEAD of index.html left in tree.
error: could not apply 0ce151e… wip

The thing I still find confusing about this is – index.html was modified in HEAD. But what is HEAD? Is it the commit I was working on when I started the merge/rebase, or is it the commit from the other branch? (the answer is “HEAD is your branch if you’re doing a merge, and it’s the “other branch” if you’re doing a rebase, but I always find that hard to remember)

I think I would personally find it easier to understand if the message listed the branch names if possible, something like this:

CONFLICT (modify/delete): index.html deleted on `main` and modified on `mybranch`

error: `git status` during a merge or rebase (who is “them”?)

$ git status On branch master You have unmerged paths. (fix conflicts and run “git commit”) (use “git merge –abort” to abort the merge)

Unmerged paths: (use “git add/rm …” as appropriate to mark resolution) deleted by them: the_file

no changes added to commit (use “git add” and/or “git commit -a”)

I find this one confusing in exactly the same way as the previous message: it says deleted by them:, but what “them” refers to depends on whether you did a merge or rebase or cherry-pick.

for a merge, them is the other branch you merged in
for a rebase, them is the branch that you were on when you ran git rebase
for a cherry-pick, I guess it’s the commit you cherry-picked

What I like to do if I’m confused:

try to remember what I did
run git show main --stat or something to see what I did on the main branch if I can’t remember

error: `git clean`

$ git clean
fatal: clean.requireForce defaults to true and neither -i, -n, nor -f given; refusing to clean

I just find it a bit confusing that you need to look up what -i, -n and -f are to be able to understand this error message. I’m personally way too lazy to do that so even though I’ve probably been using git clean for 10 years I still had no idea what -i stood for (interactive) until I was writing this down.

What I like to do if I’m confused:

Usually I just chaotically run git clean -f to delete all my untracked files and hope for the best, though I might actually switch to git clean -i now that I know what -i stands for. Seems a lot safer.

that’s all!

Hopefully some of this is helpful!

Making crochet cacti

2024-04-01T07:37:00+00:00

I noticed some tech bloggers I follow have been making April Cools Day posts about topics they don’t normally write about (like decaf or microscopes). The goal isn’t to trick anyone, just to write about something different for a day.

I thought those posts were fun so here is a post with some notes on learning to crochet tiny cacti.

first, the cacti

I’ve been trying to do some non-computer hobbies, without putting a lot of pressure on myself to be “good” at them. Here are some cacti I crocheted:

They are a little wonky and I like them.

a couple of other critters

Here are a couple of other things I made: an elephant, an orange guy, a much earlier attempt at a cactus, and an in-progress cactus

Some of these are also pretty wonky, but sometimes it adds to the charm: for example the elephant’s head is attached at an angle which was not on purpose but I think adds to the effect. (orange guy pattern, elephant pattern)

I haven’t really been making clothing: I like working in a pretty chaotic way and I think you need to be a lot more careful when you make clothing so that it will actually fit.

the first project: a mouse

The first project I made was this little mouse. It took me a few hours (maybe 3 hours?) and I made a lot of mistakes and it definitely was not as cute as it was in the pictures in the pattern, but it was still good! I can’t find a picture right now though.

buying patterns is great

Originally I started out using free patterns, but I found some cacti patterns I really liked in an ebook called Knotmonsters: Cactus Gardens Edition, so I bought it.

I like the patterns in that book and also buying patterns seems like a nice way to support people who are making fun patterns. I found this guide to designing your own patterns through searching on Ravelry and it seems like a lot of work! Maybe I will do it one day but for now I appreciate the work of other people who make the patterns.

modifying patterns chaotically is great too

I’ve been modifying all of the patterns I make in a somewhat chaotic way, often just because I made a mistake somewhere along the way and then decide to move forward and change the pattern to adjust for the mistake instead of undoing my work. Some of of the changes I’ve made are:

remove rows
put fewer stitches in a row
use a different stitch

This doesn’t always work but often it works well enough, and I think all of the mistakes help me learn.

no safety eyes

A lot of the patterns I’ve been seeing for animals suggest using “safety eyes” (plastic eyes). I didn’t really feel like buying those , so I’ve been embroidering eyes on instead. “Embroidering” might not be accurate, really I just sew some black yarn on in a haphazard way and hope it doesn’t come out looking too weird.

My crochet kit came with a big plastic yarn needle that I’ve been using to embroider and also

no stitch markers

My crochet kit came with some plastic “stitch markers” which you can use to figure out where the beginning of your row is, so you know when you’re done. I’ve been finding it easier to just use a short piece of scrap yarn instead.

on dealing with all the counting

In crochet there is a LOT of counting. Like “single crochet 3 times, then double crochet 1 time, then repeat that 6 times”. I find it hard to do that accurately without making mistakes, and all of the counting is not that fun! A few things that have helped:

go back and look at my stitches to see what I did (“have I done 1 single crochet, or 2?“). I’m not actually very good at doing this, but I find it easier to see my stitches with wool/cotton yarn than with acrylic yarn for some reason.
count how many stitches in total I’ve done since the last row, and make sure it seems approximately right (“well, I’m supposed to have 20 stitches and I have 19, that’s pretty close!“). Then I’ll maybe just add an extra stitch in the wrong place to adjust, or maybe just leave it the way it is.

notes on yarn

So far I’ve tried three kinds of yarn: merino (for the elephant), cotton (for the cacti), and acrylic (for the orange dude). I still don’t know which one I like best, but since I’m doing small projects it feels like the right move is still to just buy small amounts of yarn and experiment. I think I like the cotton and merino more than the acrylic.

For the cacti I used Ricorumi cotton yarn, which comes in tiny balls (which is good for me because if I don’t end up liking it, I don’t have a lot of extra!) and in a lot of different colours.

There are a lot of yarn weights (lace! sock! sport! DK! worsted! bulky! and more!). I don’t really underestand them yet but I think so far I’ve been mostly using DK and worsted yarn.

hook size? who knows!

I’ve mostly been using a 3.5mm hook, probably because I read a tutorial that said to use a 3.5mm hook. It seems to work fine! I used a larger hook size when making a hat, and that also worked.

I still don’t really know how to choose hook sizes but that doesn’t seem to have a lot of consequences when making cacti.

every stitch I’ve learned

I think I’ve probably only learned how to do 5 things in crochet so far:

magic ring (mr)
single crochet (sc)
half double crochet (hdc)
front post half double crochet (fphdc)
double crochet (dc)
back loops only/front loops only (flo/blo)
increase/decrease

The way I’ve been approaching learning new crochet stitches is:

find a pattern I want to make
start it without reviewing it very much at all
when I get to a stitch I don’t know, watch youtube videos
don’t watch it very carefully and get it wrong
eventually realize that it doesn’t look right at all, rewatch the video, and continue

I’ve been using Sarah Maker’s pages a lot, except for the magic ring where I used this 3-minute youtube video.

The magic ring took me a very long time to learn to do correctly, I didn’t pay attention very closely to the 3-minute youtube video so I did it wrong in maybe 4 projects before I figured out how to do it right.

every single thing I’ve bought

So far I’ve only needed:

a crochet kit (which I got as a gift). it came with yarn, a bunch of crochet needles in different sizes, big sewing needles, and some other things I haven’t needed yet.
some Ricorumi cotton (for the cacti)
1 ball of gray yarn (for the elephant)

I’ve been trying to not buy too much stuff, because I never know if I’ll get bored with a new hobby, and if I get bored it’s annoying to have a bunch of stuff lying around. Some examples of things I’ve avoided buying so far:

Instead of buying polyester fiberfill, to fill all of the critters I’ve just been cutting up an old sweater I have that was falling apart.
I’ve been embroidering the eyes instead of buying safety eyes

Everything I have right now fits in a the box the crochet kit came in (which is about the size of a large shoebox), and my plan is to keep it that way for a while.

that’s all!

Mainly what I like about crochet so far is that:

it’s a way to not be on the computer, and you can chat with people while doing it
you can do it without buying too much stuff, it’s pretty compact
I end up with cacti in our living room which is great (I also have a bunch of real succulents, so they go with those)
it seems extremely forgiving of mistakes and experimentation

There are definitely still a lot of things I’m doing “wrong” but it’s fun to learn through trial and error.

Some Git poll results

2024-03-28T08:35:56+00:00

A new thing I’ve been trying while writing this Git zine is doing a bunch of polls on Mastodon to learn about:

which git commands/workflows people use (like “do you use merge or rebase more?” or “do you put your current git branch in your shell prompt?”)
what kinds of problems people run into with git (like “have you lost work because of a git problem in the last year or two?”)
which terminology people find confusing (like “how confident do you feel that you know what HEAD means in git?”)
how people think about various git concepts (“how do you think about git branches?”)
in what ways my usage of git is “normal” and in what ways it’s “weird”. Where am I pretty similar to the majority of people, and where am I different?

It’s been a lot of fun and some of the results have been surprising to me, so here are some of the results. I’m partly just posting these so that I can have them all in one place for myself to refer to, but maybe some of you will find them interesting too.

these polls are highly unscientific

Polls on social media that I thought about for approximately 45 seconds before posting are not the most rigorous way of doing user research, so I’m pretty cautious about drawing conclusions from them. Potential problems include: I phrased the poll badly, the set of possible responses aren’t chosen very carefully, some of the poll responses I just picked because I thought they were funny, and the set of people who follow me on Mastodon is not representative of all git users.

But here are a couple of examples of why I still find these poll results useful:

The first poll is “what’s your approach to merge commits and rebase in git”? 600 people (30% of responders) replied “I usually use merge, rarely/never rebase”. It’s helpful for me to know that there are a lot of people out there who rarely/never use rebase, because I use rebase all the time – it’s a good reminder that my experiences isn’t necessarily representative.
For the poll “how confident do you feel that you know what HEAD means in git?“, 14% of people replied “literally no idea”. That tells me to be careful about assuming that people know what HEAD means in my writing.

where to read more

If you want to read more about any given poll, you can click at the date at the bottom – there’s usually a bunch of interesting follow-up discussion.

Also this post has a lot of CSS so it might not work well in a feed reader.

Now! Here are the polls! I’m mostly just going to post the results without commenting on them.

merge and rebase

poll: what's your approach to merge commits and rebase in git?

41%usually rebase, rarely/never create merge commits
29%usually merge, rarely/never rebase
24%i do both all the time
4%other / show results

1872 people · Dec 14, 2023, 21:06

merge conflicts

poll: if you use git, how often do you deal with nontrivial merge conflicts? (like where 2 people were really editing the same code at the same time and you need to take time to think about how to reconcile the edits)

10%~every week or so
33%~every month or so
52%very rarely/never (a few times a year at most)
4%other/show results

2009 people · Jan 03, 2024, 18:43

another merge conflict poll:

have you ever seen a bug in production caused by an incorrect merge conflict resolution? I've heard about this as a reason to prefer merges over rebase (because it makes the merge conflict resolution easier to audit) and I'm curious about how common it is

14%yes, many times
47%yes, but only once or twice
32%no
5%other/show results

1482 people · Jan 03, 2024, 18:59

I thought it was interesting in the next one that “edit the weird text file by hand” was most people’s preference:

poll: when you have a merge conflict, how do you prefer to handle it?

58%edit the weird text file by hand
34%use a merge conflict tool
5%delete your work and start over
1%other

2380 people · Feb 22, 2024, 15:17

merge conflict follow up: if you prefer to edit the weird text file by hand instead of using a dedicated merge conflict tool, why is that?

24%most merge conflicts are simple
23%it's infrequent, not worth learning another tool
38%prefer to use my usual text editor
13%other

1093 people · Feb 23, 2024, 20:22

poll: did you know that in a git merge conflict, the order of the code is different when you do a merge/rebase?

merge:

<<<<<<< HEAD
YOUR CODE
=======
OTHER BRANCH'S CODE
>>>>>>> c694cf8aabe

rebase:

<<<<<<< HEAD
OTHER BRANCH'S CODE
=======
YOUR CODE
>>>>>>> d945752 (your commit message)

(where "YOUR CODE" is the code from the branch you were on when you ran `git merge` or `git rebase`)

15%yes
14%yes, mostly
48%no
21%what?

1511 people · Mar 11, 2024, 14:17

git pull

poll: do you prefer `git fetch` or `git pull`?

(no lectures about why you think `git pull` is bad please but if you use both I'd be curious to hear in what cases you use fetch!)

12%only `git fetch`
37%only `git pull`
48%mix of both
1%other

2036 people · Mar 18, 2024, 20:07

commits

[poll] how do you think of a git commit?

(sorry, you can't pick “it’s all 3”, I'm curious about which one feels most true to you)

50%a **diff** from the previous commit
42%a **snapshot** of the current state
3%a **history** of every past commit
2%other/show results

2466 people · Dec 11, 2023, 18:18

branches

poll: how do you think about git branches? (I'll put an image in a reply with pictures for the 3 options)

as with all of these polls obviously all 3 are valid, I'm curious which one feels the most true to you

58%1. just the commits that "branch" off
22%2. the history of every previous commit
15%3. just the commit at the end ("branch = pointer")
3%other / show results

1966 people · Jan 06, 2024, 14:24

git environment

poll: do you put your current git branch in your shell prompt?

71%yes
22%no
3%no, but I don't use git on the command line
1%other/show results

2365 people · Jan 18, 2024, 15:38

poll: do you use git on the command line or in a GUI?

(you can pick more than one option if it’s a mix of both, sorry magit users I didn't have space for you in this poll)

80%command line, regularly
29%GUI, regularly
13%command line, occasionally
16%GUI, occasionally

2661 people · Feb 29, 2024, 12:38

losing work

poll: have you lost work because of a git problem in the last year or two? (it counts even if it was "your fault" :))

17%yes
76%no
4%no, but git did something else unforgivable
1%other

1475 people · Feb 14, 2024, 14:14

meaning of various git terms

These polls gave me the impression that for a lot of git terms (fast-forward, reference, HEAD), there are a lot of git users who have “literally no idea” what they mean. That makes me want to be careful about using and defining those terms.

poll: how confident do you feel that you know what HEAD means in git?

10%100%
36%pretty confident
38%somewhat confident?
14%literally no idea

1783 people · Mar 06, 2024, 15:02

another poll: how do you think of HEAD in git?

67%a pointer to the current commit
25%a pointer to the current branch (usually)
6%other

1386 people · Mar 06, 2024, 17:57

poll: when you see this message in `git status`:

”Your branch is up to date with 'origin/main’.”

do you know that your branch may not actually be up to date with the `main` branch on the remote?

63%yes
15%mostly yes
7%no
13%what?

2332 people · Mar 08, 2024, 19:04

poll: how confident do you feel that you know what the term "fast-forward" means in git, for example in this error message:

`! [rejected] main -> main (non-fast-forward)`

or this one:

fatal: Not possible to fast-forward, aborting.

(I promise this is not a trick question, I'm just writing a blog post about git terminology and I'm trying to gauge how people feel about various core git terms)

25%100%
31%pretty confident
20%somewhat confident?
21%literally no idea

1629 people · Mar 11, 2024, 17:59

poll: how confident do you feel that you know what a "ref" or "reference" is in git? (“ref” and “reference” are the same thing)

for example in this error message (from `git push`)

error: failed to push some refs to 'github.com:jvns/int-exposed'

or this one: (from `git switch mybranch`)

fatal: invalid reference: mybranch

9%100%
28%pretty confident
31%somewhat confident?
29%literally no idea

1117 people · Mar 13, 2024, 13:41

another git terminology poll: how confident do you feel that you know what a git commit is?

(not a trick question, I'm mostly curious how this one relates to people's reported confidence about more "advanced" terms like reference/fast-forward/HEAD)

32%100%
50%pretty confident
15%somewhat confident?
1%literally no idea

1294 people · Mar 15, 2024, 13:15

poll: in git, do you think of "detached HEAD state" and "not having any branch checked out" as being the same thing?

52%yes
27%no
17%what?
2%other

1278 people · Mar 21, 2024, 18:34

poll: how confident do you feel that you know what the term "current branch" means in git?

(deleted & reposted to clarify that I'm asking about the meaning of the term)

26%100%
49%pretty confident
18%somewhat confident?
4%literally no idea

1282 people · Mar 21, 2024, 19:24

other version control systems

I occasionally hear “SVN was better than git!” but this “svn vs git” poll makes me think that’s a minority opinion. I’m much more cautious about concluding anything from the hg-vs-git poll but it does seem like some people prefer git and some people prefer Mercurial.

poll 2: if you've used both svn and git, which do you prefer?

(no replies please, i have already read 300 comments about git vs other version control systems today and they were great but i can't read more)

3%svn
91%git
3%depends
1%other

1642 people · Mar 19, 2024, 21:16

gonna do a short thread of git vs other version control systems polls just to get an overall vibe

poll 1: if you've used both hg and git, which do you prefer?

(no replies please though, i have already read 300 comments about git vs other version control systems today and i can't read more)

21%hg
65%git
7%depends
5%other

684 people · Mar 19, 2024, 21:15

that’s all!

It’s been very fun to run all of these polls and I’ve learned a lot about how people use and think about git.

The "current branch" in git

2024-03-22T08:15:02+00:00

Hello! I know I just wrote a blog post about HEAD in git, but I’ve been thinking more about what the term “current branch” means in git and it’s a little weirder than I thought.

four possible definitions for “current branch”

It’s what’s in the file .git/HEAD. This is how the git glossary defines it.
It’s what git status says on the first line
It’s what you most recently checked out with git checkout or git switch
It’s what’s in your shell’s git prompt. I use fish_git_prompt so that’s what I’ll be talking about.

I originally thought that these 4 definitions were all more or less the same, but after chatting with some people on Mastodon, I realized that they’re more different from each other than I thought.

So let’s talk about a few git scenarios and how each of these definitions plays out in each of them. I used git version 2.39.2 (Apple Git-143) for all of these experiments.

scenario 1: right after `git checkout main`

Here’s the most normal situation: you check out a branch.

.git/HEAD contains ref: refs/heads/main
git status says On branch main
The thing I most recently checked out was: main
My shell’s git prompt says: (main)

In this case the 4 definitions all match up: they’re all main. Simple enough.

scenario 2: right after `git checkout 775b2b399`

Now let’s imagine I check out a specific commit ID (so that we’re in “detached HEAD state”).

.git/HEAD contains 775b2b399fb8b13ee3341e819f2aaa024a37fa92
git status says HEAD detached at 775b2b39
The thing I most recently checked out was 775b2b399
My shell’s git prompt says ((775b2b39))

Again, these all basically match up – some of them have truncated the commit ID and some haven’t, but that’s it. Let’s move on.

scenario 3: right after `git checkout v1.0.13`

What if we’ve checked out a tag, instead of a branch or commit ID?

.git/HEAD contains ca182053c7710a286d72102f4576cf32e0dafcfb
git status says HEAD detached at v1.0.13
The thing I most recently checked out was v1.0.13
My shell’s git prompt says ((v1.0.13))

Now things start to get a bit weirder! .git/HEAD disagrees with the other 3 indicators: git status, the git prompt, and what I checked out are all the same (v1.0.13), but .git/HEAD contains a commit ID.

The reason for this is that git is trying to help us out: commit IDs are kind of opaque, so if there’s a tag that corresponds to the current commit, git status will show us that instead.

Some notes about this:

If we check out the commit by its ID (git checkout ca182053c7710a286d72) instead of by its tag, what shows up in git status and in my shell prompt are exactly the same – git doesn’t actually “know” that we checked out a tag.
it looks like you can find the tags matching HEAD by running git describe HEAD --tags --exact-match (here’s the fish git prompt code)
You can see where git-prompt.sh added support for describing a commit by a tag in this way in commit 27c578885 in 2008.
I don’t know if it makes a difference whether the tag is annotated or not.
If there are 2 tags with the same commit ID, it gets a little weird. For example, if I add the tag v1.0.12 to this commit so that it’s with both v1.0.12 and v1.0.13, you can see here that my git prompt changes, and then the prompt and git status disagree about which tag to display:

bork@grapefruit ~/w/int-exposed ((v1.0.12))> git status
HEAD detached at v1.0.13

(my prompt shows v1.0.12 and git status shows v1.0.13)

scenario 4: in the middle of a rebase

Now: what if I check out the main branch, do a rebase, but then there was a merge conflict in the middle of the rebase? Here’s the situation:

.git/HEAD contains c694cf8aabe2148b2299a988406f3395c0461742 (the commit ID of the commit that I’m rebasing onto, origin/main in this case)
git status says interactive rebase in progress; onto c694cf8
The thing I most recently checked out was main
My shell’s git prompt says (main|REBASE-i 1/1)

Some notes about this:

I think that in some sense the “current branch” is main here – it’s what I most recently checked out, it’s what we’ll go back to after the rebase is done, and it’s where we’d go back to if I run git rebase --abort
in another sense, we’re in a detached HEAD state at c694cf8aabe2. But it doesn’t have the usual implications of being in “detached HEAD state” – if you make a commit, it won’t get orphaned! Instead, assuming you finish the rebase, it’ll get absorbed into the rebase and put somewhere in the middle of your branch.
it looks like during the rebase, the old “current branch” (main) is stored in .git/rebase-merge/head-name. Not totally sure about this though.

scenario 5: right after `git init`

What about when we create an empty repository with git init?

.git/HEAD contains ref: refs/heads/main
git status says On branch main (and “No commits yet”)
The thing I most recently checked out was, well, nothing
My shell’s git prompt says: (main)

So here everything mostly lines up, except that we’ve never run git checkout or git switch. Basically Git automatically switches to whatever branch was configured in init.defaultBranch.

scenario 6: a bare git repository

What if we clone a bare repository with git clone --bare https://github.com/rbspy/rbspy?

HEAD contains ref: refs/heads/main
git status says fatal: this operation must be run in a work tree
The thing I most recently checked out was, well, nothing, git checkout doesn’t even work in bare repositories
My shell’s git prompt says: (BARE:main)

So #1 and #4 match (they both agree that the current branch is “main”), but git status and git checkout don’t even work.

Some notes about this one:

I think HEAD in a bare repository mainly only really affects 1 thing: it’s the branch that gets checked out when you clone the repository. It’s also used when you run git log.
if you really want to, you can update HEAD in a bare repository to a different branch with git symbolic-ref HEAD refs/heads/whatever. I’ve never needed to do that though and it seems weird because git symbolic ref doesn’t check if the thing you’re pointing HEAD at is actually a branch that exists. Not sure if there’s a better way.

all the results

Here’s a table with all of the results:

	`.git/HEAD`	git status	checked out	prompt
1. `checkout main`	`ref: refs/heads/main`	`On branch main`	main	`(main)`
2. `checkout 775b2b`	`775b2b399...`	`HEAD detached at 775b2b39`	775b2b399	`((775b2b39))`
3. `checkout v1.0.13`	`ca182053c...`	`HEAD detached at v1.0.13`	v1.0.13	`((v1.0.13))`
4. inside rebase	`c694cf8aa...`	`interactive rebase in progress; onto c694cf8`	main	`(main\\|REBASE-i 1/1)`
5. after `git init`	`ref: refs/heads/main`	`On branch main`	n/a	`(main)`
6. bare repository	`ref: refs/heads/main`	`fatal: this operation must be run in a work tree`	n/a	`(BARE:main)`

“current branch” doesn’t seem completely well defined

My original instinct when talking about git was to agree with the git glossary and say that HEAD and the “current branch” mean the exact same thing.

But this doesn’t seem as ironclad as I used to think anymore! Some thoughts:

.git/HEAD is definitely the one with the most consistent format – it’s always either a branch or a commit ID. The others are all much messier
I have a lot more sympathy than I used to for the definition “the current branch is whatever you last checked out”. Git does a lot of work to remember which branch you last checked out (even if you’re currently doing a bisect or a merge or something else that temporarily moves HEAD off of that branch) and it feels weird to ignore that.
git status gives a lot of helpful context – these 5 status messages say a lot more than just what HEAD is set to currently
1. on branch main
2. HEAD detached at 775b2b39
3. HEAD detached at v1.0.13
4. interactive rebase in progress; onto c694cf8
5. on branch main, no commits yet

some more “current branch” definitions

I’m going to try to collect some other definitions of the term current branch that I heard from people on Mastodon here and write some notes on them.

“the branch that would be updated if i made a commit”
- Most of the time this is the same as .git/HEAD
- Arguably if you’re in the middle of a rebase, it’s different from HEAD, because ultimately that new commit will end up on the branch in .git/rebase-merge/head-name
“the branch most git operations work against”
- This is sort of the same as what’s in .git/HEAD, except that some operations (like git status) will behave differently in some situations, like how git status won’t tell you the current branch if you’re in a bare repository

on orphaned commits

One thing I noticed that wasn’t captured in any of this is whether the current commit is orphaned or not – the git status message (HEAD detached from c694cf8) is the same whether or not your current commit is orphaned.

I imagine this is because figuring out whether or not a given commit is orphaned might take a long time in a large repository: you can find out if the current commit is orphaned with git branch --contains HEAD, and that command takes about 500ms in a repository with 70,000 commits.

Git will warn you if the commit is orphaned (“Warning: you are leaving 1 commit behind, not connected to any of your branches…“) when you switch to a different branch though.

that’s all!

I don’t have anything particularly smart to say about any of this. The more I think about git the more I can understand why people get confused.

How HEAD works in git

2024-03-08T10:13:27+00:00

Hello! The other day I ran a Mastodon poll asking people how confident they were that they understood how HEAD works in Git. The results (out of 1700 votes) were a little surprising to me:

10% “100%”
36% “pretty confident”
39% “somewhat confident?”
15% “literally no idea”

I was surprised that people were so unconfident about their understanding – I’d been thinking of HEAD as a pretty straightforward topic.

Usually when people say that a topic is confusing when I think it’s not, the reason is that there’s actually some hidden complexity that I wasn’t considering. And after some follow up conversations, it turned out that HEAD actually was a bit more complicated than I’d appreciated!

Here’s a quick table of contents:

HEAD is actually a few different things
the file .git/HEAD
HEAD as in git show HEAD
next: all the output formats

HEAD is actually a few different things

After talking to a bunch of different people about HEAD, I realized that HEAD actually has a few different closely related meanings:

The file .git/HEAD
HEAD as in git show HEAD (git calls this a “revision parameter”)
All of the ways git uses HEAD in the output of various commands (<<<<<<<<<, (HEAD -> main), detached HEAD state, On branch main, etc)



These are extremely closely related to each other, but I don’t think the
relationship is totally obvious to folks who are starting out with git.

the file .git/HEAD

Git has a very important file called .git/HEAD. The way this file works is that it contains either:


The name of a branch (like ref: refs/heads/main)
A commit ID (like 96fa6899ea34697257e84865fefc56beb42d6390)


This file is what determines what your “current branch” is in Git. For example, when you run git status and see this:

$ git status
On branch main


it means that the file .git/HEAD contains ref: refs/heads/main.

If .git/HEAD contains a commit ID instead of a branch, git calls that
“detached HEAD state”. We’ll get to that later.


(People will sometimes say that HEAD contains a name of a reference or a
commit ID, but I’m pretty sure that that the reference has to be a branch.
You can technically make .git/HEAD contain the name of a reference that
isn’t a branch by manually editing .git/HEAD, but I don’t think you can do it
with a regular git command. I’d be interested to know if there is a
regular-git-command way to make .git/HEAD a non-branch reference though, and if
so why you might want to do that!)


HEAD as in git show HEAD

It’s very common to use HEAD in git commands to refer to a commit ID, like:


git diff HEAD
git rebase -i HEAD^^^^
git diff main..HEAD
git reset --hard HEAD@{2}


All of these things (HEAD, HEAD^^^, HEAD@{2}) are called “revision parameters”. They’re documented in man
gitrevisions, and Git will try to
resolve them to a commit ID.

(I’ve honestly never actually heard the term “revision parameter” before, but
that’s the term that’ll get you to the documentation for this concept)

HEAD in git show HEAD has a pretty simple meaning: it resolves to the
current commit you have checked out! Git resolves HEAD in one of two ways:


if .git/HEAD contains a branch name, it’ll be the latest commit on that branch (for example by reading it from .git/refs/heads/main)
if .git/HEAD contains a commit ID, it’ll be that commit ID


next: all the output formats

Now we’ve talked about the file .git/HEAD, and the “revision parameter”
HEAD, like in git show HEAD. We’re left with all of the various ways git
uses HEAD in its output.

git status: “on branch main” or “HEAD detached”

When you run git status, the first line will always look like one of these two:


on branch main. This means that .git/HEAD contains a branch.
HEAD detached at 90c81c72. This means that .git/HEAD contains a commit ID.


I promised earlier I’d explain what “HEAD detached” means, so let’s do that now.

detached HEAD state

“HEAD is detached” or “detached HEAD state” mean that you have no current branch.

Having no current branch is a little dangerous because if you make new commits,
those commits won’t be attached to any branch – they’ll be orphaned! Orphaned
commits are a problem for 2 reasons:


the commits are more difficult to find (you can’t run git log somebranch to find them)
orphaned commits will eventually be deleted by git’s garbage collection


Personally I’m very careful about avoiding creating commits in detached HEAD state, though some people prefer to work that way.
Getting out of detached HEAD state is pretty easy though, you can either:


Go back to a branch (git checkout main)
Create a new branch at that commit (git checkout -b newbranch)
If you’re in detached HEAD state because you’re in the middle of a rebase, finish or abort the rebase (git rebase --abort)


Okay, back to other git commands which have HEAD in their output!

git log: (HEAD -> main)

When you run git log and look at the first line, you might see one of the following 3 things:


commit 96fa6899ea (HEAD -> main)
commit 96fa6899ea (HEAD, main)
commit 96fa6899ea (HEAD)


It’s not totally obvious how to interpret these, so here’s the deal:


inside the (...), git lists every reference that points at that commit, for example (HEAD -> main, origin/main, origin/HEAD) means HEAD, main, origin/main, and origin/HEAD all point at that commit (either directly or indirectly)
HEAD -> main means that your current branch is main
If that line says HEAD, instead of HEAD ->, it means you’re in detached HEAD state (you have no current branch)


if we use these rules to explain the 3 examples above: the result is:


commit 96fa6899ea (HEAD -> main) means:


.git/HEAD contains ref: refs/heads/main
.git/refs/heads/main contains 96fa6899ea

commit 96fa6899ea (HEAD, main) means:


.git/HEAD contains 96fa6899ea (HEAD is “detached”)
.git/refs/heads/main also contains 96fa6899ea

commit 96fa6899ea (HEAD) means:


.git/HEAD contains 96fa6899ea (HEAD is “detached”)
.git/refs/heads/main either contains a different commit ID or doesn’t exist



merge conflicts: <<<<<<< HEAD is just confusing

When you’re resolving a merge conflict, you might see something like this:

<<<<<<< HEAD
def parse(input):
    return input.split("\n")
=======
def parse(text):
    return text.split("\n\n")
>>>>>>> somebranch


I find HEAD in this context extremely confusing and I basically just ignore it. Here’s why.


When you do a merge, HEAD in the merge conflict is the same as what HEAD was when you ran git merge. Simple.
When you do a rebase, HEAD in the merge conflict is something totally
different: it’s the other commit that you’re rebasing on top of. So it’s
totally different from what HEAD was when you ran git rebase. It’s like
this because rebase works by first checking out the other commit and then
repeatedly cherry-picking commits on top of it.


Similarly, the meaning of “ours” and “theirs” are flipped in a merge and rebase.

The fact that the meaning of HEAD changes depending on whether I’m doing a
rebase or merge is really just too confusing for me and I find it much simpler
to just ignore HEAD entirely and use another method to figure out which part
of the code is which.

some thoughts on consistent terminology

I think HEAD would be more intuitive if git’s terminology around HEAD were a
little more internally consistent.

For example, git talks about “detached HEAD state”, but never about “attached
HEAD state” – git’s documentation never uses the term “attached” at all to
refer to HEAD. And git talks about being “on” a branch, but never “not on” a
branch.

So it’s very hard to guess that on branch main is actually the opposite of
HEAD detached. How is the user supposed to guess that HEAD detached has
anything to do with branches at all, or that “on branch main” has anything to
do with HEAD?

that’s all!

If I think of other ways HEAD is used in Git (especially ways HEAD appears in
Git’s output), I might add them to this post later.

If you find HEAD confusing, I hope this helps a bit!



Popular git config options
2024-02-16T10:42:27+00:00


Hello! I always wish that command line tools came with data about how popular their various options are, like:


“basically nobody uses this one”
“80% of people use this, probably take a look”
“this one has 6 possible values but people only really use these 2 in practice”


So I asked about people’s favourite git config options on Mastodon:


what are your favourite git config options to set? Right now I only really
have git config push.autosetupremote true and git config
init.defaultBranch main set in my ~/.gitconfig, curious about what other
people set


As usual I got a TON of great answers and learned about a bunch of very popular
git config options that I’d never heard of.

I’m going to list the options, starting with (very roughly) the most popular
ones. Here’s a table of contents:


pull.ff only or pull.rebase true
merge.conflictstyle zdiff3
rebase.autosquash true
rebase.autostash true
push.default simple, push.default current
init.defaultBranch main
commit.verbose true
rerere.enabled true
help.autocorrect 10
core.pager delta
diff.algorithm histogram
core.excludesfile ~/.gitignore
includeIf: separate git configs for personal and work
fsckobjects: avoid data corruption
submodule stuff
and more
how to set these
config changes I’ve made after writing this post


All of the options are documented in man git-config, or this page.

pull.ff only or pull.rebase true

These two were the most popular. These both have similar goals: to avoid accidentally creating a merge commit
when you run git pull on a branch where the upstream branch has diverged.


pull.rebase true is the equivalent of running git pull --rebase every time you git pull
pull.ff only is the equivalent of running git pull --ff-only every time you git pull


I’m pretty sure it doesn’t make sense to set both of them at once, since --ff-only
overrides --rebase.

Personally I don’t use either of these since I prefer to decide how to handle
that situation every time, and now git’s default behaviour when your branch has
diverged from the upstream is to just throw an error and ask you what to do
(very similar to what git pull --ff-only does).

merge.conflictstyle zdiff3

Next: making merge conflicts more readable! merge.conflictstyle zdiff3 and merge.conflictstyle diff3 were both super popular (“totally indispensable”).

The main idea is
The consensus seemed to be “diff3 is great, and zdiff3 (which is newer) is even better!”.

So what’s the deal with diff3. Well, by default in git, merge conflicts look like this:

<<<<<<< HEAD
def parse(input):
    return input.split("\n")
=======
def parse(text):
    return text.split("\n\n")
>>>>>>> somebranch


I’m supposed to decide whether input.split("\n") or text.split("\n\n") is
better. But how? What if I don’t remember whether \n or \n\n is right? Enter diff3!

Here’s what the same merge conflict look like with merge.conflictstyle diff3 set:

<<<<<<< HEAD
def parse(input):
    return input.split("\n")
||||||| b9447fc
def parse(input):
    return input.split("\n\n")
=======
def parse(text):
    return text.split("\n\n")
>>>>>>> somebranch


This has extra information: now the original version of the code is in the middle! So we can see that:


one side changed \n\n to \n
the other side renamed input to text


So presumably the correct merge conflict resolution is return
text.split("\n"), since that combines the changes from both sides.

I haven’t used zdiff3, but a lot of people seem to think it’s better. The blog post Better Git Conflicts with zdiff3 talks more about it.

rebase.autosquash true

Autosquash was also a new feature to me. The goal is to make it easier to modify old commits.

Here’s how it works:


You have a commit that you would like to be combined with some commit that’s 3 commits ago, say add parsing code
You commit it with git commit --fixup OLD_COMMIT_ID, which gives the new commit the commit message fixup! add parsing code
Now, when you run git rebase --autosquash main, it will automatically combine all the fixup! commits with their targets


rebase.autosquash true means that --autosquash always gets passed automatically to git rebase.

rebase.autostash true

This automatically runs git stash before a git rebase and git stash pop after. It basically passes --autostash to git rebase.

Personally I’m a little scared of this since it potentially can result in merge
conflicts after the rebase, but I guess that doesn’t come up very often for
people since it seems like a really popular configuration option.

push.default simple, push.default current, push.autoSetupRemote true

These push options tell git push to automatically push the current branch to a remote branch with the same name.


push.default simple is the default in Git. It only works if your branch is already tracking a remote branch
push.default current is similar, but it’ll always push the local branch to a remote branch with the same name.
push.autoSetupRemote true is a little different – this one makes it so when you first push a branch, it’ll automatically set up tracking for it


I think I prefer push.autoSetupRemote true to push.default current because
push.autoSetupRemote true also lets you pull from the matching remote
branch (though you do need to push first to set up tracking). push.default
current only lets you push.

I believe the only thing to be careful of with push.autoSetupRemote true and
push.default current is that you need to be confident that you’re never going
to accidentally make a local branch with the same name as an unrelated remote
branch. Lots of people have branch naming conventions (like julia/my-change)
that make this kind of conflict very unlikely, or just have few enough
collaborators that branch name conflicts probably won’t happen.

init.defaultBranch main

Create a main branch instead of a master branch when creating a new repo.

commit.verbose true

This adds the whole commit diff in the text editor where you’re writing your
commit message, to help you remember what you were doing.

rerere.enabled true

This enables rerere (”reuse recovered resolution”), which remembers how you resolved merge conflicts
during a git rebase and automatically resolves conflicts for you when it can.

help.autocorrect 10

By default git’s autocorrect try to check for typos (like git ocmmit), but won’t actually run the corrected command.

If you want it to run the suggestion automatically, you can set
help.autocorrect
to 1 (run after 0.1 seconds), 10 (run after 1 second), immediate (run
immediately), or prompt (run after prompting)

core.pager delta

The “pager” is what git uses to display the output of git diff, git log, git show, etc. People set it to:


delta (a fancy diff viewing tool with syntax highlighting)
less -x5,9 (sets tabstops, which I guess helps if you have a lot of files with tabs in them?)
less -F -X (not sure about this one, -F seems to disable the pager if everything fits on one screen if but my git seems to do that already anyway)
cat (to disable paging altogether)


I used to use delta but turned it off because somehow I messed up the colour
scheme in my terminal and couldn’t figure out how to fix it. I think it’s a
great tool though.

I believe delta also suggests that you set up interactive.diffFilter  delta --color-only to syntax highlight code when you run git add -p.

diff.algorithm histogram

Git’s default diff algorithm often handles functions being reordered badly. For example look at this diff:

-.header {
+.footer {
     margin: 0;
 }

-.footer {
+.header {
     margin: 0;
+    color: green;
 }


I find it pretty confusing. But with diff.algorithm histogram, the diff looks like this instead, which I find much clearer:

-.header {
-    margin: 0;
-}
-
 .footer {
     margin: 0;
 }

+.header {
+    margin: 0;
+    color: green;
+}


Some folks also use patience, but histogram seems to be more popular. When to Use Each of the Git Diff Algorithms has more on this.

core.excludesfile: a global .gitignore

core.excludeFiles = ~/.gitignore lets you set a global gitignore file that
applies to all repositories, for things like .idea or .DS_Store that you
never want to commit to any repo. It defaults to ~/.config/git/ignore.

includeIf: separate git configs for personal and work

Lots of people said they use this to configure different email addresses for
personal and work repositories. You can set it up something like this:

[includeIf "gitdir:~/code//"]
path = "~/code//.gitconfig"


url."git@github.com:".insteadOf 'https://github.com/'

I often accidentally clone the HTTP version of a repository instead of the
SSH version and then have to manually go into ~/.git/config and edit the
remote URL. This seems like a nice workaround: it’ll replace
https://github.com in remotes with git@github.com:.

Here’s what it looks like in ~/.gitconfig since it’s kind of a mouthful:

[url "git@github.com:"]
	insteadOf = "https://github.com/"


One person said they use pushInsteadOf instead to only do the replacement for
git push because they don’t want to have to unlock their SSH key when
pulling a public repo.

A couple of other people mentioned setting insteadOf = "gh:" so they can git
remote add gh:jvns/mysite to add a remote with less typing.

fsckobjects: avoid data corruption

A couple of people mentioned this one. Someone explained it as “detect data
corruption eagerly. Rarely matters but has saved my entire team a couple
times”.

transfer.fsckobjects = true
fetch.fsckobjects = true
receive.fsckObjects = true


submodule stuff

I’ve never understood anything about submodules but a couple of person said they like to set:


status.submoduleSummary  true
diff.submodule  log
submodule.recurse  true


I won’t attempt to explain those but there’s an explanation on Mastodon by @unlambda here.

and more

Here’s everything else that was suggested by at least 2 people:


blame.ignoreRevsFile .git-blame-ignore-revs lets you specify a file with commits to ignore during git blame, so that giant renames don’t mess up your blames
branch.sort -committerdate, makes git branch sort by most recently used branches instead of alphabetical, to make it easier to find branches. tag.sort taggerdate is similar for tags.
color.ui false: to turn off colour
commit.cleanup scissors: so that you can write #include in a commit message without the # being treated as a comment and removed
core.autocrlf false: on Windows, to work well with folks using Unix
core.editor emacs: to use emacs (or another editor) to edit commit messages
credential.helper osxkeychain: use the Mac keychain for managing
diff.tool difftastic: use difftastic (or meld or nvimdiffs) to display diffs
diff.colorMoved default: uses different colours to highlight lines in diffs that have been “moved”
diff.colorMovedWS allow-indentation-change: with diff.colorMoved set, also ignores indentation changes
diff.context 10: include more context in diffs
fetch.prune true and fetch.prunetags - automatically delete remote tracking branches that have been deleted
gpg.format ssh: allow you to sign commits with SSH keys
log.date iso: display dates as 2023-05-25 13:54:51 instead of Thu May 25 13:54:51 2023
merge.keepbackup false, to get rid of the .orig files git creates during a merge conflict
merge.tool meld (or nvim, or nvimdiff) so that you can use git mergetool to help resolve merge conflicts
push.followtags true: push new tags along with commits being pushed
rebase.missingCommitsCheck error: don’t allow deleting commits during a rebase
rebase.updateRefs true: makes it much easier to rebase multiple stacked branches at a time. Here’s a blog post about it.


how to set these

I generally set git config options with git config --global NAME VALUE, for
example git config --global diff.algorithm histogram. I usually set all of my
options globally because it stresses me out to have different git behaviour in
different repositories.

If I want to delete an option I’ll edit ~/.gitconfig manually, where they look like this:

[diff]
	algorithm = histogram


config changes I’ve made after writing this post

My git config is pretty minimal, I already had:


init.defaultBranch main
push.autoSetupRemote true
merge.tool meld
diff.colorMoved default (which actually doesn’t even work for me for some reason but I haven’t found the time to debug)


and I added these 3 after writing this blog post:


diff.algorithm histogram
branch.sort -committerdate
merge.conflictstyle zdiff3


I’d probably also set rebase.autosquash if making carefully crafted pull
requests with multiple commits were a bigger part of my life right now.

I’ve learned to be cautious about setting new config options – it takes me a
long time to get used to the new behaviour and if I change too many things at
once I just get confused. branch.sort -committerdate is something I was
already using anyway (through an alias), and I’m pretty sold that diff.algorithm
histogram will make my diffs easier to read when I reorder functions.

that’s all!

I’m always amazed by how useful to just ask a lot of people what stuff they like and
then list the most commonly mentioned ones, like with this list of new-ish command line tools
I put together a couple of years ago. Having a list of 20 or 30 options to consider feels so much more efficient than combing through a list of all 600 or so git config options

It was a little confusing to summarize these because git’s default
options have actually changed a lot of the years, so people occasionally have
options set that were important 8 years ago but today are the default. Also a
couple of the experimental options people were using have been removed and
replaced with a different version.

I did my best to explain things accurately as of how git works right now in
2024 but I’ve definitely made mistakes in here somewhere, especially because I
don’t use most of these options myself. Let me know on Mastodon if you see a
mistake and I’ll try to fix it.

I might also ask people about aliases later, there were a bunch of great ones
that I left out because this was already getting long.



Dealing with diverged git branches
2024-02-01T08:47:00+00:00


Hello! One of the most common problems I see folks struggling with in Git is
when a local branch (like main) and a remote branch (maybe also called
main) have diverged.

There are two things that make this situation hard:


If you’re not used to interpreting git’s error messages, it’s nontrivial to
even realize that your main has diverged from the remote main (git
will often just give you an intimidating but generic error message like
! [rejected] main -> main (non-fast-forward) error: failed to push some refs to 'github.com:jvns/int-exposed')
Once you realize that your branch has diverged from the remote main, there
no single clear way to handle it (what you need to do depends on the
situation and your git workflow)


So let’s talk about a) how to recognize when you’re in a situation where a local
branch and remote branch have diverged and b) what you can do about it! Here’s a
quick table of contents:


what does “diverged” mean?
recognizing when branches are diverged


way 1: git status
way 2: git push
way 3: git pull

there’s no one solution


solution 1.1: git pull –rebase
solution 1.2: git pull –no-rebase
solution 2.1: git push –force
solution 2.2: git push –force-with-lease
solution 3: git reset –hard origin/main



Let’s start with what it means for 2 branches to have “diverged”.

what does “diverged” mean?

If you have a local main and a remote main, there are 4 basic configurations:

1: up to date. The local and remote main branches are in the exact same place. Something like this:

a - b - c - d
            ^ LOCAL
            ^ REMOTE


2: local is behind

Here you might want to git pull. Something like this:

a - b - c - d - e
    ^ LOCAL     ^ REMOTE


3: remote is behind

Here you might want to git push. Something like this:

a - b - c - d - e
    ^ REMOTE    ^ LOCAL


4: they’ve diverged :(

This is the situation we’re talking about in this blog post. It looks something like this:

a - b - c - d - e
        \       ^ LOCAL
         -- f 
            ^ REMOTE


There’s no one recipe for resolving this (how you want to handle it depends on
the situation and your git workflow!) but let’s talk about how to recognize
that you’re in that situation and some options for how to resolve it.

recognizing when branches are diverged

There are 3 main ways to tell that your branch has diverged.

way 1: git status

The easiest way to is to run git fetch and then git status. You’ll get a message something like this:

$ git fetch
$ git status
On branch main
Your branch and 'origin/main' have diverged, <-- here's the relevant line!
and have 1 and 2 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)


way 2: git push

When I run git push, sometimes I get an error like this:

$ git push
To github.com:jvns/int-exposed
 ! [rejected]        main -> main (non-fast-forward)
error: failed to push some refs to 'github.com:jvns/int-exposed'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.


This doesn’t always mean that my local main and the remote main have
diverged (it could just mean that my main is behind), but for me it often
means that. So if that happens I might run git fetch and git status to
check.

way 3: git pull

If I git pull when my branches have diverged, I get this error message:

$ git pull
hint: You have divergent branches and need to specify how to reconcile them.
hint: You can do so by running one of the following commands sometime before
hint: your next pull:
hint:
hint:   git config pull.rebase false  # merge
hint:   git config pull.rebase true   # rebase
hint:   git config pull.ff only       # fast-forward only
hint:
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
fatal: Need to specify how to reconcile divergent branches.


This is pretty clear about the issue (“you have divergent branches”).

git pull doesn’t always spit out this error message though when your branches have diverged: it depends on how
you configure git. The three other options I’m aware of are:


if you set git config pull.rebase false, it’ll automatically start merging the remote main
if you set git config pull.rebase true, it’ll automatically start rebasing onto the remote main
if you set git config pull.ff only, it’ll exit with the error fatal: Not possible to fast-forward, aborting.


Now that we’ve talked about some ways to recognize that you’re in a situation
where your local branch has diverged from the remote one, let’s talk about what
you can do about it.

there’s no one solution

There’s no “best” way to resolve branches that have diverged – it really
depends on your workflow for git and why the situation is happening.

I use 3 main solutions, depending on the situation:


I want to keep both sets of changes on main. To do this, I’ll run git
pull --rebase.
The remote changes are useless and I want to overwrite them. To do this,
I’ll run git push --force
The local changes are useless and I want to overwrite them. To do this, I’ll
run git reset --hard origin/main


Here are some more details about all 3 of these solutions.

solution 1.1: git pull --rebase

This is what I do when I want to keep both sets of changes. It rebases main
onto the remote main branch. I mostly use this in repositories where I’m
doing all of my work on the main branch.

You can configure git config pull.rebase true, to do this automatically every
time, but I don’t because sometimes I actually want to use solutions 2 or 3
(overwrite my local changes with the remote, or the reverse). I’d rather be
warned “hey, these branches have diverged, how do you want to handle it?” and
decide for myself if I want to rebase or not.

solution 1.2: git pull --no-rebase

This starts a merge between the local and remote main. Here you’ll need to:


Run git pull --no-rebase. This starts a merge and (if it succeeds) opens a text editor so that you can confirm that you want to commit the merge
Save the file in your text editor.


I don’t have too much to say about this because I’ve never done it. I always
use rebase instead. That’s a personal workflow choice though, lots of people have very
legitimate reasons to avoid rebase.

solution 2.1: git push --force

Sometimes I know that the work on the remote main is actually useless and I
just want to overwrite it with whatever is on my local main.

I do this pretty often on private repositories where I’m the only committer,
for example I might:


git push some commits
belatedly decide I want to change the most recent commit
make the changes and run git commit --amend
run git push --force


Of course, if the repository has many different committers, force-pushing in
this way can cause a lot of problems. On shared repositories I’ll usually
enable github branch protection
so that it’s impossible to force push.

solution 2.2: git push --force-with-lease

I’ve still never actually used git push --force-with-lease, but I’ve seen a
lot of people recommend it as an alternative to git push --force that makes
sure that nobody else has changed the branch since the last time you pushed or
fetched, so that you don’t accidentally blow their changes away.

Seems like a good option. I did notice that --force-with-lease isn’t
foolproof though – for example this git commit
talks about how if you use VSCode’s autofetching feature to continuously git fetch,
then --force-with-lease won’t help you.

Apparently now Git also has --force-with-lease --force-if-includes
(documented here),
which I think checks the reflog to make sure that you’ve already integrated the
remote branch into your branch somehow. I still don’t totally understand this
but I found this stack overflow conversation
helpful.

solution 3.1: git reset --hard origin/main

You can use this as the reverse of git push --force (since there’s no git pull --force). I do this when I know that
my local work shouldn’t be there and I want to throw it away and replace it
with whatever’s on the remote branch.

For example, I might do this if I accidentally made a commit to main that
actually should have been on new branch. In that case I’ll also create a new
branch (new-branch in this example) to store my local work on the main
branch, so it’s not really being thrown away.

Fixing that problem looks like this:

git checkout main

# 1. create `new-branch` to store my work
git checkout -b new-branch   

# 2. go back to the `main` branch I messed up
git checkout main            

# 3. make sure that my `origin/main` is up to date
git fetch                    

# 4. double check to make sure I don't have any uncomitted 
# work because `git reset --hard` will blow it away                                       
git status                   

# 5. force my local branch to match the remote `main`                               
#    NOTE: replace `origin/main` with the actual name of the
#    remote/branch, you can get this from `git status`.
git reset --hard origin/main  


This “store your work on main on a new branch and then git reset --hard” pattern can
also be useful if you’re not sure yet how to solve the conflict, since most
people are more used to merging 2 local branches than dealing with merging a
remote branch.

As always git reset --hard is a dangerous action and you can permanently lose
your uncommitted work. I always run git status first to make sure I don’t
have any uncommitted changes.

Some alternatives to using git reset --hard for this:


check out some other branch and run git branch -f main origin/main.
check out some other branch and run git fetch origin main:main --force


that’s all!

I’d never really thought about how confusing the git push and git pull
error messages can be if you’re not used to reading them.



Inside .git
2024-01-26T09:42:42+00:00


Hello! I posted a comic on Mastodon this week about what’s in the .git
directory and someone requested a text version, so here it is. I added some
extra notes too. First, here’s the image. It’s a ~15 word explanation of each
part of your .git directory.



You can git clone https://github.com/jvns/inside-git if you want to run all
these examples yourself.

Here’s a table of contents:


HEAD: .git/head
branch: .git/refs/heads/main
commit: .git/objects/10/93da429…
tree: .git/objects/9f/83ee7550…
blobs: .git/objects/5a/475762c…
reflog: .git/logs/refs/heads/main
remote-tracking branches: .git/refs/remotes/origin/main
tags: .git/refs/tags/v1.0
the stash: .git/refs/stash
.git/config
hooks: .git/hooks/pre-commit
the staging area: .git/index
this isn’t exhaustive
this isn’t meant to completely explain git


The first 5 parts (HEAD, branch, commit, tree, blobs) are the core of git.

HEAD: .git/head

HEAD is a tiny file that just contains the name of your current branch.

Example contents:

$ cat .git/HEAD
ref: refs/heads/main


HEAD can also be a commit ID, that’s called “detached HEAD state”.

branch: .git/refs/heads/main

A branch is stored as a tiny file that just contains 1 commit ID. It’s stored
in a folder called refs/heads.

Example contents:

$ cat .git/refs/heads/main
1093da429f08e0e54cdc2b31526159e745d98ce0


commit: .git/objects/10/93da429...

A commit is a small file containing its parent(s), message, tree, and author.

Example contents:

$ git cat-file -p 1093da429f08e0e54cdc2b31526159e745d98ce0
tree 9f83ee7550919867e9219a75c23624c92ab5bd83
parent 33a0481b440426f0268c613d036b820bc064cdea
author Julia Evans  1706120622 -0500
committer Julia Evans  1706120622 -0500

add hello.py


These files are compressed, the best way to see objects is with git cat-file -p HASH.

tree: .git/objects/9f/83ee7550...

Trees are small files with directory listings. The files in it are called blobs.

Example contents:

$  git cat-file -p 9f83ee7550919867e9219a75c23624c92ab5bd83
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	.gitignore
100644 blob 665c637a360874ce43bf74018768a96d2d4d219a	hello.py
040000 tree 24420a1530b1f4ec20ddb14c76df8c78c48f76a6	lib


The permissions here LOOK like unix permissions, but they’re actually super
restricted, only 644 and 755 are allowed.

blobs: .git/objects/5a/475762c...

blobs are the files that contain your actual code

Example contents:

$ git cat-file -p 665c637a360874ce43bf74018768a96d2d4d219a	
print("hello world!")


Storing a new blob with every change can get big, so git gc periodically
packs them for efficiency in .git/objects/pack.

reflog: .git/logs/refs/heads/main

The reflog stores the history of every branch, tag, and HEAD. For (mostly) every file in .git/refs, there’s a corresponding log in .git/logs/refs.

Example content for the main branch:

$ tail -n 1 .git/logs/refs/heads/main
33a0481b440426f0268c613d036b820bc064cdea
1093da429f08e0e54cdc2b31526159e745d98ce0
Julia Evans 
1706119866 -0500
commit: add hello.py


each line of the reflog has:


before/after commit IDs
user
timestamp
log message


Normally it’s all one line, I just wrapped it for readability here.

remote-tracking branches: .git/refs/remotes/origin/main

Remote-tracking branches store the most recently seen commit ID for a remote branch

Example content:

$ cat .git/refs/remotes/origin/main
fcdeb177797e8ad8ad4c5381b97fc26bc8ddd5a2


When git status says “you’re up to date with origin/main”, it’s just looking
at this. It’s often out of date, you can update it with git fetch origin
main.

tags: .git/refs/tags/v1.0

A tag is a tiny file in .git/refs/tags containing a commit ID.

Example content:

$ cat .git/refs/tags/v1.0
1093da429f08e0e54cdc2b31526159e745d98ce0


Unlike branches, when you make new commits it doesn’t update the tag.

the stash: .git/refs/stash

The stash is a tiny file called .git/refs/stash. It contains the commit ID of a commit that’s created when you run git stash.

cat .git/refs/stash
62caf3d918112d54bcfa24f3c78a94c224283a78


The stash is a stack, and previous values are stored in .git/logs/refs/stash (the reflog for stash).

cat .git/logs/refs/stash
62caf3d9 e85c950f Julia Evans  1706290652 -0500	WIP on main: 1093da4 add hello.py
00000000 62caf3d9 Julia Evans  1706290668 -0500	WIP on main: 1093da4 add hello.py


Unlike branches and tags, if you git stash pop a commit from the stash, it’s
deleted from the reflog so it’s almost impossible to find it again. The
stash is the only reflog in git where things get deleted very soon after
they’re added. (entries expire out of the branch reflogs too, but generally
only after 90 days)

A note on refs:

At this point you’ve probably noticed that a lot of things (branches,
remote-tracking branches, tags, and the stash) are commit IDs in .git/refs.
They’re called “references” or “refs”. Every ref is a commit ID, but the
different types of refs are treated VERY differently by git, so I find it
useful to think about them separately even though they all use
the same file format. For example, git deletes things from the stash reflog in
a way that it won’t for branch or tag reflogs.

.git/config

.git/config is a config file for the repository. It’s where you configure
your remotes.

Example content:

[remote "origin"] 
url = git@github.com: jvns/int-exposed 
fetch = +refs/heads/*: refs/remotes/origin/* 
[branch "main"] 
remote = origin 
merge refs/heads/main


git has local and global settings, the local settings are here and the global
ones are in ~/.gitconfig hooks

hooks: .git/hooks/pre-commit

Hooks are optional scripts that you can set up to run (eg before a commit) to do anything you want.

Example content:

#!/bin/bash 
any-commands-you-want


(this obviously isn’t a real pre-commit hook)

the staging area: .git/index

The staging area stores files when you’re preparing to commit. This one is a
binary file, unlike a lot of things in git which are essentially plain text
files.

As far as I can tell the best way to look at the contents of the index is with git ls-files --stage:

$ git ls-files --stage
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	.gitignore
100644 665c637a360874ce43bf74018768a96d2d4d219a 0	hello.py
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	lib/empty.py


this isn’t exhaustive

There are some other things in .git like FETCH_HEAD, worktrees, and
info. I only included the ones that I’ve found it useful to understand.

this isn’t meant to completely explain git

One of the most common pieces of advice I hear about git is “just learn how
the .git directory is structured and then you’ll understand everything!“.

I love understanding the internals of things more than anyone, but there’s a
LOT that “how the .git directory is structured” doesn’t explain, like:


how merges and rebases work and how they can go wrong (for instance this list of what can go wrong with rebase)
how exactly your colleagues are using git, and what guidelines you should be following to work with them successfully
how pushing/pulling code from other repositories works
how to handle merge conflicts


Hopefully this will be useful to some folks out there though.

some other references:


the book building git by James Coglan (side note: looks like there’s a 50% off discount for the rest of January)
git from the inside out by mary rose cook
the official git repository layout docs




Do we think of git commits as diffs, snapshots, and/or histories?
2024-01-05T14:00:51+00:00


Hello! I’ve been extremely slowly trying to figure how to explain every core
concept in Git (commits! branches! remotes! the staging area!) and commits have
been surprisingly tricky.

Understanding how git commits are implemented feels pretty straightforward to
me (those are facts! I can look it up!), but it’s been much harder to figure
out how other people think about commits. So like I’ve been doing a lot
recently, I went on Mastodon and started asking some questions.

how do people think about Git commits?

I did a highly unscientific poll on Mastodon about how people think about Git
commits: is it a snapshot? is it a diff? is it a list of every previous commit?
(Of course it’s legitimate to think about it as all three, but I was curious
about the primary way people think about Git commits). Here it is:



The results were:


51% diff
42% snapshot
4% history of every previous commit
3% “other”


I was really surprised that it was so evenly split between diffs and snapshots.
People also made some interesting kind of contradictory statements like “in my
mind a commit is a diff, but I think it’s actually implemented as a snapshot”
and “in my mind a commit is a snapshot, but I think it’s actually implemented
as a diff”. We’ll talk more about how a commit is actually implemented later in
the post.

Before we go any further: when we say “a diff” or “a snapshot”, what does that
mean?

what’s a diff?

What I mean by a diff is probably obvious: it’s what you get when you run git show
COMMIT_ID. For example here’s a typo fix from rbspy:

diff --git a/src/ui/summary.rs b/src/ui/summary.rs
index 5c4ff9c..3ce9b3b 100644
--- a/src/ui/summary.rs
+++ b/src/ui/summary.rs
@@ -160,7 +160,7 @@ mod tests {
 ";

         let mut buf: Vec = Vec::new();
-        stats.write(&mut buf).expect("Callgrind write failed");
+        stats.write(&mut buf).expect("summary write failed");
         let actual = String::from_utf8(buf).expect("summary output not utf8");
         assert_eq!(actual, expected, "Unexpected summary output");
     }


You can see it on GitHub here: https://github.com/rbspy/rbspy/commit/24ad81d2439f9e63dd91cc1126ca1bb5d3a4da5b

what’s a snapshot?

When I say “a snapshot”, what I mean is “all the files that you get when you
run git checkout COMMIT_ID”.

Git often calls the list of files for a commit a “tree” (as in “directory
tree”), and you can see all of the files for the above example commit here on
GitHub:

https://github.com/rbspy/rbspy/tree/24ad81d2439f9e63dd91cc1126ca1bb5d3a4da5b (it’s /tree/ instead of /commit/)

is “how Git implements it” really the right way to explain it?

Probably the most common piece of advice I hear related to learning Git is
“just learn how Git represents things internally, and everything will make
sense”. I obviously find this perspective extremely appealing (if you’ve spent
any time reading this blog, you know I love thinking about how things are
implemented internally).

But as a strategy for teaching Git, it hasn’t been as successful as I’d hoped!
Often I’ve eagerly started explaining “okay, so git commits are snapshots with
a pointer to their parent, and then a branch is a pointer to a commit, and…“,
but the person I’m trying to help will tell me that they didn’t really find
that explanation that useful at all and they still don’t get it. So I’ve been
considering other options.

Let’s talk about the internals a bit anyway though.

how git represents commits internally: snapshots

Internally, git represents commits as snapshots (it stores the “tree” of the
current version of every file). I wrote about this in In a git repository, where do your files live?,
but here’s a very quick summary of what the internal format looks like.

Here’s how a commit is represented:

$ git cat-file -p 24ad81d2439f9e63dd91cc1126ca1bb5d3a4da5b
tree e197a79bef523842c91ee06fa19a51446975ec35
parent 26707359cdf0c2db66eb1216bf7ff00eac782f65
author Adam Jensen  1672104452 -0500
committer Adam Jensen  1672104890 -0500

Fix typo in expectation message


and here’s what we get when we look at this tree object: a list of every file /
subdirectory in the repository’s root directory as of that commit:

$ git cat-file -p e197a79bef523842c91ee06fa19a51446975ec35
040000 tree 2fcc102acd27df8f24ddc3867b6756ac554b33ef	.cargo
040000 tree 7714769e97c483edb052ea14e7500735c04713eb	.github
100644 blob ebb410eb8266a8d6fbde8a9ffaf5db54a5fc979a	.gitignore
100644 blob fa1edfb73ce93054fe32d4eb35a5c4bee68c5bf5	ARCHITECTURE.md
100644 blob 9c1883ee31f4fa8b6546a7226754cfc84ada5726	CODE_OF_CONDUCT.md
100644 blob 9fac1017cb65883554f821914fac3fb713008a34	CONTRIBUTORS.md
100644 blob b009175dbcbc186fb8066344c0e899c3104f43e5	Cargo.lock
100644 blob 94b87cd2940697288e4f18530c5933f3110b405b	Cargo.toml


What this means is that checking out a Git commit is always fast: it’s just as
easy for Git to check out a commit from yesterday as it is to check out a
commit from 1 million commits ago. Git never has to replay 10000 diffs to
figure out the current state or anything, because commits just aren’t stored as
diffs.

snapshots are compressed using packfiles

I just said that Git commits are snapshots, but when someone says “I think of
git commits as a snapshot, but I think internally they’re actually diffs”,
that’s actually kind of true too! Git commits are not represented as diffs in
the sense you’re probably used to (they’re not represented on disk as a diff
from the previous commit), but the basic intuition that if you’re editing a
10,000 lines 500 times, it would be inefficient to store 500 copies of that
file is right.

Git does have a way of storing files as differences from other ways. This is
called “packfiles” and periodically git will do a garbage collection and
compress your data into packfiles to save disk space. When you git clone a
repository git will also compress the data.

I don’t have space for a full explanation of how packfiles work in this post
(Aditya Mukerjee’s Unpacking Git packfiles
is my favourite writeup of how they work). But here’s a quick summary of my
understanding of how deltas work and how they’re different from diffs:


Objects are stored as a reference to an “original file”, plus a “delta”
the delta has a bunch of instructions like “read bytes 0 to 100, then insert bytes ‘hello there’, then read bytes 120 to 200”. It cobbles together bytes from the original plus new text. So there’s no notion of “deletions”, just copies and additions.
I think there are less layers of deltas: I don’t know how to actually check how many layers of deltas Git actually had to go through to get a given object, but my impression is that it usually isn’t very many. Probably less than 10? I’d love to know how to actually find this out though.
The “original file” isn’t necessarily from the previous commit, it could be anything. Maybe it could even be from a later commit? I’m not sure about that.
There’s no “right” algorithm for how to compute deltas, Git just has some approximate heuristics


what actually happens when you do a diff is kind of weird

When I run git show SOME_COMMIT to look at the diff for a commit, what
actually happens is kind of counterintuitive. My understanding is:


git looks in the packfiles and applies deltas to reconstruct the tree for that commit and for its parent.
git diffs the two directory trees (the current commit’s tree, and the parent commit’s tree). Usually this is pretty fast because almost all of
the files are exactly the same, so git can just compare the hashes of the identical files and do nothing almost all of the time.
finally git shows me the diff


So it takes deltas, turns them into a snapshot, and then calculates a diff. It
feels a little weird because it starts with a diff-like-thing and ends up with
another diff-like-thing, but the deltas and diffs are actually totally
different so it makes sense.

That said, the way I think of it is that git stores commits as snapshots and
packfiles are just an implementation detail to save disk space and make clones
faster. I’ve never actually needed to know how packfiles work for any practical
reason, but it does help me understand how it’s possible for git commits to
be snapshots without using way too much disk space.

a “wrong” mental model for git: commits are diffs

I think a pretty common “wrong” mental model for Git is:


commits are stored as diffs from the previous commit (plus a pointer to the parent commit(s) and an author and message).
to get the current state for a commit, Git starts at the beginning and
replays all the previous commits


This model is obviously not true (in real life, commits are stored as
snapshots, and diffs are calculated from those snapshots), but it seems very
useful and coherent to me! It gets a little weird with merge commits, but maybe
you just say it’s stored as a diff from the first parent of the merge.

I think wrong mental models are often extremely useful, and this one doesn’t
seem very problematic to me for every day Git usage. I really like that it
makes the thing that we deal with the most often (the diff) the most
fundamental – it seems really intuitive to me.

I’ve also been thinking about other “wrong” mental models you can have about
Git which seem pretty useful like:


commit messages can be edited (they can’t really, actually you make a copy of the commit with a new message, and the old commit continues to exist)
commits can be moved to have a different base (similarly, they’re copied)


I feel like there’s a whole very coherent “wrong” set of ideas you can have
about git that are pretty well supported by Git’s UI and not very problematic
most of the time. I think it can get messy when you want to undo a change or
when something goes wrong though.

some advantages of “commit as diff”

Personally even though I know that in Git commits are snapshots, I probably think of them as diffs most of the time, because:


most of the time I’m concerned with the change I’m making – if I’m just
changing 1 line of code, obviously I’m mostly thinking about just that 1 line
of code and not the entire current state of the codebase
when you click on a Git commit on GitHub or use git show, you see the diff, so it’s just what I’m used to seeing
I use rebase a lot, which is all about replaying diffs


some advantages of “commit as snapshot”

I also think about commits as snapshots sometimes though, because:


git often gets confused about file moves: sometimes if I move a file and edit
it, Git can’t recognize that it was moved and instead will show it as
“deleted old.py, added new.py”. This is because git only stores snapshots, so
when it says “moved old.py -> new.py”, it’s just guessing because the
contents of old.py and new.py are similar.
it’s conceptually much easier to think about what git checkout COMMIT_ID is doing (the idea of replaying 10000 commits just feels stressful to me)
merge commits kind of make more sense to me as snapshots, because the merged
commit can actually be literally anything (it’s just a new snapshot!). It
helps me understand why you can make arbitrary changes when you’re resolving
a merge conflict, and why it’s so important to be careful about conflict
resolution.


some other ways to think about commits

Some folks in the Mastodon replies also mentioned:


“extra” out-of-band information about the commit, like an email or a GitHub pull request or just a conversation you had with a coworker
thinking about a diff as a “before state + after state”
and of course, that lots of people think of commits in lots of different ways depending on the situation


some other words people use to talk about commits might be less ambiguous:


“revision” (seems more like a snapshot)
“patch” (seems more like a diff)


that’s all for now!

It’s been very difficult for me to get a sense of what different mental models
people have for git. It’s especially tricky because people get really into
policing “wrong” mental models even though those “wrong” models are often
really useful, so folks are reluctant to share their “wrong” ideas for fear of
some Git Explainer coming out of the woodwork to explain to them why they’re
Wrong. (these Git Explainers are often well-intentioned, but it still has a chilling effect either way)

But I’ve been learning a lot! I still don’t feel totally clear about how I want to
talk about commits, but we’ll get there eventually.

Thanks to Marco Rogers, Marie Flanagan, and everyone on Mastodon for talking to
me about git commits.



Some notes on NixOS
2024-01-01T10:22:37+00:00


Hello! Over the holidays I decided it might be fun to run NixOS on one of my
servers, as part of my continuing experiments with Nix.

My motivation for this was that previously I was using Ansible to
provision the server, but then I’d ad hoc installed a bunch of stuff on the
server in a chaotic way separately from Ansible, so in the end I had no real
idea of what was on that server and it felt like it would be a huge pain to
recreate it if I needed to.

This server just runs a few small personal Go services, so it seemed like a
good candidate for experimentation.

I had trouble finding explanations of how to set up NixOS and I needed to
cobble together instructions from a bunch of different places, so here’s a
very short summary of what worked for me.

why NixOS instead of Ansible?

I think the reason NixOS feels more reliable than Ansible to me is that NixOS is
the operating system. It has full control over all your users and services and
packages, and so it’s easier for it to reliably put the system into the state
you want it to be in.

Because Nix has so much control over the OS, I think that if I tried to make
any ad-hoc changes at all to my Nix system, Nix would just blow them away the
next time I ran nixos-rebuild. But with Ansible, Ansible only controls a few
small parts of the system (whatever I explicitly tell it to manage), so it’s
easy to make changes outside Ansible.

That said, here’s what I did to set up NixOS on my server and run a Go service on it.

step 1: install NixOS with nixos-infect

To install NixOS, I created a new Hetzner instance running Ubuntu, and then ran nixos-infect on it to convert the Ubuntu installation into a NixOS install, like this:

curl https://raw.githubusercontent.com/elitak/nixos-infect/master/nixos-infect | PROVIDER=hetznercloud NIX_CHANNEL=nixos-23.11 bash 2>&1 | tee /tmp/infect.log


I originally tried to do this on DigitalOcean, but it didn’t work for some
reason, so I went with Hetzner instead and that worked.

This isn’t the only way to install NixOS (this wiki page lists options for setting up NixOS cloud servers), but it seemed to work.
It’s possible that there are problems with installing that way that I don’t
know about though. It does feel like using an ISO is probably better because that way you don’t have to do this transmogrification of Ubuntu into NixOS.

I definitely skipped Step 1 in nixos-infect’s README (“Read and understand
the script”), but I didn’t feel too worried because I was running it on a
new instance and I figured that if something went wrong I’d just delete it.

step 2: copy the generated Nix configuration

Next I needed to copy the generated Nix configuration to a new local Git
repository, like this:

scp root@SERVER_IP:/etc/nixos/* .


This copied 3 files: configuration.nix, hardware-configuration.nix, and networking.nix. configuration.nix is the main file. I didn’t touch anything in hardware-configuration.nix or networking.nix.

step 3: create a flake

I created a flake to wrap configuration.nix. I don’t remember why I did this
(I have some idea of what the advantages of flakes are, but it’s not clear to
me if any of them are actually relevant in this case) but it seems to work. Here’s
my flake.nix:

{ inputs.nixpkgs.url = "github:NixOS/nixpkgs/23.11";

  outputs = { nixpkgs, ... }: {
    nixosConfigurations.default = nixpkgs.lib.nixosSystem {
      system = "x86_64-linux";

      modules = [ ./configuration.nix ];
    };
  };
}


The main gotcha about flakes that I needed to remember here was that you need
to git add every .nix file you create otherwise Nix will pretend it doesn’t
exist.

The rules about git and flakes seem to be:


you do need to git add your files
you don’t need to commit your changes
unstaged changes to files are also fine, as long as the file has been git added


These rules feel very counterintuitive to me (why require that you git add
files but allow unstaged changes?) but that’s how it works. I think it might be
an optimization because Nix has to copy all your .nix files to the Nix store for some
reason, so only copying files that have been git added makes the copy faster. There’s a GitHub issue tracking it here so maybe the way this works will change at some point.

step 4: figure out how to deploy my configuration

Next I needed to figure out how to deploy changes to my configuration.  There are a bunch
of tools for this, but I found the blog post Announcing nixos-rebuild: a “new” deployment tool for NixOS
that said you can just use the built-in nixos-rebuild, which has
--target-host and --build-host options so that you can specify which host
to build on and deploy to, so that’s what I did.

I wanted to be able to get Go repositories and build the Go code on the target
host, so I created a bash script that runs this command:

nixos-rebuild switch --fast --flake .#default --target-host my-server --build-host my-server --option eval-cache false


Making --target-host and --build-host the same machine is certainly not
something I would do for a Serious Production Machine, but this server is
extremely unimportant so it’s fine.

This --option eval-cache false is because Nix kept not showing me my errors
because they were cached – it would just say error: cached failure of
attribute 'nixosConfigurations.default.config.system.build.toplevel' instead
of showing me the actual error message. Setting --option eval-cache false
turned off caching so that I could see the error messages.

Now I could run bash deploy.sh on my laptop and deploy my configuration to the server! Hooray!

step 5: update my ssh config

I also needed to set up a my-server host in my ~/.ssh/config. I set up SSH
agent forwarding so that the server could download the private Git repositories
it needed to access.

Host my-server
   Hostname MY_IP_HERE
   User root
   Port 22
   ForwardAgent yes

AddKeysToAgent yes


step 6: set up a Go service

The thing I found the hardest was to figure out how to compile and configure a
Go web service to run on the server. The norm seems to be to define your package and define your
service’s configuration in 2 different files, but I didn’t feel like doing that
– I wanted to do it all in one file. I couldn’t find a simple example of how
to do this, so here’s what I did.

I’ve replaced the actual repository name with my-service because it’s a
private repository and you can’t run it anyway.

{ pkgs ? (import  { }), lib, stdenv, ... }: 
let myservice = pkgs.callPackage pkgs.buildGoModule {
  name = "my-service";
  src = fetchGit {
    url = "git@github.com:jvns/my-service.git";
    rev = "efcc67c6b0abd90fb2bd92ef888e4bd9c5c50835"; # put the right git sha here
  };
  vendorHash = "sha256-b+mHu+7Fge4tPmBsp/D/p9SUQKKecijOLjfy9x5HyEE"; # nix will complain about this and tell you the right value
}; in { 
  services.caddy.virtualHosts."my-service.example.com".extraConfig = ''
    reverse_proxy localhost:8333
  '';

  systemd.services.my-service = {
    enable = true;
    description = "my-service";
    after = ["network.target"];
    wantedBy = ["multi-user.target"];
    script = "${myservice}/bin/my-service";
    environment = {
      DB_FILENAME = "/var/lib/my-service/db.sqlite";
    };
    serviceConfig = {
      DynamicUser = true;
      StateDirectory = "my-service"; # /var/lib/my-service
    };
  };
}


Then I just needed to do 2 more things:


add ./my-service.nix to the imports section of configuration.nix
add services.caddy.enable = true; to configuration.nix to enable Caddy


and everything worked!!

Some notes on this service configuration file:


I used extraConfig to configure Caddy because I didn’t feel like learning
Nix’s special Caddy syntax – I wanted to just be able to refer to the Caddy
documentation directly.
I used systemd’s DynamicUser to create a user dynamically to run the
service. I’d never used this before but it seems like a great simple way to
create a different user for every service without having to write a bunch of
repetitive boilerplate and being really careful to choose unique UID and
GIDs. The blog post Dynamic Users with systemd talks
about how it works.
I used StateDirectory to get systemd to create a persistent directory where I could store a SQLite database. It creates a directory at /var/lib/my-service/


I’d never heard of DynamicUser or StateDirectory before Kamal told me about
them the other day but they seem like cool systemd features and I wish
I’d known about them earlier.

why Caddy?

One quick note on Caddy: I switched to Caddy a while back from nginx
because it automatically sets up Let’s Encrypt certificates. I’ve only been
using it for tiny hobby services, but it seems pretty great so far for that,
and its configuration language is simpler too.

problem: “fetchTree requires a locked input”

One problem I ran into was this error message:

error: in pure evaluation mode, 'fetchTree' requires a locked input, at «none»:0


I found this really perplexing – what is fetchTree? What is «none»:0? What did I do wrong?

I learned 4 things from debugging this (with help from the folks in the Nix discord):


In Nix, fetchGit calls an internal function called fetchTree. So errors that say fetchTree might actually be referring to fetchGit.
Nix truncates long stack traces by default. Sometimes you can get more information with --show-trace.
It seems like Nix doesn’t always give you the line number in your code which caused the error, even if you use --show-trace. I’m not sure why this is. Some people told me this is because fetchTree is a built in function but – why can’t I see the line number in my nix code that called that built in function?
Like I mentioned before, you can pass --option eval-cache false to turn off caching so that Nix will always show you the error message instead of error: cached failure of attribute 'nixosConfigurations.default.config.system.build.toplevel'


Ultimately the problem turned out to just be that I forgot to pass the Github
revision ID (rev = "efcc67c6b0abd90fb2bd92ef888e4bd9c5c50835";) to fetchGit
which was really easy to fix.

nix syntax is still pretty confusing to me

I still don’t really understand the nix language syntax that well, but I
haven’t felt motivated to get better at it yet – I guess learning new language
syntax just isn’t something I find fun. Maybe one day I’ll learn it. My plan
for now with NixOS is to just keep copying and pasting that my-service.nix
file above forever.

some questions I still have

I think my main outstanding questions are:


When I run nixos-rebuild, Nix checks that my systemd services are still
working in some way. What does it check exactly? My best guess is that it
checks that the systemd service starts successfully, but if the service
starts and then immediately crashes, it won’t notice.
Right now to deploy a new version of one of my services, I need to manually
copy and paste the Git SHA of the new revision. There’s probably a better
workflow but I’m not sure what it is.


that’s all!

I really do like having all of my service configuration defined in one file,
and the approach Nix takes does feel more reliable than the approach I was
taking with Ansible.

I just started doing this a week ago and as with all things Nix I have no idea
if I’ll end up liking it or not. It seems pretty good so far though!

I will say that I find using Nix to be very difficult and I really struggle
when debugging Nix problems (that fetchTree problem I mentioned sounds
simple, but it was SO confusing to me at the time), but I kind of like it
anyway. Maybe because I’m not using Linux on my laptop right now I miss having
linux evenings and Nix feels
like a replacement for that :)



2023: Year in review
2023-12-31T08:54:42+00:00


Hello! This was my 4th year working full time on Wizard Zines! Here are a few
of the things I worked on this year.

a zine!

I published How Integers and Floats Work, which I worked on with
Marie.

This one started out its life as “how your computer represents things in
memory”, but once we’d explained how integers and floats were represented in
memory the zine was already long enough, so we just kept it to integers and
floats.

This zine was fun to write: I learned about why signed integers are represented
in memory the way they are, and I’m really happy with the explanation of
floating point we ended up with.

a playground: memory spy!

When explaining to people how your computer represents people in memory, I kept
wanting to open up gdb or lldb and show some example C programs and how the
variables in those C programs are represented in memory.

But gdb is kind of confusing if you’re not used to looking at it! So me and
Marie made a cute interface on top of lldb, where you can put in any C program,
click on a line, and see what the variable looks like. It’s called memory spy and here’s what it looks like:





a playground: integer exposed!

I got really obsessed with float.exposed by Bartosz
Ciechanowski for seeing how floats are represented in
memory. So with his permission, I made a copy of it for integers called integer.exposed.

Here’s a screenshot:



It was pretty straightforward to make (copying someone else’s design is so much
easier than making your own!) but I learned a few CSS tricks from analyzing how
he implemented it.

Implement DNS in a Weekend

I’ve been working on a big project to show people how to implement a working
networking stack (TCP, TLS, DNS, UDP, HTTP) in 1400 lines of Python, that you
can use to download a webpage using 100% your own networking code. Kind of like Nand to Tetris, but for computer networking.

This has been going VERY slowly – writing my own working shitty
implementations was relatively easy (I finished that in October 2022), but
writing clear tutorials that other people can follow is not.

But in March, I released the first part: Implement DNS in a Weekend. The response was really good
– there are dozens of people’s implementations on GitHub, and
people have implemented it in Go, C#, C, Clojure, Python, Ruby, Kotlin, Rust,
Typescript, Haskell, OCaml, Elixir, Odin, and probably many more languages too.
I’d like to see more implementations in less systems-y languages like vanilla
JS and PHP, need to think about what I can do to encourage that.

I think “Implement IPv4 in a Weekend” might be the next one I release. It’s
going to come with bonus guides to implementing ICMP and UDP too.

a talk: Making Hard Things Easy!

I gave a keynote at Strange Loop this year called Making Hard Things Easy (video + transcript),
about why some things are so hard to learn and how we can make them easier. I’m
really proud of how it turned out.

a lot of blog posts about Git!

In September I decided to work on a second zine about Git, focusing more on how
Git works. This is one of the hardest projects I’ve ever worked on, because
over the last 10 years of using it I’d completely lost sight of what’s hard
about Git.

So I’ve been doing a lot of research to try to figure out why Git is hard, and
I’ve been writing a lot of blog posts. So far I’ve written:


In a git repository, where do your files live?
Some miscellaneous git facts
Confusing git terminology
git rebase: what can go wrong?
How git cherry-pick and revert use 3-way merge
git branches: intuition & reality
Mounting git commits as folders with NFS


What’s been most surprising so far is that I originally thought “to understand
Git, people just need to learn git’s internal data model!“. But the more I talk
to people about their struggles with Git, the less I think that’s true. I’ll
leave it at that for now, but there’s a lot of work still to do.

some Git prototypes!

I worked on a couple of fun Git tools this year:


git-commit-folders: a way to
mount your Git commits as (read-only) folders using FUSE or NFS. This one
came about because someone mentioned that they think of Git commits as being
folders with old versions of the code, and it made me wonder – why can’t
you just have a virtual folder for every commit? It turns out that it can and
it works pretty well.
git-oops: an experimental prototype of an
undo system for git. This one came out of me wondering “why can’t we just
have a git undo?“. I learned a bunch of things about why that’s not easy
through writing the prototype, I might write a longer blog post about it
later.


I’ve been trying to put a little less pressure on myself to release software
that’s Amazing and Perfect – sometimes I have an idea that I think is cool but
don’t really have the time or energy to fully execute on it. So I decided to
just put these both on Github in a somewhat unfinished state, so I can come
back to them if later if I want. Or not!

I’m also working on another Git software project, which is a collaboration with
a friend.

hired an operations manager!

This year I hired an Operations Manager for Wizard Zines! Lee is incredible and has done SO much to
streamline the logistics of running the company, so that I can focus more on
writing and coding. I don’t talk much about the mechanics of running the
business on here, but it’s a lot and I’m very grateful to have some help.

A few of the many things Lee has made possible:


run a Black Friday sale!
we added a review system to the website! (it’s so nice to hear about how people loved getting zines for Christmas!)
the store has been reorganized to be way clearer!
we’re more consistent about sending out the new comics newsletter!
I can take a vacation and not worry about support emails!


migrated to Mastodon!

I spent 10 years building up a Twitter presence, but with the Recent Events, I
spent a lot of time in 2023 working on building up a Mastodon account. I’ve found that I’m able to have more
interesting conversations about computers on Mastodon than on Twitter or
Bluesky, so that’s where I’ve been spending my time. We’ve been having a lot of
great discussions about Git there recently.

I’ve run into a few technical issues with Mastodon (which I wrote about at Notes on
using a single-person Mastodon server) but overall
I’m happy there and I’ve been spending a lot more time there than on Twitter.

some questions for 2024

one of my questions for 2022 was:


What’s hard for developers about learning to use the Unix command line in 2022? What do I want to do about it?


Maybe I’ll work on that in 2024! Maybe not! I did make a little bit of progress
on that question this year (I wrote What helps people get comfortable on the command line?).

Some other questions I’m thinking about on and off:


Could man pages be a more useful form of documentation? Do I want to try to do anything about that?
What format do I want to use for this “implement all of computer networking in Python” project? (is it a website? a book? is there a zine? what’s the deal?) Do I want to run workshops?
What community guidelines do I want to have for discussions on Mastodon?
Could I be marketing Mess With DNS (from 2021) more? How do I want to do that?


moving slowly is okay

I’ve started to come to terms with the fact that projects always just take
longer than I think they will. I started working this “implement your own
terrible networking stack” project in 2022, and I don’t know if I’ll finish it
in 2024. I’ve been working on this Git zine since September and I still don’t
completely understand why Git is hard yet. There’s another small secret project
that I initally thought of 5 years ago, made a bunch of progress on this year,
but am still not done with. Things take a long time and that’s okay.

As always, thanks for reading and for making it possible for me to do this
weird job.



Mounting git commits as folders with NFS
2023-12-04T09:28:03+00:00


Hello! The other day, I started wondering – has anyone ever made a FUSE
filesystem for a git repository where all every commit is a folder? It turns
out the answer is yes! There’s giblefs,
GitMounter, and git9 for Plan 9.

But FUSE is pretty annoying to use on Mac – you need to install a kernel
extension, and Mac OS seems to be making it harder and harder to install kernel
extensions for security reasons. Also I had a few ideas for how to organize the
filesystem differently than those projects.

So I thought it would be fun to experiment with ways to mount filesystems on
Mac OS other than FUSE, so I built a project that does that called
git-commit-folders. It works (at least on my computer) with both FUSE and NFS, and there’s a broken WebDav
implementation too.

It’s pretty experimental (I’m not sure if this is actually a useful piece of
software to have or just a fun toy to think about how git works) but it was fun
to write and I’ve enjoyed using it myself on small repositories so here are
some of the problems I ran into while writing it.

goal: show how commits are like folders

The main reason I wanted to make this was to give folks some intuition for how
git works under the hood. After all, git commits really are very similar to
folders – every Git commit contains a directory listing of the files in it,
and that directory can have subdirectories, etc.

It’s just that git commits aren’t actually implemented as folders to save
disk space.

So in git-commit-folders, every commit is actually a folder, and if you want
to explore your old commits, you can do it just by exploring the filesystem!
For example, if I look at the initial commit for my blog, it looks like this:

$ ls commits/8d/8dc0/8dc0cb0b4b0de3c6f40674198cb2bd44aeee9b86/
README


and a few commits later, it looks like this:

$ ls /tmp/git-homepage/commits/c9/c94e/c94e6f531d02e658d96a3b6255bbf424367765e9/
_config.yml  config.rb  Rakefile  rubypants.rb  source


branches are symlinks

In the filesystem mounted by git-commit-folders, commits are the only real folders – everything
else (branches, tags, etc) is a symlink to a commit. This mirrors how git works under the hood.

$ ls -l branches/
lr-xr-xr-x 59 bork bazil-fuse -> ../commits/ff/ff56/ff563b089f9d952cd21ac4d68d8f13c94183dcd8
lr-xr-xr-x 59 bork follow-symlink -> ../commits/7f/7f73/7f73779a8ff79a2a1e21553c6c9cd5d195f33030
lr-xr-xr-x 59 bork go-mod-branch -> ../commits/91/912d/912da3150d9cfa74523b42fae028bbb320b6804f
lr-xr-xr-x 59 bork mac-version -> ../commits/30/3008/30082dcd702b59435f71969cf453828f60753e67
lr-xr-xr-x 59 bork mac-version-debugging -> ../commits/18/18c0/18c0db074ec9b70cb7a28ad9d3f9850082129ce0
lr-xr-xr-x 59 bork main -> ../commits/04/043e/043e90debbeb0fc6b4e28cf8776e874aa5b6e673
$ ls -l tags/
lr-xr-xr-x - bork 31 Dec  1969 test-tag -> ../commits/16/16a3/16a3d776dc163aa8286fb89fde51183ed90c71d0


This definitely doesn’t completely explain how git works (there’s a lot more to
it than just “a commit is like a folder!”), but my hope is that it makes thie
idea that every commit is like a folder with an old version of your code” feel
a little more concrete.

why might this be useful?

Before I get into the implementation, I want to talk about why having a filesystem
with a folder for every git commit in it might be useful. A lot of my projects
I end up never really using at all (like dnspeep) but I did find myself using this
project a little bit while I was working on it.

The main uses I’ve found so far are:


searching for a function I deleted – I can run grep someFunction branch_histories/main/*/commit.go to find an old version of it
quickly looking at a file on another branch to copy a line from it, like vim branches/other-branch/go.mod
searching every branch for a function, like grep someFunction branches/*/commit.go


All of these are through symlinks to commits instead of referencing commits
directly.

None of these are the most efficient way to do this (you can use git show and
git log -S or maybe git grep to accomplish something similar), but
personally I always forget the syntax and navigating a filesystem feels easier
to me. git worktree also lets you have multiple branches checked out at the same
time, but to me it feels weird to set up an entire worktree just to look at 1
file.

Next I want to talk about some problems I ran into.

problem 1: webdav or NFS?

The two filesystems I could that were natively supported by Mac OS were WebDav
and NFS. I couldn’t tell which would be easier to implement so I just
tried both.

At first webdav seemed easier and it turns out that golang.org/x/net has a
webdav implementation, which  was
pretty easy to set up.

But that implementation doesn’t support symlinks, I think because it uses the io/fs interface
and io/fs doesn’t support symlinks yet. Looks like that’s in progress
though. So I gave up on webdav and decided to focus on the NFS implementation, using this go-nfs NFSv3 library.

Someone also mentioned that there’s
FileProvider on Mac
but I didn’t look into that.

problem 2: how to keep all the implementations in sync?

I was implementing 3 different filesystems (FUSE, NFS, and WebDav), and it
wasn’t clear to me how to avoid a lot of duplicated code.

My friend Dave suggested writing one core implementation and then writing
adapters (like fuse2nfs and fuse2dav) to translate it into the NFS and
WebDav verions. What this looked like in practice is that I needed to implement
3 filesystem interfaces:


fs.FS for FUSE
billy.Filesystem for NFS
webdav.Filesystem for webdav


So I put all the core logic in the fs.FS interface, and then wrote two functions:


func Fuse2Dav(fs fs.FS) webdav.FileSystem
func Fuse2NFS(fs fs.FS) billy.Filesystem


All of the filesystems were kind of similar so the translation wasn’t too hard,
there were just 1 million annoying bugs to fix.

problem 3: I didn’t want to list every commit

Some git repositories have thousands or millions of commits. My first idea for how to address this was to make commits/ appear empty, so that it works like this:

$ ls commits/
$ ls commits/80210c25a86f75440110e4bc280e388b2c098fbd/
fuse  fuse2nfs  go.mod  go.sum  main.go  README.md


So every commit would be available if you reference it directly, but you can’t
list them. This is a weird thing for a filesystem to do but it actually works
fine in FUSE. I couldn’t get it to work in NFS though. I assume what’s going on
here is that if you tell NFS that a directory is empty, it’ll interpret that
the directory is actually empty, which is fair.

I ended up handling this by:


organizing the commits by their 2-character prefix the way .git/objects
does (so that ls commits shows 0b 03 05 06 07 09 1b 1e 3e 4a), but doing
2 levels of this so that a 18d46e76d7c2eedd8577fae67e3f1d4db25018b0 is at commits/18/18df/18d46e76d7c2eedd8577fae67e3f1d4db25018b0
listing all the packed commits hashes only once at the beginning, caching
them in memory, and then only updating the loose objects afterwards. The idea
is that almost all of the commits in the repo should be packed and git
doesn’t repack its commits very often.


This seems to work okay on the Linux kernel which has ~1 million commits. It
takes maybe a minute to do the initial load on my machine and then after that
it just needs to do fast incremental updates.

Each commit hash is only 20 bytes so caching 1 million commit hashes isn’t a
big deal, it’s just 20MB.

I think a smarter way to do this would be to load the commit listings lazily –
Git sorts its packfiles by commit ID, so you can pretty easily do a binary
search to find all commits starting with 1b or 1b8c. The git library I was using
doesn’t have great support for this though, because listing all commits in a
Git repository is a really weird thing to do. I spent maybe a couple of days
trying to implement it but I didn’t manage to get the performance I wanted so I
gave up.

problem 4: “not a directory”

I kept getting this error:

"/tmp/mnt2/commits/59/59167d7d09fd7a1d64aa1d5be73bc484f6621894/": Not a directory (os error 20)


This really threw me off at first but it turns out that this just means that
there was an error while listing the directory, and the way the NFS library
handles that error is with “Not a directory”. This happened a bunch of times
and I just needed to track the bug down every time.

There were a lot of weird errors like this. I also got cd: system call
interrupted which was pretty upsetting but ultimately was just some other bug
in my program.

Eventually I realized that I could use Wireshark to look at all the NFS
packets being sent back and forth, which made some of this stuff easier to debug.

problem 5: inode numbers

At first I was accidentally setting all my directory inode numbers to 0. This
was bad because if if you run find on a directory where the inode number of
every directory is 0, it’ll complain about filesystem loops and give up, which
is very fair.

I fixed this by defining an inode(string) function which hashed a string to
get the inode number, and using the tree ID / blob ID as the string to hash.

problem 6: stale file handles

I kept getting this “Stale NFS file handle” error. The problem is that I need
to be able to take an opaque 64-byte NFS “file handle” and map it to the right
directory.

The way the NFS library I’m using works is that it generates a file handle for
every file and caches those references with a fixed size cache. This works fine
for small repositories, but if there are too many files then it’ll overflow the
cache and you’ll start getting stale file handle errors.

This is still a problem and I’m not sure how to fix it. I don’t understand how
real NFS servers do this, maybe they just have a really big cache?

The NFS file handle is 64 bytes (64 bytes! not bits!) which is pretty big, so
it does seem like you could just encode the entire file path in the handle a
lot of the time and not cache it at all. Maybe I’ll try to implement that at
some point.

problem 7: branch histories

The branch_histories/ directory only lists the latest 100 commits for each
branch right now. Not sure what the right move is there – it would be nice to
be able to list the full history of the branch somehow. Maybe I could use a
similar subfolder trick to the commits/ directory.

problem 8: submodules

Git repositories sometimes have submodules. I don’t understand anything about
submodules so right now I’m just ignoring them. So that’s a bug.

problem 9: is NFSv4 better?

I built this with NFSv3 because the only Go library I could find at the time
was an NFSv3 library. After I was done I discovered that the buildbarn project
has an NFSv4 server in it. Would it be better to use that?

I don’t know if this is actually a problem or how big of an advantage it would
be to use NFSv4. I’m also a little unsure about using the buildbarn NFS library
because it’s not clear if they expect other people to use it or not.

that’s all!

There are probably more problems I forgot but that’s all I can think of for
now. I may or may not fix the NFS stale file handle problem or the “it takes 1
minute to start up on the linux kernel” problem, who knows!

Thanks to my friend vasi who explained one million things about filesystems to me.



git branches: intuition & reality
2023-11-23T09:51:27+00:00


Hello! I’ve been working on writing a zine about git so I’ve been thinking
about git branches a lot. I keep hearing from people that they find the way git
branches work to be counterintuitive. It got me thinking:  what might an
“intuitive” notion of a branch be, and how is it different from how git
actually works?

So in this post I want to briefly talk about


an intuitive mental model I think many people have
how git actually represents branches internally (“branches are a pointer to a commit” etc)
how the “intuitive model” and the real way it works are actually pretty closely related
some limits of the intuitive model and why it might cause problems


Nothing in this post is remotely groundbreaking so I’m going to try to keep it pretty short.

an intuitive model of a branch

Of course, people have many different intuitions about branches. Here’s the one
that I think corresponds most closely to the physical “a branch of an
apple tree” metaphor.

My guess is that a lot of people think about a git branch like this: the 2
commits in pink in this picture are on a “branch”.



I think there are two important things about this diagram:


the branch has 2 commits on it
the branch has a “parent” (main) which it’s an offshoot of


That seems pretty reasonable, but that’s not how git defines a branch – most
importantly, git doesn’t have any concept of a branch’s “parent”. So how does
git define a branch?

in git, a branch is the full history

In git, a branch is the full history of every previous commit, not just the “offshoot” commits. So in our picture above both branches (main and branch) have 4 commits on them.

I made an example repository at https://github.com/jvns/branch-example which
has its branches set up the same way as in the picture above. Let’s look at the
2 branches:

main has 4 commits on it:

$ git log --oneline main
70f727a d
f654888 c
3997a46 b
a74606f a


and mybranch has 4 commits on it too. The bottom two commits are shared
between both branches.

$ git log --oneline mybranch
13cb960 y
9554dab x
3997a46 b
a74606f a


So mybranch has 4 commits on it, not just the 2 commits 13cb960 and 9554dab that are “offshoot” commits.

You can get git to draw all the commits on both branches like this:

$ git log --all --oneline --graph
* 70f727a (HEAD -> main, origin/main) d
* f654888 c
| * 13cb960 (origin/mybranch, mybranch) y
| * 9554dab x
|/
* 3997a46 b
* a74606f a


a branch is stored as a commit ID

Internally in git, branches are stored as tiny text files which have a commit ID in
them. That commit is the latest commit on the branch. This is the “technically correct” definition I was talking about at the beginning.

Let’s look at the text files for main and mybranch in our example repo:

$ cat .git/refs/heads/main
70f727acbe9ea3e3ed3092605721d2eda8ebb3f4
$ cat .git/refs/heads/mybranch
13cb960ad86c78bfa2a85de21cd54818105692bc


This makes sense: 70f727 is the latest commit on main and 13cb96 is the latest commit on mybranch.

The reason this works is that every commit contains a pointer to its parent(s),
so git can follow the chain of pointers to get every commit on the branch.

Like I mentioned before, the thing that’s missing here is any relationship at
all between these two branches. There’s no indication that mybranch is an
offshoot of main.

Now that we’ve talked about how the intuitive notion of a branch is “wrong”, I
want to talk about how it’s also right in some very important ways.

people’s intuition is usually not that wrong

I think it’s pretty popular to tell people that their intuition about git is
“wrong”. I find that kind of silly – in general, even if people’s intuition
about a topic is technically incorrect in some ways, people usually have the
intuition they do for very legitimate reasons! “Wrong” models can be super useful.

So let’s talk about 3 ways the intuitive “offshoot” notion of a branch matches
up very closely with how we actually use git in practice.

rebases use the “intuitive” notion of a branch

Now let’s go back to our original picture.



When you rebase mybranch on main, it takes the commits on the “intuitive”
branch (just the 2 pink commits) and replays them onto main.

The result is that just the 2 (x and y) get copied. Here’s what that looks like:

$ git switch mybranch
$ git rebase main
$ git log --oneline mybranch
952fa64 (HEAD -> mybranch) y
7d50681 x
70f727a (origin/main, main) d
f654888 c
3997a46 b
a74606f a


Here git rebase has created two new commits (952fa64 and 7d50681) whose
information comes from the previous two x and y commits.

So the intuitive model isn’t THAT wrong! It tells you exactly what happens in a
rebase.

But because git doesn’t know that mybranch is an offshoot of main, you need
to tell it explicitly where to rebase the branch.

merges use the “intuitive” notion of a branch too

Merges don’t copy commits, but they do need a “base” commit: the way merges
work is that it looks at two sets of changes (starting from the shared base)
and then merges them.

Let’s undo the rebase we just did and then see what the merge base is.

$ git switch mybranch
$ git reset --hard 13cb960  # undo the rebase
$ git merge-base main mybranch
3997a466c50d2618f10d435d36ef12d5c6f62f57


This gives us the “base” commit where our branch branched off, 3997a4.
That’s exactly the commit you would think it might be based on our intuitive
picture.

github pull requests also use the intuitive idea

If we create a pull request on GitHub to merge mybranch into main, it’ll
also show us 2 commits: the commits x and y. That makes sense and also
matches our intuitive notion of a branch.



I assume if you make a merge request on GitLab it shows you something similar.

intuition is pretty good, but it has some limits

This leaves our intuitive definition of a branch looking pretty good actually!
The “intuitive” idea of what a branch is matches exactly with how merges and
rebases and GitHub pull requests work.

You do need to explicitly
specify the other branch when merging or rebasing or making a pull request (like git rebase main),
because git doesn’t know what branch you think your offshoot is based on.

But the intuitive notion of a branch has one fairly serious problem: the way
you intuitively think about main and an offshoot branch are very different,
and git doesn’t know that.

So let’s talk about the different kinds of git branches.

trunk and offshoot branches

To a human, main and mybranch are pretty different, and you probably have
pretty different intentions around how you want to use them.

I think it’s pretty normal to think of some branches as being “trunk” branches,
and some branches as being “offshoots”. Also you can have an offshoot of an
offshoot.

Of course, git itself doesn’t make any such distinctions (the term “offshoot”
is one I just made up!), but what kind of a branch it is definitely affects how
you treat it.

For example:


you might rebase mybranch onto main but you probably wouldn’t rebase main onto mybranch – that would be weird!
in general people are much more careful around rewriting the history on “trunk” branches than short-lived offshoot branches


git lets you do rebases “backwards”

One thing I think throws people off about git is – because git doesn’t
have any notion of whether a branch is an “offshoot” of another branch, it
won’t give you any guidance about if/when it’s appropriate to rebase branch X
on branch Y. You just have to know.

for example, you can do either:

$ git checkout main
$ git rebase mybranch


or

$ git checkout mybranch
$ git rebase main


Git will happily let you do either one, even though in this case git rebase main is
extremely normal and git rebase mybranch is pretty weird. A lot of people
said they found this confusing so here’s a picture of the two kinds of rebases:



Similarly, you can do merges “backwards”, though that’s much more normal than
doing a backwards rebase – merging mybranch into main and main into
mybranch are both useful things to do for different reasons.

Here’s a diagram of the two ways you can merge:



git’s lack of hierarchy between branches is a little weird

I hear the statement “the main branch is not special” a lot and I’ve been
puzzled about it – in most of the repositories I work in, main is
pretty special! Why are people saying it’s not?

I think the point is that even though branches do have relationships
between them (main is often special!), git doesn’t know anything about those
relationships.

You have to tell git explicitly about the relationship between branches every
single time you run a git command like git rebase or git merge, and if you
make a mistake things can get really weird.

I don’t know whether git’s design here is “right” or “wrong” (it definitely has
some pros and cons, and I’m very tired of reading endless arguments about
it), but I do think it’s surprising to a lot of people for good reason.

git’s UI around branches is weird too

Let’s say you want to look at just the “offshoot” commits on a branch, which as
we’ve discussed is a completely normal thing to want.

Here’s how to see just the 2 offshoot commits on our branch with git log:

$ git switch mybranch
$ git log main..mybranch --oneline
13cb960 (HEAD -> mybranch, origin/mybranch) y
9554dab x


You can look at the combined diff for those same 2 commits with git diff like this:

$ git diff main...mybranch


So to see the 2 commits x and y with git log, you need to use 2 dots
(..), but to look at the same commits with git diff, you need to use 3 dots
(...).

Personally I can never remember what .. and ... mean so I just avoid
them completely even though in principle they seem useful.

in GitHub, the default branch is special

Also, it’s worth mentioning that GitHub does have a “special branch”: every
github repo has a “default branch” (in git terms, it’s what HEAD points at),
which is special in the following ways:


it’s what you check out when you git clone the repository
it’s the default destination for pull requests
github will suggest that you protect the default branch from force pushes


and probably even more that I’m not thinking of.

that’s all!

This all seems extremely obvious in retrospect, but it took me a long time to
figure out what a more “intuitive” idea of a branch even might be because I was
so used to the technical “a branch is a reference to a commit” definition.

I also hadn’t really thought about how git makes you tell it about the
hierarchy between your branches every time you run a git rebase or git
merge command – for me it’s second nature to do that and it’s not a big deal,
but now that I’m thinking about it, it’s pretty easy to see how somebody could
get mixed up.



Some notes on nix flakes
2023-11-14T09:06:07+00:00


I’ve been using nix for about 9 months now.
For all of that time I’ve been steadfastly ignoring flakes, but everyone keeps
saying that flakes are great and the best way to use nix, so I decided to try
to figure out what the deal is with them.

I found it very hard to find simple examples of flake files and I ran into a
few problems that were very confusing to me, so I wanted to write down some very
basic examples and some of the problems I ran into in case it’s helpful to
someone else who’s getting started with flakes.

First, let’s talk about what a flake is a little.


addition from a couple months later: I still do not actually understand flakes,
but a couple of months after I wrote this post, Jade wrote
Flakes aren’t real and cannot hurt you: a guide to using Nix flakes the non-flake way
which I still haven’t fully processed but is the closest thing I’ve found to an
explanation of flakes that I can understand


flakes are self-contained

Every explanation I’ve found of flakes explains them in terms of other nix
concepts (“flakes simplify nix usability”, “flakes are processors of Nix
code”). Personally I really needed a way to think about flakes in terms of
other non-nix things and someone made an analogy to Docker containers that
really helped me, so I’ve been thinking about flakes a little like Docker
container images.

Here are some ways in which flakes are like Docker containers:


you can install and compile any software you want in them
you can use them as a dev environment (the flake sets up all your dependencies)
you can share your flake with other people with a flake.nix file and
then they can build the software exactly the same way you built it (a little
like how you can share a Dockerfile, though flakes are MUCH better at the
“exactly the same way you built it” thing)


flakes are also different from Docker containers in a LOT of ways:


with a Dockerfile, you’re not actually guaranteed to get the exact same
results as another user. With flake.nix and flake.lock you are.
they run natively on Mac (you don’t need to use Linux / a Linux VM the way you do with Docker)
different flakes can share dependencies very easily (you can technically share layers between Docker images, but flakes are MUCH better at this)
flakes can depend on other flakes and pick and choose which parts they want to take from their dependencies
flake.nix files are programs in the nix programming language instead of mostly a bunch of shell commands
the way they do isolation is completely different (nix uses dynamic linker/rpath tricks instead of filesystem overlays, and there are no cgroups or namespaces or VMs or anything with nix)


Obviously this analogy breaks down pretty quickly (the list of differences is
VERY long), but they do share the “you can share a dev environment with a
single configuration file” design goal.

nix has a lot of pre-compiled binaries

To me one of the biggest advantages of nix is that I’m on a Mac and nix has a
repository with a lot of pre-compiled binaries of various packages for Mac. I
mostly mention this because people always say that nix is good because it’s
“declarative” or “reproducible” or “functional” or whatever but my main
motivation for using nix personally is that it has a lot of binary packages. I
do appreciate that it makes it easier for me to build a 5-year-old version of hugo on mac
though.

My impression is that nix has more binary packages than Homebrew does, so
installing things is faster and I don’t need to build as much from source.

my goal: make a flake with every package I want installed on my system

Previously I was using nix as a Homebrew replacement like this (which I talk about more in this blog post):


run nix-env -iA nixpkgs.whatever to install stuff
that’s it


This worked great (except that it randomly broke occasionally, but someone helped me
find a workaround for that so the random breaking wasn’t a big issue).

I thought it might be fun to have a single flake.nix file where I could maintain a list
of all the packages I wanted installed and then put all that stuff in a
directory in my PATH. This isn’t very well motivated: my previous setup was
generally working just fine, but I have a long history of fiddling with my
computer setup (Arch Linux ftw) and so I decided to have a Day Of Fiddling.

I think the only practical advantages of flakes for me are:


I could theoretically use the flake.nix file to set up a new computer more easily
I can never remember how to uninstall software in nix, deleting a line in a configuration file is maybe easier to remember


These are pretty minor though.

how do we make a flake?

Okay, so I want to make a flake with a bunch of packages installed in it, let’s say Ruby and cowsay to start. How do I
do that? I went to zero-to-nix and copied and pasted some things and ended up with this flake.nix file (here it is in a gist):

{
  inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-23.05-darwin";
  outputs = { self, nixpkgs }: {
    devShell.aarch64-darwin = nixpkgs.legacyPackages.aarch64-darwin.mkShell {
      buildInputs = with nixpkgs.legacyPackages.aarch64-darwin; [
        cowsay
        ruby
      ];
    };
  };
}


This has a little bit of boilerplate so let’s list the things I understand about this:


nixpkgs is a huge central repository of nix packages
aarch64-darwin is my machine’s architecture, this is important because I’m asking nix to download binaries
I’ve been thinking of an “input” as a sort of dependency. nixpkgs is my one
input. I get to pick and choose which bits of it I want to bring into my
flake though.
the github:NixOS/nixpkgs/nixpkgs-23.05-darwin url scheme is a bit unusual:
the format is github:USER/REPO_NAME/TAG_OR_BRANCH_NAME. So this is looking
at the nixpkgs-23.05-darwin tag in the NixOS/nixpkgs repository.
mkShell is a nix function that’s apparently useful if you want to run nix develop. I stopped using it after this so I don’t know more than that.
devShell.aarch64-darwin is the name of the output. Apparently I need to give it that exact name or else nix develop will yell at me
cowsay and ruby are the things I’m taking from nixpkgs to put in my output
I don’t know what self is doing here or what legacyPackages is about


Okay, cool.  Let’s try to build it:

$ nix build
error: getting status of '/nix/store/w1v41cyqyx4d7q4g7c8nb50bp9dvjm29-source/flake.nix': No such file or directory


This error is VERY mysterious – what is /nix/store/w1v41cyqyx4d7q4g7c8nb50bp9dvjm29-source/ and why does nix think it should exist???

I was totally stuck until a very nice person on Mastodon helped me. So let’s talk about what’s going wrong here.

problem 1: nix completely ignores untracked files

Apparently nix flakes have some Weird Rules about git. The way it works is:


if your current directory isn’t a git repo, everything is fine
if your are in a git repository, and all your files have been git added to git, everything is fine
but if you’re in a git directory and your flake.nix file isn’t tracked by
git yet (because you just created it and are trying to get it to work), nix
will COMPLETELY IGNORE YOUR FILE


After someone kindly told me what was happening, I found that this is
mentioned in this blog post about flakes, which says:


Note that any file that is not tracked by Git is invisible during Nix evaluation


There’s also a github issue discussing what to do about this.

So we need to git add the file to get nix to pay attention to it. Cool. Let’s keep going.

a note on enabling the flake feature

To get any of the commands we’re going to talk about to work (like nix build), you need to enable two nix features:


flakes
“commands”


I set this up by putting experimental-features = nix-command flakes  in my
~/.config/nix/nix.conf, but you can also run nix --extra-experimental-features "flakes nix-command" build instead of nix build.

time for nix develop

The instructions I was following told me that I could now run nix develop and get a shell inside my new environment. I tried it and it works:

$ nix develop
grapefruit:nix bork$ cowsay hi
 ____
< hi >
 ----
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |


Cool! I was curious about how the PATH was set up inside this environment so I took a look:

grapefruit:nix bork$ echo $PATH
/nix/store/v5q1bxrqs6hkbsbrpwc81ccyyfpbl8wk-clang-wrapper-11.1.0/bin:/nix/store/x9jmvvxcys4zscff39cnpw0kyfvs80vp-clang-11.1.0/bin:/nix/store/3f1ii2y5fs1w7p0id9mkis0ffvhh1n8w-coreutils-9.1/bin:/nix/store/8ldvi6b3ahnph19vm1s0pyjqrq0qhkvi-cctools-binutils-darwin-wrapper-973.0.1/bin:/nix/store/5kbbxk18fp645r4agnn11bab8afm0ry3-cctools-binutils-darwin-973.0.1/bin:/nix/store/5si884h02nqx3dfcdm5irpf7caihl6f8-cowsay-3.7.0/bin:/nix/store/5bs5q2dw5bl7c4krcviga6yhdrqbvdq6-ruby-3.1.4/bin:/nix/store/3f1ii2y5fs1w7p0id9mkis0ffvhh1n8w-coreutils-9.1/bin


It looks like every dependency has been added to the PATH separately: for example there’s
/nix/store/5si884h02nqx3dfcdm5irpf7caihl6f8-cowsay-3.7.0/bin for cowsay and
/nix/store/5bs5q2dw5bl7c4krcviga6yhdrqbvdq6-ruby-3.1.4/bin for ruby. That’s
fine but it’s not how I wanted my setup to work: I wanted a single directory of
symlinks that I could just put in my PATH in my normal shell.

getting a directory  of symlinks with buildEnv

I asked in the Nix discord and someone told me I could use buildEnv to turn
my flake into a directory of symlinks. As far as I can tell it’s just a way to
take nix packages and copy their symlinks into another directory.

After some fiddling, I ended up with this: (here’s a gist)

{
  inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-23.05-darwin";
  outputs = { self, nixpkgs }: {
    defaultPackage.aarch64-darwin = nixpkgs.legacyPackages.aarch64-darwin.buildEnv {
      name = "julia-stuff";
      paths = with nixpkgs.legacyPackages.aarch64-darwin; [
        cowsay
        ruby
      ];
      pathsToLink = [ "/share/man" "/share/doc" "/bin" "/lib" ];
      extraOutputsToInstall = [ "man" "doc" ];
    };
  };
}


This put a bunch of symlinks in result/bin:

$ ls result/bin/
bundle  bundler  cowsay  cowthink  erb  gem  irb  racc  rake  rbs  rdbg  rdoc  ri  ruby  typeprof


Sweet! Now I have a thing I can theoretically put in my PATH – this result directory. Next I mostly just
needed to add every other package I wanted to install to this flake.nix file (I got the list
from nix-env -q).

next step: add all the packages

I ran into a bunch of weird problems adding all the packges I already had
installed to my nix, so let’s talk about them.

problem 2: an unfree package

I wanted to install a non-free package called ngrok. Nix gave me 3 options for how I could do this. Option C seemed the most promising:

       c) For `nix-env`, `nix-build`, `nix-shell` or any other Nix command you can add
         { allowUnfree = true; }
       to ~/.config/nixpkgs/config.nix.


But adding { allowUnfree = true} to ~/.config/nixpkgs/config.nix didn’t do
anything for some reason so instead I went with option A, which did seem to
work:

            $ export NIXPKGS_ALLOW_UNFREE=1

        Note: For `nix shell`, `nix build`, `nix develop` or any other Nix 2.4+
        (Flake) command, `--impure` must be passed in order to read this
        environment variable.


problem 3: installing a flake from a relative path doesn’t work

I made a couple of flakes for custom Nix packages I’d made (which I wrote about in my first nix blog post, and I wanted to set them up like this
(you can see the full configuration here):

      hugoFlake.url = "path:../hugo-0.40";
      paperjamFlake.url = "path:../paperjam";


This worked fine the first time I ran nix build, but when I reran nix build
again later I got some totally inscrutable error.

My workaround was just to run rm flake.lock everytime before running nix
build, which seemed to fix the problem.

I don’t really understand what’s going on here but there’s a very long github issue thread about it.

problem 4 : “error while reading the response from the build hook”

For a while, every time I ran nix build, I got this error:

$ nix build
error:
       … while reading the response from the build hook

       error: unexpected EOF reading a line


I spent a lot of time poking at my flake.nix trying to guess at what I could
have gone wrong.

A very nice person on Mastodon also helped me with this one and it turned out
that what I needed to do was find the nix-daemon process and kill it. I still
have no idea what happened here or what that error message means, I did upgrade
nix at some point during this whole process so I guess the
upgrade went wonky somehow.

I don’t think this one is a common problem.

problem 5: error with share/man symlink

I wanted to install the zulu package for Java, but when I tried to add it to
my list of packages I got this error complaining about a broken symlink:

$ nix build
error: builder for '/nix/store/4n9c4707iyiwwgi9b8qqx7mshzrvi27r-julia-dev.drv' failed with exit code 2;
       last 1 log lines:
       > error: not a directory: `/nix/store/2vc4kf5i28xcqhn501822aapn0srwsai-zulu-11.62.17/share/man'
       For full logs, run 'nix log /nix/store/4n9c4707iyiwwgi9b8qqx7mshzrvi27r-julia-dev.drv'.
$ ls /nix/store/2vc4kf5i28xcqhn501822aapn0srwsai-zulu-11.62.17/share/ -l
lrwxr-xr-x 29 root 31 Dec  1969 man -> zulu-11.jdk/Contents/Home/man


I think what’s going on here is that the zulu package in nixpkgs-23.05 was
just broken (looks like it’s since been fixed in the unstable version).

I decided I didn’t feel like dealing with that and it turned out I already had
Java installed another way outside nix, so I just removed zulu from my list
and moved on.

putting it in my PATH

Now that I knew how to fix all of the weird problems I’d run into, I wrote a
little shell script called nix-symlink to build my flake and symlink it to
the very unimaginitively named ~/.nix-flake. The idea was that then I could
put ~/.nix-flake in my PATH and have all my programs available.

I think people usually use nix flakes in a per-project way instead of “a single
global flake”, but this is how I wanted my setup to work so that’s what I did.

Here’s the nix-symlink script. The rm flake.lock is because of that relative path issue,
and the NIXPKGS_ALLOW_UNFREE is so I could install ngrok.

#!/bin/bash

set -euo pipefail

export NIXPKGS_ALLOW_UNFREE=1
cd ~/work/nixpkgs/flakes/grapefruit || exit
rm flake.lock
nix build --impure --out-link ~/.nix-flake


I put ~/.nix-flake at the beginning of my PATH (not at the end), but I might revisit that, we’ll see.

a note on GC roots

At the end of all this, I wanted to run a garbage collection because I’d
installed a bunch of random stuff that was taking about 20GB of extra hard
drive space in my /nix/store. I think there are two different ways to collect
garbage in nix:


nix-store --gc
nix-collect-garbage


I have no idea what the difference between them is, but nix-collect-garbage
seemed to delete more stuff for some reason.

I wanted to check that my ~/.nix-flake directory was actually a GC root, so
that all my stuff wouldn’t get deleted when I ran a GC.

I ran nix-store --gc --print-roots to print out all the GC roots and my
~/.nix-flake was in there so everything was good! This command also runs a GC
so it was kind of a dangerous way to check if a GC was going to delete
everything, but luckily it worked.

problem 6: it’s a little slow

The last problem I ran into is speed. Previously, installing a new small package took me 2 seconds with nix-env -iA:

$ time nix-env -iA nixpkgs.sl
installing 'sl-5.05'
these 2 paths will be fetched (0.41 MiB download, 3.77 MiB unpacked):
  /nix/store/yv1c98m5pncx3i5q7nr7i7mfjkiyii72-ncurses-6.4
  /nix/store/2k78vf30czicjs0dq9x0sj4017ziwxkn-sl-5.05
copying path '/nix/store/yv1c98m5pncx3i5q7nr7i7mfjkiyii72-ncurses-6.4' from 'https://cache.nixos.org'...
copying path '/nix/store/2k78vf30czicjs0dq9x0sj4017ziwxkn-sl-5.05' from 'https://cache.nixos.org'...
building '/nix/store/zadpfs9k1cw5x7iniwwcqd8lb7nnc7bb-user-environment.drv'...

________________________________________________________
Executed in    1.96 secs      fish           external


Installing the same package with flakes takes 7 seconds, plus the time to edit the config file:

$ vim ~/work/nixpkgs/flakes/grapefruit/flake.nix
$ time nix-symlink
________________________________________________________
Executed in    7.04 secs    fish           external
   usr time    1.78 secs    0.29 millis    1.78 secs
   sys time    0.51 secs    2.03 millis    0.51 secs


I don’t know what to do about this so I’ll just live with it. We’ll see if
this ends up being annoying or not

that’s it!

Now my new nix workflow is:


edit my flake.nix to add or remove packages (this file)
rerun my nix-symlink script after editing it
maybe periodically run nix-collect-garbage
that’s it


setting up the nix registry

The last thing I wanted to do was run

nix registry add nixpkgs github:NixOS/nixpkgs/nixpkgs-23.05-darwin


so that if I want to ad-hoc run a flake with nix run nixpkgs#cowsay, it’ll
take the version from the 23.05 version of nixpkgs. Mostly I just wanted this
so I didn’t have to download new versions of the nixpkgs repository all the
time – I just wanted to pin the 23.05 version.

I think nixpkgs-unstable is the default which I’m sure is fine too if you
want to have more up-to-date software.

my solutions are probably not the best

My solutions to all the nix problems I described are maybe not The Best ™,
but I’m happy that I figured out a way to install stuff that just involves one
relatively simple flake.nix file and a 6-line bash script and not a lot of other
machinery.

Personally I still feel extremely uncomfortable with nix and so
it’s important to me to keep my configuration as simple as possible without a
lot of extra abstraction layers that I don’t understand. I might try out
flakey-profile at some point though
because it seems extremely simple.

you can do way fancier stuff

You can manage a lot more stuff with nix, like:


your npm / ruby / python / etc packages (I just do npm install and pip install and bundle install)
your config files


There are all kind of tools that build on top of nix and flakes like
home-manager. Like I said
before though, it’s important to me to keep my configuration super simple so that I
can have any hope of understanding how it works and being able to fix problems
when it breaks so I haven’t paid attention to any of that stuff.

there’s a useful discord

I’ve been complaining about nix a little in this post, but as usual with open source
projects I assume that nix has all of these papercuts because it’s a
complicated system run by a small team of volunteers with very limited time.

Folks on the unofficial nix discord have been
helpful, I’ve had a somewhat mixed experience there but they have a “support
forum” section in there and I’ve gotten answers to a lot of my questions.

some other nix resources

the main resources I’ve found for understanding nix flakes are:


Nix Flakes, Part 1: An introduction and tutorial, I think by their creator
xe iaso’s blog
ian henry’s blog
the nix docs
zero to nix


Also Kamal (my partner) uses nix and that really helps, I think using nix with
an experienced friend around is a lot easier.

that’s all!

I still kind of like nix after using it for 9 months despite how confused I am
about it all the time, I feel like once I get things working they don’t usually
break.

We’ll see if that’s continues to be the case with flakes! Maybe I’ll go back to
just using nix-env -iAing everything if it goes badly.



How git cherry-pick and revert use 3-way merge
2023-11-10T12:00:48+00:00


Hello! I was trying to explain to someone how git cherry-pick works the other
day, and I found myself getting confused.

What went wrong was: I thought that git cherry-pick was basically applying a
patch, but when I tried to actually do it that way, it didn’t work!

Let’s talk about what I thought cherry-pick did (applying a patch), why
that’s not quite true, and what it actually does instead (a “3-way merge”).

This post is extremely in the weeds and you definitely don’t need to understand
this stuff to use git effectively. But if you (like me) are curious about git’s
internals, let’s talk about it!

cherry-pick isn’t applying a patch

The way I previously understood git cherry-pick COMMIT_ID is:


calculate the diff for COMMIT_ID, like git show COMMIT_ID --patch > out.patch
Apply the patch to the current branch, like git apply out.patch


Before we get into this – I want to be clear that this model is mostly
right, and if that’s your mental model that’s fine. But it’s wrong in some
subtle ways and I think that’s kind of interesting, so let’s see how it works.

If I try to do the “calculate the diff and apply the patch” thing in a case
where there’s a merge conflict, here’s what happens:

$ git show 10e96e46 --patch > out.patch
$ git apply out.patch
error: patch failed: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown:17
error: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown: patch does not apply


This just fails – it doesn’t give me any way to resolve the conflict or figure
out how to solve the problem.

This is quite different from what actually happens when run git cherry-pick,
which is that I get a merge conflict:

$ git cherry-pick 10e96e46
error: could not apply 10e96e46... wip
hint: After resolving the conflicts, mark them with
hint: "git add/rm ", then run
hint: "git cherry-pick --continue".


So it seems like the “git is applying a patch” model isn’t quite right. But the
error message literally does say “could not apply 10e96e46”, so it’s not quite
wrong either. What’s going on?

so what is cherry-pick doing?

I went digging through git’s source code to see how cherry-pick works, and
ended up at this line of code:

res = do_recursive_merge(r, base, next, base_label, next_label, &head, &msgbuf, opts);


So a cherry-pick is a… merge? What? How? What is it even merging? And how does merging even work in the first place?

I realized that I didn’t really know how git’s merge worked, so I googled it
and found out that git does a thing called “3-way merge”. What’s that?

how git merges files: the 3-way merge

Let’s say I want to merge these 2 files. We’ll call them v1.py and v2.py.

def greet():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name


def say_hello():
    greeting = "hello"
    name = "aanya"
    return greeting + " " + name


There are two lines that differ: we have


def greet() and def say_hello
name = "aanya" and name = "julia"


How do we know what to pick? It seems impossible!

But what if I told you that the original function was this (base.py)?

def say_hello():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name


Suddenly it seems a lot clearer! v1 changed the function’s name to greet
and v2 set name = "aanya". So to merge, we should make both those changes:

def greet():
    greeting = "hello"
    name = "aanya"
    return greeting + " " + name


We can ask git to do this merge with git merge-file, and it gives us exactly
the result we expected: it picks def greet() and name = "aanya".

$ git merge-file v1.py base.py v2.py -p
def greet():
    greeting = "hello"
    name = "aanya"
    return greeting + " " + name⏎


This way of merging where you merge 2 files + their original version is called
a 3-way merge.

If you want to try it out yourself in a browser, I made a little playground at
jvns.ca/3-way-merge/. I made it very quickly so it’s not mobile friendly.

git merges changes, not files

The way I think about the 3-way merge is – git merges changes, not files.
We have an original file and 2 possible changes to it, and git tries to combine
both of those changes in a reasonable way. Sometimes it can’t (for example if
both changes change the same line), and then you get a merge conflict.

Git can also merge more than 2 possible changes: you can have an original file
and 8 possible changes, and it can try to reconcile all of them. That’s called
an octopus merge but I don’t know much more than that, I’ve never done one.

how git uses 3-way merge to apply a patch

Now let’s get a little weird! When we talk about git “applying a patch” (as you
do in a rebase or revert or cherry-pick), it’s not actually creating a
patch file and applying it. Instead, it’s doing a 3-way merge.

Here’s how applying commit X as a patch to your current commit corresponds to
this v1, v2, and base setup from before:


The version of the file in your current commit is v1.
The version of the file before commit X is base
The version of the file in commit X. Call that v2
Run git merge-file v1 base v2 to combine them (technically git does not
actually run git merge-file, it runs a C function that does it)


Together, you can think of base and v2 as being the “patch”: the diff between
them is the change that you want to apply to v1.

how cherry-pick works

Let’s say we have this commit graph, and we want to cherry-pick Y on to main:

A - B (main)
 \
  \
   X - Y - Z


How do we turn that into a 3-way merge? Here’s how it translates into our v1, v2 and base from earlier:


B is v1
X is the base, Y is v2


So together X and Y are the “patch”.

And git rebase is just like git cherry-pick, but repeated a bunch of times.

how revert works

Now let’s say we want to run git revert Y on this commit graph

X - Y - Z - A - B



B is v1
Y is the base, X is v2


This is exactly like a cherry-pick, but with X and Y reversed. We have to
flip them because we want to apply a “reverse patch”.

Revert and cherry-pick are so closely related in git that they’re actually
implemented in the same file:
revert.c.

this “3-way patch” is a really cool trick

This trick of using a 3-way merge to apply a commit as a patch seems really
clever and cool and I’m surprised that I’d never heard of it before! I don’t
know of a name for it, but I kind of want to call it a “3-way patch”.

The idea is that with a 3-way patch, you specify the patch as 2 files: the file
before the patch and after (base and v2 in our language in this post).

So there are 3 files involved: 1 for the original and 2 for the patch.

The point is that the 3-way patch is a much better way to patch than a normal
patch, because you have a lot more context for merging when you have
both full files.

Here’s more or less what a normal patch for our example looks like:

@@ -1,1 +1,1 @@:
- def greet():
+ def say_hello():
    greeting = "hello"


and a 3-way patch. This “3-way patch” is not a real file format, it’s just
something I made up.

BEFORE: (the full file)
def greet():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name
AFTER: (the full file)
def say_hello():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name


“Building Git” talks about this

The book Building Git by James Coglan
is the only place I could find other than the git source code explaining how
git cherry-pick actually uses 3-way merge under the hood (I thought Pro Git might
talk about it, but it didn’t seem to as far as I could tell).

I actually went to buy it and it turned out that I’d already bought it in 2019
so it was a good reference to have here :)

merging is actually much more complicated than this

There’s more to merging in git than the 3-way merge – there’s something
called a “recursive merge” that I don’t understand, and there are a bunch of
details about how to deal with handling file deletions and moves, and there are
also multiple merge algorithms.

My best idea for where to learn more about this stuff is Building Git, though I
haven’t read the whole thing.

so what does git apply do?

I also went looking through git’s source to find out what git apply does, and it
seems to (unsurprisingly) be in apply.c. That code parses a patch file, and
then hunts through the target file to figure out where to apply it. The core logic
seems to be around here:
I think the idea is to start at the line number that the patch suggested and
then hunt forwards and backwards from there to try to find it:

	/*
	 * There's probably some smart way to do this, but I'll leave
	 * that to the smart and beautiful people. I'm simple and stupid.
	 */
	backwards = current;
	backwards_lno = line;
	forwards = current;
	forwards_lno = line;
	current_lno = line;
  for (i = 0; ; i++) {
     ...


That all seems pretty intuitive and about what I’d naively expect.

how git apply --3way works

git apply also has a --3way flag that does a 3-way merge. So we actually
could have more or less implemented git cherry-pick with git apply like
this:

$ git show 10e96e46 --patch > out.patch
$ git apply out.patch --3way
Applied patch to 'content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown' with conflicts.
U content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown


--3way doesn’t just use the contents of the patch file  though! The patch file starts with:

index d63ade04..65778fc0 100644


d63ade04 and 65778fc0 are the IDs of the old/new versions of that file in
git’s object database, so git can retrieve them to do a 3-way patch
application. This won’t work if someone emails you a patch and you don’t have
the files for the new/old versions of the file though: if you’re missing the
blobs you’ll get this error:

$ git apply out.patch
error: repository lacks the necessary blob to perform 3-way merge.


3-way merge is old

A couple of people pointed out that 3-way merge is much older than git, it’s
from the late 70s or something. Here’s a paper from 2007 talking about it

that’s all!

I was pretty surprised to learn that I didn’t actually understand the core way
that git applies patches internally – it was really cool to learn about!

I have lots of issues with git’s UI but I think this particular thing is not
one of them. The 3-way merge seems like a nice unified way to solve a bunch of
different problems, it’s pretty intuitive for people (the idea of “applying a
patch” is one that a lot of programmers are used to thinking about, and the
fact that it’s implemented as a 3-way merge under the hood is an implementation
detail that nobody actually ever needs to think about).


Also a very quick plug: I’m working on writing a
zine about git, if you’re interested in getting an email when it comes out you can
sign up to my very infrequent announcements mailing list.




git rebase: what can go wrong?
2023-11-06T07:45:21+00:00


Hello! While talking with folks about Git, I’ve been seeing a comment over and
over to the effect of “I hate rebase”. People seemed to feel pretty strongly
about this, and I was really surprised because I don’t run into a lot of
problems with rebase and I use it all the time.

I’ve found that if many people have a very strong opinion that’s different from
mine, usually it’s because they have different experiences around that thing
from me.

So I asked on Mastodon:


today I’m thinking about the tradeoffs of using git rebase a bit. I think
the goal of rebase is to have a nice linear commit history, which is something
I like.

but what are the costs of using rebase? what problems has it caused for you
in practice? I’m really only interested in specific bad experiences you’ve had
here – not opinions or general statements like “rewriting history is bad”


I got a huge number of incredible answers to this, and I’m going to do my best
to summarize them here. I’ll also mention solutions or workarounds to those
problems in cases where I know of a solution. Here’s the list:


fixing the same conflict repeatedly is annoying
rebasing a lot of commits is hard
undoing a rebase is hard
force pushing to shared branches can cause lost work
force pushing makes code reviews harder
losing commit metadata
more difficult reverts
rebasing can break intermediate commits
accidentally run git commit –amend instead of git rebase –continue
splitting commits in an interactive rebase is hard
complex rebases are hard
rebasing long lived branches can be annoying
rebase and commit discipline
a “squash and merge” workflow
miscellaneous problems


My goal with this isn’t to convince anyone that rebase is bad and you shouldn’t
use it (I’m certainly going to keep using rebase!). But seeing all these
problems made me want to be more cautious about recommending rebase to
newcomers without explaining how to use it safely. It also makes me wonder if
there’s an easier workflow for cleaning up your commit history that’s harder to
accidentally mess up.

my git workflow assumptions

First, I know that people use a lot of different Git workflows. I’m going to be
talking about the workflow I’m used to when working on a team, which is:


the team uses a central Github/Gitlab repo to coordinate
there’s one central main branch. It’s protected from force pushes.
people write code in feature branches and make pull requests to main
The web service is deployed from main every time a pull request is merged.
the only way to make a change to main is by making a pull request on Github/Gitlab and merging it


This is not the only “correct” git workflow (it’s a very “we run a web service”
workflow and open source project or desktop software with releases generally
use a slightly different workflow). But it’s what I know so that’s what I’ll
talk about.

two kinds of rebase

Also before we start: one big thing I noticed is that there were 2 different kinds of rebase that kept coming up, and only one of them requires you to deal with merge conflicts.


rebasing on an ancestor, like git rebase -i HEAD^^^^^^^ to squash many
small commits into one. As long as you’re just squashing commits, you’ll
never have to resolve a merge conflict while doing this.
rebasing onto a branch that has diverged, like git rebase main. This can cause merge conflicts.


I think it’s useful to make this distinction because sometimes I’m thinking
about rebase type 1 (which is a lot less likely to cause problems), but people
who are struggling with it are thinking about rebase type 2.

Now let’s move on to all the problems!

fixing the same conflict repeatedly is annoying

If you make many tiny commits, sometimes you end up in a hellish loop where you
have to fix the same merge conflict 10 times.  You can also end up fixing merge
conflicts totally unnecessarily (like dealing with a merge conflict in code
that a future commit deletes).

There are a few ways to make this better:


first do a git rebase -i HEAD^^^^^^^^^^^ to squash all of the tiny commits
into 1 big commit and then a git rebase main to rebase onto a different
branch. Then you only have to fix the conflicts once.
use git rerere to automate repeatedly resolving the same merge conflicts
(“rerere” stands for “reuse recorded resolution”, it’ll record your previous merge conflict resolutions and replay them).
I’ve never tried this but I think you need to set git config rerere.enabled
true and then it’ll automatically help you.


Also if I find myself resolving merge conflicts more than once in a rebase,
I’ll usually run git rebase --abort to stop it and then squash my commits into
one and try again.

rebasing a lot of commits is hard

Generally when I’m doing a rebase onto a different branch, I’m rebasing 1-2
commits. Maybe sometimes 5! Usually there are no conflicts and it works
fine.

Some people described rebasing hundreds of commits by many different people onto
a different branch. That sounds really difficult and I don’t envy that task.

undoing a rebase is hard

I heard from several people that when they were new to rebase, they messed up a
rebase and permanently lost a week of work that they then had to redo.

The problem here is that undoing a rebase that went wrong is much more complicated
than undoing a merge that went wrong (you can undo a bad merge with something like git reset --hard HEAD^).
Many newcomers to rebase don’t even realize that undoing a rebase is even
possible, and I think it’s pretty easy to understand why.

That said, it is possible to undo a rebase that went wrong. Here’s an example of how to undo a rebase using git reflog.

step 1: Do a bad rebase (for example run git rebase -I HEAD^^^^^ and just delete 3 commits)

step 2:  Run git reflog. You should see something like this:

ee244c4 (HEAD -> main) HEAD@{0}: rebase (finish): returning to refs/heads/main
ee244c4 (HEAD -> main) HEAD@{1}: rebase (pick): test
fdb8d73 HEAD@{2}: rebase (start): checkout HEAD^^^^^^^
ca7fe25 HEAD@{3}: commit: 16 bits by default
073bc72 HEAD@{4}: commit: only show tooltips on desktop


step 3: Find the entry immediately before rebase (start). In my case that’s ca7fe25

step 4:  Run git reset --hard ca7fe25

A couple of other ways to undo a rebase:


Apparently @ always refers to your current branch in git, so you can run
git reset --hard @{1} to reset your branch to its previous location.
Another solution folks mentioned that avoids having to use the reflog is to
make a “backup branch” with git switch -c backup before rebasing, so you
can easily get back to the old commit.


force pushing to shared branches can cause lost work

A few people mentioned the following situation:


You’re collaborating on a branch with someone
You push some changes
They rebase the branch and run git push --force (maybe by accident)
Now when you run git pull, it’s a mess – you get the a fatal: Need to specify how to reconcile divergent branches error
While trying to deal with the fallout you might lose some commits, especially if some of the people are involved aren’t very comfortable with git


This is an even worse situation than the “undoing a rebase is hard” situation
because the missing commits might be split across many different people’s and
the only worse thing than having to hunt through the reflog is multiple
different people having to hunt through the reflog.

This has never happened to me because the only branch I’ve ever collaborated on
is main, and main has always been protected from force pushing (in my
experience the only way you can get something into main is through a pull
request). So I’ve never even really been in a situation where this could
happen. But I can definitely see how this would cause problems.

The main tools I know to avoid this are:


don’t rebase on shared branches
use --force-with-lease when force pushing, to make sure that nobody else has pushed to the branch since your last fetch


Apparently the “since your last fetch” is important here – if you run git
fetch immediately before running git push --force-with-lease, the
--force-with-lease won’t protect you at all.

I was curious about why people would run git push --force on a shared branch. Some reasons people gave were:


they’re working on a collaborative feature branch, and the feature branch needs to be rebased onto main. The idea here is that you’re just really careful about coordinating the rebase so nothing gets lost.
as an open source maintainer, sometimes they need to rebase a contributor’s branch to fix a merge conflict
they’re new to git, read some instructions online that suggested git rebase and git push --force as a solution, and followed them without understanding the consequences
they’re used to doing git push --force on a personal branch and ran it on a shared branch by accident


force pushing makes code reviews harder

The situation here is:


You make a pull request on GitHub
People leave some comments
You update the code to address the comments, rebase to clean up your commits, and force push
Now when the reviewer comes back, it’s hard for them to tell what you changed since the last time you saw it – all the commits show up as “new”.


One way to avoid this is to push new commits addressing the review comments,
and then after the PR is approved do a rebase to reorganize everything.

I think some reviewers are more annoyed by this problem than others, it’s kind
of a personal preference. Also this might be a Github-specific issue, other
code review tools might have better tools for managing this.

losing commit metadata

If you’re rebasing to squash commits, you can lose important commit metadata
like Co-Authored-By. Also if you GPG sign your commits, rebase loses the
signatures.

There’s probably other commit metadata that you can lose that I’m not thinking of.

I haven’t run into this one so I’m not sure how to avoid it. I think GPG
signing commits isn’t as popular as it used to be.

more difficult reverts

Someone mentioned that it’s important for them to be able to easily revert
merging any branch (in case the branch broke something), and if the branch
contains multiple commits and was merged with rebase, then you need to do
multiple reverts to undo the commits.

In a merge workflow, I think you can revert merging any branch just by
reverting the merge commit.

rebasing can break intermediate commits

If you’re trying to have a very clean commit history where the tests pass on
every commit (very admirable!), rebasing can result in some intermediate
commits that are broken and don’t pass the tests, even if the final commit
passes the tests.

Apparently you can avoid this by using git rebase -x to run the test suite at
every step of the rebase and make sure that the tests are still passing. I’ve
never done that though.

accidentally run git commit --amend instead of git rebase --continue

A couple of people mentioned issues with running git commit --amend instead of git rebase --continue when resolving a merge conflict.

The reason this is confusing is that there are two reasons when you might want to edit files during a rebase:


editing a commit (by using edit in git rebase -i), where you need to write git commit --amend when you’re done
a merge conflict, where you need to run git rebase --continue when you’re done


It’s very easy to get these two cases mixed up because they feel very similar. I think what goes wrong here is that you:


Start a rebase
Run into a merge conflict
Resolve the merge conflict, and run git add file.txt
Run git commit because that’s what you’re used to doing after you run git add
But you were supposed to run git rebase --continue! Now you have a weird extra commit, and maybe it has the wrong commit message and/or author


splitting commits in an interactive rebase is hard

The whole point of rebase is to clean up your commit history, and combining
commits with rebase is pretty easy. But what if you want to split up a commit into 2
smaller commits? It’s not as easy, especially if the commit you want to split
is a few commits back! I actually don’t really know how to do it even though I
feel very comfortable with rebase. I’d probably just do git reset HEAD^^^  or
something and use git add -p to redo all my commits from scratch.

One person shared their workflow for splitting commits with rebase.

complex rebases are hard

If you try to do too many things in a single git rebase -i (reorder commits
AND combine commits AND modify a commit), it can get really confusing.

To avoid this, I personally prefer to only do 1 thing per rebase, and if I want
to do 2 different things I’ll do 2 rebases.

rebasing long lived branches can be annoying

If your branch is long-lived (like for 1 month), having to rebase repeatedly
gets painful. It might be easier to just do 1 merge at the end and only resolve
the conflicts once.

The dream is to avoid this problem by not having long-lived branches but it
doesn’t always work out that way in practice.

miscellaneous problems

A few more issues that I think are not that common:


Stopping a rebase wrong: If you try to abort a rebase that’s going badly with
git reset --hard instead of git rebase --abort, things will behave
weirdly until you stop it properly
Weird interactions with merge commits: A couple of quotes about this: “If you
rebase your working copy to keep a clean history for a branch, but the
underlying project uses merges, the result can be ugly. If you do rebase -i
HEAD~4 and the fourth commit back is a merge, you can see dozens of commits
in the interactive editor.“, “I’ve learned the hard way to never rebase if
I’ve merged anything from another branch”


rebase and commit discipline

I’ve seen a lot of people arguing about rebase. I’ve been thinking about why
this is and I’ve noticed that people work at a few different levels of “commit
discipline”:


Literally anything goes, “wip”, “fix”, “idk”, “add thing”
When you make a pull request (on github/gitlab), squash all of your crappy commits into a single commit with a reasonable message (usually the PR title)
Atomic Beautiful Commits – every change is split into the appropriate
number of commits, where each one has a nice commit message and where they
all tell a story around the change you’re making


Often I think different people inside the same company have different levels of
commit discipline, and I’ve seen people argue about this a lot. Personally I’m
mostly a Level 2 person. I think Level 3 might be what people mean when they say
“clean commit history”.

I think Level 1 and Level 2 are pretty easy to achieve without rebase – for
level 1, you don’t have to do anything, and for level 2, you can either press
“squash and merge” in github or run git switch main; git merge --squash mybranch on the command line.

But for Level 3, you either need rebase or some other tool (like GitUp) to help
you organize your commits to tell a nice story.

I’ve been wondering if when people argue about whether people “should” use
rebase or not, they’re really arguing about which minimum level of commit
discipline should be required.

I think how this plays out also depends on how big the changes folks are making –
if folks are usually making pretty small pull requests anyway, squashing them
into 1 commit isn’t a big deal, but if you’re making a 6000-line change you
probably want to split it up into multiple commits.

a “squash and merge” workflow

A couple of people mentioned using this workflow that doesn’t use rebase:


make commits
Run git merge main to merge main into the branch periodically (and fix conflicts if necessary)
When you’re done, use GitHub’s “squash and merge” feature (which is the
equivalent of running git checkout main; git merge --squash mybranch) to
squash all of the changes into 1 commit. This gets rid of all the “ugly” merge
commits.


I originally thought this would make the log of commits on my branch too ugly,
but apparently git log main..mybranch will just show you the changes on your
branch, like this:

$ git log main..mybranch
756d4af (HEAD -> mybranch) Merge branch 'main' into mybranch
20106fd Merge branch 'main' into mybranch
d7da423 some commit on my branch
85a5d7d some other commit on my branch


Of course, the goal here isn’t to force people who have made beautiful
atomic commits to squash their commits – it’s just to provide an easy
option for folks to clean up a messy commit history (“add new feature; wip;
wip; fix; fix; fix; fix; fix;“) without having to use rebase.

I’d be curious to hear about other people who use a workflow like this and if
it works well.

there are more problems than I expected

I went into this really feeling like “rebase is fine, what could go wrong?” But
many of these problems actually have happened to me in the past, it’s just that
over the years I’ve learned how to avoid or fix all of them.

And I’ve never really seen anyone share best practices for rebase, other than
“never force push to a shared branch”. All of these honestly make me a lot more
reluctant to recommend using rebase.

To recap, I think these are my personal rebase rules I follow:


stop a rebase if it’s going badly instead of letting it finish (with git rebase --abort)
know how to use git reflog to undo a bad rebase
don’t rebase a million tiny commits (instead do it in 2 steps: git rebase -i HEAD^^^^ and then git rebase main)
don’t do more than one thing in a git rebase -i. Keep it simple.
never force push to a shared branch
never rebase commits that have already been pushed to main



Thanks to Marco Rogers for encouraging me to think about the problems people
have with rebase, and to everyone on Mastodon who helped with this.




Confusing git terminology
2023-11-01T08:45:26+00:00


Hello! I’m slowly working on explaining git. One of my biggest problems is that
after almost 15 years of using git, I’ve become very used to git’s
idiosyncracies and it’s easy for me to forget what’s confusing about it.

So I asked people on Mastodon:


what git jargon do you find confusing? thinking of writing a blog post that explains some of git’s weirder terminology: “detached HEAD state”, “fast-forward”, “index/staging area/staged”, “ahead of ‘origin/main’ by 1 commit”, etc


I got a lot of GREAT answers and I’ll try to summarize some of them here.  Here’s a list of the terms:


HEAD and “heads”
“detached HEAD state”
“ours” and “theirs” while merging or rebasing
“Your branch is up to date with ‘origin/main’”
HEAD^, HEAD~ HEAD^^, HEAD~~, HEAD^2, HEAD~2
.. and …
“can be fast-forwarded”
“reference”, “symbolic reference”
refspecs
“tree-ish”
“index”, “staged”, “cached”
“reset”, “revert”, “restore”
“untracked files”, “remote-tracking branch”, “track remote branch”
checkout
reflog
merge vs rebase vs cherry-pick
rebase –onto
commit
more confusing terms


I’ve done my best to explain what’s going on with these terms, but they
cover basically every single major feature of git which is definitely too much
for a single blog post so it’s pretty patchy in some places.

HEAD and “heads”

A few people said they were confused by the terms HEAD and refs/heads/main,
because it sounds like it’s some complicated technical internal thing.

Here’s a quick summary:


“heads” are “branches”. Internally in git, branches are stored in a directory called .git/refs/heads. (technically the official git glossary says that the branch is all the commits on it and the head is just the most recent commit, but they’re 2 different ways to think about the same thing)
HEAD is the current branch. It’s stored in .git/HEAD.


I think that “a head is a branch, HEAD is the current branch” is a good
candidate for the weirdest terminology choice in git, but it’s definitely too
late for a clearer naming scheme so let’s move on.

There are some important exceptions to “HEAD is the current branch”, which we’ll talk about next.

“detached HEAD state”

You’ve probably seen this message:

$ git checkout v0.1
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

[...]


Here’s the deal with this message:


In Git, usually you have a “current branch” checked out, for example main.
The place the current branch is stored is called HEAD.
Any new commits you make will get added to your current branch, and if you run git merge other_branch, that will also affect your current branch
But HEAD doesn’t have to be a branch! Instead it can be a commit ID.
Git calls this state (where HEAD is a commit ID instead of a branch) “detached HEAD state”
For example, you can get into detached HEAD state by checking out a tag, because a tag isn’t a branch
if you don’t have a current branch, a bunch of things break:


git pull doesn’t work at all (since the whole point of it is to update your current branch)
neither does git push unless you use it in a special way
git commit, git merge, git rebase, and git cherry-pick do still
work, but they’ll leave you with “orphaned” commits that aren’t connected
to any branch, so those commits will be hard to find

You can get out of detached HEAD state by either creating a new branch or switching to an existing branch


“ours” and “theirs” while merging or rebasing

If you have a merge conflict, you can run git checkout --ours file.txt to pick the version of file.txt from the “ours” side. But which side is “ours” and which side is “theirs”?

I always find this confusing and I never use git checkout --ours because of
that, but I looked it up to see which is which.

For merges, here’s how it works: the current branch is “ours” and the branch
you’re merging in is “theirs”, like this. Seems reasonable.

$ git checkout merge-into-ours # current branch is "ours"
$ git merge from-theirs # branch we're merging in is "theirs"


For rebases it’s the opposite – the current branch is “theirs” and the target branch we’re rebasing onto is “ours”, like this:

$ git checkout theirs # current branch is "theirs"
$ git rebase ours # branch we're rebasing onto is "ours"


I think the reason for this is that under the hood git rebase main is
repeatedly merging commits from the current branch into a copy of the main branch (you can
see what I mean by that in this weird shell script the implements git rebase using git merge. But I
still find it confusing.

This nice tiny site explains the “ours” and “theirs” terms.

A couple of people also mentioned that VSCode calls “ours”/“theirs” “current
change”/“incoming change”, and that it’s confusing in the exact same way.

“Your branch is up to date with ‘origin/main’”

This message seems straightforward – it’s saying that your main branch is up
to date with the origin!

But it’s actually a little misleading. You might think that this means that
your main branch is up to date. It doesn’t. What it actually means is –
if you last ran git fetch or git pull 5 days ago, then your main branch
is up to date with all the changes as of 5 days ago.

So if you don’t realize that, it can give you a false sense of security.

I think git could theoretically give you a more useful message like “is up to
date with the origin’s main as of your last fetch 5 days ago” because the time
that the most recent fetch happened is stored in the reflog, but it doesn’t.

HEAD^, HEAD~ HEAD^^, HEAD~~, HEAD^2, HEAD~2

I’ve known for a long time that HEAD^ refers to the previous commit, but I’ve
been confused for a long time about the difference between HEAD~ and HEAD^.

I looked it up, and here’s how these relate to each other:


HEAD^ and HEAD~ are the same thing (1 commit ago)
HEAD^^^ and HEAD~~~ and HEAD~3 are the same thing (3 commits ago)
HEAD^3 refers the the third parent of a commit, and is different from HEAD~3


This seems weird – why are HEAD~ and HEAD^ the same thing? And what’s the
“third parent”? Is that the same thing as the parent’s parent’s parent? (spoiler: it
isn’t) Let’s talk about it!

Most commits have only one parent. But merge commits have multiple parents  –
they’re merging together 2 or more commits. In Git HEAD^ means “the parent of
the HEAD commit”. But what if HEAD is a merge commit? What does HEAD^ refer
to?

The answer is that HEAD^ refers to the the first parent of the merge,
HEAD^2 is the second parent, HEAD^3 is the third parent, etc.

But I guess they also wanted a way to refer to “3 commits ago”, so HEAD^3 is
the third parent of the current commit (which may have many parents if it’s a merge commit), and HEAD~3 is the parent’s parent’s
parent.

I think in the context of the merge commit ours/theirs discussion earlier, HEAD^ is “ours” and HEAD^2 is “theirs”.

.. and ...

Here are two commands:


git log main..test
git log main...test


What’s the difference between .. and ...? I never use these so I had to look it up in man git-range-diff. It seems like the answer is that in this case:

A - B main
  \ 
    C - D test



main..test is commits C and D
test..main is commit B
main...test is commits B, C, and D


But it gets worse: apparently git diff also supports .. and ..., but
they do something completely different than they do with git log? I think the summary is:


git log test..main shows changes on main that aren’t on test, whereas git log test...main shows changes on both sides.
git diff test..main shows test changes and main changes (it diffs B and D) whereas git diff test...main diffs A and D (it only shows you the diff on one side).


this blog post talks about it a bit more.

“can be fast-forwarded”

Here’s a very common message you’ll see in git status:

$ git status
On branch main
Your branch is behind 'origin/main' by 2 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)


What does “fast-forwarded” mean? Basically it’s trying to say that the two branches look something like this: (newest commits are on the right)

main:        A - B - C
origin/main: A - B - C - D - E


or visualized another way:

A - B - C - D - E (origin/main)
        |
       main


Here origin/main just has 2 extra commits that main doesn’t have, so it’s
easy to bring main up to date – we just need to add those 2 commits.
Literally nothing can possibly go wrong – there’s no possibility of merge
conflicts. A fast forward merge is a very good thing! It’s the easiest way to combine 2 branches.

After running git pull, you’ll end up this state:

main:        A - B - C - D - E
origin/main: A - B - C - D - E


Here’s an example of a state which can’t be fast-forwarded.

             A - B - C - X  (main)
                     |
                     - - D - E  (origin/main)


Here main has a commit that origin/main doesn’t have (X). So
you can’t do a fast forward. In that case, git status would say:

$ git status
Your branch and 'origin/main' have diverged,
and have 1 and 2 different commits each, respectively.


“reference”, “symbolic reference”

I’ve always found the term “reference” kind of confusing. There are at least 3 things that get called “references” in git


branches and tags like main and v0.2
HEAD, which is the current branch
things like HEAD^^^ which git will resolve to a commit ID. Technically these are probably not “references”, I guess git calls them “revision parameters” but I’ve never used that term.


“symbolic reference” is a very weird term to me because personally I think the only
symbolic reference I’ve ever used is HEAD (the current branch), and HEAD
has a very central place in git (most of git’s core commands’ behaviour depends
on the value of HEAD), so I’m not sure what the point of having it as a
generic concept is.

refspecs

When you configure a git remote in .git/config, there’s this +refs/heads/main:refs/remotes/origin/main thing.

[remote "origin"]
	url = git@github.com:jvns/pandas-cookbook
	fetch = +refs/heads/main:refs/remotes/origin/main


I don’t really know what this means, I’ve always just used whatever the default
is when you do a git clone or git remote add, and I’ve never felt any
motivation to learn about it or change it from the default.

“tree-ish”

The man page for git checkout says:

 git checkout [-f|--ours|--theirs|-m|--conflict=








Hello, Strange Loop! Strange Loop is one of the first places I
spoke almost 10 years ago and I'm so honored to be back here today for the
last one. Can we have one more round of applause for the organizers?












I often give talks about things that I'm excited about,
or that I think are really fun.



But today, I want to talk about something that I'm a little bit mad about,
which is that sometimes things that seem like they should be basic take me 10
years or 20 years to learn, way longer than it seems like they should. 











One thing that took me a long time to learn was DNS, which is this question
of -- what's the IP address for a domain name like example.com?
This feels like it should be a straightforward thing.











But seven years into learning DNS, I'd be setting up a website. And I'd feel
like things should be working. I thought I understood DNS. But then I'd run
into problems, like my domain name wouldn't work. And I'd wonder -- why not?
What's happening?











And sometimes this would feel kind of personal! This shouldn't be so hard
for me! I should understand this already. It's been seven years!




And this "it's just me" attitude is often encouraged -- when I write about
finding things hard to learn on the Internet, Internet strangers will sometimes
tell me: "yeah, this is easy! You should get it already! Maybe you're just not
very smart!"



But luckily I have a pretty big ego so I don't take the internet strangers too
seriously. And I have a lot of patience so I'm willing to keep coming back to a
topic I'm confused about. There were maybe four different things that were
going wrong with DNS in my life and eventually I figured them all out.













So, hooray! I understood DNS! I win! But then I see some of my friends struggling with
the exact same things.



They're wondering, hey, my DNS isn't working. Why not? 














And it doesn't end. We're still having the same problems over and over and over
again. And it's frustrating! It feels redundant! It makes
me mad. Especially when friends take it personally, and they feel like "hey I
should really understand this already".



Because everyone is going through this. From the sounds of recognition I hear,
I think a lot of you have been through some of these same problems with DNS.










I got so mad about this that I decided to make it my job. 









   
   I started a little publishing company called Wizard Zines where --
   
   
 
 (applause)
 
 
   
   Wow. Where I write about some of these topics and try to demystify them.
   
     








     Here are a few of the zines I've published. I want to talk today about a
     few of these topics and what makes them so hard and how we can make them
     easier.
     
    










 We're going to talk about bash, HTTP, SQL, and DNS.
 
    













 For each of them, we're
 going to talk a little bit about:
 
   
 
 a.  what's so hard about it? 
 
   
 
 b. what are some things we can do to make it a little bit easier for each other?
 










   Let's start with Bash. 









What's so hard about it?








So, bash is a programming language, right?
But it's one of the weirdest programming languages that I work
with.











To understand why it's weird, let's do a little small demo
of bash.











First, let's run this script, bad.sh:


mv ./*.txt /tmmpp
echo "success!"


   
   This moves a file and prints "success!". And with most of the programming languages that I use, if there's a problem, the program will stop.
   
   
 
 [laughter from audience]
 
 
 
 
     But I think a lot of you know from maybe sad experience that bash does not
     stop, right? It keeps going. And going... and sometimes very bad things
     happen to your computer in the process. 
   
   
 
 When I run this program, here's the output:
 
   
   mv: cannot stat './*.txt': No such file or directory
success!



It didn't stop after the failed mv.

     










Eventually I learned that you can write set
-e at the top of your program, and that will make bash stop if
there's a problem. 



When we run this new program with set -e at the top, here's the output:


mv: cannot stat './*.txt': No such file or directory











Great. We're happy. Everything is good. But every time I think I've learned
everything that go wrong with bash, I'll find out -- surprise! There are more
bad things that can happen! Let's look at another program as an example.
     
    









Here we've put our code in a function. And if the function
fails, we want to echo "failed". 




So use set -e at the beginning, and you might think everything should be okay. 



But if we run it... this is the output we get


mv: cannot stat './*.txt': No such file or directory
success



We get the "success" message again! It didn't stop, it just kept going. This is
because the "or" (|| echo "failed") globally disables set -e in the
function.



Which is certainly not what I wanted, and not what I would expect. But this is
not a bug in bash, it's is the documented behavior.












And I think one reason this is tricky is a lot of us don't use bash very often.
Maybe you write a bash script every six months and don't look at it again.



When you use a system very infrequently and it's full of a lot of weird trivia
and gotchas, it's hard to use the system correctly.












So how can we make this easier? What can we do about it?








One thing that I sometimes hear is -- a newcomer will say "this is hard",
and someone more experienced will say "Oh, yeah, it's impossible to use bash.
Nobody knows how to use it."













But I would say this is factually untrue. How many of you are using bash?



A lot of us ARE using it! And it doesn't always work perfectly, but often
it gets the job done.












We have a lot of bash programs that are mostly working, and there's a big
community of us who are using bash mostly successfully despite all the
problems.











The way I think this is --  you have some people on the left in this
diagram who are confused about bash, who think it seems awful and
incomprehensible.



And some people on the right who know how to make the bash work for them,
mostly.



So how do we move people from the left to the right, from being overwhelmed by
a pile of impossible gotchas to being able to mostly use the system correctly?











Well, bash has a giant pile of trivia to remember. But who's good at remembering
giant piles of trivia?









Not me! I can't memorize all of the weird things about bash. But computers!
Computers are great at memorizing trivia!











And for bash, we have this incredible tool called
shellcheck.



[ Applause ]



Yes! Shellcheck is amazing! And shellcheck knows a lot of things that can go
wrong and can tell you "oh no, you don't want to do that. You're going to have
a bad time."



I'm very grateful for shellcheck, it makes it much easier for me to write
tiny bash scripts from time to time. 












Now let's do a shellcheck demo! 


$ shellcheck -o all bad-again.sh
In bad-again.sh line 7:
f || echo "failed!"
^-- SC2310 (info): This function is invoked in an || condition so set -e will be disabled. Invoke separately if failures should cause the script to exit.



Shellcheck gives us this
lovely error message. The message isn't completely obvious on its own (and this
check is only run if you invoke shellcheck with -o all). But
shellcheck tells you "hey, there's this problem, maybe you should be worried
about that".



And I think it's wonderful that all these tips live in this linter. 













I'm not trying to tell you to write linters, though I think that some of you
probably will write linters because this is that kind of crowd.



I've personally never written a linter, and I'm definitely not going to create
something as cool as shellcheck!












But instead, the way I write linters is I tell people about shellcheck from
time to time and then I feel a little like I invented shellcheck for those
people. Because some people didn't know about the tool until I told them about
it!



I didn't find out about shellcheck for a long time and I was kind of mad about
it when I found out. I felt like -- excuse me? I could have been using
shellcheck this whole time? I didn't need to remember all of this stuff in
my brain?



So I think an incredible thing we can do is to reflect on the tools that we're
using to reduce our cognitive load and all the things that we can't fit into
our minds, and make sure our friends or coworkers know about them.












I also like to warn people about gotchas and some of the terrible things
computers have done to me.












I think this is an incredibly valuable community service. The example I shared
about how set -e got disabled is something I learned from my
friend Jesse a few weeks ago. 



They told me how this thing happened to them, and now I know and I don't have
to go through it personally.












One way I see people kind of trying to share terrible things that their
computers have done to them is by sharing "best practices".



But I really love to hear the stories behind the best practices!














If someone has
a strong opinion like "nobody should ever use bash", I want to hear about the
story! What did bash do to you? I need to know.



The reason I prefer stories to best practices is if I know the story about how
the bash hurt you, I can take that information and decide for myself how I want
to proceed.



Maybe I feel like -- the computer did that to you? That's okay, I can deal with
that problem, I don't mind.



Or I might instead feel like "oh no, I'm going to do the best practice you
recommended, because I do not want that thing to happen to me". 



These bash stories are a great example of that: my reaction to them is "okay,
I'm going to keep using bash, I'll just use shellcheck and keep my bash scripts
pretty simple". But other people see them and decide "wow, I never want to use
bash for anything, that's awful, I hate it".



Different people have different reactions to the same stories and that's okay.










That's all for bash. Next up we're gonna talk about HTTP. 


















I was talking to Marco Rogers at some point, many years ago, and he mentioned
some new developers he was working with were struggling with HTTP.



And at first, I was a little confused about this -- I didn't understand what
was hard about HTTP.












The way I was thinking about it
at the time was that if you have an HTTP response, it has a few parts: a response
code, some headers, and a body.




I felt like -- that's a pretty simple structure, what's the problem? But of
course there was a problem, I just couldn't see what it was at first.













So, I talked to a friend who was newer to HTTP. And they asked "why does it
matter what headers you set?"



And I said: "well, the browser..."











But then I thought... the browser?








the browser?










The browser!



Firefox is 20 million lines of code! It's been
evolving since the '90s. There have been as I understand it, 1 million
changes to the browser security model as people have discovered new and
exciting exploits and the web has become a scarier and scarier place.



The browser is really a lot to understand.














One trick for understanding why a topic is hard is -- if the implementation if the
thing involves 20 million lines of code, maybe that's why people are confused!



Though that 20 million lines of code also involves CSS and JS and many other
things that aren't HTTP, but still.












Once I thought of it in terms of how complex a modern web browser is, it
made so much more sense! Of course newcomers are confused about HTTP if you
have to understand what the browser is doing!



Then my problem changed from "why is this hard?" to "how do I explain this at all?"



So how do we make it easier? How do we wrap our minds around this 20 million lines
of code?















One way I think about this for HTTP is: here are some of the HTTP request
headers. That's kind of a big list there are 43 headers there.













There are more unofficial headers too.












My brain does not contain all of those headers, I have no idea what most of
them are.



When I think about trying to explain big topics, I think about -- what is
actually in my brain, which only contains a normal human number of things?













This is a comic I drew about HTTP request headers.
You don't have to read the whole thing. This has 15
request headers.



I wrote that these are "the most important headers", but what I mean by "most
important" here is that these are the ones that I know about and use. It's a
subjective list.



I wrote about 12 words about each one, which I think is approximately the
amount of information about each header that lives in my mind.



For example I know that you can set Accept-Encoding to gzip
and then you might get back a compressed response. That's all I know,
and that's usually all I need to know!



This very small set of information is working pretty well for me.













The general way I think about this trick is "turn a big list into a small list".



Turn the set of EVERY SINGLE THING into just the things I've personally used. I
find it helps a lot.













Another example of this "turn a big list into a small list" trick is command line arguments.



I use a lot of command line tools, the number of arguments they have can be
overwhelming, and I've written about them a fair amount over
the years.













Here are all the flags for grep, from its man page. That's too much! I've been
using grep for 20 years but I don't know what all that stuff is.












But when I look at the grep man page, this is what I see.




I think it's very helpful to newcomers when a more experienced person says
"look, I've been using this system for a while, I know about 7 things about it,
and here's what they are".



We're just pruning those lists down to a more human scale. And it can even help
other more experienced people -- often someone else will know a slightly
different set of 7 things from me.












But what about the stuff that doesn't fit in my brain?



Because I have a few things about HTTP stored in my brain. But sometimes I need
other information which is hard to remember, like maybe the exact details of
how CORS works.











And so, that's where we come to references. Where do we find the information
that we can't remember?











I often have trouble finding the right references.



For example I've been trying to learn CSS off and on for 20 years. I've made a
lot of progress -- it's going well!



But only in the last 2 years or so I learned about this wonderful website called 
CSS Tricks.



And I felt kind of mad when I learned about CSS Tricks! Why didn't I know about
this before? It would have helped me!













But anyway, I'm happy to know about CSS Tricks now. (though sadly they seem to
have stopped publishing in April after the acquisition, I'm still happy the older posts are there)



For HTTP, I think a lot of us use the Mozilla Developer Network. 














Another HTTP reference I love is the official RFC, RFC 9110 (also
9111,
9112,
9113,
9114)



It's a new authoritative reference for HTTP and it was written just last
year, in 2022! They decided to organize all the information really nicely. So if you
want to know exactly what the Connection header does, you can look
it up. 



This is not really my top reference. I'm usually on MDN. But I really
appreciate that it's available.













So I love to share my favorite references.












I do sometimes find it tempting to kind of lie about references. Not on
purpose.
But I'll see something on the internet, and I'll think it's kind of cool, and
tell a friend about. But then my friend might ask me -- "when have you used this?"
And I'll have to admit "oh, never, I just thought it seemed cool".



I think it's important to be honest about what the references that I'm actually
using in real life are. Even if maybe the real references I use are a little
"embarrassing", like maybe w3schools or something.












So that's HTTP! Next we're going to talk about SQL.








The case of the mysterious execution order.










I started thinking about SQL because someone mentioned they're trying to learn
SQL. I get most of my zine ideas that way, one person will make an offhand
comment and I'll decide "ok, I'm going to spend 4 months writing about
that". It's a weird process.



So I was wondering -- what's hard about SQL? What gets in the way of trying
to learn that?



I want to say that when I'm confused about what's hard about something, that's
a fact about me. It's not usually that the thing is easy, it's that I need to
work on understanding what's hard about it. It's easy to forget when you've
been using something for a while.












So, I was used to reading SQL queries. For example this made up query that tries to
find people who own exactly two cats. It felt straightforward
to me, SELECT,
FROM, WHERE, GROUP BY.












But then I was talking to a friend about these queries who was new to SQL. And
my friend asked -- what is this doing?



I thought, hmm, fair point.














And I think the point my friend was making was that the order that this SQL
query is written in, is not the order that it actually happens in. It happens
in a different order, and it's not immediately obvious what that is.












So how do we make this easier?











I like to think about: what does the computer do first?
What actually happens first chronologically?



Computers actually do live in the same timeline as us. Things happen. Things
happen in an order. So what happens first?




















The way I think about an SQL query is: is you start with a table like
cats.










Then maybe you filter it, you remove some stuff. 









Then you make some groups.









Then you filter the groups, remove some of them.









Then you do some
aggregation. There's two things in each group.









And you sort it.

And you
can also limit the results.











So, that's how I think about SQL. The way a query runs is first
FROM, then WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT.



At least conceptually. Real life databases have optimizations and it's more
complicated than that. But this is the mental model that I use most of the time
and it works for me. Everything is in the same order as you write it,
except SELECT is fifth. 











I've really gotten a lot out of this trick where you try to tell the
chronological story of what the computer is doing. I want to talk about a
couple other examples.











One is CORS, in HTTP. 



This comic is way too small to read on the slide.
But the idea is if you're making a cross-origin request in your
browser, you can write down every communication that's happening between your
browser and the server, in chronological order.



And I think writing down everything in chronological order makes it a lot easier to understand and more concrete.













"What happens in chronological order?" is a very
straightforward structure, which is what I like about it. "What happens first?"
feels like it should be easy to answer. But it's not!



I've found that it's actually very hard to know what our computers is
doing, and it's a really fun question to explore.













As an example of how this is hard: I wrote a blog post recently called 
"Behind Hello World on Linux". It's about what happens when you run "hello world" on a
Linux computer. I wrote a bunch about it, and I was really happy with it.



But after I wrote the post, I thought -- haven't I written about this before? Maybe 10 years ago?












And sure enough, I'd tried to write 
a similar post 10 years before.



I think this is really cool. Because the 2013 version of this post was about 6
times shorter. This isn't because Linux is more complicated than it was 10
years ago -- I think everything in the 2023 post was probably also true in
2013. The 2013 post just has a lot less information in it.



The reason the 2023 post is longer is that I didn't know what was happening
chronologically on my computer in 2013 very well, and in 2023 I know a lot
more. Maybe in 2033 I'll know even more!



I think a lot of us -- like me in 2013 and honestly me now, often don't know
the facts of what's happening on our computers. It's very hard, which is what
makes it such a fun question to try and discuss.














I think it's cool that all of us
have different knowledge about what is happening chronologically on our
computers and we can all chip in to this conversation.



For example when I posted this blog post about Hello World on Linux, some people
mentioned that they had a lot of thoughts about what happens exactly in your
terminal, or more details about the filesystem, or about what's happening
internally in the Python interpreter, or any number of things. You can go
really deep.



I think it's just a really fun collaborative question. 












I've seen "what happens chronologically?" work really well as an activity with
coworkers, where you're ask: "when a request comes into this API endpoint we
run, how does that work? What happens?"



What I've seen is that someone will understand some part of the system, like "X
happens, then Y happens, then it goes over to the database and I have no idea
how that works".  And then someone else can chime in and say "ah, yes, with the
database A B C happens, but then there's a queue and I don't know about that".



I think it's really fun to get together with people who have different
specializations and try to make these little timelines of what the
computers are doing. I've learned a lot from doing that with people.











That's all for SQL.









So, now we've arrived at DNS which is
where we started the talk.












Even though I struggled with DNS. Once I got figured it out, I felt like "dude,
this is easy!". Even though it just took me 10 years to learn how it
works.



But of course, DNS was pretty hard for me to learn. So -- why is that? Why did
it take me so long?













So, I have a little chart here of how I think about DNS.



You have your browser on the left. And over on the right there's the authoritative
nameservers, the source of truth of where the DNS records for a domain live. 



In the middle, there's a function that you call and a cache.
So you have browser, function, cache, source of truth.












One problem is that there are a lot of things in this diagram that are
totally hidden from you.



The library code that you're using where you make a DNS request -- there are a
lot of different libraries you could be using, and it's not straightforward to figure out which one is being used.
That was the source of some of my confusion.



There's a cache which has a bunch of cached data. That's invisible to you, you
can't inspect it easily and you have no control over it. that



And there's a conversation between the cache and the source of
truth, these two red arrows which also you can't see at all.



So this is kind of tough! How are you supposed to develop an intuition for a
system when it's mostly things that are completely hidden from you? Feels like
a lot to expect.











So, what do we do about this?











So: let's talk about these red arrows
on the right.



We have our cache and then we have the source of truth. This conversation
is normally hidden from you because you often don't control either of these
servers. Usually they're too busy doing high-performance computing to report to
you what they're doing.



But I thought: anyone can write an authoritative nameserver!
In particular, I could write one that reports back every single message that it receives to its users.
So, with my friend Marie, we wrote a little DNS server.












(demo of messwithdns.net)



This is called Mess With DNS. The idea is I have a domain name and you
can do whatever you want with it. We're going to make a DNS record called
strangeloop, and we're going to make a CNAME record pointing at
orange.jvns.ca, which is just a picture of an orange. Because I
like oranges.



And then over here, every time a request comes in from a resolver, this will --
this will report back what happened. So, if we click on this link, we can see
-- a Canadian DNS resolver, which is apparently what my browser is configured
to use, is requesting an IPv4 record and an IPv6 record, A and AAAA.



(at this point in the demo everyone in the audience starts visiting the link
and it gets a bit chaotic, it's very funny)












So the trick here is to find ways to show people parts of what the computer is
doing that are normally hidden.












Another great example of showing things that are hidden is this website called float.exposed
by Bartosz Ciechanowski who makes a lot of incredible visualizations.



So if you look at this 32-bit
floating point number and click the "up" button on the significand, it'll
show you the next floating point number, which is 2 more. And then as you make
the number bigger and bigger (by increasing the exponent), you can see that the
floating point numbers get further and further apart.



Anyway, this is not a talk about floating point. I could do an entire talk
about this site and how we can use it to see how floating point works, but
that's not this talk.













Another thing that makes DNS confusing is that it's a giant distributed system
-- maybe you're confused because there are 5 million computers involved (really, more!).
Most of which you have no control over, and some
are doing not what they're supposed to do. 



So that's another trick for understanding why things are hard, check to see if
there are actually 5 million computers involved.












So what else is hard about DNS?



We've talked about how most of the system is hidden from you, and about how
it's a big distributed system.











One problem I've run into is that the tools are confusing.











One of the hidden things I talked about was: the resolver has cached data,
right? And you might be curious about whether a certain domain name is cached
or not by your resolver right now.



Just to understand what's happening:  am I getting this result because it was
cached? What's the deal?




I said this was hidden, but there are a couple of ways to query a resolver to
see what it has cached, and I want to show you one of them.













The tool I usually use for making DNS queries is called dig, and
it has a flag called +norecurse. You can use it to query a
resolver and ask it to only return results it already has cached.


With dig +norecurse jvns.ca, I'm kind of asking -- how popular is my website? Is it popular enough that someone has visited it in the last 5 minutes?
Because my records are not cached for that long, only for 5 minutes.













But when I look at this
response, I feel like "please! What is all this?"



And when I show newcomers this output, they often respond by saying "wow,
that's complicated, this DNS thing must be really complicated". But really this
is just not a great output format, I think someone just made some relatively
arbitrary choices about how to print this stuff out in the 90s and it's stayed
that way ever since.



So a bad output format can mislead newcomers into thinking that something is more complicated than it actually is.











What can we do about confusing output like this?











One of my favorite tricks, I call eraser eyes.



Because when I look at that output, I'm not looking at all of it, I'm just
looking at a few things. My eyes are ignoring the rest of it.











When I look at the output, this is what I see: it says SERVFAIL.
That's the DNS response code.




Which as I understand it is a very unintuitive way of it saying, "I do not have
that in my cache". So nobody has asked that resolver about my domain name in
the last 5 minutes, which isn't very surprising.



I've learned so much from people doing a little demo of a tool, and showing how
they use it and which parts of the output or UI they pay attention to, and which parts they ignore.



Becuase usually we ignore most of what's on our screens!



I really love to use dig even though it's a little hairy because
it has a lot of features (I don't know of another DNS debugging that supports this
+norecurse trick), it's everywhere, and it hasn't changed in a
long time. And I know if I learn its weird output format once I can know that
forever. Stability is really valuable to me.











So we've talked about these four technologies. Let's talk a little more about
how we can make things easier for each other.












What can we do to move folks from "I really don't get it" to "okay, I can
mostly deal with this, at least 90% of the time, it's fine"? For bash or HTTP or DNS or anything else.












We've talked about some tricks I use to bring people over, like:



 sharing useful tools 
 sharing references
telling a chronological story of what happens on your computer
turning a big list into a small list of the things you actually use
showing the hidden things
demoing a confusing tool and telling folks which parts I pay attention to














When I practiced this talk, I got some feedback from people saying "julia! I don't
do those things! I don't have a blog, and I'm not going to start one!"




And it's true that most people are probably not going to start programming blogs.













But I really don't think you need to have a public presence on the internet to
tell the people around you a little bit about how you use computers and how you
understand them.













My experience is that a lot of people (who do not have blogs!) have helped me
understand how computers work and have
shared little pieces of their experience with computers with me.



I've learned a lot from my friends and my coworkers and honestly a lot of
random strangers on the Internet too. I'm pretty sure some of you here today
have helped me over the years, maybe on Twitter or Mastodon.



So I want to talk about some archetypes of helpful people












One kind of person who has really helped me is the
grumpy old-timer. I'll say "this is so cool". And they'll reply yes,
however, let me tell you some stories of how this has gone wrong in my life.




And those stories have sometimes helped spare me some suffering.












We have the loud newbie, who asks questions like "wait, how does that work?"
And then everyone else feels relieved -- "oh, thank god. It's not just me."



I think it's especially valuable when the person who takes the "loud newbie"
role is actually a pretty senior developer. Because when you're more secure in
your position, it's easier to put yourself out there and say "uh, I don't get
this" because nobody is going to judge you for that and think you're
incompetent.



And then other people who feel more like they might be judged for not knowing
something can ride along on your coattails.













Then we have the bug chronicler. Who decides "ok, that bug. This can never happen again".



"I'm gonna make sure we understand what happened. Because I want this to end
now."











We have the tool builder, whose attitude is more like "I see people struggling
with something, and I don't feel like explaining it. But I can write code to
just make it easier permanently for everyone."










There's this "today I learned" person who's into sharing cool new tools they
learned about, a bug that they ran into, or a great new-to-them library feature.










There's the person who has read the entire Internet and has 700 tabs open. If you
want to know where to find something, there's a good chance they already have
it open in their browser.











We have the person who is just willing to answer questions! "Yeah, I can tell
you how that works!"









And at the end of all this, sometimes you have someone who likes to write some
things down so that other people can read it and can find it later.











But all of us have different roles and we need to work together. I'm into
writing but a lof of the stuff I've written about, I only know about because
someone told me about it or explained it to me.










To end: the one thing I would like to convince you of is: if you're struggling
with something that feels basic, it's not just you! You're not alone. We're all struggling with a
lot of these things that feel like they should be "basic".











And we're struggling with these things for a lot of
the same reasons as each other. 












And much like when debugging a computer program, when you have a bug, you
want to understand why the bug is happening if you're gonna fix it.



If we're all struggling with the same things together for the same reasons, if
we can figure out what those reasons are, we can do a better job of fixing
them.











Some of the reasons we've talked about were:



a giant pile of trivia and gotchas.


or maybe there's 20 million lines of code somewhere.


Maybe a big part of the system is being hidden from you.


Maybe the tool's output is extremely confusing and no UI designer has ever worked on improving it



And there are a lot more reasons.










I don't have all the answers for why things are hard. For example I don't really understand why Git is hard, that's something I've been thinking about recently.










But that's something I'm excited to keep
working on and keep trying to figure out.












And that's all I have for you. Thank you.



I brought some zines to the conference, if you come to the signing later on you can get one.





some thanks

This was the last ever Strange Loop and I’m really grateful to Alex Miller and the
whole organizing team for making such an incredible conference for so many years. Strange Loop
accepted one of my first talks (you can be a kernel hacker) 9 years ago when I had
almost no track record as a speaker so I owe a lot to them.

Thanks to Sumana for coming up with the idea for this talk, and to Marie,
Danie, Kamal, Alyssa, and Maya for listening to rough drafts of it and helping
make it better, and to Dolly, Jesse, and Marco for some of the conversations I
mentioned.

Also after the conference Nick Fagerland wrote a nice post with thoughts on why git is hard in response to my “I
don’t know why git is hard” comment and I really appreciated it. It had some
new-to-me ideas and I’d love to read more analyses like that.



In a git repository, where do your files live?
2023-09-14T11:53:00+00:00


Hello! I was talking to a friend about how git works today, and we got onto the
topic – where does git store your files? We know that it’s in your .git
directory, but where exactly in there are all the versions of your old files?

For example, this blog is in a git repository, and it contains a file called
content/post/2019-06-28-brag-doc.markdown. Where is that in my .git folder?
And where are the old versions of that file? Let’s investigate by writing some
very short Python programs.

git stores files in .git/objects

Every previous version of every file in your repository is in .git/objects.
For example, for this blog, .git/objects contains 2700 files.

$ find .git/objects/ -type f | wc -l
2761



note: .git/objects actually has more information than “every previous version
of every file in your repository”, but we’re not going to get into that just yet


Here’s a very short Python program
(find-git-object.py) that
finds out where any given file is stored in .git/objects.

import hashlib
import sys


def object_path(content):
    header = f"blob {len(content)}\0"
    data = header.encode() + content
    digest = hashlib.sha1(data).hexdigest()
    return f".git/objects/{digest[:2]}/{digest[2:]}"


with open(sys.argv[1], "rb") as f:
    print(object_path(f.read()))


What this does is:


read the contents of the file
calculate a header (blob 16673\0) and combine it with the contents
calculate the sha1 sum (e33121a9af82dd99d6d706d037204251d41d54 in this case)
translate that sha1 sum into a path (.git/objects/e3/3121a9af82dd99d6d706d037204251d41d54)


We can run it like this:

$ python3 find-git-object.py content/post/2019-06-28-brag-doc.markdown
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54


jargon: “content addressed storage”

The term for this storage strategy (where the filename of an object in the
database is the same as the hash of the file’s contents) is “content addressed
storage”.

One neat thing about content addressed storage is that if I have two files (or
50 files!) with the exact same contents, that doesn’t take up any extra space
in Git’s database – if the hash of the contents is aabbbbbbbbbbbbbbbbbbbbbbbbb, they’ll both be stored in .git/objects/aa/bbbbbbbbbbbbbbbbbbbbb.

how are those objects encoded?

If I try to look at this file in .git/objects, it gets a bit weird:

$ cat .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
x^A<8D><9B>}sƑo|<8A>^Q<9D>ju<92>\<9C><9C>*<89>j^...


What’s going on? Let’s run file on it:

$ file .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54: zlib compressed data


It’s just compressed! We can write another little Python program called decompress.py that uses the zlib module to decompress the data:

import zlib
import sys

with open(sys.argv[1], "rb") as f:
    content = f.read()
    print(zlib.decompress(content).decode())


Now let’s decompress it:

$ python3 decompress.py .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54 
blob 16673---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... the entire blog post ...


So this data is encoded in a pretty simple way: there’s this
blob 16673\0 thing, and then the full contents of the file.

there aren’t any diffs

One thing that surprised me here is the first time I learned it: there aren’t
any diffs here! That file is the 9th version of that blog post, but the version
git stores in the .git/objects is the whole file, not the diff from the
previous version.

Git actually sometimes also does store files as diffs (when you run git gc it
can combine multiple different files into a “packfile” for efficiency), but I
have never needed to think about that in my life so we’re not going to get into
it. Aditya Mukerjee has a great post called Unpacking Git packfiles about how the format works.

what about older versions of the blog post?

Now you might be wondering – if there are 8 previous versions of that blog
post (before I fixed some typos), where are they in the .git/objects
directory? How do we find them?

First, let’s find every commit where that file changed with git log:

$ git log --oneline  content/post/2019-06-28-brag-doc.markdown
c6d4db2d
423cd76a
7e91d7d0
f105905a
b6d23643
998a46dd
67a26b04
d9999f17
026c0f52
72442b67


Now let’s pick a previous commit, let’s say 026c0f52. Commits are also stored
in .git/objects, and we can try to look at it there. But the commit isn’t
there! ls .git/objects/02/6c* doesn’t have any results! You know how we
mentioned “sometimes git packs objects to save space but we don’t need to worry
about it?“. I guess now is the time that we need to worry about it.

So let’s take care of that.

let’s unpack some objects

So we need to unpack the objects from the pack files. I looked it up on Stack
Overflow and apparently you can do it like this:

$ mv .git/objects/pack/pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack .
$ git unpack-objects < pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack


This is weird repository surgery so it’s a bit alarming but I can always
just clone the repository from Github again if I mess it up, so I wasn’t too
worried.

After unpacking all the object files, we end up with way more objects: about
20000 instead of about 2700. Neat.

find .git/objects/ -type f | wc -l
20138


back to looking at a commit

Now we can go back to looking at our commit 026c0f52. You know how we said
that not everything in .git/objects is a file? Some of them are commits! And
to figure out where the old version of our post
content/post/2019-06-28-brag-doc.markdown is stored, we need to dig pretty
deep into this commit.

The first step is to look at the commit in .git/objects.

commit step 1: look at the commit

The commit 026c0f52 is now in
.git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4 after doing some
unpacking and we can look at it like this:

$ python3 decompress.py .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4
commit 211tree 01832a9109ab738dac78ee4e95024c74b9b71c27
parent 72442b67590ae1fcbfe05883a351d822454e3826
author Julia Evans  1561998673 -0400
committer Julia Evans  1561998673 -0400

brag doc


We can also get same information with git cat-file -p 026c0f52, which does the same thing but does a better job of formatting the data. (the -p option means “format it nicely please”)

commit step 2: look at the tree

This commit has a tree. What’s that? Well let’s take a look. The tree’s ID
is 01832a9109ab738dac78ee4e95024c74b9b71c27, and we can use our
decompress.py script from earlier to look at that git object. (though I had to remove the .decode() to get the script to not crash)

$ python3 decompress.py .git/objects/01/832a9109ab738dac78ee4e95024c74b9b71c27
b'tree 396\x00100644 .gitignore\x00\xc3\xf7`$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\x8e:h\xad100644 README.md\x00~\xba\xec\xb3\x11\xa0^\x1c\xa9\xa4?\x1e\xb9\x0f\x1cfG\x96\x0b


This is formatted in kind of an unreadable way. The main display issue here is that
the commit hashes  (\xc3\xf7$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\…) are raw
bytes instead of being encoded in hexadecimal. So we see \xc3\xf7$8\x9b\x8d
instead of c3f76024389b8d. Let’s switch over to using git cat-file -p which
formats the data in a friendlier way, because I don’t feel like writing a
parser for that.

$ git cat-file -p 01832a9109ab738dac78ee4e95024c74b9b71c27
100644 blob c3f76024389b8d4f192f18b77d7cc7ce8e3a68ad	.gitignore
100644 blob 7ebaecb311a05e1ca9a43f1eb90f1c6647960bc1	README.md
100644 blob 0f21dc9bf1a73afc89634bac586271384e24b2c9	Rakefile
100644 blob 00b9d54abd71119737d33ee5d29d81ebdcea5a37	config.yaml
040000 tree 61ad34108a327a163cdd66fa1a86342dcef4518e	content <-- this is where we're going next
040000 tree 6d8543e9eeba67748ded7b5f88b781016200db6f	layouts
100644 blob 22a321a88157293c81e4ddcfef4844c6c698c26f	mystery.rb
040000 tree 8157dc84a37fca4cb13e1257f37a7dd35cfe391e	scripts
040000 tree 84fe9c4cb9cef83e78e90a7fbf33a9a799d7be60	static
040000 tree 34fd3aa2625ba784bced4a95db6154806ae1d9ee	themes


This is showing us all of the files I had in the root directory of the
repository as of that commit. Looks like I accidentally committed some file
called mystery.rb at some point which I later removed.

Our file is in the content directory, so let’s look at that tree: 61ad34108a327a163cdd66fa1a86342dcef4518e

commit step 3: yet another tree

$ git cat-file -p 61ad34108a327a163cdd66fa1a86342dcef4518e

040000 tree 1168078878f9d500ea4e7462a9cd29cbdf4f9a56	about
100644 blob e06d03f28d58982a5b8282a61c4d3cd5ca793005	newsletter.markdown
040000 tree 1f94b8103ca9b6714614614ed79254feb1d9676c	post <-- where we're going next!
100644 blob 2d7d22581e64ef9077455d834d18c209a8f05302	profiler-project.markdown
040000 tree 06bd3cee1ed46cf403d9d5a201232af5697527bb	projects
040000 tree 65e9357973f0cc60bedaa511489a9c2eeab73c29	talks
040000 tree 8a9d561d536b955209def58f5255fc7fe9523efd	zines


Still not done…

commit step 4: one more tree….

The file we’re looking for is in the post/ directory, so there’s one more tree:

$ git cat-file -p 1f94b8103ca9b6714614614ed79254feb1d9676c	
.... MANY MANY lines omitted ...
100644 blob 170da7b0e607c4fd6fb4e921d76307397ab89c1e	2019-02-17-organizing-this-blog-into-categories.markdown
100644 blob 7d4f27e9804e3dc80ab3a3912b4f1c890c4d2432	2019-03-15-new-zine--bite-size-networking-.markdown
100644 blob 0d1b9fbc7896e47da6166e9386347f9ff58856aa	2019-03-26-what-are-monoidal-categories.markdown
100644 blob d6949755c3dadbc6fcbdd20cc0d919809d754e56	2019-06-23-a-few-debugging-resources.markdown
100644 blob 3105bdd067f7db16436d2ea85463755c8a772046	2019-06-28-brag-doc.markdown <-- found it!!!!!


Here the 2019-06-28-brag-doc.markdown is the last file listed because it was
the most recent blog post when it was published.

commit step 5: we made it!

Finally we have found the object file where a previous version of my blog post
lives! Hooray! It has the hash 3105bdd067f7db16436d2ea85463755c8a772046, so
it’s in  git/objects/31/05bdd067f7db16436d2ea85463755c8a772046.

We can look at it with decompress.py

$ python3 decompress.py .git/objects/31/05bdd067f7db16436d2ea85463755c8a772046 | head
blob 15924---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... rest of the contents of the file here ...


This is the old version of the post! If I ran git checkout 026c0f52 content/post/2019-06-28-brag-doc.markdown or git restore --source 026c0f52 content/post/2019-06-28-brag-doc.markdown, that’s what I’d get.

this tree traversal is how git log works

This whole process we just went through (find the commit, go through the
various directory trees, search for the filename we wanted) seems kind of long
and complicated but this is actually what’s happening behind the scenes when we
run git log content/post/2019-06-28-brag-doc.markdown. It needs to go through
every single commit in your history, check the version (for example
3105bdd067f7db16436d2ea85463755c8a772046 in this case) of
content/post/2019-06-28-brag-doc.markdown, and see if it changed from the previous commit.

That’s why git log FILENAME is a little slow sometimes – I have 3000 commits in this
repository and it needs to do a bunch of work for every single commit to figure
out if the file changed in that commit or not.

how many previous versions of files do I have?

Right now I have 1530 files tracked in my blog repository:

$ git ls-files | wc -l
1530


But how many historical files are there? We can list everything in .git/objects to see how many object files there are:

$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | wc -l
20135


Not all of these represent previous versions of files though – as we saw
before, lots of them are commits and directory trees. But we can write another little Python
script called find-blobs.py that goes through all of the objects and checks
if it starts with blob or not:

import zlib
import sys

for line in sys.stdin:
    line = line.strip()
    filename = f".git/objects/{line[0:2]}/{line[2:]}"
    with open(filename, "rb") as f:
        contents = zlib.decompress(f.read())
        if contents.startswith(b"blob"):
            print(line)


$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | python3 find-blobs.py | wc -l
6713


So it looks like there are 6713 - 1530 = 5183 old versions of files lying
around in my git repository that git is keeping around for me in case I ever
want to get them back. How nice!

that’s all!

Here’s the gist with all
the code for this post. There’s not very much.

I thought I already knew how git worked, but I’d never really thought about
pack files before so this was a fun exploration. I also don’t spend too much
time thinking about how much work git log is actually doing when I ask it to
track the history of a file, so that was fun to dig into.

As a funny postscript: as soon as I committed this blog post, git got mad about
how many objects I had in my repository (I guess 20,000 is too many!) and
ran git gc to compress them all into packfiles. So now my .git/objects
directory is very small:

$ find .git/objects/ -type f | wc -l
14

Julia Evans

Notes on git's error messages

improving error messages isn’t easy

error: git checkout HEAD^

that’s all!

Making crochet cacti

first, the cacti

a couple of other critters

the first project: a mouse

buying patterns is great

modifying patterns chaotically is great too

no safety eyes

no stitch markers

on dealing with all the counting

notes on yarn

hook size? who knows!

every stitch I’ve learned

every single thing I’ve bought

that’s all!

Some Git poll results

these polls are highly unscientific

where to read more

merge and rebase

merge conflicts

git pull

commits

branches

git environment

losing work

meaning of various git terms

other version control systems

that’s all!

The "current branch" in git

four possible definitions for “current branch”

scenario 1: right after git checkout main

scenario 2: right after git checkout 775b2b399

scenario 3: right after git checkout v1.0.13

scenario 4: in the middle of a rebase

scenario 5: right after git init

scenario 6: a bare git repository

all the results

“current branch” doesn’t seem completely well defined

some more “current branch” definitions

on orphaned commits

that’s all!

How HEAD works in git

HEAD is actually a few different things

the file .git/HEAD

HEAD as in git show HEAD

next: all the output formats

git status: “on branch main” or “HEAD detached”

detached HEAD state

git log: (HEAD -> main)

merge conflicts: <<<<<<< HEAD is just confusing

some thoughts on consistent terminology

that’s all!

Popular git config options

pull.ff only or pull.rebase true

merge.conflictstyle zdiff3

rebase.autosquash true

rebase.autostash true

push.default simple, push.default current, push.autoSetupRemote true

init.defaultBranch main

commit.verbose true

rerere.enabled true

help.autocorrect 10

core.pager delta

diff.algorithm histogram

core.excludesfile: a global .gitignore

includeIf: separate git configs for personal and work

url."git@github.com:".insteadOf 'https://github.com/'

fsckobjects: avoid data corruption

submodule stuff

and more

how to set these

config changes I’ve made after writing this post

that’s all!

Dealing with diverged git branches

what does “diverged” mean?

recognizing when branches are diverged

error: `git checkout HEAD^`

scenario 1: right after `git checkout main`

scenario 2: right after `git checkout 775b2b399`

scenario 3: right after `git checkout v1.0.13`

scenario 5: right after `git init`

the file `.git/HEAD`

`HEAD` as in `git show HEAD`

`git status`: “on branch main” or “HEAD detached”

`git log`: `(HEAD -> main)`

merge conflicts: `<<<<<<< HEAD` is just confusing

`pull.ff only` or `pull.rebase true`

`merge.conflictstyle zdiff3`

`rebase.autosquash true`

`rebase.autostash true`

`push.default simple`, `push.default current`, `push.autoSetupRemote true`

`init.defaultBranch main`

`commit.verbose true`

`rerere.enabled true`

`help.autocorrect 10`

`core.pager delta`

`diff.algorithm histogram`

`core.excludesfile`: a global .gitignore

`includeIf`: separate git configs for personal and work

`url."git@github.com:".insteadOf 'https://github.com/'`

`fsckobjects`: avoid data corruption

way 1: `git status`

way 2: `git push`

way 3: `git pull`

solution 1.1: `git pull --rebase`

solution 1.2: `git pull --no-rebase`

solution 2.1: `git push --force`

solution 2.2: `git push --force-with-lease`

solution 3.1: `git reset --hard origin/main`

HEAD: `.git/head`

branch: `.git/refs/heads/main`

commit: `.git/objects/10/93da429...`

tree: `.git/objects/9f/83ee7550...`

blobs: `.git/objects/5a/475762c...`

reflog: `.git/logs/refs/heads/main`

remote-tracking branches: `.git/refs/remotes/origin/main`

tags: `.git/refs/tags/v1.0`

the stash: `.git/refs/stash`

hooks: `.git/hooks/pre-commit`

the staging area: `.git/index`