git internals show how it git works through examples in code below.

Sample Repository is here and the Video is here.

A good overview on what’s inside the .git directory.

Introduction

We’re going to show the internals of git and the data structures it uses. We’ll do that by crafting the .git directory and a blob object by hand.

Where are we, and is git happy?

Let’s start by seeing where we are, and if we’re in a valid git repository.

echo "My current working dir is: $PWD"
git status 2>&1 || echo Git is not happy.

My current working dir is: /private/tmp/org

Still not happy, clearly we’ve got some work to do.

Let’s create the .git directory

Ok, so we knoew that we need a .git directory to start things off. Let’s create one and fill it with the stuff that git needs.

mkdir -p .git/objects
mkdir -p .git/refs
mkdir -p .git/refs/heads
echo "ref: refs/heads/master" > .git/HEAD
tree .git/
git status 2>&1 && echo "Git is happy!"

.git/
├── HEAD
├── objects
│   └── 2e
│       └── 475926c85491ad77fa28ea00ad35686dec7cdc
└── refs
    └── heads

5 directories, 2 files
Git is happy!

Git needs a few things to be happy – it needs a cplace to stash objects, a place to track refs and a HEAD file which points to our current commit, and as you can see git is now happy!

Let’s hash something!

We’re going to hash the string “Welcome to Git Internals!” by using the git plumbing command hash-object.

echo -n 'Welcome to Git Internals!' | git hash-object --stdin -w

2e475926c85491ad77fa28ea00ad35686dec7cdc

We’ve asked git to hash the content we passed in via STDIN and we’ve also asked it to store it in the object database. It returned a 40 character SHA1 hash of the content, and if you have ever worked with git before, you’ve likely seen one of these. You can also refer to this hash by its first four characters like 2e+47 which is pretty handy.

Where did git put it?

In the previous example we asked git hash-object to write our string the objects database. Let’s see how that was stored in the .git directory.

tree .git

.git
├── HEAD
├── objects
│   └── 2e
│       └── 475926c85491ad77fa28ea00ad35686dec7cdc
└── refs
    └── heads

5 directories, 2 files

Because file systems get abgry with you wgeb you try to stash too many files in the same directory, git shards the directory based on the first two bytes of the hash.

Can we just look at the object?

Let’s see if we can find part of the string we hashed inside the object.

grep -F Welcome .git/objects/2e/475926c85491ad77fa28ea00ad35686dec7cdc || printf "No."

No.

No. Git stores the objects in compressed format, but we can use git cat-file to take a peek inside it. We’ll run it with the -p argument to pretty-print the object. We’ll also reference the object by its short name, because typing long git hashes is no fun!

git cat-file -p 2e+47

Let’s do it ourselves

OK, let’s figure out how git compresses the file.

cat .git/objects/2e/475926c85491ad77fa28ea00ad35686dec7cdc | file -
cat .git/objects/2e/475926c85491ad77fa28ea00ad35686dec7cdc | gunzip || echo "Nope."

/dev/stdin: zlib compressed data
Nope.

Neither file or gunzip know what to make of it. I happen to know it’s a zlib stream. The program pigz can deal with these.

cat .git/objects/2e/475926c85491ad77fa28ea00ad35686dec7cdc | pigz -d | hexdump -C

We can see from the hex output that git is storing our hashed string with a header, which contains blob which is the type of thing we’re storing, followed by 0, which is the number of bytes of the thing we are storing, followed by a null byte.

Since we’re doing things from the ground up, let’s hash it ourselves using Python.

import hashlib
hashme = "<<hello()>>"
header = "blob" + str(len(hashme)) + "\0"
myblob = (header + hashme).encode('utf-8')
gitsha = hashlib.sha1(myblob).hexdigest()

return gitsha

1fcaca27591773077c55521219d6b844791c57cd

So now we’ve figured out how to hash the string like git would, now we just need to compress it and save it. First let’s get rid of the object we previously created.

rm .git/objects/2e/475926c85491ad77fa28ea00ad35686dec7cdc

Here we go, this will be much like the previous program except now it will compress and then save the file!

import hashlib, zlib
hashme = "<<hello()>>"
header = "blob" + str(len(hashme)) + "\0"
myblob = (header + hashme).encode('utf-8')
gitsha = hashlib.sha1(myblob).hexdigest()

# Calculate Filename
gitobj= f".git/objects/{gitsha[:2]}/{gitsha[2:]}"

# Write out the binary bytes!
myfile = open(gitobj, 'wb')
myfile.write(zlib.compress(myblob))
myfile.close()

return f"Wrote: {gitobj}"

Wrote: .git/objects/1f/caca27591773077c55521219d6b844791c57cd

Did it work?

Now use git cat-file to verify that our program did everything right.

git cat-file -p 2e+47

Final thoughts

In a future post, I’ll show how we can use similar techniques to craft git trees and commits.