Key Insights on How Git Version Control System Works

Seeing Git Version Control System as a Simple Key-value Store

All of us know Git as a distributed version control system (DVCS). It would be a surprise to many, like it occurred to me, to know that it is a content-addressable filesystem (or simply a key-value store) with a VCS layer.

Git

Unlike other popular VCS solutions that store changesets (forward/reverse diffs) across revisions, Git version control system stores snapshots of files for each revision in its filesystem, as shown in the following diagram.

Git Version Control System

Git Object Model

Git stores content and history as directed acyclic graphs (DAG) using different types of objects. Central to its storage strategy is the object database, stored under .git/objects directory of repository root.

It is made up of objects that form DAGs which in turn store the state of the repository. Git version control system supports four types of objects that can represent every type of content in the repository.

Blob – A blob object stores the compressed content of a file, using zlib algorithm

Tree – A tree object stores references of blobs and subtrees with their file modes, types, and file names

Commit – A commit object refers to a tree representing the top-level directory for the commit and parent commits with standard attributes like author, timestamp etc.

Tag – A tag object refers to a commit (or any other object) in repository using a friendly name

Each object type has four main attributes: sha, type, size and an optional content attribute. sha attribute stores a 40-character SHA1 checksum hash and acts like an identifier for the object. The typical structure of a Git object would be as shown below:

Git object

Upon committing a file, Git creates a new blob object representing the changed file. For each directory above the changed file, a new tree object is created with a new identifier. A new DAG is also created starting from the newly created root tree object, pointing to blobs.

Existing blob references are reused, if file contents have not changed. Thus, creating a snapshot of the current state of the repository. A sample snapshot with all the above objects is shown below for ease of understanding.

Git repository

Object Manipulation Commands

Git offers both console and UI tools for version control operations. All of them work on top of its core command-line toolkit though. Further, Git version control system follows Unix philosophy of software tools – providing a set of command-line tools for achieving a task instead of having one special-purpose program that possibly does every imaginable task.

Git’s core toolkit consists of two types of commands – porcelain and plumbing. Porcelain commands are those user-friendly commands (like checkout, branch, commit etc.) we are accustomed to, that we use for versioning files and maintaining repositories. Plumbing commands comprise low-level commands (like hash-object, update-ref, write-tree etc.) that enable basic content tracking and manipulation of Git’s object tree. A few interesting plumbing commands are listed below:

1. hash-object: Outputs the 40-character SHA1 checksum hash of the content you would like to store.

$ echo 'Hello Git' | git hash-object --stdin
 9f4d96d5b00d98959ea9960f069585ce42b1349a

The above command outputs the hash key without writing to Git. To store the text, we need to append -w flag as shown:

$ echo 'Hello Git' | git hash-object -w --stdin
 9f4d96d5b00d98959ea9960f069585ce42b1349a

2. cat-file: Inspects and displays the content of Git objects.

$ git cat-file -p 9f4d96d5b00d98959ea9960f069585ce42b1349a
 Hello Git

To see the type of object, we can use -t flag instead of -p:

$ git cat-file -t 9f4d96d5b00d98959ea9960f069585ce42b1349a
 blob

3. write-tree: Creates a tree from staging area (aka index). To create a tree object, let us first stage our test blob content.

$ git update-index --add --cacheinfo 100644 9f4d96d5b00d98959ea9960f069585ce42b1349a HelloGit.txt

Here, 100644 is the file-mode, like in Unix systems and ‘HelloGit.txt.’ is the filename we are associating with our test blob content. We are too good to build a tree object now:

$ git write-tree
 88fb4b20a2d9c5d69b3a54aff31f68acc8627289

To see the object content, let us use cat-file command.

$ git cat-file -p 88fb4b20a2d9c5d69b3a54aff31f68acc8627289
 100644 blob 9f4d96d5b00d98959ea9960f069585ce42b1349a HelloGit.txt

4. commit-tree: Create a commit object. The following command commits our tree

$ echo 'Initial Commit' | git commit-tree 88fb4b
 478a084b741671bcbbb5c9cfc0e6fe743064bfeb

To see the commit object, let us use cat-file command again.

$ git cat-file -p 478a084b741671bcbbb5c9cfc0e6fe743064bfeb
tree 88fb4b20a2d9c5d69b3a54aff31f68acc8627289
author Frankline <foo@foobar.com> 1495522584 +0530
committer Frankline <foo@foobar.com> 1495522584 +0530
Initial Commit

There is only a thin line of differentiation between porcelain and plumbing commands since Git maintainers no longer keep the command category list updated. The general idea is that if a command produces user readable text it is more likely to be porcelain than plumbing.

Filesystem Layout

Git version control system distinguishes its workspace and staging areas from its repository area. Workspace is the place where we do normal VCS operations. It contains tracked and untracked files that users usually manipulate during their day-to-day tasks. Staging area stores pre-commit data.

Repository contains all data and metadata information and is stored under a hidden .git directory. Following are the contents you are likely to see if you list this directory.

GIT Internals

Git initially saves objects as loose files. Occasionally several of these object files are packed into a single file called a packfile, to save space and be more efficient. This operation can be triggered manually using git gc command or is automatically done on pushing changes to a remote server.

Choose Evoke Technologies for Consistent Deliverables

Evoke Technologies is an innovative IT services firm offering value-driven software services. Our IT services help global enterprises improve their software systems using our innovative and proven global delivery model. We have been actively assisting our clients to constantly innovate and remain competitive in the global environment. Our emphasis remains on core software technology practices, which helps us to consistently maintain quality standards in our key deliverables.

To learn more, call us today at +1 (937) 202-4161, or contact us through our website.

Frankline Francis

View posts by Frankline Francis
Frankline has been practicing software development for more than a decade and has worked on products and projects of varying sizes. He is currently working as an architect at Evoke Technologies and is specialized in building service oriented and cloud native applications.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

%d bloggers like this: