git-friendly image format?

1.3k Views Asked by At

For repositories with daily updated data plots (including slightly changing background color gradients), I asked myself if there is some preferred format (or compression algorithm) to use, so that git can store them more efficiently, instead of having to re-write about 90% of them, all the time.

Is there any kind of image format which is more 'git-friendly' then others?

2

There are 2 best solutions below

6
Nicolas Voron On

Since git is not designed for (depsite the fact that it can) deal with binary files, I recommand you the excellent git-lfs extension (originally suported by github):

Because with git, the problem is not what you are versionning, but how you do it. Daily updated dataplots will generate a huge amount of data over time, which will be a problem in several years for cloning & fetching.

How to use it :

Download and install the Git command line extension. Once downloaded and installed, set up Git LFS for your user account by running:

git lfs install You only need to run this once per user account.

In each Git repository where you want to use Git LFS, select the file types you'd like Git LFS to manage (or directly edit your .gitattributes). You can configure additional file extensions at anytime.

git lfs track "*.psd" Now make sure .gitattributes is tracked:

git add .gitattributes Note that defining the file types Git LFS should track will not, by itself, convert any pre-existing files to Git LFS, such as files on other branches or in your prior commit history. To do that, use the git lfs migrate1 command, which has a range of options designed to suit various potential use cases.

There is no step three. Just commit and push to GitHub as you normally would; for instance, if your current branch is named main:

git add file.psd git

commit -m "Add design file"

git push origin main

What it does :

Git LFS stores a pointer file in the git repo in lieu of the real large file. The pointer is swapped out for the real file at checkout (using smudge and clean). The smudge and clean filters are part of core Git and are designed to allow changing a file on checkout (smudge) and on commit (clean). Git LFS uses these techniques to replace the pointer files with the actual large files that are in use.

EDIT

As i commented under your question, you might consider going uncompressed image types like PNG so git can optimise the delta over time, since two relatively close pictures in this format will have a close binary representation, which is not necessarily the same for compressed format (e.g. JPEG ) (it depends of your pictures and their variabilities each day, but since this is a plot, png should definitively do the trick).

Another recommendation is to handle pictures inside a submodule (unless it's a dedicated image-only repo), so the overweight of versionned images will not impact the whole repo for cloning & fetching.

0
joanis On

The theory

Formats that are "Git friendly" will be formats that share long identical byte sequences, whether they are binary or text.

Now, a lossy binary format will probably change most bytes when you change even just the background colour gradients, whereas a more descriptive text-based format might not.

Testing things with your own files

I recommend this test to calculate the compressed size of different file formats in your actual use case.

  1. Before you start, take a sandbox or a clone, and aggressively compress it so we know further compression in later steps is not due to the images being added: run git gc --aggressive a few times, until du .git yields the same answer twice.

Now, for each file format you want to test, copy that sandbox into a new directory and do the following steps:

  1. Add one set of images and aggressively compress the repo again by running git gc --aggressive a few times, until du .git yields the same answer twice.

  2. Write down what du .git tells you: that's your baseline size.

  3. Add and commit a new set of files, slightly changed in the way you describe in your question.

  4. Now du .git tells you the size of just adding those files into the repo. On commit, Git does not (normally) try to apply delta compression or packing, it just add a new blob for each file being committed, unless an identical blob already existed.

  5. Again, run git gc --aggressive until the size is stable.

  6. Now du .git tells you how much Git was able to compress those files, by whatever means it found, possibly delta compression. The size here minus the size at step 2 is your space cost for adding one new set of files.

By running the above procedure for different file formats for your images, you'll get an answer specific to your use case.

Git LFS is probably your friend

PS: All that being said, I stand by @Nicolas Voron's answer: unless the size cost above is actually small for the file format you end up choosing, use Git LFS to avoid creating problems in the future when your repo gets too large to clone.