Skip to content

Thoughts on Delta Compression

Git can use delta compression to achieve substantial space savings.  As more data is added to a git project, relatively little additional space is needed.

I’ve been thinking through this and have the following observations:

  1. The more different the additional data, the less effective delta compression will be, all other things equal.
  2. It is pretty straightforward to realize that as the number of blobs grows, delta computations become more expensive.  But why?
    1. The obvious reason is that you have to spend more time searching through the blobs.
    2. There is a less obvious reason. If we demand that all blobs be addressable, as the number of blobs grows, indexing into the blobs requires more and more bits.  In other words, you need more bits to identify a unique blob. This means that our delta compression becomes relatively less beneficial as the number of blobs grows, all other things equal.
  3. As a corollary to 2-2, above, I would suspect that it could theoretically be useful to remove certain blobs from the index that are relatively less useful. This would “clean up” the blob index for both space and time savings.
    1. Perhaps the blob index removal is permanent.
    2. Or perhaps the blobs index is in flux.
  4. I would make a wild guess that git’s delta compression is path-dependent. I mean that the order in which you add files will affect the way the compression occurs. I suppose it is theoretically possible that delta compression could explore all possibilities and pick the most compact compression, but that would seem very time  expensive.
  5. As you get more blobs, you might “cover” more of the “information space”.  In other words, the more blobs you have, the more likely you are to find a good match when new data comes in.
  6. What do the entropy formulas look like for delta compression?
  7. I wonder if the human brain uses delta compression, or something similar to it!

I’d like to see a journal paper that dug into some of these observations, maybe I will dig around and see what I find.

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*