Tarsplit is a utility I wrote which can split UNIX tarfiles into multiple parts while keeping files in the tarballs intact. I specifically wrote those because other ways I found of splitting up tarballs didn’t keep the individual files intact, which would not play nice with Docker.
But what does Docker have to do with tar?
While building the Docker images for my Splunk Lab project, I noticed that one of the layers was something like a Gigabyte in size! While Docker can handle large layers, the issue become one of time it takes to push or pull an image. Docker does validation on each layer as it’s received, and the validation of a 1 GB layer took 20-30 wall clock seconds. Not fun.
It occurred to me that if I could split up that layer into say, 10 layers of 100 Megabytes each, then Docker would transfer about 3 layers in parallel, and while a layer is being validated post-transfer, the next layer would start being transferred. The end result is less wall clock seconds to transfer an entire Docker image.
But there was an issue–Splunk and its applications are big. A few of the tarballs are hundreds of Megabytes in size, and I needed a way to split those tarballs into smaller tarballs without corrupting the files in them. This led to Tarsplit being written.
How to install Tarsplit
If you’re running Homebrew on a Mac or Linux, here’s how to install Tarsplit:
curl https://raw.githubusercontent.com/dmuth/tarsplit/main/Formula/tarsplit.rb > tarsplit.rb \ && brew install ./tarsplit.rb
If you’d prefer to install Tarsplit manually, that is also possible:
curl https://raw.githubusercontent.com/dmuth/tarsplit/main/tarsplit > /usr/local/bin/tarsplit \ && chmod 755 /usr/local/bin/tarsplit
The source is also available on GitHub: https://github.com/dmuth/tarsplit
How Tarsplit Works
Python ships with a module called tarfile, which provides a high-level interface to tarballs. I made use of that module to read in the contents of the tarball to split, create chunks of an equal size and write out the files as separate tarballs of close to equal size. This is done in a single thread.
Why not use multi-threading?
Yeah, I tried that after release 1.0. It turns out that even when using every trick I knew, a multithreaded approach consisting of one thread per chunk to be written was slower than just doing everything in a single thread. I observed this on a 10-core machine with an SSD, so I’m just gonna go ahead and point the finger at the GIL and remind myself that threading in Python is cursed.
Tarsplit In Action
The syntax of Tarsplit is fairly straightforward:
$ tarsplit usage: tarsplit [-h] [--dry-run] file num
Here’s what it looks like in action on a tar file:
$ tarsplit splunk-188.8.131.52-24fd52428b5a-Linux-x86_64.tgz 10 Welcome to Tarsplit! Reading file splunk-184.108.40.206-24fd52428b5a-Linux-x86_64.tgz... Total uncompressed file size: 1407526406 bytes, num chunks: 10, chunk size: 140752640 bytes 10 files written to splunk-220.127.116.11-24fd52428b5a-Linux-x86_64.tgz-part-01-of-10 20 files written to splunk-18.104.22.168-24fd52428b5a-Linux-x86_64.tgz-part-01-of-10 [snip] 3000 files written to splunk-22.214.171.124-24fd52428b5a-Linux-x86_64.tgz-part-01-of-10 Successfully wrote 140813431 bytes in 3299 files to splunk-126.96.36.199-24fd52428b5a-Linux-x86_64.tgz-part-01-of-10 10 files written to splunk-188.8.131.52-24fd52428b5a-Linux-x86_64.tgz-part-02-of-10 [snip] Successfully wrote 142518553 bytes in 35 files to splunk-184.108.40.206-24fd52428b5a-Linux-x86_64.tgz-part-09-of-10 10 files written to splunk-220.127.116.11-24fd52428b5a-Linux-x86_64.tgz-part-10-of-10 20 files written to splunk-18.104.22.168-24fd52428b5a-Linux-x86_64.tgz-part-10-of-10 30 files written to splunk-22.214.171.124-24fd52428b5a-Linux-x86_64.tgz-part-10-of-10 Successfully wrote 59287376 bytes in 30 files to splunk-126.96.36.199-24fd52428b5a-Linux-x86_64.tgz-part-10-of-10
And that’s… pretty much it! The chunks which the tarball was split into will reside in the same directory:
485M Dec 26 14:08 splunk-188.8.131.52-24fd52428b5a-Linux-x86_64.tgz 32M Dec 26 14:09 splunk-184.108.40.206-24fd52428b5a-Linux-x86_64.tgz-part-01-of-10 36M Dec 26 14:10 splunk-220.127.116.11-24fd52428b5a-Linux-x86_64.tgz-part-02-of-10 45M Dec 26 14:10 splunk-18.104.22.168-24fd52428b5a-Linux-x86_64.tgz-part-03-of-10 25M Dec 26 14:10 splunk-22.214.171.124-24fd52428b5a-Linux-x86_64.tgz-part-04-of-10 54M Dec 26 14:10 splunk-126.96.36.199-24fd52428b5a-Linux-x86_64.tgz-part-05-of-10 43M Dec 26 14:10 splunk-188.8.131.52-24fd52428b5a-Linux-x86_64.tgz-part-06-of-10 53M Dec 26 14:10 splunk-184.108.40.206-24fd52428b5a-Linux-x86_64.tgz-part-07-of-10 104M Dec 26 14:11 splunk-220.127.116.11-24fd52428b5a-Linux-x86_64.tgz-part-08-of-10 67M Dec 26 14:11 splunk-18.104.22.168-24fd52428b5a-Linux-x86_64.tgz-part-09-of-10 23M Dec 26 14:11 splunk-22.214.171.124-24fd52428b5a-Linux-x86_64.tgz-part-10-of-10
Note that not all the resulting chunks are the same size, this is due to the underlying files being different sizes. Tarballs with large files in them will see this to a greater degree than tarballs with smaller files;.
Tarsplit With Docker
How do things look in Docker? This is what I now see in Docker while pushing the Splunk Lab image:
While there are more layers in the Docker image, the layers are smaller, and I get that parallelism that I didn’t have when uploading a Gigabyte-sized layer. So while my build process is now complicated, with the time saved I consider this a net gain.
I hope you find this utility useful. I had fun writing it, and I enjoy the ability to make my Docker images just a little more manageable. 🙂