splitcompress - A shell script to split and compress file(s)
Why use splitcompress?
splitcompress does exactly that - it splits (presumably large) source file(s) into multiple parts - typically,
several parts per source file - and compresses each of the parts.
The reason one might want to do this are:
- To save space - e.g., to be able to store N times as many backup files as storing them uncompressed
- To save network bandwidth - e.g., copying compressed files uses less
- To save elapsed time copying - one can copy completely compressed parts as soon as they are finished
(while other parts are still compressing)
- When you have spare CPU cycles but not spare network bandwidth and/or disk space
splitcompress Features
splitcompress has quite a few options to configure. Some of them are:
- The number of parallel (simultaneous) compressions to run - use as many processors (cores) as you like
- By default, splitcompress checksums the result to confirm the compressed parts are valid
- The file part size
- The read block size
- The intermediate block size
- The compress and uncompress commands
- The output filename format
- The minimum time a source file must have been idle before splitcompress will process it
- The width of splitcompress's log file
splitcompress Usage Message
$ splitcompress
usage: splitcompress [OPTIONS] file ...
where OPTIONS are:
-c Compress_command (default 'xz')
-d number_of_numeric Digits (default 4 [1 to 10])
-i Intermediate_block_size (default '64K' [1 to 67108864]) [*]
-k checKsum_command (default 'sha256sum')
-n Number_of_parallel_processes (default 2 [1 to 32])
-r Read_block_size (default '1M' [1 to 2147479552]) [*]
-s part_Size, uncompressed (default '2G' [1 to 137438953472]) [*]
-t minimum_idle_Time (default '10 seconds ago')
-u Uncompress_command (default 'xzcat')
-v Verbosity level (default 2)
-C Compressed summary name (default ':compressed_summary')
-D Destination_directory (default '.')
-N partition Name (default '_compressed_part_')
-U Uncompressed summary name (default ':uncompressed_summary')
-w Wide (do not wrap the) output lines
-x do not calculate (eXclude) the uncompressed checksums
[*] Legal size units are: k, K, m, M, g, and G
Download splitcompress from the github repository.
splitcompress Examples
-
Split bigfile into 1GiB parts (and then compress them) using 8 processors:
$ splitcompress -s 1G -n 8 bigfile
-
Split bigfile into 2GiB parts (and then compress them) using 4 processors and use 2 digits for the
variable part of the compressed part file names:
$ splitcompress -s 2G -n 4 -d 2 bigfile
-
As above, but use a read block size of 32kiB and an intermediate block size of 64kiB
$ splitcompress -s2G -n4 -d2 -r32K -i64K bigfile
-
Split (not really very big at all) file foo into 500MiB parts (and compress each one). Use a read block
size of 50MiB, use 4 processes, set the number of digits to 1 and use sha1sum as the checksum utility, wide output.
$ splitcompress -s 500M -r 50M -n 4 -d 1 -k sha1sum -w foo
17:40:05[1] Called: splitcompress -s 500M -r 50M -n 4 -d 1 -k sha1sum -w foo
17:40:05[1]
17:40:05[1] Processing [foo] 2021/12/03 ========================================
17:40:05[2] Starting split/compress of [foo]: 3455200476 B, 7 parts)
17:40:05[2] nice dd if='foo' ibs=52428800 count=10 skip=000000 obs=65536 2> tmp_8289_proc_00 | nice xz > 'foo_compressed_part_0.xz'
17:40:05[2] nice dd if='foo' ibs=52428800 count=10 skip=000010 obs=65536 2> tmp_8289_proc_01 | nice xz > 'foo_compressed_part_1.xz'
17:40:05[2] nice dd if='foo' ibs=52428800 count=10 skip=000030 obs=65536 2> tmp_8289_proc_03 | nice xz > 'foo_compressed_part_3.xz'
17:40:05[2] nice dd if='foo' ibs=52428800 count=10 skip=000020 obs=65536 2> tmp_8289_proc_02 | nice xz > 'foo_compressed_part_2.xz'
17:40:30[2] nice dd if='foo' ibs=52428800 count=10 skip=000060 obs=65536 2> tmp_8289_proc_02 | nice xz > 'foo_compressed_part_6.xz'
17:40:30[2] nice dd if='foo' ibs=52428800 count=10 skip=000040 obs=65536 2> tmp_8289_proc_00 | nice xz > 'foo_compressed_part_4.xz'
17:40:30[2] nice dd if='foo' ibs=52428800 count=10 skip=000050 obs=65536 2> tmp_8289_proc_01 | nice xz > 'foo_compressed_part_5.xz'
17:40:54[2] Starting checksum of source file [foo]
17:40:54[2] nice sha1sum 'foo' > 'foo.sha1sum'
17:40:57[2] Starting checksum comparison of [foo]
17:40:57[2] nice xzcat 'foo_compressed_part_'*.xz | sha1sum
17:41:01[1] Checksums of 'foo' match !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
17:41:01[2] Starting checksums of compressed parts: 'foo_compressed_part_'*.xz
17:41:01[2] nice sha1sum 'foo_compressed_part_'*.xz >'foo:compressed_summary.sha1sum'
17:41:01[2] Starting checksums of uncompressed parts of source file [foo]
17:41:01[2] nice dd if='foo' ibs=52428800 count=10 skip=000000 obs=65536 2> tmp_8289_proc_00 | nice sha1sum > 'foo_uncompressed_part_8289_0.sha1sum'
17:41:01[2] nice dd if='foo' ibs=52428800 count=10 skip=000010 obs=65536 2> tmp_8289_proc_01 | nice sha1sum > 'foo_uncompressed_part_8289_1.sha1sum'
17:41:01[2] nice dd if='foo' ibs=52428800 count=10 skip=000020 obs=65536 2> tmp_8289_proc_02 | nice sha1sum > 'foo_uncompressed_part_8289_2.sha1sum'
17:41:01[2] nice dd if='foo' ibs=52428800 count=10 skip=000030 obs=65536 2> tmp_8289_proc_03 | nice sha1sum > 'foo_uncompressed_part_8289_3.sha1sum'
17:41:01[2] nice dd if='foo' ibs=52428800 count=10 skip=000040 obs=65536 2> tmp_8289_proc_00 | nice sha1sum > 'foo_uncompressed_part_8289_4.sha1sum'
17:41:01[2] nice dd if='foo' ibs=52428800 count=10 skip=000060 obs=65536 2> tmp_8289_proc_02 | nice sha1sum > 'foo_uncompressed_part_8289_6.sha1sum'
17:41:01[2] nice dd if='foo' ibs=52428800 count=10 skip=000050 obs=65536 2> tmp_8289_proc_01 | nice sha1sum > 'foo_uncompressed_part_8289_5.sha1sum'
17:41:02[2] Compressed to 511336 bytes : 0.01% of 3455200476 bytes or 6757.20:1
$ ls -l foo*
-rw-rw-r--. 1 tux tux 3455200476 Dec 3 17:28 foo
-rw-rw-r--. 1 tux tux 77504 Dec 3 17:28 foo_compressed_part_0.xz
-rw-rw-r--. 1 tux tux 77512 Dec 3 17:28 foo_compressed_part_1.xz
-rw-rw-r--. 1 tux tux 77512 Dec 3 17:28 foo_compressed_part_2.xz
-rw-rw-r--. 1 tux tux 77524 Dec 3 17:28 foo_compressed_part_3.xz
-rw-rw-r--. 1 tux tux 77504 Dec 3 17:28 foo_compressed_part_4.xz
-rw-rw-r--. 1 tux tux 77512 Dec 3 17:28 foo_compressed_part_5.xz
-rw-rw-r--. 1 tux tux 46268 Dec 3 17:28 foo_compressed_part_6.xz
-rw-rw-r--. 1 tux tux 469 Dec 3 17:41 foo:compressed_summary.sha1sum
-rw-rw-r--. 1 tux tux 46 Dec 3 17:40 foo.sha1sum
-rw-rw-r--. 1 tux tux 462 Dec 3 17:41 foo:uncompressed_summary.sha1sum
-
Split file bar into 150kB parts (and compress each one). Use a read block size of 25kB, 3 processors,
set the number of digits to 6, specify values for the compressed, uncompressed and part file names, wide output:
and use sha1sum as the checksum utility.
$ splitcompress -s 150k -r 25k -n 3 -d 6 -C ,cmp_sum -N -cmp_chunk -U -uncmp_sum -w bar
17:48:33[1] Called: splitcompress -s 150k -r 25k -n 3 -d 6 -C ,cmp_sum -N -cmp_chunk -U -uncmp_sum -w bar
17:48:33[1]
17:48:33[1] Processing [bar] 2021/12/03 ========================================
17:48:33[2] Starting split/compress of [bar]: 1481040 B, 10 parts)
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000000 obs=65536 2> tmp_12979_proc_00 | nice xz > 'bar-cmp_chunk000000.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000006 obs=65536 2> tmp_12979_proc_01 | nice xz > 'bar-cmp_chunk000001.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000012 obs=65536 2> tmp_12979_proc_02 | nice xz > 'bar-cmp_chunk000002.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000018 obs=65536 2> tmp_12979_proc_00 | nice xz > 'bar-cmp_chunk000003.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000030 obs=65536 2> tmp_12979_proc_02 | nice xz > 'bar-cmp_chunk000005.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000024 obs=65536 2> tmp_12979_proc_01 | nice xz > 'bar-cmp_chunk000004.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000036 obs=65536 2> tmp_12979_proc_00 | nice xz > 'bar-cmp_chunk000006.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000048 obs=65536 2> tmp_12979_proc_02 | nice xz > 'bar-cmp_chunk000008.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000042 obs=65536 2> tmp_12979_proc_01 | nice xz > 'bar-cmp_chunk000007.xz'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000054 obs=65536 2> tmp_12979_proc_00 | nice xz > 'bar-cmp_chunk000009.xz'
17:48:33[2] Starting checksum of source file [bar]
17:48:33[2] nice sha256sum 'bar' > 'bar.sha256sum'
17:48:33[2] Starting checksum comparison of [bar]
17:48:33[2] nice xzcat 'bar-cmp_chunk'*.xz | sha256sum
17:48:33[1] Checksums of 'bar' match !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
17:48:33[2] Starting checksums of compressed parts: 'bar-cmp_chunk'*.xz
17:48:33[2] nice sha256sum 'bar-cmp_chunk'*.xz >'bar,cmp_sum.sha256sum'
17:48:33[2] Starting checksums of uncompressed parts of source file [bar]
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000000 obs=65536 2> tmp_12979_proc_00 | nice sha256sum > 'bar_uncompressed_part_12979_000000.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000006 obs=65536 2> tmp_12979_proc_01 | nice sha256sum > 'bar_uncompressed_part_12979_000001.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000012 obs=65536 2> tmp_12979_proc_02 | nice sha256sum > 'bar_uncompressed_part_12979_000002.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000024 obs=65536 2> tmp_12979_proc_01 | nice sha256sum > 'bar_uncompressed_part_12979_000004.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000018 obs=65536 2> tmp_12979_proc_00 | nice sha256sum > 'bar_uncompressed_part_12979_000003.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000030 obs=65536 2> tmp_12979_proc_02 | nice sha256sum > 'bar_uncompressed_part_12979_000005.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000036 obs=65536 2> tmp_12979_proc_00 | nice sha256sum > 'bar_uncompressed_part_12979_000006.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000042 obs=65536 2> tmp_12979_proc_01 | nice sha256sum > 'bar_uncompressed_part_12979_000007.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000048 obs=65536 2> tmp_12979_proc_02 | nice sha256sum > 'bar_uncompressed_part_12979_000008.sha256sum'
17:48:33[2] nice dd if='bar' ibs=25000 count=6 skip=000054 obs=65536 2> tmp_12979_proc_00 | nice sha256sum > 'bar_uncompressed_part_12979_000009.sha256sum'
17:48:33[2] Compressed to 12472 bytes : 0.84% of 1481040 bytes or 118.75:1
$ ls -l bar*
1448 -rw-rw-r--. 1 tux tux 1481040 Dec 3 17:43 bar
4 -rw-rw-r--. 1 tux tux 1240 Dec 3 17:43 bar-cmp_chunk000000.xz
4 -rw-rw-r--. 1 tux tux 1244 Dec 3 17:43 bar-cmp_chunk000001.xz
4 -rw-rw-r--. 1 tux tux 1244 Dec 3 17:43 bar-cmp_chunk000002.xz
4 -rw-rw-r--. 1 tux tux 1244 Dec 3 17:43 bar-cmp_chunk000003.xz
4 -rw-rw-r--. 1 tux tux 1248 Dec 3 17:43 bar-cmp_chunk000004.xz
4 -rw-rw-r--. 1 tux tux 1256 Dec 3 17:43 bar-cmp_chunk000005.xz
4 -rw-rw-r--. 1 tux tux 1260 Dec 3 17:43 bar-cmp_chunk000006.xz
4 -rw-rw-r--. 1 tux tux 1256 Dec 3 17:43 bar-cmp_chunk000007.xz
4 -rw-rw-r--. 1 tux tux 1240 Dec 3 17:43 bar-cmp_chunk000008.xz
4 -rw-rw-r--. 1 tux tux 1240 Dec 3 17:43 bar-cmp_chunk000009.xz
4 -rw-rw-r--. 1 tux tux 890 Dec 3 17:48 bar,cmp_sum.sha256sum
4 -rw-rw-r--. 1 tux tux 70 Dec 3 17:48 bar.sha256sum
4 -rw-rw-r--. 1 tux tux 950 Dec 3 17:48 bar-uncmp_sum.sha256sum
Above are particularly poor choices for naming the files. They are only included to demonstrate that they can be changed. Recommendation: use the defaults!
splitcompress Notes
- splitcompress calls dd to read source files. dd has a maximum read block size limitation of 2147479552 bytes.
- N simultaneous instances of dd read different blocks of each source file simultaneously. Each instance of dd pipes its output
to the compression program (default xz). This output is then saved to a compressed chunk file. This approach means that no space is
required to store uncompressed chunks.
- See the usage message for default values for: the number of parallel processors, the various block sizes, compression utility,
uncompression utility, checksum utility, etc.
- Lower case units (k, m, g) are powers of 1000 and K, M and G are powers of 1024 - so: k=kB, K=kiB, m=MB, M=MiB, g=GB and G=GiB
- Because the part file compressions and checksums are called asynchronously, output log messages can appear in non-deterministic order.
As long as the checksum(s) match, all is well.