Table of contents of the article:
It happens very frequently during our spot systems support activities at external companies (companies that already have their systems engineers in the company but who may seek help on something specific) to come across shell scripts (usually for the bash version) that make use of backup and storage procedures and routines.
In addition to the various horrors seen as those that make backups with rsync or rsnapshot instead of using the modern Borg Archiver, or those who still remained in 1996 who believe that to make a backup of a 40 GB MySQL DB it is adequate to use mysqldump rather than l great Percona XtraBackup we talked about in this article: MySQL backup slow and server down when Google goes by, a STANDARD error (we will use this word often in this article) that we notice is that of the systems engineer who claims to make archive storage using the classic and now obsolete and obsolete tar.gz according to the nomearchivio.tar.gz format
The problem of data compression
Data compression is a big deal. It is one of the oldest and most studied problems in computer science. Facebook has developed its own compression algorithm called ZStandard or "zstd" which can be faster and more efficient than other compression algorithms like gzip, tar and bzip2.
They are all wrong.
And by all today we really mean everyone. Go and read the scripts of your systems engineer in the company or ask him about the story, and you can only agree with us.
The reason is quite easy to explain, being a combination of factors: the systems engineer often does not write his own scripts (and neither do we) but "limit themselves" to looking for something ready to customize, on some site such as StackOverflow for example. These examples are often generated by other systems engineers who in turn started from a common basis which very often are common antiquated bases (1990 - 2000 or so), especially for what concerns Linux systems which in turn have also inherited scripts. management from older "brothers" such as UNIX systems.
It would be different for those looking for demonstrations on React, node js or Angular js for example that being - for now - new languages, they could not for obvious logical reasons have examples and “obsolete” documentation in the most literal sense of the term.
Another practical reason why you keep dealing with the usual .tar.gz is that it actually works, and it's no small thing to have something that works nowadays.
The system engineer as a professional figure can be bored and unmotivated if not moved by real passion and / or a large salary, which is why most limit themselves to doing their job in a way that is functional to the business objectives of the firm, but not the most functional.
Moreover, it does not often happen that a systems engineer (unless you work for a datacenter with thousands of machines) has to deal with data compression and make a problem of it. A normal Italian SME ranges from 3 to 10 employees, let alone how problematic that 1GB DB of customers and suppliers can be. In short, there is no need to enter into operational logic to maximize profits while minimizing costs.
On large companies with hundreds or thousands of employees, everything is managed in outsourcing or to the usual consultancy company perhaps with 2 or 3 subcontracts, or to the large multinational consulting group that takes over the management without explaining too much why or too much. how. The fact is that in these very noble consulting firms, senior systems engineers are rarely seen exceeding 2 thousand euros / month of monthly salary and this aspect also speaks volumes about the motivation that a human being can put into something that is consciously experienced as an exploitation. .
Finally, let's face it all, even the bravest systems engineers who love experimental and venture like an experienced Indiana Jones in search of the lost data compressor, after a couple of false positives stop everything and return to the classic .tar.gz
Because even when you go to test other alternatives, you almost always get to evaluate and test the usual bzip, px, rar, zip and at the end of the tests you always come to face some conclusions:
- The software does its job well by compressing "enough" but with a very high compression and / or CPU time.
- The software is fast enough but it doesn't save me any data compared to my .tar.gz
Reason why at the end of the games you are left with a .tar.gz and all the system scripts calibrated on this technology.
ZStandard, data compression made in Facebook
It is always a pleasure to be able to embrace or simply study technologies produced by great technological actions such as Facebook, because it immediately denotes great ideas followed by impeccable execution. With the best engineers on the market and important economic investments in research and development, it is always a guarantee.
Zstandard, or zstd for its shortened version, is a fast lossless compression algorithm, which aims for zlib-level real-time compression scenarios and much better compression ratios.
It offers a very wide range of trade-offs between compression and speed, while being supported by a very fast decoder. It also offers a special mode for small data, called dictionary compression, and can create dictionaries from any set of samples.
Facebook opened Zstandard nearly six years ago with the goal of surpassing Zlib in terms of both speed and efficiency. Zstandard 1.5 improves compression speed at intermediate compression levels, compression ratio at higher levels and brings faster decompression speed.
zstandard supports compression levels up to 22. Thanks to a new predefined match finder, Zstardard 1.5 achieves a higher compression rate for levels between 5 and 12 and inputs greater than 256K. According to Facebook benchmarks, the improvements range from + 25% to + 140% with no significant losses in terms of compression ratio.
Facebook claims even better results on heavily loaded machines in significant cache contention.
In addition to improving compression performance, Zstandard 1.5 is compiled by default with multithreaded support, standardizes some new APIs, and deprecates a number of previous ones. You can find all the details in the official release notes.
In fact, Zstandard was created by Yann Collet at Facebook and has been used in Facebook's production environment since 2015.
In addition to the command line tool, Zstandard is provided as a library for you to easily integrate it into your projects.
The Zstandard library is provided as open source software using a BSD license.
Please refer to the official Facebook link that can introduce you not only the project but the enormous versatility of the project itself that has been adopted by a sea of other projects :.
5 ways Facebook improved compression at scale with Zstandard
Faster backups
One of the biggest advantages of zstd is that it can compress data into smaller packets faster than other algorithms, which makes backups faster. Many companies have terabytes of data to back up every day, so any way to speed up backups is a big win for them.
According to Facebook benchmarks, Zstandard outperforms zlib for any combination of compression ratio and bandwidth.
In particular, Zstandard showed exceptional performance compared to zlib when using standard lossless compression:
- it was about 3–5 times faster at the same compression rate
- produced 10–15% smaller files at the same compression rate
- it decompressed 2x faster regardless of the compression ratio
- it's scaled to a much higher compression ratio (~ 4x vs ~ 3,15).
Zstandard uses entropy a finite state, based on the work of Jarek Duda sui asymmetric number systems (ANS) for entropy encoding. ANS aims to “end the trade-off between speed and speed” and can be used for both precise encoding and very fast encoding, with support for data encryption. But, behind Zstandard's better performance are a number of other design and implementation choices:
- While zlib is limited to a 32KB window, Zstandard takes advantage of much greater memory availability in modern environments, including mobile and embedded environments, and imposes no inherent limitations
- a new Huffman decoder, Huff0 , is used to decode symbols in parallel thanks to multiple ALU reducing data dependencies between arithmetic operations
- Zstandard tries to be as branch-free as possible, thus minimizing costly pipe flushing due to incorrect branch forecasting. For example, here's how it is
while
It is possible to rewrite a loop without using branches:/* classic version */ while (nbBitsUsed >= 8) { /* each while test is a branch */ accumulator <<= 8; accumulator += *byte++; nbBitsUsed -= 8; } /* branch-less version */ nbBytesUsed = nbBitsUsed >> 3; nbBitsUsed &= 7; ptr += nbBytesUsed; accumulator = read64(ptr);
- repcode modeling greatly improves the compression of sequences that differ only by a few bytes
Zstandard is both a command line tool and a library, both written in C. It provides more than 20 levels of compression that allow you to accurately optimize its use for the concrete available hardware, the data to be compressed and the bottlenecks to be optimized . Facebook recommends starting with the default level 3, which is suitable for most cases, and then try higher levels up to level 9 to ensure a reasonable compromise between speed and space, or higher for better compression ratios., saving 20+ levels for those cases where you don't care about compression speed.
Collet and Turner also provided some hints on what future versions of Zstandard will bring, including support for multi-threading and new compression levels that allow for faster compressions and higher ratios.
Let's do a real test
Beyond all the theoretical disquisitions that we can produce, what counts for a systems engineer is to be able to have a tool for human use and extremely efficient in doing its job.
We are talking about a 6core / 12 thread system with FS in RAID6 with SATA disks and not nVME. The zstd utility was installed on CentOS7 via the REMI repository via: yum install zstd.
We have therefore produced this documentation starting from this test which targets the directory / var / lib / mysql of a very normal MariaDB DBMS of one of our real customers.
This directory is 80GB in size and is first compressed by the tar and gz utility and then by the tar ztds utility.
In each compression phase we are going to measure three fundamental parameters:
- The given syntax of the command
- The compression rate
- The size of the archive produced
Compression via tar and gzip
time tar czvf test.tar.gz / var / lib / mysql
actual 19m19.862s
10.121.258.046 Mar 11 16:28 test.tar.gz
We can see how the compression operation took 19 minutes and 19 seconds and produced a 10,12 Gigabyte file
Compression via tar and zstd
time tar –use-compress-program zstd -cf test.tar.zst / var / lib / mysql
actual 6m24.006s
5.727.249.564 Mar 11 16:47 test.tar.zst
We can see how the compression operation took 6 minutes and 24 seconds and produced a 5,7 Gigabyte file
Conclusions of the test
Even if only one test on an SQL database and therefore mainly text with high redundancy, the results are clear, ZStandard has produced in just one third of the time compared to tar and gzip an archive that weighs half the one produced with tar and gzip .
You will understand how important it can be for our backup and disaster recovery policies on over a thousand machines with data retention of 60 days to have a fast and efficient system like this.