TOAST’s new compression algorithm LZ4 in PostgreSQL 14. How fast can it be?

Update: July 2, 2023

For column compression options, PostgreSQL 14 provides a new compression method LZ4. Compared with the existing PGLZ compression method in TOAST, LZ4 compression is faster. This article describes how to use the entire option and compare its performance with other compression algorithms.

background

In PG, the page is the unit of storing data, and the default is 8KB. In general, a row of data is not allowed to be stored across pages. However, there are some variable length data types, and the stored data may exceed one page of the university. In order to overcome the entire limitation, the large field will be compressed or split into multiple physical rows. This technique is TOAST:

By default, if there are variable-length columns in the table, and the size of the row data exceeds TOAST_TUPLE_THRESHOLD (default 2KB), TOAST will be triggered. First, the data will be compressed first; if it is still too large after compression, the storage will overflow. It should be noted that if the column storage strategy specifies EXTERNAL/PLAIN, compression will be prohibited.

Before PG14, TOAST only supports one compression algorithm PGLZ (PG built-in algorithm). But other compression algorithms may be faster than PGLZ or have a higher compression ratio. There is a new compression option LZ4 compression in PG14, which is a lossless compression algorithm known for its speed. So we can expect it to help increase the speed of TOAST compression and decompression.

How to use LZ4?

In order to use the LZ4 compression feature, you need to specify -with-lz4 when compiling, and follow the LZ4 library in the operating system. The TOAST default compression algorithm of the PG instance can be specified through the GUC parameter default_toast_compression. Can be in postgresql. Configure in conf, you can also change only the current connection through the SET command:

postgres=# SET default_toast_compression=lz4;

SET

Specify the column compression algorithm when creating a table in CREATE TABLE:

We can use the d+ command to see the compression algorithm for all columns. If the column does not support or does not specify the compression algorithm, then a space will be displayed in the Compression column. In the above example, the id column does not support the compression algorithm, the col1 column uses PGLZ, col2 uses LZ4, and col3 does not specify the compression algorithm, then it will use the default compression algorithm.

You can modify the column compression algorithm through ALTER TABLE, but it should be noted that the modified algorithm only affects the insert data after the entire command is executed.

postgres=# INSERT INTO tbl VALUES (1, repeat('abc',1000), repeat('abc',1000),repeat('abc',1000));

INSERT 0 1

postgres=# ALTER TABLE tbl ALTER COLUMN col1 SET COMPRESSION lz4;

ALTER TABLE

postgres=# INSERT INTO tbl VALUES (2, repeat('abc',1000), repeat('abc',1000),repeat('abc',1000));

INSERT 0 1

postgres=# SELECT id,

postgres-#        pg_column_compression(id)   AS compression_colid,

postgres-#        pg_column_compression(col1) AS compression_col1,

postgres-#        pg_column_compression(col2) AS compression_col2,

postgres-#        pg_column_compression(col3) AS compression_col3

postgres-# FROM tbl;

id | compression_colid | compression_col1 | compression_col2 | compression_col3

---+-------------------+------------------+-----

1 |                   | pglz             | lz4              | lz4

2 |                   | lz4               | lz4              | lz4

(2 rows)

You can see that for rows inserted before modifying the compression algorithm, col1 still uses the PGLZ compression algorithm, even if the compression algorithm is modified from PGLZ to LZ4. (So, which algorithm should be used for decompression after modification?)

Note that if you scan data from another table and insert it into this table, such as CREATE TABLE. . . AS. . . Or INSERT INTO. . . SELECT. . . , The compression algorithm used by the inserted data still uses the compression method of the original data. pg_dump and pg_dumpall also added the option-no-toast-compuression. After using the entire option, the TOAST compression option will not be dumped.

Performance comparison

Tested the compression rate and compression speed of LZ4 and PGLZ. And added the test results of uncompressed data (the specified storage strategy is EXTERNAL). For uncompressed data, there is no time-consuming compression and decompression, but the time to read and write data will increase.

Data used in the test: PG documents (one HTML file per line of data); data provided by SilesiaCorpus, including HTML, Text, source code, executable binary files, and pictures

Test machine useIntel? Xeon? Silver 4210 CPU @2.20GHz with 10 cores/20 threads/2 sockets。

Use pgbench to test the execution time of the SQL statement, and pg_table_size to check the table university (execute VACUUM FULL before each execution to eliminate the influence of dead records).

Compression ratio

The compression rates of PGLZ and LZ4 both depend on repeated data. The more repeated tuples, the higher the compression rate. However, if PG evaluates such a compression rate as not good, compression will not be performed, even if the data size reaches the threshold. Because compression does not efficiently save disk space, it also brings extra time and resource consumption for decompression locks.

In the current PG14, PGLZ requires a compression ratio of at least 25%, while LZ is only smaller than when uncompressed data. I compared the sizes of LZ4, PGLZ tables and uncompressed tables. It can be seen that in most scenarios, the compression ratio of PGLZ is slightly better, the compression ratio is evaluated as 2.23, and the compression ratio of LZ4 is 2.07. This means that PGLZ can save 7% of disk space.

Figure 1 - Comparing table sizes (in KB)

Compression/decompression speed

TOAST data will be compressed and decompressed during insert and query. Therefore, I executed some SQL statements to see the impact of different compression algorithms.

First, compare the performance of the INSERT statement, when using LZ, PGLZ, and not using compression. It can be seen that compared with uncompressed data, LZ4 takes a little more time, and PGLZ takes more time. The compression time of LZ4 is 20% less than that of PGLZ on average. This is a very significant improvement.

Figure 2 - Comparing INSERT performance

Compare SELECT below. Compared with PGLZ, LZ4 can save 20% of time, and there is not much difference compared with uncompressed data. The cost of decompression has been reduced to a very low level.

Figure 3 - Comparing SELECT performance

Compare the concurrent INSERT statements of 16 clients. Compared with PGLZ, the compression performance of single large files (HTML, English text, source code, binary executable files, pictures) using LZ4 is 60% -70% faster. Insert multiple small files (PG files), the performance is not improved. Compared with uncompressed data, there is a huge improvement. It is guessed that using compression reduces the amount of data written to disk.

Figure 4 - Comparing INSERT performance with 16 clients

With 16 client SELECTs, LZ4 performs better than PGLZ in most scenarios:

Figure 5 - Comparing SELECT performance with 16 clients

It also compares the speed of text processing using string functions SELECT and UPDATE. LZ4 is better than PGLZ in the whole scene. Compared with the uncompressed data, the data of the LZ4 compression algorithm has almost the same function processing speed, and the LZ4 algorithm hardly affects the string operation speed.

Figure 6 - Comparing performance using string functions

Compared with PGLZ, LZ4 compresses and decompresses TOAST data more efficiently and provides good performance. Compared with uncompressed data, the query speed is almost the same, and compared with PGLZ, insertion is 80% faster. Of course, the compression rate is not very good in some scenarios, but if you want to increase the execution speed, it is strongly recommended to use the LZ4 algorithm.

Also need to pay attention to, need to consider whether the data in the table is suitable for compression. If the compression ratio is not good, it will still try to compress the number and then give up.This will result in additionalRAMResource waste, and greatly affect the speed of inserting data.

future

LZ4 has greatly improved the compression and decompression performance of TOAST. In addition to LZ4, there are many other compression algorithms such as Zstandard. Users who support Zstandard can get a better compression ratio than PGLZ. LZ4 HC has a 98.5% compression speed than LZ4 decompression, but it can greatly increase the compression rate. Hope that the future version of PG can use more compression algorithms.

In addition to TOAST, other scenes also need to be compressed. As far as I know, the current development version already supports WAL’s LZ4 compression, which is an exciting feature.