Archiv von discourse.mafiasi.de vom Saturday September 21, 2019.

Efficient incremental backup of compressed files

11wiedema

tl;dr: Use gzip --rsyncable and be happy.

From time to time there are compressed files or archives laying around on your computer that you want to backup. For example, we have database dumps in SQL which are quite big, so we compress them. We are achieving a compression ratio of about 80% with gzip, so the dumps take only one fifth of their actual size.

But compressed files are not really suitable for incremental backups. Most of the compressed data typically changes completely even if the uncompressed data changes only slightly, especially if there are bytes deleted or inserted into the stream. Therefore the deltas become big and hence the incremental backup, too.

This effect comes from the 'adaptiveness' of compression algorithms, i.e. the output at the current state depends on the previous states. But gzip knows an option called --rsyncable. With this option gzip regularly resets its internal state and starts a new block. The reset probability at some state depends on the input, so gzip compresses the data into blocks that are chunked 'by the input data'. With this blocked compression smaller deltas can be computed for slightly differing data because only some compressed blocks differ and not the complete stream.

Wanna have some numbers?

Our database dump is about 3GB and the files always change between two backups but not much. Our changesets without --rsyncable are varying in the sizes between 500MB and 1.6GB. With --rsyncable the sizes drop to about 50MB. Likewise, the time to actually create the backup is cut in half.

7kraemer

lässt sich das mit pg_dump verbinden? Hast du ein Beispiel wie so ein Backupscript aussehen soll?

edit: ja, lässt sich einfach in gzip rein pipen:

pg_dump my_database | gzip --rsyncable > my_database.dump.gz

Nils

The following script does the database backups at our server, it gets the databases as arguments:

./dbbackup.sh database1 my_database2 ownDataBase3

The dumps are written to ./pgsql

#!/bin/bash

set -eo pipefail
trap 'echo A fatal error occurred during the pgsql backup' ERR

BACKUP_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
PG_BACKUP_DIR="$BACKUP_DIR/pgsql"
mkdir -p $PG_BACKUP_DIR

echo Backing up PostgreSQL databases

cd /
sudo -u postgres pg_dumpall -g | gzip --best --rsyncable > "$PG_BACKUP_DIR/pg-globals.sql.gz" || echo Failed to backup pg-globals. Will continue
for DATABASE; do
    sudo -u postgres pg_dump -Z0 "$DATABASE" | gzip --best --rsyncable > "$PG_BACKUP_DIR/$DATABASE.sql.gz"  ||  echo Failed to backup $DATABASE. Will continue
done

echo PostgreSQL backup has been finished.