Archive for the ‘Dustin’ Category

What’s New in Amanda Community: Postgres Backups

Thursday, March 25th, 2010

Second installment in a series of posts about recent work on Amanda.

The Application API allows Amanda to back up structured data — data that cannot be handled well by dump or tar. Most databases fall into this category, and with the 3.1 release, Amanda Community Edition ships with ampgsql, which supports backing up Postgres databases using the software’s point-in-time recovery mechanism.

The how-to for this application is on the Amanda wiki.

Operation

Postgres, like most “advanced” databases, uses a logging system to ensure consistency even in the face of (some) hardware failures. In essence, it writes every change that it makes to the database to the logfile before changing the database itself. This is similar to the operation of logging filesystems. The idea is that, in the face of a failure, you just replay the log to re-apply any potentially corrupted changes.

Postgres calls its log files WAL (write-ahead log) files. By default, they are 16MB. Postgres runs a shell command to “archive” each logfile when it is full.

So there are two things to back up: the data itself, which can be quite large, and the logfiles. A full backup works like this:

  • Execute PG_START_BACKUP(ident) with some unique identifier.
  • Dump the data directory, excluding the active WAL logs. Note that the database is still in operation at this point, so the dumped data, taken alone, will be inconsistent.
  • Execute PG_STOP_BACKUP(). This archives a text file with the suffix .backup that indicates which WAL files are needed to make the dumped data consistent again.
  • Dump the required WAL files

An incremental backup, on the other hand, only requires backing up the already-archived WAL files.

A restore is still a manual operation — a DBA would usually want to perform a restore very carefully. The process is described on the wiki page linked above, but boils down to restoring the data directory and the necessary WAL files, then providing postgres with a shell command to “pull” the WAL files it wants. When postgres next starts up, it will automatically enter recovery mode and replay the WAL files as necessary.

Quiet Databases

On older Postgres versions, making a full backup of a quiet database is actually impossible. After PG_STOP_BACKUP() is invoked, the final WAL file required to reconstruct a consistent database is still “in progress” and thus not archived yet. Since the database is quiet, postgres does not get any closer to archiving that WAL file, and the database hangs (or, in the case of ampgsql, times out).

Newer versions of Postgres do the obvious thing: PG_STOP_BACKUP() “forces” an early archiving of the current WAL file.

The best solution for older versions is to make sure transactions are being committed to the database all the time. If the database is truly silent during the dump (perhaps it is only accessed during working hours), then this may mean writing garbage rows to a throwaway table:

CREATE TABLE push_wal AS SELECT * FROM GENERATE_SERIES(1, 500000);
DROP TABLE push_wal;

Note that using CREATE TEMPORARY TABLE will not work, as temporary tables are not written to the WAL file.

As a brief encounter in #postgres taught me, another option is to upgrade to a more modern version of Postgres!

Log Incremental Backups

DBAs and backup admins generally want to avoid making frequent full backups, since they’re so large. The usual pattern is to make a full backup and then dump the archived log files on a nightly basis for a week or two. As the log files are dumped, they can be deleted from the database server, saving considerable space.

In Amanda terms, each of these dumps is an incremental, and is based on the previous night’s backup. That means that the dump after the full is level 1, the next is level 2, and so on. Amanda currently supports 99 levels, but this limit is fairly arbitrary and can be increased as necessary.

The problem in ampgsql, as implemented, is that it allows Amanda to schedule incremental levels however it likes. Amanda considers a level-n backup to be everything that has changed since the last level-n-1 backup. This works great for GNU tar, but not so well for Postgres. Consider the following schedule:

Monday level 0
Tuesday level 1
Wednesday level 2
Thursday level 1

The problem is that the dump on Thursday, as a level 1, needs to capture all changes since the previous level 0, on Monday. That means that it must contain all WAL files archived since Monday, so those WAL files must remain on the database server until Thursday.

The fix to this is to only perform level 0 or level n+1 dumps, where n is the level of the last dump performed. In the example above, this means either a level 0 or level 3 dump on Thursday. A level 0 is a full backup and requires no history. A level 3 would only contain WAL files archived since the level 2 dump on Wednesday, so any WAL files before that could be deleted from the database server.

Summary

The combination of a powerful open source database system and the open source ampgsql plugin combine to produce a powerful protected storage system for your mission-critical data. We will continue to develop additional Application API plugins, and encourage you and other members of the community to do the same!

What’s New in Amanda: Automated Tests

Friday, March 12th, 2010

This is the first in what will be a series of posts about recent work on Amanda. Amanda has a reputation as old and crusty — not so! Hopefully this series will help to illustrate some of the new features we’ve completed, and what’s coming up. I’m cross-posting these on my own blog, too.

Among open-source applications, Amanda is known for being stable and highly reliable. To ensure that Amanda lives up to this reputation, we’ve constructed an automated testing framework (using Buildbot) that runs on every commit. I’ll give some of the technical details after the jump, but I think the numbers speak for themselves. The latest release of Amanda (which will soon be 3.1.0) has 2936 automated tests!

These tests range from highly-focused unit tests, for example to ensure that all of Amanda’s spellings of “true” are parsed correctly, all the way up to full integration: runs of amdump and the recovery applications.

The tests are implemented with Perl’s Test::More and Test::Harness. The result for the current trunk looks like this:

=setupcache.....................ok
Amanda_Archive..................ok
Amanda_Changer..................ok
Amanda_Changer_compat...........ok
Amanda_Changer_disk.............ok
Amanda_Changer_multi............ok
Amanda_Changer_ndmp.............ok
Amanda_Changer_null.............ok
Amanda_Changer_rait.............ok
Amanda_Changer_robot............ok
Amanda_Changer_single...........ok
Amanda_ClientService............ok
Amanda_Cmdline..................ok
Amanda_Config...................ok
Amanda_Curinfo..................ok
Amanda_DB_Catalog...............ok
Amanda_Debug....................ok
Amanda_Device...................ok
        211/428 skipped: various reasons
Amanda_Disklist.................ok
Amanda_Feature..................ok
Amanda_Header...................ok
Amanda_Holding..................ok
Amanda_IPC_Binary...............ok
Amanda_IPC_LineProtocol.........ok
Amanda_Logfile..................ok
Amanda_MainLoop.................ok
Amanda_NDMP.....................ok
Amanda_Process..................ok
Amanda_Recovery_Clerk...........ok
Amanda_Recovery_Planner.........ok
Amanda_Recovery_Scan............ok
Amanda_Report...................ok
Amanda_Tapelist.................ok
Amanda_Taper_Scan...............ok
Amanda_Taper_Scan_traditional...ok
Amanda_Taper_Scribe.............ok
Amanda_Util.....................ok
Amanda_Xfer.....................ok
amadmin.........................ok
amarchiver......................ok
amcheck.........................ok
amcheck-device..................ok
amcheckdump.....................ok
amdevcheck......................ok
amdump..........................ok
amfetchdump.....................ok
amgetconf.......................ok
amgtar..........................ok
amidxtaped......................ok
amlabel.........................ok
ampgsql.........................ok
        40/40 skipped: various reasons
amraw...........................ok
amreport........................ok
amrestore.......................ok
amrmtape........................ok
amservice.......................ok
amstatus........................ok
amtape..........................ok
amtapetype......................ok
bigint..........................ok
mock_mtx........................ok
noop............................ok
pp-scripts......................ok
taper...........................ok
All tests successful, 251 subtests skipped.
Files=64, Tests=2936, 429 wallclock secs (155.44 cusr + 31.48 csys = 186.92 CPU)

The skips are due to tests that require external resources - tape drives, database servers, etc. The first part of the list contains tests for almost all perl packages in the Amanda namespace. These are generally unit tests of the new Perl code, although some tests integrate several units due to limitations of the interfaces. The second half of the list is tests of Amanda command-line tools. These are integration tests, and ensure that all of the documented command-line options are present and working, and that the tool’s behavior is correct. The integration tests are necessarily incomplete, as it’s simply not possible to test every permutation of this highly flexible package.

The =setupcache test at the top is interesting: because most of the Amanda applications need some dumps to work against, we “cache” a few completed amdump runs using tar, and re-load them as needed during the subsequent tests. This speeds things up quite a bit, and also removes some variability from the tests (there are a lot of ways an amdump can go wrong!).

The entire test suite is run at least 54 times for every commit by Buildbot. We test on 42 different architectures - about a dozen linux distros, in both 32- and 64-bit varieties, plus Solaris 8 and 10, and Darwin-8.10.1 on both x86 and PowerPC. The remaining tests are for special configurations — server-only, client-only, special runs on a system with several tape drives, and so on.