Sunday, 16 November 2014

Ah, Does it mean a bad hardware or a kernel...Uh, Just want to avoid it.

I have seen many customers coming up with below errors and asking for root cause. They wonder with the reasons behind it and say "Ah, its because of a bad hardware or a kernel.. I hate it, just want to know how to avoid these"

Lets start with this:
ERROR: could not read block 4285 in file "base/xxxxx/xxxx": read only 0 of 8192 bytes

... have rarely been known to be caused by bugs in specific Linux kernel versions.  Such errors are more often caused by bad hardware, anti-virus software, improper backup/restore procedures, etc.

One very common cause for such corruption lately seems to be incorrect backup and restore. (For example, failure to exclude or delete all files from the pg_xlog directory can cause problems like this, or using filesystem "snapshots" which aren't really atomic.) The history of the database, including any recoveries from backup or promotion of replicas to primary, could indicate whether this is a possible cause. Faulty hardware is another fairly common cause, including SANs. If fsync or full_page_writes were ever turned off for the cluster, that could also explain it.

It is good to establish the cause where possible, so that future corruption can be avoided, but to recover the cluster should normally be dumped with pg_dumpall and/or pg_dump, and restored to a freshly created (via initdb) cluster on a machine which is not suspected of causing corruption. It may be possible to fix up or drop and recreate individual damaged objects, but when doing that it can be hard to be sure that the last of the corruption (or the cause of the initial corruption) has been eliminated.

Here is a nice article to find why-is-my-database-corrupted from Robert Haas.

Errors like this:
ERROR: unexpected data beyond EOF in block xxxx of relation pg_tblspc/xxxx
HINT: This has been seen to occur with buggy kernels; consider updating your system.

... are most often caused by Linux kernel bugs. If you are seeing both types of errors suggests it is likely that a hardware problem (like bad RAM) may be the cause of both problems, although other causes cannot be ruled out. It is recommended that you schedule a maintenance window and run thorough hardware checks. The latter error mes age has never been known to be caused by a bug in PostgreSQL itself or by improper backup/restore; it can only be caused by an OS bug or something which is interfering with the OS-level actions -- like hardware problems or AV software. The kernel bug can affect anything adding pages to a table or its indexes. It is a race condition in the kernel, so it will probably be infrequent and it will be hard to reproduce or to predict when it will be encountered. It can be caused by an fallocate() bug which is indeed fixed in below release:

6.5, also termed Update 5, 21 November 2013 (kernel 2.6.32-431): https://rhn.redhat.com/errata/RHSA-2013-1645.html

Given all the distributions of Linux and the different timings with which each has incorporated different bug fixes, it is not feasible to give a list of Linux versions that are known to work well.  A more practical approach would be to find out the exact version of Linux being used, and then do a web search for known bugs in that version.  Most often the main source of that is the list of bugs fixed in later versions.  The bug which could cause this error was fixed several years ago in all major distributions, so any bug-fix version of Linux released in the last two years is unlikely to contain the relevant bug, so simply applying available bug fixes for the distribution should rule out OS problems unless this is a new OS bug which is not yet run into.  If you continue to see this error while running with the latest OS bug fixes, the most likely cause is bad hardware.

I have googled around on "suggestions to avoid corruptions" and found this article from Craig Ringer. Here are some suggestions made by community/core team members:

** Maintain rolling backups with proper ageing. For example, keep one a day for the last 7 days, then one a week for the last 4 weeks, then one a month for the rest of the year, then one a year.
** Use warm standby with log shipping and/or replication to maintain a live copy of the DB.
** If you want point-in-time recovery, keep a few days or weeks worth of WAL archives and a basebackup around. That'll help you recover from those "oops I meant DROP TABLE unimportant; not DROP TABLE vital_financial_records;" issues.
** Keep up to date with the latest PostgreSQL patch releases. Don't be one of those people still running 9.0.0 when 9.0.10 is out.
** Plug-pull test your system when you're testing it before going live. Put it under load with something like pgbench, then literally pull the plug out. If your database doesn't come back up fine you have hardware, OS or configuration problems.
** Don't `kill -9` the postmaster. It should be fine, but it's still not smart.
** ABSOLUTELY NEVER DELETE postmaster.pid
** Use good quality hardware with proper cooling and a good quality power supply. If possible, ECC RAM is a nice extra.
** Never, ever, ever use cheap SSDs. Use good quality hard drives or (after proper testing) high end SSDs. Read the SSD reviews periodically posted on this mailing list if considering using SSDs. Make sure the SSD has a supercapacitor or other reliable option for flushing its write cache on power loss. Always do repeated plug-pull testing when using SSDs.
** Use a solid, reliable file system. zfs-on-linux, btrfs, etc are not the right choices for a database you care about. Never, ever, ever use FAT32.
** If on Windows, do not run an anti-virus program on your database server. Nobody should be using it for other things or running programs on it anyway.
** Avoid RAID 5, mostly because the performance is terrible, but also because I've seen corruption issues with rebuilds from parity on failing disks.
** Use a good quality hardware RAID controller with a battery backup cache unit if you're using spinning disks in RAID. This is as much for performance as reliability; a BBU will make an immense difference to database performance.
** If you're going to have a UPS (you shouldn't need one as your system should be crash-safe), don't waste your money on a cheap one. Get a good online double-conversion unit that does proper power filtering. Cheap UPSs are just a battery with a fast switch, they provide no power filtering and what little surge protection they offer is done with a component that wears out after absorbing a few surges, becoming totally ineffective. Since your system should be crash-safe a cheap UPS will do nothing for corruption protection, it'll only help with uptime.

4 comments:

  1. Regarding the "ERROR: unexpected data beyond EOF in block xxxx of relation pg_tblspc/xxxx" error:

    The latter error mes age has never been known to be caused by a bug in PostgreSQL itself or by improper backup/restore; it can only be caused by an OS bug or something which is interfering with the OS-level actions -- like hardware problems or AV software.

    I've also seen this error occur when two separate PostgreSQL instances are accessing the same data directory, which is also something to look out for especially if the data directory is located on some form of shared storage.

    ReplyDelete
  2. This *never* happens to me because of a kernel bug. It is almost 100% a case of Postgres running out of drive space and attempting to extend one of its files but not being able to do so (because it's out of drive space, doh). But it happily records elsewhere that it extended the file and put data there (despite the fact that, well, it *didn't*). Then when it later comes back and tries to access that data, it's not there, it gets random gibberish as a result of reading from a seek to a non-existence place (which extends the file), and this error pops out.

    Yeah, I know, we shouldn't run Postgres out of space, but you're talking to the wrong guy -- that's a server used by a bunch of developers who are always slinging databases around for their own purposes and never seem to look to see whether the tablespace they're about to sling a database into has sufficient space for what they're doing. Thing is, it ends up corrupting other databases too. SIGH.

    ReplyDelete
  3. Hi Baji,

    Its a wonderful article. Really appreciate the effort you have put through.

    Recently gone through a situation in GPDB:
    ERROR: read beyond eof in table "my_big_table" in file "base/xxxxx3/xxxxx85.3xxx" (cdbbufferedread.c:199) (seg334 slice1 sdw5x:4000x pid=20x0x) (cdbdisp.c:1520)

    Here verified the files in primary and mirror instance and found the primary file size was zero and mirror had some data.
    While copying the file from mirror to primary it worked like a charm!!

    Keep on Rocking as always mate! \m/.

    ReplyDelete