Sunday, 16 November 2014

Ah, Does it mean a bad hardware or a kernel...Uh, Just want to avoid it.

I have seen many customers coming up with below errors and asking for root cause. They wonder with the reasons behind it and say "Ah, its because of a bad hardware or a kernel.. I hate it, just want to know how to avoid these"

Lets start with this:
ERROR: could not read block 4285 in file "base/xxxxx/xxxx": read only 0 of 8192 bytes

... have rarely been known to be caused by bugs in specific Linux kernel versions.  Such errors are more often caused by bad hardware, anti-virus software, improper backup/restore procedures, etc.

One very common cause for such corruption lately seems to be incorrect backup and restore. (For example, failure to exclude or delete all files from the pg_xlog directory can cause problems like this, or using filesystem "snapshots" which aren't really atomic.) The history of the database, including any recoveries from backup or promotion of replicas to primary, could indicate whether this is a possible cause. Faulty hardware is another fairly common cause, including SANs. If fsync or full_page_writes were ever turned off for the cluster, that could also explain it.

It is good to establish the cause where possible, so that future corruption can be avoided, but to recover the cluster should normally be dumped with pg_dumpall and/or pg_dump, and restored to a freshly created (via initdb) cluster on a machine which is not suspected of causing corruption. It may be possible to fix up or drop and recreate individual damaged objects, but when doing that it can be hard to be sure that the last of the corruption (or the cause of the initial corruption) has been eliminated.

Here is a nice article to find why-is-my-database-corrupted from Robert Haas.

Errors like this:
ERROR: unexpected data beyond EOF in block xxxx of relation pg_tblspc/xxxx
HINT: This has been seen to occur with buggy kernels; consider updating your system.

... are most often caused by Linux kernel bugs. If you are seeing both types of errors suggests it is likely that a hardware problem (like bad RAM) may be the cause of both problems, although other causes cannot be ruled out. It is recommended that you schedule a maintenance window and run thorough hardware checks. The latter error mes age has never been known to be caused by a bug in PostgreSQL itself or by improper backup/restore; it can only be caused by an OS bug or something which is interfering with the OS-level actions -- like hardware problems or AV software. The kernel bug can affect anything adding pages to a table or its indexes. It is a race condition in the kernel, so it will probably be infrequent and it will be hard to reproduce or to predict when it will be encountered. It can be caused by an fallocate() bug which is indeed fixed in below release:

6.5, also termed Update 5, 21 November 2013 (kernel 2.6.32-431): https://rhn.redhat.com/errata/RHSA-2013-1645.html

Given all the distributions of Linux and the different timings with which each has incorporated different bug fixes, it is not feasible to give a list of Linux versions that are known to work well.  A more practical approach would be to find out the exact version of Linux being used, and then do a web search for known bugs in that version.  Most often the main source of that is the list of bugs fixed in later versions.  The bug which could cause this error was fixed several years ago in all major distributions, so any bug-fix version of Linux released in the last two years is unlikely to contain the relevant bug, so simply applying available bug fixes for the distribution should rule out OS problems unless this is a new OS bug which is not yet run into.  If you continue to see this error while running with the latest OS bug fixes, the most likely cause is bad hardware.

I have googled around on "suggestions to avoid corruptions" and found this article from Craig Ringer. Here are some suggestions made by community/core team members:

** Maintain rolling backups with proper ageing. For example, keep one a day for the last 7 days, then one a week for the last 4 weeks, then one a month for the rest of the year, then one a year.
** Use warm standby with log shipping and/or replication to maintain a live copy of the DB.
** If you want point-in-time recovery, keep a few days or weeks worth of WAL archives and a basebackup around. That'll help you recover from those "oops I meant DROP TABLE unimportant; not DROP TABLE vital_financial_records;" issues.
** Keep up to date with the latest PostgreSQL patch releases. Don't be one of those people still running 9.0.0 when 9.0.10 is out.
** Plug-pull test your system when you're testing it before going live. Put it under load with something like pgbench, then literally pull the plug out. If your database doesn't come back up fine you have hardware, OS or configuration problems.
** Don't `kill -9` the postmaster. It should be fine, but it's still not smart.
** ABSOLUTELY NEVER DELETE postmaster.pid
** Use good quality hardware with proper cooling and a good quality power supply. If possible, ECC RAM is a nice extra.
** Never, ever, ever use cheap SSDs. Use good quality hard drives or (after proper testing) high end SSDs. Read the SSD reviews periodically posted on this mailing list if considering using SSDs. Make sure the SSD has a supercapacitor or other reliable option for flushing its write cache on power loss. Always do repeated plug-pull testing when using SSDs.
** Use a solid, reliable file system. zfs-on-linux, btrfs, etc are not the right choices for a database you care about. Never, ever, ever use FAT32.
** If on Windows, do not run an anti-virus program on your database server. Nobody should be using it for other things or running programs on it anyway.
** Avoid RAID 5, mostly because the performance is terrible, but also because I've seen corruption issues with rebuilds from parity on failing disks.
** Use a good quality hardware RAID controller with a battery backup cache unit if you're using spinning disks in RAID. This is as much for performance as reliability; a BBU will make an immense difference to database performance.
** If you're going to have a UPS (you shouldn't need one as your system should be crash-safe), don't waste your money on a cheap one. Get a good online double-conversion unit that does proper power filtering. Cheap UPSs are just a battery with a fast switch, they provide no power filtering and what little surge protection they offer is done with a component that wears out after absorbing a few surges, becoming totally ineffective. Since your system should be crash-safe a cheap UPS will do nothing for corruption protection, it'll only help with uptime.

Saturday, 15 November 2014

Finally, its.... Slony !! (switchover and failover)


Slony is a great PostgreSQL replication tool and it’s used by many organizations. The tool works with different versions of Postgres, and it’s easy to upgrade with minimum downtime – sometimes with zero downtime!

Recently, I was working with a customer on a switchover and failover of Slony. I have seen a lot of bloggers post about installing and configuring Slony so I thought I would share my experience.

The customer wanted to upgrade their database from PostgreSQL 8.4 to v9.2. They had about 400 tables and wanted one set for each table, hence about 400 sets.

Below are the steps I took. Just for convenience, I'm using 10 tables/sets to explain.

Create tables using below script in source(8.4) and target(9.3) databases:
source=# select 'create table slony_tab_'||generate_series(1,10)||'(t int primary key);';

Inserted values using below script in source database:
source=# select 'insert into slony_tab_'||a||' values (generate_series(1,100));' from generate_series(1,10) a;


Configure Slony using below scripts:

1. Init cluser script
#################################################################################################

cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
init cluster (id = 1 , comment = 'Primary Node For the Slave postgres');
#Setting Store Nodes ...
store node (id = 2, event node = 1 , comment = 'Slave Node For The Primary postgres');
#Setting Store Paths ...
echo  'Stored all nodes in the slony catalogs';
store path(server = 1 , client = 2, conninfo = 'host=127.0.0.1 dbname=source user=postgres port=5434');
store path(server = 2, client = 1 , conninfo = 'host=127.0.0.1 dbname=target user=postgres port=5432');
echo  'Stored all Store Paths for Failover and Switchover into slony catalogs ..';

#################################################################################################
2. Create Set Script
#################################################################################################

cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
try { create set (id = 1 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set1 for upgrade!'; exit 1;}
try { create set (id = 2 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set2 for upgrade!'; exit 1;}
try { create set (id = 3 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set3 for upgrade!'; exit 1;}
try { create set (id = 4 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set4 for upgrade!'; exit 1;}
try { create set (id = 5 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set5 for upgrade!'; exit 1;}
try { create set (id = 6 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set6 for upgrade!'; exit 1;}
try { create set (id = 7 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set7 for upgrade!'; exit 1;}
try { create set (id = 8 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set8 for upgrade!'; exit 1;}
try { create set (id = 9 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set9 for upgrade!'; exit 1;}
try { create set (id = 10 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set10 for upgrade!'; exit 1;}
set add table (set id = 1 ,origin = 1 , id = 1, full qualified name = 'public.slony_tab_1', comment = 'Table slony_tab_1 with primary key');
set add table (set id = 2 ,origin = 1 , id = 2, full qualified name = 'public.slony_tab_2', comment = 'Table slony_tab_2 with primary key');
set add table (set id = 3 ,origin = 1 , id = 3, full qualified name = 'public.slony_tab_3', comment = 'Table slony_tab_3 with primary key');
set add table (set id = 4 ,origin = 1 , id = 4, full qualified name = 'public.slony_tab_4', comment = 'Table slony_tab_4 with primary key');
set add table (set id = 5 ,origin = 1 , id = 5, full qualified name = 'public.slony_tab_5', comment = 'Table slony_tab_5 with primary key');
set add table (set id = 6 ,origin = 1 , id = 6, full qualified name = 'public.slony_tab_6', comment = 'Table slony_tab_6 with primary key');
set add table (set id = 7 ,origin = 1 , id = 7, full qualified name = 'public.slony_tab_7', comment = 'Table slony_tab_7 with primary key');
set add table (set id = 8 ,origin = 1 , id = 8, full qualified name = 'public.slony_tab_8', comment = 'Table slony_tab_8 with primary key');
set add table (set id = 9 ,origin = 1 , id = 9, full qualified name = 'public.slony_tab_9', comment = 'Table slony_tab_9 with primary key');
set add table (set id = 10 ,origin = 1 , id = 10, full qualified name = 'public.slony_tab_10', comment = 'Table slony_tab_10 with primary key');

#################################################################################################

3. Starting Slon Processes
#################################################################################################
/opt/PostgreSQL/9.2/bin/slon -s 1000 -d2 shadow 'host=127.0.0.1 dbname=source user=postgres port=5434' > /tmp/node1.log 2>&1 &
/opt/PostgreSQL/8.4/bin/slon -s 1000 -d2 shadow 'host=127.0.0.1 dbname=target user=postgres port=5432' > /tmp/node2.log 2>&1 &

#################################################################################################

4. Subscribing the sets.
#################################################################################################
cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
try { subscribe set (id = 1, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 1';
try { subscribe set (id = 2, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 2';
try { subscribe set (id = 3, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 3';
try { subscribe set (id = 4, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 4';
try { subscribe set (id = 5, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 5';
try { subscribe set (id = 6, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 6';
try { subscribe set (id = 7, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 7';
try { subscribe set (id = 8, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 8';
try { subscribe set (id = 9, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 9';
try { subscribe set (id = 10, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 10';
#################################################################################################

Till here it is normal, I mean, all can do very easily.... and of-course many sources you can find. However switchover.. failover !!

If you try to insert values in tables of "target" database, it throws below error:
target=# insert into slony_tab_1 values (11);
ERROR:  Slony-I: Table slony_tab_1 is replicated and cannot be modified on a subscriber node - role=0

Ah, its time to SwitchOver and FailOver...

Let us start with Switch-Over script --
cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
lock set (id = 1, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 1, old origin = 1, new origin = 2); echo 'Set 1 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 2, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 2, old origin = 1, new origin = 2); echo 'Set 2 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 3, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 3, old origin = 1, new origin = 2); echo 'Set 3 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 4, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 4, old origin = 1, new origin = 2); echo 'Set 4 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 5, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 5, old origin = 1, new origin = 2); echo 'Set 5 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 6, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 6, old origin = 1, new origin = 2); echo 'Set 6 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 7, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 7, old origin = 1, new origin = 2); echo 'Set 7 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 8, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 8, old origin = 1, new origin = 2); echo 'Set 8 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 9, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 9, old origin = 1, new origin = 2); echo 'Set 9 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 10, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 10, old origin = 1, new origin = 2); echo 'Set 10 Has Been Moved From Origin Node 1 To 2 ';

Executing this script gives you this:
-bash-4.1$ /opt/PostgreSQL/9.2/bin/slonik /tmp/Switchover_script.slonik
/tmp/Switchover_script.slonik:4: Set 1 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:5: Set 2 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:6: Set 3 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:7: Set 4 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:8: waiting for event (1,5000000211) to be confirmed on node 2
/tmp/Switchover_script.slonik:8: Set 5 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:9: Set 6 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:10: Set 7 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:11: Set 8 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:12: waiting for event (1,5000000224) to be confirmed on node 2
/tmp/Switchover_script.slonik:12: Set 9 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:13: Set 10 Has Been Moved From Origin Node 1 To 2

-- Now try to insert values in tables of "target" database.. that should allow you.. :-)
target=# insert into slony_tab_1 values(21);
INSERT 0 1

Wow, its done.. isn't it easy.. !!
And now go ahead with failover.. its yours :P
Below is the script for FailOver...
cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
try {failover (id = 1, backup node = 2 );} on error { echo 'Failure Of The Failover For The Set 1  to 2 ';exit 1; }echo 'Failover Has been performed from 1 to 2';

Check if replication is happening..it should not !
so we are done with Switchover and Failover..
Hope this helps to someone. Any comments or suggestion would be appreciated !