Simple PostgreSQL Blog: 2014

Sunday 16 November 2014

Ah, Does it mean a bad hardware or a kernel...Uh, Just want to avoid it.

I have seen many customers coming up with below errors and asking for root cause. They wonder with the reasons behind it and say "Ah, its because of a bad hardware or a kernel.. I hate it, just want to know how to avoid these"

Lets start with this:

ERROR: could not read block 4285 in file "base/xxxxx/xxxx": read only 0 of 8192 bytes

... have rarely been known to be caused by bugs in specific Linux kernel versions. Such errors are more often caused by bad hardware, anti-virus software, improper backup/restore procedures, etc.

One very common cause for such corruption lately seems to be incorrect backup and restore. (For example, failure to exclude or delete all files from the pg_xlog directory can cause problems like this, or using filesystem "snapshots" which aren't really atomic.) The history of the database, including any recoveries from backup or promotion of replicas to primary, could indicate whether this is a possible cause. Faulty hardware is another fairly common cause, including SANs. If fsync or full_page_writes were ever turned off for the cluster, that could also explain it.

It is good to establish the cause where possible, so that future corruption can be avoided, but to recover the cluster should normally be dumped with pg_dumpall and/or pg_dump, and restored to a freshly created (via initdb) cluster on a machine which is not suspected of causing corruption. It may be possible to fix up or drop and recreate individual damaged objects, but when doing that it can be hard to be sure that the last of the corruption (or the cause of the initial corruption) has been eliminated.

Here is a nice article to find why-is-my-database-corrupted from Robert Haas.

Errors like this:

ERROR: unexpected data beyond EOF in block xxxx of relation pg_tblspc/xxxx
HINT: This has been seen to occur with buggy kernels; consider updating your system.

... are most often caused by Linux kernel bugs. If you are seeing both types of errors suggests it is likely that a hardware problem (like bad RAM) may be the cause of both problems, although other causes cannot be ruled out. It is recommended that you schedule a maintenance window and run thorough hardware checks. The latter error mes age has never been known to be caused by a bug in PostgreSQL itself or by improper backup/restore; it can only be caused by an OS bug or something which is interfering with the OS-level actions -- like hardware problems or AV software. The kernel bug can affect anything adding pages to a table or its indexes. It is a race condition in the kernel, so it will probably be infrequent and it will be hard to reproduce or to predict when it will be encountered. It can be caused by an fallocate() bug which is indeed fixed in below release:

6.5, also termed Update 5, 21 November 2013 (kernel 2.6.32-431): https://rhn.redhat.com/errata/RHSA-2013-1645.html

Given all the distributions of Linux and the different timings with which each has incorporated different bug fixes, it is not feasible to give a list of Linux versions that are known to work well. A more practical approach would be to find out the exact version of Linux being used, and then do a web search for known bugs in that version. Most often the main source of that is the list of bugs fixed in later versions. The bug which could cause this error was fixed several years ago in all major distributions, so any bug-fix version of Linux released in the last two years is unlikely to contain the relevant bug, so simply applying available bug fixes for the distribution should rule out OS problems unless this is a new OS bug which is not yet run into. If you continue to see this error while running with the latest OS bug fixes, the most likely cause is bad hardware.

I have googled around on "suggestions to avoid corruptions" and found this article from Craig Ringer. Here are some suggestions made by community/core team members:

** Maintain rolling backups with proper ageing. For example, keep one a day for the last 7 days, then one a week for the last 4 weeks, then one a month for the rest of the year, then one a year.
** Use warm standby with log shipping and/or replication to maintain a live copy of the DB.

** If you want point-in-time recovery, keep a few days or weeks worth of WAL archives and a basebackup around. That'll help you recover from those "oops I meant DROP TABLE unimportant; not DROP TABLE vital_financial_records;" issues.

** Keep up to date with the latest PostgreSQL patch releases. Don't be one of those people still running 9.0.0 when 9.0.10 is out.

** Plug-pull test your system when you're testing it before going live. Put it under load with something like pgbench, then literally pull the plug out. If your database doesn't come back up fine you have hardware, OS or configuration problems.

** Don't `kill -9` the postmaster. It should be fine, but it's still not smart.

** ABSOLUTELY NEVER DELETE postmaster.pid

** Use good quality hardware with proper cooling and a good quality power supply. If possible, ECC RAM is a nice extra.

** Never, ever, ever use cheap SSDs. Use good quality hard drives or (after proper testing) high end SSDs. Read the SSD reviews periodically posted on this mailing list if considering using SSDs. Make sure the SSD has a supercapacitor or other reliable option for flushing its write cache on power loss. Always do repeated plug-pull testing when using SSDs.

** Use a solid, reliable file system. zfs-on-linux, btrfs, etc are not the right choices for a database you care about. Never, ever, ever use FAT32.

** If on Windows, do not run an anti-virus program on your database server. Nobody should be using it for other things or running programs on it anyway.

** Avoid RAID 5, mostly because the performance is terrible, but also because I've seen corruption issues with rebuilds from parity on failing disks.

** Use a good quality hardware RAID controller with a battery backup cache unit if you're using spinning disks in RAID. This is as much for performance as reliability; a BBU will make an immense difference to database performance.

** If you're going to have a UPS (you shouldn't need one as your system should be crash-safe), don't waste your money on a cheap one. Get a good online double-conversion unit that does proper power filtering. Cheap UPSs are just a battery with a fast switch, they provide no power filtering and what little surge protection they offer is done with a component that wears out after absorbing a few surges, becoming totally ineffective. Since your system should be crash-safe a cheap UPS will do nothing for corruption protection, it'll only help with uptime.

Saturday 15 November 2014

Finally, its.... Slony !! (switchover and failover)

Slony is a great PostgreSQL replication tool and it’s used by many organizations. The tool works with different versions of Postgres, and it’s easy to upgrade with minimum downtime – sometimes with zero downtime!

Recently, I was working with a customer on a switchover and failover of Slony. I have seen a lot of bloggers post about installing and configuring Slony so I thought I would share my experience.

The customer wanted to upgrade their database from PostgreSQL 8.4 to v9.2. They had about 400 tables and wanted one set for each table, hence about 400 sets.

Below are the steps I took. Just for convenience, I'm using 10 tables/sets to explain.

Create tables using below script in source(8.4) and target(9.3) databases:

source=# select 'create table slony_tab_'||generate_series(1,10)||'(t int primary key);';

Inserted values using below script in source database:

source=# select 'insert into slony_tab_'||a||' values (generate_series(1,100));' from generate_series(1,10) a;

Configure Slony using below scripts:

1. Init cluser script

#################################################################################################

cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
init cluster (id = 1 , comment = 'Primary Node For the Slave postgres');
#Setting Store Nodes ...
store node (id = 2, event node = 1 , comment = 'Slave Node For The Primary postgres');
#Setting Store Paths ...
echo  'Stored all nodes in the slony catalogs';
store path(server = 1 , client = 2, conninfo = 'host=127.0.0.1 dbname=source user=postgres port=5434');
store path(server = 2, client = 1 , conninfo = 'host=127.0.0.1 dbname=target user=postgres port=5432');
echo  'Stored all Store Paths for Failover and Switchover into slony catalogs ..';

#################################################################################################

2. Create Set Script

#################################################################################################

cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
try { create set (id = 1 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set1 for upgrade!'; exit 1;}
try { create set (id = 2 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set2 for upgrade!'; exit 1;}
try { create set (id = 3 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set3 for upgrade!'; exit 1;}
try { create set (id = 4 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set4 for upgrade!'; exit 1;}
try { create set (id = 5 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set5 for upgrade!'; exit 1;}
try { create set (id = 6 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set6 for upgrade!'; exit 1;}
try { create set (id = 7 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set7 for upgrade!'; exit 1;}
try { create set (id = 8 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set8 for upgrade!'; exit 1;}
try { create set (id = 9 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set9 for upgrade!'; exit 1;}
try { create set (id = 10 ,origin = 1 , comment = 'Set for public'); } on error { echo  'Could not create Subscription set10 for upgrade!'; exit 1;}
set add table (set id = 1 ,origin = 1 , id = 1, full qualified name = 'public.slony_tab_1', comment = 'Table slony_tab_1 with primary key');
set add table (set id = 2 ,origin = 1 , id = 2, full qualified name = 'public.slony_tab_2', comment = 'Table slony_tab_2 with primary key');
set add table (set id = 3 ,origin = 1 , id = 3, full qualified name = 'public.slony_tab_3', comment = 'Table slony_tab_3 with primary key');
set add table (set id = 4 ,origin = 1 , id = 4, full qualified name = 'public.slony_tab_4', comment = 'Table slony_tab_4 with primary key');
set add table (set id = 5 ,origin = 1 , id = 5, full qualified name = 'public.slony_tab_5', comment = 'Table slony_tab_5 with primary key');
set add table (set id = 6 ,origin = 1 , id = 6, full qualified name = 'public.slony_tab_6', comment = 'Table slony_tab_6 with primary key');
set add table (set id = 7 ,origin = 1 , id = 7, full qualified name = 'public.slony_tab_7', comment = 'Table slony_tab_7 with primary key');
set add table (set id = 8 ,origin = 1 , id = 8, full qualified name = 'public.slony_tab_8', comment = 'Table slony_tab_8 with primary key');
set add table (set id = 9 ,origin = 1 , id = 9, full qualified name = 'public.slony_tab_9', comment = 'Table slony_tab_9 with primary key');
set add table (set id = 10 ,origin = 1 , id = 10, full qualified name = 'public.slony_tab_10', comment = 'Table slony_tab_10 with primary key');

#################################################################################################

3. Starting Slon Processes

#################################################################################################
/opt/PostgreSQL/9.2/bin/slon -s 1000 -d2 shadow 'host=127.0.0.1 dbname=source user=postgres port=5434' > /tmp/node1.log 2>&1 &
/opt/PostgreSQL/8.4/bin/slon -s 1000 -d2 shadow 'host=127.0.0.1 dbname=target user=postgres port=5432' > /tmp/node2.log 2>&1 &

#################################################################################################

4. Subscribing the sets.

#################################################################################################
cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
try { subscribe set (id = 1, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 1';
try { subscribe set (id = 2, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 2';
try { subscribe set (id = 3, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 3';
try { subscribe set (id = 4, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 4';
try { subscribe set (id = 5, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 5';
try { subscribe set (id = 6, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 6';
try { subscribe set (id = 7, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 7';
try { subscribe set (id = 8, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 8';
try { subscribe set (id = 9, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 9';
try { subscribe set (id = 10, provider = 1 , receiver = 2, forward = yes, omit copy = false); } on error { exit 1; } echo  'Subscribed nodes to set 10';
#################################################################################################

Till here it is normal, I mean, all can do very easily.... and of-course many sources you can find. However switchover.. failover !!

If you try to insert values in tables of "target" database, it throws below error:

target=# insert into slony_tab_1 values (11);
ERROR:  Slony-I: Table slony_tab_1 is replicated and cannot be modified on a subscriber node - role=0

Ah, its time to SwitchOver and FailOver...

Let us start with Switch-Over script --

cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
lock set (id = 1, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 1, old origin = 1, new origin = 2); echo 'Set 1 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 2, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 2, old origin = 1, new origin = 2); echo 'Set 2 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 3, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 3, old origin = 1, new origin = 2); echo 'Set 3 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 4, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 4, old origin = 1, new origin = 2); echo 'Set 4 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 5, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 5, old origin = 1, new origin = 2); echo 'Set 5 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 6, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 6, old origin = 1, new origin = 2); echo 'Set 6 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 7, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 7, old origin = 1, new origin = 2); echo 'Set 7 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 8, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 8, old origin = 1, new origin = 2); echo 'Set 8 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 9, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 9, old origin = 1, new origin = 2); echo 'Set 9 Has Been Moved From Origin Node 1 To 2 ';
lock set (id = 10, origin = 1); sync (id = 1); wait for event (origin = 1, confirmed = 2, wait on = 2); move set (id = 10, old origin = 1, new origin = 2); echo 'Set 10 Has Been Moved From Origin Node 1 To 2 ';

Executing this script gives you this:

-bash-4.1$ /opt/PostgreSQL/9.2/bin/slonik /tmp/Switchover_script.slonik
/tmp/Switchover_script.slonik:4: Set 1 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:5: Set 2 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:6: Set 3 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:7: Set 4 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:8: waiting for event (1,5000000211) to be confirmed on node 2
/tmp/Switchover_script.slonik:8: Set 5 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:9: Set 6 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:10: Set 7 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:11: Set 8 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:12: waiting for event (1,5000000224) to be confirmed on node 2
/tmp/Switchover_script.slonik:12: Set 9 Has Been Moved From Origin Node 1 To 2
/tmp/Switchover_script.slonik:13: Set 10 Has Been Moved From Origin Node 1 To 2

-- Now try to insert values in tables of "target" database.. that should allow you.. :-)

target=# insert into slony_tab_1 values(21);
INSERT 0 1

Wow, its done.. isn't it easy.. !!
And now go ahead with failover.. its yours :P
Below is the script for FailOver...

cluster name = shadow;
node 1 admin conninfo='host=127.0.0.1 dbname=source user=postgres port=5434';
node 2 admin conninfo='host=127.0.0.1 dbname=target user=postgres port=5432';
try {failover (id = 1, backup node = 2 );} on error { echo 'Failure Of The Failover For The Set 1  to 2 ';exit 1; }echo 'Failover Has been performed from 1 to 2';

Check if replication is happening..it should not !
so we are done with Switchover and Failover..
Hope this helps to someone. Any comments or suggestion would be appreciated !

Saturday 5 July 2014

Updating pg_cast helps, however sometimes !!

I've seen one of our customer is migrating a table from SQL Server to PostgreSQL using EnterpriseDB's Migration ToolKit. This table has a boolean datatype column. In migration process, MTK converts datatype "boolean" to "bit" in PostgreSQL and the process was taking 6 hrs to complete. Customer wanted to change the datatype from "Bit" to "Integer" and alter command for changing type was taking another 6 hrs in PostgreSQL. If he migrates only structure to PostgreSQL first, and then change the type to "Integer" from "Bit", then it does not allow you to load the data with below error. If it allows, it takes only 6 hrs as no need of alter the type after data load.

ERROR: column "hidehelm" is of type integer but expression is of type boolean
 Hint: You will need to rewrite or cast the expression.

So I found a work around to type cast from "boolean" to "Integer" implicitly by updating "pg_cast" table as below. By this, he can directly load the boolean data into integer column which saves the time of altering the type from BIT to Integer after migrating.

postgres=# insert into tarik values (1::boolean);
ERROR:  column "t" is of type integer but expression is of type boolean
LINE 1: insert into tarik values (1::boolean);
HINT:  You will need to rewrite or cast the expression.

postgres=# select * from pg_cast where castsource='boolean'::regtype and casttarget='int4'::regtype;
castsource | casttarget | castfunc | castcontext | castmethod
------------+------------+----------+-------------+------------
         16 |         23 |     2558 | e           | f
(1 row)

postgres=# update pg_cast set castcontext ='i' where castsource='boolean'::regtype and casttarget='int4'::regtype;
UPDATE 1

postgres=# select * from pg_cast where castsource='boolean'::regtype and casttarget='int4'::regtype;
 castsource | casttarget | castfunc | castcontext | castmethod 
------------+------------+----------+-------------+------------
         16 |         23 |     2558 | i           | f
(1 row)

postgres=# insert into tarik values (1::boolean);
INSERT 0 1

Irrespective of any side effects(which I'm not aware of) of this workaround, this worked and migration had take only 6 hrs. Of-course, customer rolled back this setting after migrating the table. Updating catalogs is very dangerous, so might be I should have concentrated on how to reduce the time of ALTER command after migration?, anyways, it worked, so I was happy !! ;-)

Thanks for any suggestions/comments.

Friday 4 July 2014

Oops I corrupted my table, of-course just to recover salvaged data.

The way I started the title might be confusing, "I corrupted my table", so everyone starts with "Crap !! why did you do that !!, ...", so just to justify it.......
I see many customers coming for recovering the corrupted tables without any backup. In such cases, hard to recover the tables completely and it needs lot of work, however we can recover salvaged data if they dont care about corrupted rows.

Let me corrupt the table first.. :-)

I created a million-row table called "to_be_damaged"

postgres=# select count(*) from to_be_damaged ;
  count 
---------
 1000000
(1 row)
postgres=# select relfilenode,relname from pg_class where relname='to_be_damaged';
relfilenode |    relname   
-------------+---------------
       461257 | to_be_damaged

(1 row)

I've used "hexedit" to damage it. Open relfilenode file from OS level using hexedit and try picking a line which is the start of an 8K boundary and typing hex DE AD BE EF across it.

postgres=# select count(*) from to_be_damaged ;
ERROR:  invalid page in block 0 of relation base/12896/461257

Now create an identical table "salvaged" to recover salvaged data from "to_be_damaged" table.

postgres=# create table salvaged(t int);
CREATE TABLE

Prepared below function which copies the rows which are still salvageable:

create or replace function salvage_damaged()
  returns void
  language plpgsql
as $$
declare
  pageno int;
  tupno int;
  pos tid;
begin
  <<pageloop>>
  for pageno in 0..35930 loop  -- pg_class.relpages for the damaged table
    for tupno in 1..1000 loop
      pos = ('(' || pageno || ',' || tupno || ')')::tid;
      begin
        insert into salvaged select * from to_be_damaged where ctid = pos;
      exception
        when sqlstate 'XX001' then
          raise warning 'skipping page %', pageno;
          continue pageloop;
        when others then
          raise warning 'skipping row %', pos;
      end;
    end loop;
  end loop;
end;
$$;

Now run the function to copy salvagable rows:

postgres# select salvage_damaged();
WARNING: skipping page 0
salvage_damaged
-----------------

(1 row)
postgres=# select count(*) from salvaged ;
count
----------
12999815
(1 row)

postgres=# select 13000000-12999815;
?column?
----------
185
(1 row)

I hope it helps someone. Thanks for reading.

Sunday 20 April 2014

Difference between Warm, hot standby and Streaming Replication:

We all know about replication and its types in postgresql. There are basic 3 types of replication in postgresql i.e Warm, hot standby and Streaming Replication. I used to get confused between these three and was unable to find the difference when I was in my initial stages of postgres learning. However I understand the difference later by reading the docs. So I would like to post key differences between these 3 types of replication in short(it also helps if anyone wants to find which one is configured for his/her environment).

Warm Standby:

==========

Its introduced in PostgreSQL 8.3(IIRC).

1. It is based on WAL log shipping which typically means WAL archives generated on Master will be transferred and applied at Standby side. So Warm Standby always waits for the WAL archive in which Master is currently writing and keeps throw messages like "cp: cannot stat <archive>: No such file or directory". So it is always one archive behind than Master and data loss will be max of 16MB(assuming a healthy warm standby by :-) )

2. In postgresql.conf file, you would need to change just three parameters in master; wal_level to archive, archive_mode and archive_command, however nothing in postgresql.conf file at standby side.

On Master:

wal_level = archive
archive_mode    = on
archive_command = 'cp %p /path_to/archive/%f'

3. In recovery.conf file, three parameters; standby_mode, restore_command and trigger_file.

4. You cannot connect to Standby, so database is not even open for read operations (or read operations are not permitted on db).

Detailed explanation and related docs are here
Hot Standby:
========
Its introduce in PostgreSQL 9.0.

1. It is also based on WAL log shipping(same as warm standby). And of-course, WALs will be transferred and applied at Standby, so one WAL behind and always waits for the WAL archive in which Master is currently writing.
2. In postgresql.conf file, you would need to change wal_level to hot_standby, archive_mode and archive_command. Since you'll likely want to use pg_basebackup you should also set max_wal_senders to at least 2 or 3. And hot_stanby = on in standby conf file.
On Master:

wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 32
archive_mode    = on
archive_command = 'cp %p /path_to/archive/%f'

On Slave:

hot_standby = on

3. In recovery.conf file, three parameters; standby_mode, restore_command and trigger_file. 4. You can connect to Standby for read queries(you should set hot_stanby to ON in standby postgresql.conf file).

Detailed explanation and related docs are here.
Steaming Replication:
==============
Its introduced in PostgreSQL 9.0.
1. XLOG records generated in the primary are periodically shipped to the standby via the network. XLOG records shipped are replayed as soon as possible without waiting until XLOG file has been filled. The combination of Hot Standby and SR would make the latest data inserted into the primary visible in the standby almost immediately. So minimal data loss(almost only open transactions will be lost if its async rep, 0 loss if it is sync rep) 2. In postgresql.conf file, this time 5 parameters, streaming related params like below:
On Master:

wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 32
archive_mode    = on
archive_command = 'cp %p /path_to/archive/%f'

On Slave:

hot_standby=on

3. In recovery.conf file, you would need to an extra parameter including three which you add in hot/warm standby. i.e primary_conninfo, so below are four parameters:

standby_mode          = 'on'
primary_conninfo      = 'host=192.168.0.10 port=5432 user=postgres'
trigger_file = '/path_to/trigger'
restore_command = 'cp /path_to/archive/%f "%p"'

4. You can connect to Standby for read queries(you should set hot_stanby to ON in standby postgresql.conf file).
Detailed explanation and related docs are here: && http://bajis-postgres.blogspot.in/2013/12/step-by-step-guide-to-setup-steaming.html

Kindly let me know if I miss anything.

Saturday 5 April 2014

Anyone wants to change the tablespaces locations while running pg_upgrade ?? isn't it easy !!

As we all know pg_upgrade handles tablespaces in a smart way. However it creates the tablespaces in the same location of old ones. If you want to change the location after upgrade, then you would need some work including updating the catalog table with the new location(IIRC, in PG9.4, you can do it by using ALTER TABLESPACE command). So, for current versions I would like to give you a work around to change the locations of tablespaces while running the pg_upgrade. This tweak is also applicable, If you have any contrib modules installed in databases of old cluster, you have to install them in new cluster also, however we can not create databases in new cluster as it should be empty to run pg_upgrade. "pg_upgrade" creates the databases using template "template 0" in the new cluster.

You can follow below steps to tweak the running pg_upgrade process.

1. I've upgraded PG9.1 cluster "Im_old_one" to PG9.3 cluster "Hey_Im_New", as you see below it is created new tablespace in the same directory.

-bash-4.1$ /opt/PostgreSQL/9.1/bin/psql -p 5666 -U postgres postgres
Timing is on.
psql.bin (9.1.7)
Type "help" for help.


postgres=# CREATE TABLESPACE tablsc_91 LOCATION '/tmp/tablsc_91';
CREATE TABLESPACE
Time: 1.663 ms
-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_upgrade -b /opt/PostgreSQL/9.1/bin/ -B /opt/PostgreSQL/9.3/bin/ -d /tmp/Im_old_one/ -D /tmp/Hey_Im_New/ -p 5666 -P 5667
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is a superuser                       ok

.
.
Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
    analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
    delete_old_cluster.sh
-bash-4.1$
-bash-4.1$
-bash-4.1$ ls -ltrh /tmp/tablsc_91/
total 8.0K
drwx------. 2 postgres postgres 4.0K Apr  5 14:25 PG_9.1_201105231
drwx------. 2 postgres postgres 4.0K Apr  5 14:29 PG_9.3_201306121

2. Now delete the new cluster and run the pg_upgrade command and keep monitoring it.
3. Press control + z (pause the job) just after the catalog dump is created. You can pause the job when you see the below output:

-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_upgrade -b /opt/PostgreSQL/9.1/bin/ -B /opt/PostgreSQL/9.3/bin/ -d /tmp/Im_old_one/ -D /tmp/Hey_Im_New/ -p 5666 -P 5667
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is a superuser                       ok
Checking for prepared transactions                          ok
Checking for reg* system OID user data types                ok
Checking for contrib/isn with bigint-passing mismatch       ok
Creating dump of global objects                             ok
Creating dump of database schemas
                                                            ok ---> press control + z just after this.
^Z


**Note : If we fail to pause job after this "Creating catalog dump" step, then please kill this job and re-initiate the new pg_upgrade process. The re-initiate of pg_upgrade does not require any creation of new cluster or any other changes.

4. Open the global object dump and modify CREATE TABLESAPCE command to your new location:

-bash-4.1$ vi pg_upgrade_dump_globals.sql
Above file will be created in the location from where you run pg_upgrate command, open the file and change the create tablespace command from old location to new location.
Ex:-
Change all the CREATE TABLESPACE statements in the file "pg_upgrade_dump_globals.sql"

From
CREATE TABLESPACE "tablsc_91" OWNER "postgres" LOCATION '/tmp/tablsc_91';
To
CREATE TABLESPACE "tablsc_91" OWNER "postgres" LOCATION '/tmp/tablsc_93';

5. Once modifications has been completed, then resume the job by using "fg" command. It should continue with the restoration process.

Any suggestions/comments would be most welcome!

Saturday 29 March 2014

Is it failback or switchover with PostgreSQL? Ah, maybe not...

Sometimes I feel like I found something new in postgres, but most of the times it turned out to be known one, I mean, already known by people. I believe, this is a kinda same thing now(this might be known by many already). I see many asking about switchover/failback in PostgreSQL and I always suggest to go for a fresh backup of new master(after failover) to sync old master as a slave to it. Of course, it is the correct procedure. However there is something to achieve this(like below procedure):
1. Setup steaming replication between two clusters i.e "Master_but_Slave_soon" with 8888 port and "Slave_gonna_Master" with 8889. See here to setup SR.

-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /tmp/rep/Slave_gonna_Master/ start
server starting
-bash-4.1$ LOG:  database system was interrupted; last known up at 2014-03-30 09:26:59 IST
LOG:  entering standby mode
LOG:  restored log file "000000010000000000000003" from archive
LOG:  redo starts at 0/3000028
LOG:  consistent recovery state reached at 0/30000F0
LOG:  database system is ready to accept read only connections
cp: cannot stat `/tmp/rep/arch/000000010000000000000004': No such file or directory
LOG:  started streaming WAL from primary at 0/4000000 on timeline 1
-bash-4.1$
-bash-4.1$ ./pg_ctl -D /tmp/rep/Master_but_Slave_soon/ status
pg_ctl: server is running (PID: 26531)
/opt/PostgreSQL/9.3/bin/postgres "-D" "/tmp/rep/Master_but_Slave_soon"
-bash-4.1$
-bash-4.1$
-bash-4.1$ ./psql -p 8888 postgres<
Timing is on.
psql.bin (9.3.2)
Type "help" for help.

postgres=# \x 
Expanded display is on. 
postgres=# select * from pg_stat_replication; 
-[ RECORD 1 ]----+--------------------------------- 
pid              | 26651 
usesysid         | 10 
usename          | postgres 
application_name | walreceiver 
client_addr      | 127.0.0.1 
client_hostname  | 
client_port      | 26058 
backend_start    | 2014-03-30 09:28:59.204298+05:30 
state            | streaming 
sent_location    | 0/4000000 
write_location   | 0/4000000 
flush_location   | 0/4000000 
replay_location  | 0/4000000 
sync_priority    | 0 
sync_state       | async 
 
Time: 155.089 ms 
postgres=# 
postgres=# 
postgres=# \q 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ ./pg_ctl -D /tmp/rep/Slave_gonna_Master/ status 
pg_ctl: server is running (PID: 26643) 
/opt/PostgreSQL/9.3/bin/postgres "-D" "/tmp/rep/Slave_gonna_Master" 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ ./psql -p 8889 postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 t 
(1 row) 
 
Time: 3.031 ms 
postgres=#

2. Go for Failover now by creating trigger file. Now our Slave is new Master.

-bash-4.1$ cat /tmp/rep/Slave_gonna_Master/recovery.conf 
standby_mode=on 
primary_conninfo='host=localhost port=8888' 
restore_command='cp /tmp/rep/arch/%f %p' 
trigger_file='/tmp/rep/8888.trigger' 
 
-bash-4.1$ touch /tmp/rep/8888.trigger 
LOG:  trigger file found: /tmp/rep/8888.trigger 
LOG:  redo done at 0/5000028 
cp: cannot stat `/tmp/rep/arch/000000010000000000000005': No such file or directory 
cp: cannot stat `/tmp/rep/arch/00000002.history': No such file or directory 
LOG:  selected new timeline ID: 2 
cp: cannot stat `/tmp/rep/arch/00000001.history': No such file or directory 
LOG:  archive recovery complete 
LOG:  autovacuum launcher started 
LOG:  database system is ready to accept connections 
 
-bash-4.1$ 
-bash-4.1$ /opt/PostgreSQL/9.3/bin/psql -p 8889 -U postgres postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 f 
(1 row)

3. Once failover is done, timeline ID of WALs will be changed.

bash-4.1$ ls -ltrh /tmp/rep/Slave_gonna_Master/pg_xlog/ 
total 97M 
-rw-------. 1 postgres postgres  16M Mar 30 09:27 000000010000000000000002 
-rw-------. 1 postgres postgres  16M Mar 30 09:27 000000010000000000000001 
-rw-------. 1 postgres postgres  16M Mar 30 09:28 000000010000000000000003 
-rw-------. 1 postgres postgres  16M Mar 30 09:35 000000010000000000000004 
-rw-------. 1 postgres postgres  16M Mar 30 09:35 000000010000000000000005 
-rw-------. 1 postgres postgres   41 Mar 30 09:36 00000002.history 
drwx------. 2 postgres postgres 4.0K Mar 30 09:36 archive_status 
-rw-------. 1 postgres postgres  16M Mar 30 09:41 000000020000000000000005 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ cat /tmp/rep/Slave_gonna_Master/pg_xlog/*.history 
1  0/5000090    no recovery target specified

4. Now shutdown the Master and create a recovery.conf file(change the port number to new Master for primary_conninfo parameter) and change hot_standby parameter to on in postgresql.conf and add pg_hba.conf entries in new Master to allow connections.

-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /tmp/rep/Master_but_Slave_soon/ stop -mf 
waiting for server to shut down....LOG:  received fast shutdown request 
LOG:  aborting any active transactions 
LOG:  autovacuum launcher shutting down 
LOG:  shutting down 
LOG:  database system is shut down 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ cp -R /tmp/rep/Slave_gonna_Master/recovery.done /tmp/rep/Master_but_Slave_soon/recovery.conf 
-bash-4.1$

5. Now move pg_control of Old Master to pg_control.old and Copy the pg_control of New Master to Old Master(I'm not sure if copying pg_control between replication clusters while they are in sync is recommended).

-bash-4.1$ mv /tmp/rep/Master_but_Slave_soon/global/pg_control /tmp/rep/Master_but_Slave_soon/global/pg_control.old 
-bash-4.1$ 
-bash-4.1$ cp -R /tmp/rep/Slave_gonna_Master/global/pg_control /tmp/rep/Master_but_Slave_soon/global/

6. Now try to start the Old Master:

-bash-4.1$ ./pg_ctl -D /tmp/rep/Master_but_Slave_soon/ start 
server starting 
-bash-4.1$ LOG:  database system was interrupted; last known up at 2014-03-30 09:36:11 IST 
LOG:  entering standby mode 
LOG:  restored log file "00000002.history" from archive 
LOG:  record with zero length at 0/50000F8 
LOG:  invalid primary checkpoint record 
LOG:  using previous checkpoint record at 0/4000060 
LOG:  database system was not properly shut down; automatic recovery in progress 
LOG:  redo starts at 0/4000028 
LOG:  record with zero length at 0/5000090 
LOG:  consistent recovery state reached at 0/5000090 
LOG:  database system is ready to accept read only connections 
cp: cannot stat `/tmp/rep/arch/000000020000000000000005': No such file or directory 
LOG:  restored log file "000000010000000000000005" from archive 
LOG:  record with zero length at 0/5000090 
LOG:  started streaming WAL from primary at 0/5000000 on timeline 2

7. Check the replication status between New Master(Old Slave) and Old Master(New Slave).

-bash-4.1$ /opt/PostgreSQL/9.3/bin/psql -p 8889 -U postgres postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 f 
(1 row) 
 
Time: 1.469 ms 
postgres=# \x 
Expanded display is on. 
postgres=# select * from pg_stat_replication ; 
-[ RECORD 1 ]----+--------------------------------- 
pid              | 26948 
usesysid         | 10 
usename          | postgres 
application_name | walreceiver 
client_addr      | 127.0.0.1 
client_hostname  | 
client_port      | 52443 
backend_start    | 2014-03-30 09:38:09.862199+05:30 
state            | streaming 
sent_location    | 0/5000340 
write_location   | 0/5000340 
flush_location   | 0/5000340 
replay_location  | 0/5000340 
sync_priority    | 0 
sync_state       | async 
 
Time: 52.213 ms 
postgres=# 
postgres=# 
postgres=# 
postgres=# create table after_failover(t int); 
CREATE TABLE 
Time: 67.959 ms 
postgres=# 
postgres=# 
postgres=# \q 
-bash-4.1$ /opt/PostgreSQL/9.3/bin/psql -p 8888 -U postgres postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 t 
(1 row) 
 
Time: 1.411 ms 
postgres=# \dt after_failover 
             List of relations 
 Schema |      Name      | Type  |  Owner   
--------+----------------+-------+---------- 
 public | after_failover | table | postgres 
(1 row)

Though it works here, I don't have any other test results as it might not work in other situations.

Saturday 8 March 2014

Want to ensure that Postgres generates core dump after server crash?

As you all know, we have to start the server with "-c" option in order to generate core dump during server crash. However if you want to know whether your server is already started to generate core dump, then here you go:
Start the PostgreSQL cluster with core dump option

.pg_ctl -c -D <Data directory location>

Here are the steps to find whether the server is started with -c option.

1. Check the procpid of the postmaster. Go to the location /proc/<procpid of postmaster>.
2. At this location, you can verify limits file to check whether server has started with "-c" option. If you do not have any limitations for enterprisedb user in "limits.conf" file at "/etc/security" location, then the "Soft Limit" of "Max core file size" in "limits" file will be "unlimited". The entry in limits file should be like below:

Limit Soft Limit Hard Limit Units
Max core file size unlimited unlimited bytes

If "Soft Limit" of "Max core file size" in "limits" file is "0", it indicates that your postmaster was not started with "-c" option.

Limit Soft Limit Hard Limit Units
Max core file size 0 unlimited bytes

If you have any limitations for all users or enterprisedb user in "limits.conf" file under "/etc/sercurity", then the above value for Soft Limit or Hard Limit will depends on value in "limits.conf" file.

In simple way, if your postmaster has started with "-c" option, then the "Soft Limit" of "Max core file size" in "limits" file should be some value or unlimited otherthan "0"(which depends on limits.conf file).

If we have started our postmaster with "-c" option, a core dump file will be generated under $PG_DATA directory during a backend crash. A core dump will be generated by the operating system, and we will be able to attach gdb to it to collect a stack trace or other information.

Here is one simple test case which shows us how core dump will be generated:

1. We have started the postmaster with "-c" option.

-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data -c start
server starting

2. Go to the location /proc/<procpid of postmaster> and check the limits file to know whether postmaster was started with "-c" option.

-bash-4.1$ ps -ef|grep PostgreSQL|grep 9.3
postgres 48892     1  0 14:28 pts/15   00:00:00 /opt/PostgreSQL/9.3/bin/postgres -D /opt/PostgreSQL/9.3/data
postgres 48980 46943  0 14:30 pts/15   00:00:00 grep PostgreSQL
-bash-4.1$
-bash-4.1$ cd /proc/48892/
-bash-4.1$ ls -ltrh


enterprisedb@ubuntu:/proc/6711$ ls -ltr limits
-r-------- 1 enterprisedb enterprisedb 0 2012-08-28 05:15 limits
-bash-4.1$ grep 'core' limits
Max core file size        unlimited            unlimited            bytes   
-bash-4.1$ cat limits
Limit                     Soft Limit           Hard Limit           Units   
Max cpu time              unlimited            unlimited            seconds 
Max file size             unlimited            unlimited            bytes   
Max data size             unlimited            unlimited            bytes  
Max stack size            10485760             unlimited            bytes   
Max core file size        unlimited            unlimited            bytes   
Max resident set          unlimited            unlimited            bytes   
Max processes             1024                 15777                processes
Max open files            1024                 1024                 files   
Max locked memory         65536                65536                bytes   
Max address space         unlimited            unlimited            bytes   
Max file locks            unlimited            unlimited            locks   
Max pending signals       15777                15777                signals 
Max msgqueue size         819200               819200               bytes   
Max nice priority         0                    0                   
Max realtime priority     0                    0                   
Max realtime timeout      unlimited            unlimited            us

3. For example, aborting any backend process will generate a core dump file under $PG_DATA directory.

-bash-4.1$ ps -ef|grep 48892
postgres 48892     1  0 14:28 pts/15   00:00:00 /opt/PostgreSQL/9.3/bin/postgres -D /opt/PostgreSQL/9.3/data
postgres 48893 48892  0 14:28 ?        00:00:00 postgres: logger process         postgres 49090 48892  0 14:32 ?        00:00:00 postgres: checkpointer process   postgres 49091 48892  0 14:32 ?        00:00:00 postgres: writer process         postgres 49092 48892  0 14:32 ?        00:00:00 postgres: wal writer process     
postgres 49093 48892  0 14:32 ?        00:00:00 postgres: autovacuum launcher process

postgres 49094 48892  0 14:32 ?        00:00:00 postgres: stats collector process
postgres 49114 46943  0 14:32 pts/15   00:00:00 grep 48892
-bash-4.1$
-bash-4.1$ kill -ABRT 49091 
-bash-4.1$ ls -ltrh /opt/PostgreSQL/9.3/data/core*
-rw-------. 1 postgres postgres 143M Mar  8 14:33 /opt/PostgreSQL/9.3/data/core.49091

4. We can attach gdb to core dump to collect a stack trace by using below command.

-bash-4.1$ gdb -q -c /opt/PostgreSQL/9.3/data/core.49091 /opt/PostgreSQL/9.3/bin/postgres

we should have debug symbols in our installer which was used for installation to get proper trace?, maybe, I'm poor at that, so better stay here.. ;-)

Saturday 1 March 2014

Huge archive generation with PostgreSQL cluster??

Recently I found huge archive logs generation on one of the servers(for my client). Around 3500 files per day and it needs 3500 * 16 MB = 54GB space every day and space is low on the server. First thing came to my mind was checkpoint parameters, I had a look at them and they were at default. Hmm, so had changed them as below:

checkpoint_segments = 180
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9

Monitored for couple of days and found that archive logs generation decreased to 50%(around 1700 files).

With the above parameter settings, archive log generation should be minimal. However if not, Here are some general suggestions to look at:

-- Make sure that you don't UPDATE a row to the same values it already has. The row will be logged even if old and new values match. One way to handle this is to make sure that any UPDATE statements which might set a column to its existing value include in the WHERE clause a test that oldvalue IS DISTINCT FROM new value. Or you can consider creating trigger on problem tables using the Postgres built-in function suppress_redundant_updates_trigger (you might want to look here: http://www.postgresql.org/docs/9.3/static/functions-trigger.html)

-- Use one UPDATE statement instead of several where possible. If updates to different columns or from multiple processing steps can be combined, this can reduce WAL logging.

-- Consider using UNLOGGED or TEMPORARY tables where possible. For example, if data is inserted into the database and updated right away, perform the initial steps in an UNLOGGED or TEMPORARY table, and INSERT into the main table once the data has been prepared.

-- Identify and eliminate unused indexes. Index changes are WAL-logged so that indexes can be recovered on a crash. The storage space needed for indexes can be more than the storage space needed for the data itself, with a corresponding increase in generated WAL.

Sunday 26 January 2014

Copy all objects of one schema to other schema with in same database

There are two ways to copy all objects of one schema to other schema with in the same database.

Procedure 1:
------------
1. Take the dump of Source schema using below command.

$./pg_dump -U <username> -p <port> -n <schema name> <database name> >> <dump filename>

Ex:-
-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_dump -U postgres -p 5435 -n test postgres >> /tmp/test_dmp.sql

2. Open the dumpfile which was created in step 1, and remove the "CREATE SCHEMA" and "ALTER SCHEMA" commands. Below are for example:

CREATE SCHEMA test;

ALTER SCHEMA test OWNER TO postgres;

3. Modify the Source schema name with Target schema name in the dump file.
4. Restore in the database by using below command.

$./psql -U <username> -p <port> -d <database> -f <dump filename>

Ex:- 
$./psql -U postgres -p 5435 -d postgres -f dump.sql

Or connect to psql prompt and execute the dump file from target database.

postgres=#\i <dump file location>
Ex:- #\i dump.sql

Procedure 2:
------------ --
-- Rename the target schema to different name and rename the source schema to target schema name. Take the dump of source schema which is renamed to target schema(You can use step 1 in procedure 1 to take the dump). And rename the source schema and target schema to their previous names. And then restore the schema(you can use step 4 in procedure 1 to restore the dump)

For example your source schema name is test1 and target schema name is test2.
1. Rename the test2 schema to test3 and rename the test1 schema to test2.

postgres=#ALTER SCHEMA <schemaname> RENAME TO <schemaname>;

2. Take the dump of test2.

$./pg_dump -U <username> -p <port> -n <schema name> <database name> >> <dump filename>

3. Rename test2 to test1 and test3 to test2.

postgres=#ALTER SCHEMA <schemaname> RENAME TO <schemaname>;

4. Restore the dump of test2 which is taken in step 2.

$./psql -U <username> -p <port> -d <database> -f <dump filename>

Note: You can ignore the errors of schema already exists during the restore process as target schema is already exists.