Saturday, 29 March 2014

Is it failback or switchover with PostgreSQL? Ah, maybe not...

Sometimes I feel like I found something new in postgres, but most of the times it turned out to be known one, I mean, already known by people. I believe, this is a kinda same thing now(this might be known by many already). I see many asking about switchover/failback in PostgreSQL and I always suggest to go for a fresh backup of new master(after failover) to sync old master as a slave to it. Of course, it is the correct procedure. However there is something to achieve this(like below procedure): 
 1. Setup steaming replication between two clusters i.e "Master_but_Slave_soon" with 8888 port and "Slave_gonna_Master" with 8889. See here to setup SR.
-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /tmp/rep/Slave_gonna_Master/ start
server starting
-bash-4.1$ LOG:  database system was interrupted; last known up at 2014-03-30 09:26:59 IST
LOG:  entering standby mode
LOG:  restored log file "000000010000000000000003" from archive
LOG:  redo starts at 0/3000028
LOG:  consistent recovery state reached at 0/30000F0
LOG:  database system is ready to accept read only connections
cp: cannot stat `/tmp/rep/arch/000000010000000000000004': No such file or directory
LOG:  started streaming WAL from primary at 0/4000000 on timeline 1
-bash-4.1$
-bash-4.1$ ./pg_ctl -D /tmp/rep/Master_but_Slave_soon/ status
pg_ctl: server is running (PID: 26531)
/opt/PostgreSQL/9.3/bin/postgres "-D" "/tmp/rep/Master_but_Slave_soon"
-bash-4.1$
-bash-4.1$
-bash-4.1$ ./psql -p 8888 postgres<
Timing is on.
psql.bin (9.3.2)
Type "help" for help.

postgres=# \x 
Expanded display is on. 
postgres=# select * from pg_stat_replication; 
-[ RECORD 1 ]----+--------------------------------- 
pid              | 26651 
usesysid         | 10 
usename          | postgres 
application_name | walreceiver 
client_addr      | 127.0.0.1 
client_hostname  | 
client_port      | 26058 
backend_start    | 2014-03-30 09:28:59.204298+05:30 
state            | streaming 
sent_location    | 0/4000000 
write_location   | 0/4000000 
flush_location   | 0/4000000 
replay_location  | 0/4000000 
sync_priority    | 0 
sync_state       | async 
 
Time: 155.089 ms 
postgres=# 
postgres=# 
postgres=# \q 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ ./pg_ctl -D /tmp/rep/Slave_gonna_Master/ status 
pg_ctl: server is running (PID: 26643) 
/opt/PostgreSQL/9.3/bin/postgres "-D" "/tmp/rep/Slave_gonna_Master" 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ ./psql -p 8889 postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 t 
(1 row) 
 
Time: 3.031 ms 
postgres=#
2. Go for Failover now by creating trigger file. Now our Slave is new Master.
-bash-4.1$ cat /tmp/rep/Slave_gonna_Master/recovery.conf 
standby_mode=on 
primary_conninfo='host=localhost port=8888' 
restore_command='cp /tmp/rep/arch/%f %p' 
trigger_file='/tmp/rep/8888.trigger' 
 
-bash-4.1$ touch /tmp/rep/8888.trigger 
LOG:  trigger file found: /tmp/rep/8888.trigger 
LOG:  redo done at 0/5000028 
cp: cannot stat `/tmp/rep/arch/000000010000000000000005': No such file or directory 
cp: cannot stat `/tmp/rep/arch/00000002.history': No such file or directory 
LOG:  selected new timeline ID: 2 
cp: cannot stat `/tmp/rep/arch/00000001.history': No such file or directory 
LOG:  archive recovery complete 
LOG:  autovacuum launcher started 
LOG:  database system is ready to accept connections 
 
-bash-4.1$ 
-bash-4.1$ /opt/PostgreSQL/9.3/bin/psql -p 8889 -U postgres postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 f 
(1 row) 
 
3. Once failover is done, timeline ID of WALs will be changed.
bash-4.1$ ls -ltrh /tmp/rep/Slave_gonna_Master/pg_xlog/ 
total 97M 
-rw-------. 1 postgres postgres  16M Mar 30 09:27 000000010000000000000002 
-rw-------. 1 postgres postgres  16M Mar 30 09:27 000000010000000000000001 
-rw-------. 1 postgres postgres  16M Mar 30 09:28 000000010000000000000003 
-rw-------. 1 postgres postgres  16M Mar 30 09:35 000000010000000000000004 
-rw-------. 1 postgres postgres  16M Mar 30 09:35 000000010000000000000005 
-rw-------. 1 postgres postgres   41 Mar 30 09:36 00000002.history 
drwx------. 2 postgres postgres 4.0K Mar 30 09:36 archive_status 
-rw-------. 1 postgres postgres  16M Mar 30 09:41 000000020000000000000005 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ cat /tmp/rep/Slave_gonna_Master/pg_xlog/*.history 
1  0/5000090    no recovery target specified
4. Now shutdown the Master and create a recovery.conf file(change the port number to new Master for primary_conninfo parameter) and change hot_standby parameter to on in postgresql.conf and add pg_hba.conf entries in new Master to allow connections.
-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /tmp/rep/Master_but_Slave_soon/ stop -mf 
waiting for server to shut down....LOG:  received fast shutdown request 
LOG:  aborting any active transactions 
LOG:  autovacuum launcher shutting down 
LOG:  shutting down 
LOG:  database system is shut down 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ 
-bash-4.1$ cp -R /tmp/rep/Slave_gonna_Master/recovery.done /tmp/rep/Master_but_Slave_soon/recovery.conf 
-bash-4.1$
5. Now move pg_control of Old Master to pg_control.old and Copy the pg_control of New Master to Old Master(I'm not sure if copying pg_control between replication clusters while they are in sync is recommended).
-bash-4.1$ mv /tmp/rep/Master_but_Slave_soon/global/pg_control /tmp/rep/Master_but_Slave_soon/global/pg_control.old 
-bash-4.1$ 
-bash-4.1$ cp -R /tmp/rep/Slave_gonna_Master/global/pg_control /tmp/rep/Master_but_Slave_soon/global/ 
6. Now try to start the Old Master:
-bash-4.1$ ./pg_ctl -D /tmp/rep/Master_but_Slave_soon/ start 
server starting 
-bash-4.1$ LOG:  database system was interrupted; last known up at 2014-03-30 09:36:11 IST 
LOG:  entering standby mode 
LOG:  restored log file "00000002.history" from archive 
LOG:  record with zero length at 0/50000F8 
LOG:  invalid primary checkpoint record 
LOG:  using previous checkpoint record at 0/4000060 
LOG:  database system was not properly shut down; automatic recovery in progress 
LOG:  redo starts at 0/4000028 
LOG:  record with zero length at 0/5000090 
LOG:  consistent recovery state reached at 0/5000090 
LOG:  database system is ready to accept read only connections 
cp: cannot stat `/tmp/rep/arch/000000020000000000000005': No such file or directory 
LOG:  restored log file "000000010000000000000005" from archive 
LOG:  record with zero length at 0/5000090 
LOG:  started streaming WAL from primary at 0/5000000 on timeline 2 
7. Check the replication status between New Master(Old Slave) and Old Master(New Slave).
-bash-4.1$ /opt/PostgreSQL/9.3/bin/psql -p 8889 -U postgres postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 f 
(1 row) 
 
Time: 1.469 ms 
postgres=# \x 
Expanded display is on. 
postgres=# select * from pg_stat_replication ; 
-[ RECORD 1 ]----+--------------------------------- 
pid              | 26948 
usesysid         | 10 
usename          | postgres 
application_name | walreceiver 
client_addr      | 127.0.0.1 
client_hostname  | 
client_port      | 52443 
backend_start    | 2014-03-30 09:38:09.862199+05:30 
state            | streaming 
sent_location    | 0/5000340 
write_location   | 0/5000340 
flush_location   | 0/5000340 
replay_location  | 0/5000340 
sync_priority    | 0 
sync_state       | async 
 
Time: 52.213 ms 
postgres=# 
postgres=# 
postgres=# 
postgres=# create table after_failover(t int); 
CREATE TABLE 
Time: 67.959 ms 
postgres=# 
postgres=# 
postgres=# \q 
-bash-4.1$ /opt/PostgreSQL/9.3/bin/psql -p 8888 -U postgres postgres 
Timing is on. 
psql.bin (9.3.2) 
Type "help" for help. 
 
postgres=# 
postgres=# select pg_is_in_recovery(); 
 pg_is_in_recovery 
------------------- 
 t 
(1 row) 
 
Time: 1.411 ms 
postgres=# \dt after_failover 
             List of relations 
 Schema |      Name      | Type  |  Owner   
--------+----------------+-------+---------- 
 public | after_failover | table | postgres 
(1 row) 
Though it works here, I don't have any other test results as it might not work in other situations.

6 comments:

  1. And how safe is this method, exactly? I'm not sure how good an idea it is to have these kind of posts indexed by Google, since someone somewhere might actually do it with their data and end up with a broken replica.

    ReplyDelete
  2. Do I assume correctly that this will only work as long as no transaction commits on the "old master" after the trigger file has been created? In that case you have to be careful that the cluster is completely quiesced before creating the trigger file.

    ReplyDelete
  3. Look at pg_rewind tool which allow rewinding old master to new slave. http://www.youtube.com/watch?v=wcwh0icr9GM

    ReplyDelete
    Replies
    1. Thank you Alexey, Nice tool from Heikki. I will look at it.

      Delete
  4. This looks like an exceptionally bad idea to attempt in production on any kind of data you care about even slightly.

    The only safe failback (that doesn't involve a totally new pg_basebackup) in 9.4 and below is to stop the old master, make sure WAL archiving is enabled on the old master, pg_start_backup on the new master, rsync from new master to old, pg_stop_backup on the new master, and start the old master up as a replica with a recovery.conf.

    There's been discussion on ways to do this safely in the next version but I don't think it's gone far yet.

    ReplyDelete
  5. I think it's actually safe to do with pg_rewind as well in 9.3, if (and only if) you enable wal_log_hints or use data checksums. But only with those.

    The method suggested here is definitely very likely to cause corruption. It's enough that any background operation runs in between, such as autovacuum, generating any WAL at all.

    Another safe way is to shut down the old master *first*. That should ensure that no WAL can be created in the "fork".

    (I would also advice you to look at pg_ctl promote as the recommended way to do failover, rather than trigger files)

    ReplyDelete