Friday, 4 July 2014

Oops I corrupted my table, of-course just to recover salvaged data.


The way I started the title might be confusing, "I corrupted my table", so everyone starts with "Crap !! why did you do that !!, ...", so just to justify it.......
I see many customers coming for recovering the corrupted tables without any backup. In such cases, hard to recover the tables completely and it needs lot of work, however we can recover salvaged data if they dont care about corrupted rows.

Let me corrupt the table first.. :-)

 I created a million-row table called "to_be_damaged"
postgres=# select count(*) from to_be_damaged ;
  count 
---------
 1000000
(1 row)
postgres=# select relfilenode,relname from pg_class where relname='to_be_damaged';
relfilenode |    relname   
-------------+---------------
       461257 | to_be_damaged

(1 row)
I've used "hexedit" to damage it. Open relfilenode file from OS level using hexedit and try picking a line which is the start of an 8K boundary and typing hex DE AD BE EF across it.

postgres=# select count(*) from to_be_damaged ;
ERROR:  invalid page in block 0 of relation base/12896/461257

Now create an identical table "salvaged" to recover salvaged data from "to_be_damaged" table.
postgres=# create table salvaged(t int);
CREATE TABLE
Prepared below function which copies the rows which are still salvageable:
create or replace function salvage_damaged()
  returns void
  language plpgsql
as $$
declare
  pageno int;
  tupno int;
  pos tid;
begin
  <<pageloop>>
  for pageno in 0..35930 loop  -- pg_class.relpages for the damaged table
    for tupno in 1..1000 loop
      pos = ('(' || pageno || ',' || tupno || ')')::tid;
      begin
        insert into salvaged select * from to_be_damaged where ctid = pos;
      exception
        when sqlstate 'XX001' then
          raise warning 'skipping page %', pageno;
          continue pageloop;
        when others then
          raise warning 'skipping row %', pos;
      end;
    end loop;
  end loop;
end;
$$;

Now run the function to copy salvagable rows:
postgres# select salvage_damaged();
WARNING: skipping page 0
salvage_damaged
-----------------

(1 row)
postgres=# select count(*) from salvaged ;
count
----------
12999815
(1 row)

postgres=# select 13000000-12999815;
?column?
----------
185
(1 row)

I hope it helps someone. Thanks for reading.

7 comments:

  1. Thanks, thanks a great idea, but here's a slight mod that makes the function a little more flexible:

    CREATE OR REPLACE FUNCTION salvage_damaged_table(bad_table name)
    returns void
    language plpgsql
    AS $$
    DECLARE
    bad_table ALIAS FOR $1;
    totpages int;
    tottuples bigint;
    pageno int;
    tupno int;
    pos tid;

    BEGIN
    SELECT relpages, reltuples INTO totpages, tottuples
    FROM pg_class
    WHERE relname = quote_ident(bad_table)
    AND relkind = 'r';

    RAISE NOTICE 'totpages %, tottuples %', totpages::text, tottuples::text;

    <>
    for pageno in 0..totpages loop -- pg_class.relpages for the damaged table
    for tupno in 1..tottuples loop
    pos = ('(' || pageno || ',' || tupno || ')')::tid;
    begin
    insert into salvaged
    select *
    from my_bad_table -- <-- Replace with actual table name here.
    where ctid = pos;
    exception
    when sqlstate 'XX001' then
    raise warning 'skipping page %', pageno;
    continue pageloop;
    when others then
    raise warning 'skipping row %, SQLSTATE %', pos, SQLSTATE::text;
    end;
    end loop;
    end loop;

    RETURN;
    end;
    $$;

    ReplyDelete
    Replies
    1. Thanks for the suggestion, it looks cool !

      Delete
    2. postgres=# \i a.sql
      CREATE FUNCTION
      postgres=# select salvage_damaged_table('my_bad_table') ;
      NOTICE: totpages 57712, tottuples 2433024
      ERROR: invalid input syntax for type tid: "(0,65536)"
      PL/pgSQL function salvage_damaged_table(character varying) line 20 at assignment



      here :
      for pageno in 0..totpages loop -- pg_class.relpages for the damaged table
      for tupno in 1..tottuples loop


      is wrong

      tuple number can not be bigger than 65536


      so

      for pageno in 0..totpages loop -- pg_class.relpages for the damaged table
      for tupno in 1..65535 loop

      thanks

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Thanks Baji. Nice approach to get the data back from corrupted table.

    ReplyDelete
  4. Great thanks Baji, very useful post!

    ReplyDelete