Monday 28 October 2013

PostgreSQL Architecture

I know its not a first blog on PostgreSQL Architecture, but its my first one.. :-) I have created a diagram of PostgreSQL architecture based on my experience with the database. 

I have discussed this chart with PostgreSQL experts on how Postgres works and here is some description from my discussion:


Architecture Diagram:

=============





Working Process:
===========

When you start PostgreSQL, The Postmaster starts first and allocates the shared memory. It also accepts connections and spins off a backend for each new connection. So each backend (server process) gets its pointers to shared memory from the postmaster. It is pretty disastrous if the postmaster dies with backends still running, so we have it do as little as possible, so that there isn't as much which can crash it. Postgres does have a pool of shared memory; however, it does not have a library or dictionary cache stored in that memory. This means that statements do need to be parsed and planned every time they are entered. If parse/plan overhead is an issue, we suggest the use of prepared statements. While Oracle is able to avoid the repeated parse/plan overhead, it must still do enough analysis of the query to determine whether the information is present in the library cache, which also consumes some time and CPU resources. The parser is quite lightweight, so we feel that the overhead of parsing the query each time is acceptable.


1. Shared Memory:

----------------------

SHARED BUFFERS -- The biggest chunk of shared memory is shared_buffers. When pages from a table or index are read from the OS, they are read into shared_buffers, and the backends reference the pages and their contents right there in shared memory. An exception is temporary tables, where (since only the creating backend can reference the temp table) data is accessed in temp_buffer space as much as possible(temp_buffers is separate. It is not in shared memory). It's faster to access process-local memory like that because you don't need to worry about pinning or locking the data, since you are not sharing it.

WAL BUFFERS -- are for buffering data to be written to the WAL files.
CLOG BUFFERS -- are one of the SLRU-style buffers oriented toward circular "rings" of data, like which transaction numbers have been committed or rolled back.
LOCK SPACE -- Memory structures in shared memory are generally protected by "lightweight" locks, which are in shared memory. Tables are protected by "heavyweight" locks which are also in shared memory (and themselves protected by lightweight locks). Of course, lightweight locks are protected by spinlocks. It gets very complicated and fussy. :-)
OTHER BUFFERS -- are probably mostly SLRU buffers besides CLOG (which was the first user of the SLRU system). SLRU is good for data where you mostly want to use recently accessed data and then you are done with it relatively quickly.
The opposite of shared memory is process-local memory -- only the one process that allocates it can access it. Each SLRU system has a separate subdirectory. Shared memory is memory that all of the backend server processes can directly access. To prevent chaos, access to shared memory must follow some rules which tends to make it a little slower, like locking areas of memory a process will be using. Process-local memory is allocated by one backend server process, and the other backend server processes can't see it or use it, so it's faster to access, and no worries about another process trashing it while you're using it.
Getting back to the shared memory uses I hadn't talked about, CLOG buffers and SLRU buffers like multixact, notify, subtrans, serial, etc. use buffers in memory for recently accessed data, but spill to disk in those subdirectories beyond a limited number of buffers.

2. Utility Processes:
---------------------
With a default configuration I see the postmaster, the checkpointer process, the writer process, the wal writer process, the autovacuum launcher process, and the stats collector process. I think you will see more processes running if you turn on archiving or streaming replication. You might also get a process for writing the server log, depending on configuration. As their name say, WRITER -- process is responsible to write the dirty buffers to data files, CHECKPOINTER -- process is for checkpoint, WAL WRITER -- is for writing the dirty buffers in WAL buffers to WAL files, AUTOVACUUM LAUNCHER -- process lauches autovacuum when require(depends on your autovacuum settings in postgresql.conf file) and STATS COLLECTOR -- process to collect the statistics of objects in the database require by Optimizer to improve the performance.

The checkpointer process is responsible for creating safe points from which a recovery can begin; the background writer tries to keep some pages available for re-use so that processes running queries don't need to wait for page writes  in order to have free spots to use in shared buffers. Both checkpointer and writer processes writes to the same files, however the checkpointer writes all data that was dirty as of a certain time (the start of the checkpoint) regardless of how often it was used since dirtied, and the background writer writes data that hasn't been used recently, regardless of when it was first dirtied.  Neither knows or cares whether the data being written was committed, rolled back, or still in progress.

3. Directory Structure:
--------------------------
All the data needed for a database cluster is stored within the cluster's data directory, commonly referred to as PGDATA. You can get the detailed description at below link:
http://www.enterprisedb.com/docs/en/9.2/pg/storage-file-layout.html

I see the diagram left out the one I would like to add: pg_serial. pg_serial is used to track summarized information about committed serializable transactions which might still become part of a serialization failure rolling back some not-yet-committed transaction to protect data integrity.
The catalog cache is information from the system tables which describes the tables, indexes, views, etc. in the database. If you had to re-read that from the system tables each time, it would be slow. Even shared memory would be clumsy for that, so each backend process has its own cache of system catalog data for fast lookup. When anything changes, all backends are sent a signal to update or reload their cache data. When pages are read or written, they go through the OS cache, which is not directly under PostgreSQL control. The optimizer needs to keep track of a lot of information while it parses and plans a query, which is why that is shown. A plan has execution nodes, some of which may need to use memory; that is where work_mem comes in -- a sort or hash table (as examples) will try not to exceed work_mem *for that node*. It is significant that one query might use quite a few nodes which each allocate memory up to work_mem. But since most queries are simpler and might not use any work_mem allocations, people often do their calculations based on an expected maximum of one allocation per backend (i.e., per connection). But that could be off by quite a bit if all connections might be running queries with five nodes allocating memory.

It is worth noting that if there is enough RAM on the machine to have a good-sized OS cache, a PostgreSQL page read will often just be a copy from system cache to pg shared_buffers, and a page write will often just be a copy from pg shared_buffers to the system cache. The fsync of tables which is part of the checkpoint process is when they are actually written from the OS to the storage system. But even there a server may have a battery-backed RAM cache, so the OS write to storage is often just a copy in RAM.... unless there is so much writing that the RAID controller's cache fills, at which point writes suddenly become hundreds of times slower than they were.


Other interesting dynamics: pg will try to minimize disk writes by hanging onto dirty buffers (ones which have logically been updated) before writing them to the OS. But buffers may need to be written so they can be freed so that a new read or write has a buffer to use. If a request to read a page or write to a new buffer can't find an idle page, the query might need to write a buffer dirtied by some other backend before it can do its read (or whatever). The background writer can help with this. It tries to watch how fast new pages are being requested and write out dirty pages at a rate which will stay ahead of demand.