[[TOC]]
PostgreSQL is an advanced database server that is robust and fast, although possibly less well-known and popular than its eternal rival in the free software world, MySQL.
Tutorial
Those are quick reminders on easy things to do in a cluster.
Connecting
Our PostgreSQL setup is fairly standard so connecting to the database is like any other Debian machine:
sudo -u postres psql
This drops you in a psql shell where you can issue SQL queries and so on.
Creating a user and a database
This procedure will create a user and a database named tor-foo:
sudo -u postgres createuser -D -E -P -R -S tor-foo
sudo -u postgres createdb tor-foo
For read-only permissions:
sudo -u postgres psql -c 'GRANT SELECT ON ALL TABLES IN SCHEMA public TO tor-foo; \
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO tor-foo; \
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO tor-foo;'
For read-write:
sudo -u postgres psql -c 'GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO tor-foo; \
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO tor-foo; \
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO tor-foo;'
How-to
Checking permissions
It's surprisingly hard to figure out the privileges of a given user in postgresql. First, it's context-sensitive (per database), and second there's all sorts of places where it can be found.
The simplest way is to use the documented \du command to list
users, which will also show which databases they own, but only
that. To go beyond (e.g. specific GRANTs), you need something
more. This, for example, will show SELECT grants on a table, given
that you're connected to the right database already:
SELECT *
FROM information_schema.role_table_grants
WHERE grantee='USERNAME';
But it won't show access like table ownerships. For that you need:
SELECT *
FROM pg_tables
WHERE tableowner = 'USERNAME';
But that won't show things like "functions" and so on.
This mouthful of SQL might be more exhaustive:
-- Cluster permissions not "on" anything else
SELECT
'cluster' AS on,
NULL AS name_1,
NULL AS name_2,
NULL AS name_3,
unnest(
CASE WHEN rolcanlogin THEN ARRAY['LOGIN'] ELSE ARRAY[]::text[] END
|| CASE WHEN rolsuper THEN ARRAY['SUPERUSER'] ELSE ARRAY[]::text[] END
|| CASE WHEN rolcreaterole THEN ARRAY['CREATE ROLE'] ELSE ARRAY[]::text[] END
|| CASE WHEN rolcreatedb THEN ARRAY['CREATE DATABASE'] ELSE ARRAY[]::text[] END
) AS privilege_type
FROM pg_roles
WHERE oid = quote_ident(:'rolename')::regrole
UNION ALL
-- Direct role memberships
SELECT 'role' AS on, groups.rolname AS name_1, NULL AS name_2, NULL AS name_3, 'MEMBER' AS privilege_type
FROM pg_auth_members mg
INNER JOIN pg_roles groups ON groups.oid = mg.roleid
INNER JOIN pg_roles members ON members.oid = mg.member
WHERE members.rolname = :'rolename'
-- Direct ACL or ownerships
UNION ALL (
-- ACL or owned-by dependencies of the role - global or in the currently connected database
WITH owned_or_acl AS (
SELECT
refobjid, -- The referenced object: the role in this case
classid, -- The pg_class oid that the dependent object is in
objid, -- The oid of the dependent object in the table specified by classid
deptype, -- The dependency type: o==is owner, and might have acl, a==has acl and not owner
objsubid -- The 1-indexed column index for table column permissions. 0 otherwise.
FROM pg_shdepend
WHERE refobjid = quote_ident(:'rolename')::regrole
AND refclassid='pg_catalog.pg_authid'::regclass
AND deptype IN ('a', 'o')
AND (dbid = 0 OR dbid = (SELECT oid FROM pg_database WHERE datname = current_database()))
),
relkind_mapping(relkind, type) AS (
VALUES
('r', 'table'),
('v', 'view'),
('m', 'materialized view'),
('f', 'foreign table'),
('p', 'partitioned table'),
('S', 'sequence')
),
prokind_mapping(prokind, type) AS (
VALUES
('f', 'function'),
('p', 'procedure'),
('a', 'aggregate function'),
('w', 'window function')
),
typtype_mapping(typtype, type) AS (
VALUES
('b', 'base type'),
('c', 'composite type'),
('e', 'enum type'),
('p', 'pseudo type'),
('r', 'range type'),
('m', 'multirange type'),
('d', 'domain')
)
-- Database ownership
SELECT 'database' AS on, datname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_database d
INNER JOIN owned_or_acl a ON a.objid = d.oid
WHERE classid = 'pg_database'::regclass AND deptype = 'o'
UNION ALL
-- Database privileges
SELECT 'database' AS on, datname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_database d
INNER JOIN owned_or_acl a ON a.objid = d.oid
CROSS JOIN aclexplode(COALESCE(d.datacl, acldefault('d', d.datdba)))
WHERE classid = 'pg_database'::regclass AND grantee = refobjid
UNION ALL
-- Schema ownership
SELECT 'schema' AS on, nspname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_namespace n
INNER JOIN owned_or_acl a ON a.objid = n.oid
WHERE classid = 'pg_namespace'::regclass AND deptype = 'o'
UNION ALL
-- Schema privileges
SELECT 'schema' AS on, nspname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_namespace n
INNER JOIN owned_or_acl a ON a.objid = n.oid
CROSS JOIN aclexplode(COALESCE(n.nspacl, acldefault('n', n.nspowner)))
WHERE classid = 'pg_namespace'::regclass AND grantee = refobjid
UNION ALL
-- Table(-like) ownership
SELECT r.type AS on, nspname AS name_1, relname AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_class c
INNER JOIN pg_namespace n ON n.oid = c.relnamespace
INNER JOIN owned_or_acl a ON a.objid = c.oid
INNER JOIN relkind_mapping r ON r.relkind = c.relkind
WHERE classid = 'pg_class'::regclass AND deptype = 'o' AND objsubid = 0
UNION ALL
-- Table(-like) privileges
SELECT r.type AS on, nspname AS name_1, relname AS name_2, NULL AS name_3, privilege_type
FROM pg_class c
INNER JOIN pg_namespace n ON n.oid = c.relnamespace
INNER JOIN owned_or_acl a ON a.objid = c.oid
CROSS JOIN aclexplode(COALESCE(c.relacl, acldefault('r', c.relowner)))
INNER JOIN relkind_mapping r ON r.relkind = c.relkind
WHERE classid = 'pg_class'::regclass AND grantee = refobjid AND objsubid = 0
UNION ALL
-- Column privileges
SELECT 'table column', nspname AS name_1, relname AS name_2, attname AS name_3, privilege_type
FROM pg_attribute t
INNER JOIN pg_class c ON c.oid = t.attrelid
INNER JOIN pg_namespace n ON n.oid = c.relnamespace
INNER JOIN owned_or_acl a ON a.objid = t.attrelid
CROSS JOIN aclexplode(COALESCE(t.attacl, acldefault('c', c.relowner)))
WHERE classid = 'pg_class'::regclass AND grantee = refobjid AND objsubid != 0
UNION ALL
-- Function and procdedure ownership
SELECT m.type AS on, nspname AS name_1, proname AS name_2, p.oid::text AS name_3, 'OWNER' AS privilege_type
FROM pg_proc p
INNER JOIN pg_namespace n ON n.oid = p.pronamespace
INNER JOIN owned_or_acl a ON a.objid = p.oid
INNER JOIN prokind_mapping m ON m.prokind = p.prokind
WHERE classid = 'pg_proc'::regclass AND deptype = 'o'
UNION ALL
-- Function and procedure privileges
SELECT m.type AS on, nspname AS name_1, proname AS name_2, p.oid::text AS name_3, privilege_type
FROM pg_proc p
INNER JOIN pg_namespace n ON n.oid = p.pronamespace
INNER JOIN owned_or_acl a ON a.objid = p.oid
CROSS JOIN aclexplode(COALESCE(p.proacl, acldefault('f', p.proowner)))
INNER JOIN prokind_mapping m ON m.prokind = p.prokind
WHERE classid = 'pg_proc'::regclass AND grantee = refobjid
UNION ALL
-- Large object ownership
SELECT 'large object' AS on, l.oid::text AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_largeobject_metadata l
INNER JOIN owned_or_acl a ON a.objid = l.oid
WHERE classid = 'pg_largeobject'::regclass AND deptype = 'o'
UNION ALL
-- Large object privileges
SELECT 'large object' AS on, l.oid::text AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_largeobject_metadata l
INNER JOIN owned_or_acl a ON a.objid = l.oid
CROSS JOIN aclexplode(COALESCE(l.lomacl, acldefault('L', l.lomowner)))
WHERE classid = 'pg_largeobject'::regclass AND grantee = refobjid
UNION ALL
-- Type ownership
SELECT m.type, nspname AS name_1, typname AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_type t
INNER JOIN pg_namespace n ON n.oid = t.typnamespace
INNER JOIN owned_or_acl a ON a.objid = t.oid
INNER JOIN typtype_mapping m ON m.typtype = t.typtype
WHERE classid = 'pg_type'::regclass AND deptype = 'o'
UNION ALL
-- Type privileges
SELECT m.type, nspname AS name_1, typname AS name_2, NULL AS name_3, privilege_type
FROM pg_type t
INNER JOIN pg_namespace n ON n.oid = t.typnamespace
INNER JOIN owned_or_acl a ON a.objid = t.oid
CROSS JOIN aclexplode(COALESCE(t.typacl, acldefault('T', t.typowner)))
INNER JOIN typtype_mapping m ON m.typtype = t.typtype
WHERE classid = 'pg_type'::regclass AND grantee = refobjid
UNION ALL
-- Language ownership
SELECT 'language' AS on, l.lanname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_language l
INNER JOIN owned_or_acl a ON a.objid = l.oid
WHERE classid = 'pg_language'::regclass AND deptype = 'o'
UNION ALL
-- Language privileges
SELECT 'language' AS on, l.lanname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_language l
INNER JOIN owned_or_acl a ON a.objid = l.oid
CROSS JOIN aclexplode(COALESCE(l.lanacl, acldefault('l', l.lanowner)))
WHERE classid = 'pg_language'::regclass AND grantee = refobjid
UNION ALL
-- Tablespace ownership
SELECT 'tablespace' AS on, t.spcname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_tablespace t
INNER JOIN owned_or_acl a ON a.objid = t.oid
WHERE classid = 'pg_tablespace'::regclass AND deptype = 'o'
UNION ALL
-- Tablespace privileges
SELECT 'tablespace' AS on, t.spcname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_tablespace t
INNER JOIN owned_or_acl a ON a.objid = t.oid
CROSS JOIN aclexplode(COALESCE(t.spcacl, acldefault('t', t.spcowner)))
WHERE classid = 'pg_tablespace'::regclass AND grantee = refobjid
UNION ALL
-- Foreign data wrapper ownership
SELECT 'foreign-data wrapper' AS on, f.fdwname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_foreign_data_wrapper f
INNER JOIN owned_or_acl a ON a.objid = f.oid
WHERE classid = 'pg_foreign_data_wrapper'::regclass AND deptype = 'o'
UNION ALL
-- Foreign data wrapper privileges
SELECT 'foreign-data wrapper' AS on, f.fdwname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_foreign_data_wrapper f
INNER JOIN owned_or_acl a ON a.objid = f.oid
CROSS JOIN aclexplode(COALESCE(f.fdwacl, acldefault('F', f.fdwowner)))
WHERE classid = 'pg_foreign_data_wrapper'::regclass AND grantee = refobjid
UNION ALL
-- Foreign server ownership
SELECT 'foreign server' AS on, f.srvname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
FROM pg_foreign_server f
INNER JOIN owned_or_acl a ON a.objid = f.oid
WHERE classid = 'pg_foreign_server'::regclass AND deptype = 'o'
UNION ALL
-- Foreign server privileges
SELECT 'foreign server' AS on, f.srvname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_foreign_server f
INNER JOIN owned_or_acl a ON a.objid = f.oid
CROSS JOIN aclexplode(COALESCE(f.srvacl, acldefault('S', f.srvowner)))
WHERE classid = 'pg_foreign_server'::regclass AND grantee = refobjid
UNION ALL
-- Parameter privileges
SELECT 'parameter' AS on, p.parname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
FROM pg_parameter_acl p
INNER JOIN owned_or_acl a ON a.objid = p.oid
CROSS JOIN aclexplode(p.paracl)
WHERE classid = 'pg_parameter_acl'::regclass AND grantee = refobjid
);
Replace :'rolename' with the user, or pass it on the commandline
with:
psql -f show-grants-for-role.sql -v rolename=YOUR_ROLE
Show running queries
If the server seems slow, it's possible to inspect running queries with this query:
SELECT datid,datname,pid,query_start,now()-query_start as age,state,query FROM pg_stat_activity;
If the state is waiting, it might be worth looking at the
wait_event, and wait_event_type columns as well. We're looking for
deadlocks here.
Killing a slow query
This will kill all queries to database_name:
SELECT
pg_terminate_backend(pid)
FROM
pg_stat_activity
WHERE
-- don't kill my own connection!
pid <> pg_backend_pid()
-- don't kill the connections to other databases
AND datname = 'database_name'
;
A more selective approach is to list threads (above) and then kill only one PID, say:
SELECT
pg_terminate_backend(pid)
FROM
pg_stat_activity
WHERE
-- don't kill my own connection!
pid = 1234;
Diagnosing performance issues
Some ideas from the #postgresql channel on Libera:
-
look at
query_startandstate, and ifstateiswaiting,wait_event, andwait_event_type, inpg_stat_activity, possibly looking for locks here. this is done by the query above, in Show running queries -
enable
pg_stat_statementsto see where the time is going, and then dig into the queries/functions found there, possibly withauto_explainandauto_explain.log_nested_statements=on
In general, we have a few Grafana dashboards specific to PostgreSQL (see logs and metrics, below) that might help tracing performance issues as well. Obviously, system-level statistics (disk, CPU, memory usage) can help pinpoint where the bottleneck is as well, so basic node-level Grafana dashboards are useful there as well.
Consider tuning the whole database with pgtune.
Find what is taking up space
This will show all databases with their sizes and description:
\l+
This will report size and count information for all "relations", which includes indexes:
SELECT relname AS objectname
, relkind AS objecttype
, reltuples AS "#entries"
, pg_size_pretty(relpages::bigint*8*1024) AS size
FROM pg_class
WHERE relpages >= 8
ORDER BY relpages DESC;
It might be difficult to track the total size of a table because it doesn't add up index size, which is typically small but can grow quite significantly.
This will report the same, but with aggregated results:
SELECT table_name
, row_estimate
, pg_size_pretty(total_bytes) AS total
, pg_size_pretty(table_bytes) AS TABLE
, pg_size_pretty(index_bytes) AS INDEX
, pg_size_pretty(toast_bytes) AS toast
FROM (
SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
, c.reltuples AS row_estimate
, pg_total_relation_size(c.oid) AS total_bytes
, pg_indexes_size(c.oid) AS index_bytes
, pg_total_relation_size(reltoastrelid) AS toast_bytes
FROM pg_class c
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE relkind = 'r'
) a
) a ORDER BY total_bytes DESC LIMIT 10;
Same with databases:
SELECT d.datname AS Name, pg_catalog.pg_get_userbyid(d.datdba) AS Owner,
CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
ELSE 'No Access'
END AS SIZE
FROM pg_catalog.pg_database d
ORDER BY
CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
THEN pg_catalog.pg_database_size(d.datname)
ELSE NULL
END DESC -- nulls first
LIMIT 20;
Source: PostgreSQL wiki. See also the upstream manual.
Checking for wasted space
PostgreSQL is particular as a database in the sense that it never
actually returns free space to the operating system unless explicitly
asked for. Modern PostgreSQL releases (8.1+) have an "auto-vacuum"
daemon which takes care of cleaning up DELETE and related operations
to reclaim that disk space, but this only marks those regions of the
database as usable: it doesn't actually returns those blocks to the
operating system.
Because databases typically either stay the same size or grow over
their lifetime, this typically does not matter: the next INSERT will
use that space and no space is actually wasted.
But sometimes that disk space can grow too large. How do we check if our database is wasting space? There are many ways...
check_postgresql
There is a monitoring plugin, which we didn't actually use, which checks for wasted space. It is called check_postgresql and features a bloat check which can run regularly. This could be ported to Prometheus or, perhaps better, we could have something in the PostgreSQL exporter that could check for bloat.
Running bloat query by hand
The above script might be annoying to deploy for an ad-hoc situation. You can just run the query by hand instead:
SELECT
current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::float/otta END)::numeric,1) AS tbloat,
CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
iname, /*ituples::bigint, ipages::bigint, iotta,*/
ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::float/iotta END)::numeric,1) AS ibloat,
CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
FROM (
SELECT
schemaname, tablename, cc.reltuples, cc.relpages, bs,
CEIL((cc.reltuples*((datahdr+ma-
(CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta,
COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols
FROM (
SELECT
ma,bs,schemaname,tablename,
(datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
(maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
FROM (
SELECT
schemaname, tablename, hdr, ma, bs,
SUM((1-null_frac)*avg_width) AS datawidth,
MAX(null_frac) AS maxfracsum,
hdr+(
SELECT 1+count(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
) AS nullhdr
FROM pg_stats s, (
SELECT
(SELECT current_setting('block_size')::numeric) AS bs,
CASE WHEN substring(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
FROM (SELECT version() AS v) AS foo
) AS constants
GROUP BY 1,2,3,4,5
) AS foo
) AS rs
JOIN pg_class cc ON cc.relname = rs.tablename
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
) AS sml
ORDER BY wastedbytes DESC
Another way
It is rumored, however, that this is not very accurate. A better option seems to be this ... more complicated query:
-- change to the max number of field per index if not default.
\set index_max_keys 32
-- (readonly) IndexTupleData size
\set index_tuple_hdr 2
-- (readonly) ItemIdData size
\set item_pointer 4
-- (readonly) IndexAttributeBitMapData size
\set index_attribute_bm (:index_max_keys + 8 - 1) / 8
SELECT current_database(), nspname, c.relname AS table_name, index_name, bs*(sub.relpages)::bigint AS totalbytes,
CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint END AS wastedbytes,
CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END AS realbloat
FROM (
SELECT bs, nspname, table_oid, index_name, relpages, coalesce(
ceil((reltuples*(:item_pointer+nulldatahdrwidth))/(bs-pagehdr::float)) +
CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block
) AS otta
FROM (
SELECT maxalign, bs, nspname, relname AS index_name, reltuples, relpages, relam, table_oid,
( index_tuple_hdr_bm +
maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */
WHEN index_tuple_hdr_bm%maxalign = 0 THEN maxalign
ELSE index_tuple_hdr_bm%maxalign
END
+ nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
ELSE nulldatawidth::integer%maxalign
END
)::numeric AS nulldatahdrwidth, pagehdr
FROM (
SELECT
i.nspname, i.relname, i.reltuples, i.relpages, i.relam, s.starelid, a.attrelid AS table_oid,
current_setting('block_size')::numeric AS bs,
/* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */
CASE
WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8
ELSE 4
END AS maxalign,
/* per page header, fixed size: 20 for 7.X, 24 for others */
CASE WHEN substring(current_setting('server_version') FROM '#"[0-9]+#"%' FOR '#')::integer > 7
THEN 24
ELSE 20
END AS pagehdr,
/* per tuple header: add index_attribute_bm if some cols are null-able */
CASE WHEN max(coalesce(s.stanullfrac,0)) = 0
THEN :index_tuple_hdr
ELSE :index_tuple_hdr + :index_attribute_bm
END AS index_tuple_hdr_bm,
/* data len: we remove null values save space using it fractionnal part from stats */
sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth
FROM pg_attribute AS a
JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum
JOIN (
SELECT nspname, relname, reltuples, relpages, indrelid, relam, regexp_split_to_table(indkey::text, ' ')::smallint AS attnum
FROM pg_index
JOIN pg_class ON pg_class.oid=pg_index.indexrelid
JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
) AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum
WHERE a.attnum > 0
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
) AS s1
) AS s2
LEFT JOIN pg_am am ON s2.relam = am.oid
) as sub
JOIN pg_class c ON c.oid=sub.table_oid
ORDER BY wastedbytes;
It was modified to sort the output by wastedbytes.
Grouped output
One disadvantage of the above query is that tables and indexes are displayed separately. How do we know which belongs to which? It also makes it less obvious what the big tables are, and which ones are important.
This one comes from the pgx_scripts GitHub repo, and is a 130+ line SQL query:
-- new table bloat query
-- still needs work; is often off by +/- 20%
WITH constants AS (
-- define some constants for sizes of things
-- for reference down the query and easy maintenance
SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 8 AS ma
),
no_stats AS (
-- screen out table who have attributes
-- which dont have stats, such as JSON
SELECT table_schema, table_name,
n_live_tup::numeric as est_rows,
pg_table_size(relid)::numeric as table_size
FROM information_schema.columns
JOIN pg_stat_user_tables as psut
ON table_schema = psut.schemaname
AND table_name = psut.relname
LEFT OUTER JOIN pg_stats
ON table_schema = pg_stats.schemaname
AND table_name = pg_stats.tablename
AND column_name = attname
WHERE attname IS NULL
AND table_schema NOT IN ('pg_catalog', 'information_schema')
GROUP BY table_schema, table_name, relid, n_live_tup
),
null_headers AS (
-- calculate null header sizes
-- omitting tables which dont have complete stats
-- and attributes which aren't visible
SELECT
hdr+1+(sum(case when null_frac <> 0 THEN 1 else 0 END)/8) as nullhdr,
SUM((1-null_frac)*avg_width) as datawidth,
MAX(null_frac) as maxfracsum,
schemaname,
tablename,
hdr, ma, bs
FROM pg_stats CROSS JOIN constants
LEFT OUTER JOIN no_stats
ON schemaname = no_stats.table_schema
AND tablename = no_stats.table_name
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
AND no_stats.table_name IS NULL
AND EXISTS ( SELECT 1
FROM information_schema.columns
WHERE schemaname = columns.table_schema
AND tablename = columns.table_name )
GROUP BY schemaname, tablename, hdr, ma, bs
),
data_headers AS (
-- estimate header and row size
SELECT
ma, bs, hdr, schemaname, tablename,
(datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
(maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
FROM null_headers
),
table_estimates AS (
-- make estimates of how large the table should be
-- based on row and page size
SELECT schemaname, tablename, bs,
reltuples::numeric as est_rows, relpages * bs as table_bytes,
CEIL((reltuples*
(datahdr + nullhdr2 + 4 + ma -
(CASE WHEN datahdr%ma=0
THEN ma ELSE datahdr%ma END)
)/(bs-20))) * bs AS expected_bytes,
reltoastrelid
FROM data_headers
JOIN pg_class ON tablename = relname
JOIN pg_namespace ON relnamespace = pg_namespace.oid
AND schemaname = nspname
WHERE pg_class.relkind = 'r'
),
estimates_with_toast AS (
-- add in estimated TOAST table sizes
-- estimate based on 4 toast tuples per page because we dont have
-- anything better. also append the no_data tables
SELECT schemaname, tablename,
TRUE as can_estimate,
est_rows,
table_bytes + ( coalesce(toast.relpages, 0) * bs ) as table_bytes,
expected_bytes + ( ceil( coalesce(toast.reltuples, 0) / 4 ) * bs ) as expected_bytes
FROM table_estimates LEFT OUTER JOIN pg_class as toast
ON table_estimates.reltoastrelid = toast.oid
AND toast.relkind = 't'
),
table_estimates_plus AS (
-- add some extra metadata to the table data
-- and calculations to be reused
-- including whether we can't estimate it
-- or whether we think it might be compressed
SELECT current_database() as databasename,
schemaname, tablename, can_estimate,
est_rows,
CASE WHEN table_bytes > 0
THEN table_bytes::NUMERIC
ELSE NULL::NUMERIC END
AS table_bytes,
CASE WHEN expected_bytes > 0
THEN expected_bytes::NUMERIC
ELSE NULL::NUMERIC END
AS expected_bytes,
CASE WHEN expected_bytes > 0 AND table_bytes > 0
AND expected_bytes <= table_bytes
THEN (table_bytes - expected_bytes)::NUMERIC
ELSE 0::NUMERIC END AS bloat_bytes
FROM estimates_with_toast
UNION ALL
SELECT current_database() as databasename,
table_schema, table_name, FALSE,
est_rows, table_size,
NULL::NUMERIC, NULL::NUMERIC
FROM no_stats
),
bloat_data AS (
-- do final math calculations and formatting
select current_database() as databasename,
schemaname, tablename, can_estimate,
table_bytes, round(table_bytes/(1024^2)::NUMERIC,3) as table_mb,
expected_bytes, round(expected_bytes/(1024^2)::NUMERIC,3) as expected_mb,
round(bloat_bytes*100/table_bytes) as pct_bloat,
round(bloat_bytes/(1024::NUMERIC^2),2) as mb_bloat,
table_bytes, expected_bytes, est_rows
FROM table_estimates_plus
)
-- filter output for bloated tables
SELECT databasename, schemaname, tablename,
can_estimate,
est_rows,
pct_bloat, mb_bloat,
table_mb
FROM bloat_data
-- this where clause defines which tables actually appear
-- in the bloat chart
-- example below filters for tables which are either 50%
-- bloated and more than 20mb in size, or more than 25%
-- bloated and more than 4GB in size
WHERE ( pct_bloat >= 50 AND mb_bloat >= 10 )
OR ( pct_bloat >= 25 AND mb_bloat >= 1000 )
ORDER BY mb_bloat DESC;
It will show only tables which have significant bloat, which is defined in the last few lines above. It makes the output much more readable.
There's also this other query we haven't evaluated.
Recovering disk space
In some cases, you do need to reclaim actual operating system disk space from the PostgreSQL server (see above to see if you do). This can happen for example,for example if you have removed years of old data from a database).
VACUUM FULL
Typically this is done with the VACUUM FULL command (instead of
plain VACUUM, which the auto-vacuum does, see this discussion for
details). This will actually rewrite all the tables to make sure
only the relevant data is actually stored on this. It's roughly the
equivalent of a dump/restore, except it is faster.
pg_repack
For very large changes (say, a dozens of terabytes) however, VACUUM
FULL (and even plain VACUUM) can be prohibitively slow (think
days). And while VACUUM doesn't require an exclusive lock on the
tables it's working on, VACUUM FULL does which implies a
significant outage.
An alternative to that method is the pg_repack extension, which
is packaged in Debian. In Debian 10 buster, the following
procedure was used on bacula-director-01 to purge old data about
removed Bacula clients that hadn't been cleaned up in years:
apt install postgresql-11-repack
Then install the extension on the database, as the postgres user (sudo -u postgres -i), this needs to be done only once:
psql -c "CREATE EXTENSION pg_repack" -d bacula
Then, for each table:
pg_repack -d bacula --table media
It is a good idea to start with a small table we can afford to lose,
just in case something goes wrong. That job took about 2 hours on a
very large table (150GB, file). The entire Bacula database went from
using 161GB to 91GB after that cleanup, see this ticket for
details.
When done, drop the pg_repack extension:
DROP EXTENSION pg_repack;
Also note that, after the repack, VACUUM performance improved
significantly, going from hours (if not days) to minutes.
Note that pg_squeeze is another alternative to pg_repack, but
it isn't available in Debian.
WAL is growing: dangling replication slot
As it is noted down below we currently generally don't (yet) use
PostgreSQL replication. However, some tools can use a replication slot in order
to extract backups like it is the case for barman.
If disk usage is growing linearly and you find out that the pg_wal directory
is the biggest item, take a look at whether there is a replication slot that's
left dangling and keeping PostgreSQL from being able to clear out its WAL:
SELECT slot_name,
pg_wal_lsn_diff(
pg_current_wal_lsn(),
restart_lsn
) AS bytes_behind,
active,
wal_status
FROM pg_replication_slots
WHERE wal_status <> 'lost'
ORDER BY restart_lsn;
If there is one entry listed there, especially if the value in the column bytes_behind is high, then you might have found the source of the issue.
First off, verify that the replication point is really not used by anything anymore. That will be a matter of checking what other tools are running on the host, if the name of the replication slot evokes something that's familiar or not and to check in with services admins about this replication slot if necessary.
If you know that you can remove the replication slot safely, then you can do so with:
select pg_drop_replication_slot('barman');
After that, you'll need to wait for the next checkpoint to happen. By default this is 15 minutes, but some hosts may set a different checkpoint interval. Once the checkpoint is reached, you should see the disk usage go down on the machine.
See this page for information on other cases where the WAL can start growing.
Monitoring the VACUUM processes
In PostgreSQL, the VACUUM command "reclaims storage occupied by dead tuples". To quote the excellent PostgreSQL documentation:
In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.
By default, the autovacuum launcher is enabled in PostgreSQL (and in our deployments), which should automatically take care of this problem.
This will show that the autovacuum daemon is running:
# ps aux | grep [v]acuum
postgres 534 0.5 4.7 454920 388012 ? Ds 05:31 3:08 postgres: 11/main: autovacuum worker bacula
postgres 17259 0.0 0.1 331376 10984 ? Ss Nov12 0:10 postgres: 11/main: autovacuum launcher
In the above, the launcher is running, and we can see a worker has
been started to vacuum the bacula table.
If you don't see the launcher, check that it's enabled:
bacula=# SELECT name, setting FROM pg_settings WHERE name='autovacuum' or name='track_counts';
autovacuum | on
track_counts | on
Both need to be on for the autovacuum workers to operate. It's
possible that some tables might have autovacuum disabled, however,
see:
SELECT reloptions FROM pg_class WHERE relname='my_table';
In the above scenario, the autovacuum worker bacula process had been
running for hours, which was concerning. One way to diagnose is to
figure out how much data there is to vacuum.
This query will show the tables with dead tuples that need to be cleaned up by the VACUUM process:
SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
In our case, there were tens of millions of rows to clean:
bacula=# SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
file | 183278595
That is 200 million tuples to cleanup!
We can see details of the vacuum operation with this funky query, taken from this amazing blog post:
SELECT
p.pid,
now() - a.xact_start AS duration,
coalesce(wait_event_type ||'.'|| wait_event, 'f') AS waiting,
CASE
WHEN a.query ~*'^autovacuum.*to prevent wraparound' THEN 'wraparound'
WHEN a.query ~*'^vacuum' THEN 'user'
ELSE 'regular'
END AS mode,
p.datname AS database,
p.relid::regclass AS table,
p.phase,
pg_size_pretty(p.heap_blks_total * current_setting('block_size')::int) AS table_size,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(p.heap_blks_scanned * current_setting('block_size')::int) AS scanned,
pg_size_pretty(p.heap_blks_vacuumed * current_setting('block_size')::int) AS vacuumed,
round(100.0 * p.heap_blks_scanned / p.heap_blks_total, 1) AS scanned_pct,
round(100.0 * p.heap_blks_vacuumed / p.heap_blks_total, 1) AS vacuumed_pct,
p.index_vacuum_count,
round(100.0 * p.num_dead_tuples / p.max_dead_tuples,1) AS dead_pct
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a using (pid)
ORDER BY now() - a.xact_start DESC
For example, the above vacuum on the Bacula director is in this state at the time of writing:
bacula=# \x
Expanded display is on.
bacula=# SELECT [...]
-[ RECORD 1 ]------+----------------
pid | 534
duration | 10:55:24.413986
waiting | f
mode | regular
database | bacula
table | file
phase | scanning heap
table_size | 55 GB
total_size | 103 GB
scanned | 29 GB
vacuumed | 16 GB
scanned_pct | 52.2
vacuumed_pct | 29.3
index_vacuum_count | 1
dead_pct | 93.8
This is a lot of information, but basically the worker with PID 513
has been running for 10h55m on the bacula database. It is in the
scanning heap phase, second out of 8 phases of the vacuuming
process. It's working on the file table which has has 55GB of
data on the "heap" and a total size of 103 GB (including indexes). It
scanned 29 GB of data (52%), vacuumed 16GB out of that (29%). The
dead_pct indicates that the maintenance_work_mem buffer is
94% full, which could indicate raising that buffer could improve
performance. I am not sure what the waiting and index_vacuum_count
fields are for.
Naturally, this will return information for very large VACUUM operations, which typically do not take this long. This one VACUUM operation was especially slow because we suddenly removed almost half of the old clients in the Bacula database, see ticket 40525 for more information.
One more trick: this will show last VACUUM dates on tables:
SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE last_vacuum IS NOT NULL or last_autovacuum IS NOT NULL ORDER BY relname;
Some of the ideas above were found on this datadog post.
Finally, note that the Debian 10 ("buster") version of PostgreSQL (11) does
not support reporting on "FULL" VACUUM, that feature was introduced in
PostgreSQL 12. Debian 11 ("bullseye") has PostgreSQL 13, but progress
there is reported in the pg_stat_progress_cluster table, so the
above might not work even there.
Running a backup manually
In pgBackRest, there is a systemd unit for each full or diff backup, so this is as simple as:
systemctl start pgbackrest-backup-full@materculae.service
You'd normally do a "diff" backup though:
systemctl start pgbackrest-backup-diff@materculae.service
You can follow the logs with:
journalctl -u pgbackrest-backup-diff@materculae -f
And check progress with:
watch -d sudo -u pgbackrest-materculae pgbackrest --stanza=materculae.torproject.org info
Checking backup health
The backup configuration can be tested on a client with:
sudo -u postgres pgbackrest --stanza=`hostname -f` check
For example, this was done to test weather-01:
root@weather-01:~# sudo -u postgres pgbackrest --stanza=weather-01.torproject.org check
You should be able to see information about that backup with the
info command on the client:
sudo -u postgres pgbackrest --stanza=`hostname -f` info
For example:
root@weather-01:~# sudo -u postgres pgbackrest --stanza=`hostname -f` info
stanza: weather-01.torproject.org
status: ok
cipher: none
db (current)
wal archive min/max (15): 000000010000001F00000004/00000001000000210000002F
full backup: 20241118-202245F
timestamp start/stop: 2024-11-18 20:22:45 / 2024-11-18 20:28:43
wal start/stop: 000000010000001F00000009 / 000000010000001F00000009
database size: 40.3MB, database backup size: 40.3MB
repo1: backup set size: 7.6MB, backup size: 7.6MB
This will run the check command on all configured backups:
for stanza in $( ls /var/lib/pgbackrest/backup ); do
hostname=$(basename $stanza .torproject.org)
sudo -u pgbackrest-$hostname pgbackrest --stanza=$stanza check
done
This can be used to check the status of all backups in batch:
for stanza in $( ls /var/lib/pgbackrest/backup ); do
hostname=$(basename $stanza .torproject.org)
sudo -u pgbackrest-$hostname pgbackrest --stanza=$stanza info | tail -12
done
It's essentially the same as the first, but with info instead of
check.
See also the upstream FAQ.
Backup recovery
pgBackRest is our new PostgreSQL backup system. It features restore procedure and restore command, and detailed restore procedures, which includes instructions on how to restore a specific database in a cluster, do point in time recovery, to go back to a specific time in the past.
pgBackRest uses a variation of the official recovery procedure, which can also be referred to for more information.
Simple latest version restore
The procedure here assumes you are restoring to the latest version in the backups, overwriting the current server. It assumes PostgreSQL is installed, if not, see the installation procedure.
-
visit the right cluster version:
cd /var/lib/postgresql/15/ -
stop the server:
service postgresql stop -
move or remove all files from the old cluster, alternatively:
mv main main.old && sudo -u postgres mkdir --mode 700 mainor to remove all files:
find main -mindepth 1 -deleteYou should typically move files aside unless you don't have enough room to restore while keeping the bad data in place.
-
Run the restore command:
sudo -u postgres pgbackrest --stanza=`hostname -f` restoreBackup progress can be found in the log files, in:
/var/log/pgbackrest/`hostname -f`-restore.logIt takes a couple of minutes to start, but eventually you should see lines like:
2024-12-05 19:22:52.582 P01 DETAIL: restore file /var/lib/postgresql/15/main/base/16402/852859.4 (1GB, 11.39%) checksum 8a17b30a73a1d1ea9c8566bd264eb89d9ed3f35cThe percentage there (
11.39%above) is how far in the restore you are. Note that this number, like all progress bar, lies. In particular, we've seen in the wild a long tail of 8KB files that seem to never finish:2024-12-05 19:34:53.754 P01 DETAIL: restore file /var/lib/postgresql/15/main/base/16400/14044 (8KB, 100.00%) checksum b7a66985a1293b00b6402bfb650fa22c924fd893It will finish eventually.
-
Start the restored server:
sudo service postgresql start -
You're not done yet. This will replay log files from archives. Monitor the progress in
/var/log/postgresql/postgresql-15-main.log, you will see:database system is ready to accept connectionsWhen recovery is complete. Here's an example of a recovery:
starting PostgreSQL 15.10 (Debian 15.10-0+deb12u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit listening on IPv4 address "0.0.0.0", port 5432 listening on IPv6 address "::", port 5432 listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" database system was interrupted; last known up at 2024-12-05 16:28:52 UTC starting archive recovery starting backup recovery with redo LSN 12B/410000C8, checkpoint LSN 12B/41000100, on timeline ID 2 restored log file "000000020000012B00000041" from archive redo starts at 12B/410000C8 restored log file "000000020000012B00000042" from archive completed backup recovery with redo LSN 12B/410000C8 and end LSN 12B/410B3930 consistent recovery state reached at 12B/410B3930 database system is ready to accept read-only connections restored log file "000000020000012B00000043" from archive restored log file "000000020000012B00000044" from archive redo in progress, elapsed time: 10.63 s, current LSN: 12B/43087E50 restored log file "000000020000012B00000045" from archive redo done at 12B/452747D8 system usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 19.77 s last completed transaction was at log time 2024-12-05 19:20:38.375101+00 restored log file "000000020000012B00000045" from archive selected new timeline ID: 3 archive recovery complete checkpoint starting: end-of-recovery immediate wait checkpoint complete: wrote 840 buffers (5.1%); 0 WAL file(s) added, 0 removed, 5 recycled; write=0.123 s, sync=0.009 s, total=0.153 s; sync files=71, longest=0.004 s, average=0.001 s; distance=81919 kB, estimate=81919 kB database system is ready to accept connectionsNote that the date and
LOGparts of the log entries were removed to make it easier to read.
This procedure also assumes that the pgbackrest command is
functional. This should normally be the case on an existing server,
but if pgBackRest is misconfigured or the server is los or too
damaged, you might not be able to perform a restore with the normal
procedure.
In that case, you should treat the situation as a bare-bones recovery, below.
Restoring on a new server
The normal restore procedure assumes the server is properly configured for backups (technically with a proper "stanza").
If that's not the case, for example if you're recovering the database to a new server, you first need to do a proper PostgreSQL installation which should setup the backups properly.
The only twist is that you will need to tweak the stanza names to match the server you are restoring from and will also likely need to add extra SSH keys.
TODO: document exact procedure, should be pretty similar to the bare bones recovery below
Bare bones restore
This assumes the host is configured with Puppet. If this is a real catastrophe (e.g. the Puppet server is down!), you might not have that luxury. In that case, you need to need to manually configure pgBackRest, except steps:
- 2.b: user and SSH keys are probably already present on server
- 4.b: server won't be able to connect to client
- 5: don't configure the pgbackrest server, it's already done
- stop at step seven:
- 7: don't create the stanza on the server, already present
- 8: no need to configure backups on the client, we're restoring
- 9: the check command will fail if the server is stopped
- 10: server configuration talks to the old server
- 11: we're doing a restore, not a backup
Essentially, once you have a new machine to restore on, you will:
-
Install required software:
apt install sudo pgbackrest postgresql -
Create SSH keys on the new VM:
sudo -u postgres ssh-keygen -
Add that public to the repository server, in
/etc/ssh/userkeys/pgbackrest-weather-01:echo 'no-agent-forwarding,no-X11-forwarding,no-port-forwarding,command="/usr/bin/pgbackrest ${SSH_ORIGINAL_COMMAND#* }" ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJrOnnOpX0cyzQ/lqvNLQt2mcJUziiJ0MdubSf/c1+2g postgres@test-01' \ > /etc/ssh/userkeys/pgbackrest-weather-01 -
Configure the repository on the client, in
/etc/pgbackrest.conf:
[weather-01.torproject.org]
lock-path = /var/lock/pgbackrest/weather-01.torproject.org
pg1-host = weather-01.torproject.org
pg1-path = /var/lib/postgresql/15/main
log-path = /var/log/pgbackrest/weather-01.torproject.org
repo1-path = /var/lib/pgbackrest
- Restore with:
sudo -u postgres pgbackrest --stanza=weather-01.torproject.org restore
Once this is done, make sure to reconfigure the machine with Puppet properly so that it's again hooked up with the backup system.
Note that if the machine has been gone long enough, it's possible the user and configuration is gone from the server as well, in which case you'll need to create those as well (step 2.b in the manual procedure).
Restoring without pgBackRest
This is likely not the procedure you want, and should be used only in extreme cases where pgBackRest is completely failing ro restore from backups.
This procedure assumes you have already a server installed with enough
disk space to hold the data to recover. We assume you are restoring
the server testdb-01, which is hardcoded in this procedure.
-
First, disable Puppet so you have control on when PostgreSQL is running:
puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09' -
Then install the right PostgreSQL version and stop the server:
apt install postgresql-13 service postgresql stopMake sure you run the SAME MAJOR VERSION of PostgreSQL than the backup! You cannot restore across versions. This might mean installing from backports or an older version of Debian.
-
On that new PostgreSQL server, show the
postgresserver public key, creating it if missing:( [ -f ~postgres/.ssh/id_rsa.pub ] || sudo -u postgres ssh-keygen )&& cat ~postgres/.ssh/*.pub -
Then on the backup server, allow the user to access backups of the old server:
echo "restrict $HOSTKEY" > /etc/ssh/userkeys/pgbackrest-testdb-01.moreThe
$HOSTKEYis the public key found on the postgres server above.NOTE: the above will not work if the key is already present in
/etc/ssh/userkeys/torbackup, as the key will override the one in.more. Edit the key in there instead in that case. -
Then you need to find the right
BASEfile to restore from. EachBASEfile has a timestamp in its filename, so just sorting them by name should be enough to find the latest one.Decompress the
BASEfile in place, as thepostgresuser:sudo -u postgres -i rsync -a pgbackrest-testdb-01@$BACKUPSERVER:/srv/backups/pg/backup/testdb-01.torproject.org/20250604-170509F/pg_data /var/lib/postgresql/13/main/ -
Make sure the
pg_waldirectory doesn't contain any files.rm -rf -- /var/lib/postgresql/13/main/pg_wal/*Note: this directory was called
pg_walin earlier PostgreSQL versions (e.g. 9.6). -
Tell the database it is okay to restore from backups:
touch /var/lib/postgresql/13/main/recovery.signal -
At this point, you're ready to start the database based on that restored backup. But you will probably also want to restore WAL files to get the latest changes.
-
Create add a configuration parameter in
/etc/postgresql/13/main/postgresql.confthat will tell postgres where to find the WAL files. At least therestore_commandneed to be specified. Something like this may work:restore_command = '/usr/bin/ssh $OLDSERVER cat /srv/backups/pg/backup/anonticket-01.torproject.org/13-1/%f'You can specify a specific recovery point in the
postgresql.conf, see the upstream documentation for more information. This, for example, will recovermeronensefrom backups of themaincluster up to October 1st, and then start accepting connections (promote, other options arepauseto stay in standby to accept more logs orshutdownto stop the server):restore_command = '/usr/local/bin/pg-receive-file-from-backup meronense main.WAL.%f %p' recovery_target_time = '2022-10-01T00:00:00+0000' recovery_target_action = 'promote' -
Then start the server and look at the logs to follow the recovery process:
service postgresql start tail -f /var/log/postgresql/*You should see something like this this in
/var/log/postgresql/postgresql-13-main.log:2019-10-09 21:17:47.335 UTC [9632] LOG: database system was interrupted; last known up at 2019-10-04 08:12:28 UTC 2019-10-09 21:17:47.517 UTC [9632] LOG: starting archive recovery 2019-10-09 21:17:47.524 UTC [9633] [unknown]@[unknown] LOG: incomplete startup packet 2019-10-09 21:17:48.032 UTC [9639] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:17:48.538 UTC [9642] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:17:49.046 UTC [9645] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:17:49.354 UTC [9632] LOG: restored log file "00000001000005B200000074" from archive 2019-10-09 21:17:49.552 UTC [9648] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:17:50.058 UTC [9651] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:17:50.565 UTC [9654] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:17:50.836 UTC [9632] LOG: redo starts at 5B2/74000028 2019-10-09 21:17:51.071 UTC [9659] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:17:51.577 UTC [9665] postgres@postgres FATAL: the database system is starting up 2019-10-09 21:20:35.790 UTC [9632] LOG: restored log file "00000001000005B20000009F" from archive 2019-10-09 21:20:37.745 UTC [9632] LOG: restored log file "00000001000005B2000000A0" from archive 2019-10-09 21:20:39.648 UTC [9632] LOG: restored log file "00000001000005B2000000A1" from archive 2019-10-09 21:20:41.738 UTC [9632] LOG: restored log file "00000001000005B2000000A2" from archive 2019-10-09 21:20:43.773 UTC [9632] LOG: restored log file "00000001000005B2000000A3" from archive... and so on. Note that you do see some of those notices in the normal syslog/journald logs, but, critically, not the following recovery one.
Then the recovery will complete with something like this, again in
/var/log/postgresql/postgresql-13-main.log:2019-10-10 01:30:55.460 UTC [16953] LOG: redo done at 5B8/9C5BE738 2019-10-10 01:30:55.460 UTC [16953] LOG: last completed transaction was at log time 2019-10-10 01:04:23.238233+00 2019-10-10 01:31:03.536 UTC [16953] LOG: restored log file "00000001000005B80000009C" from archive 2019-10-10 01:31:06.458 UTC [16953] LOG: selected new timeline ID: 2 2019-10-10 01:31:17.485 UTC [16953] LOG: archive recovery complete 2019-10-10 01:32:11.975 UTC [16953] LOG: MultiXact member wraparound protections are now enabled 2019-10-10 01:32:12.438 UTC [16950] LOG: database system is ready to accept connections 2019-10-10 01:32:12.439 UTC [26501] LOG: autovacuum launcher startedThe server is now ready for use.
-
Remove the temporary SSH access on the backup server, either by removing the
.morekey file or restoring the previous key configuration:rm /etc/ssh/userkeys/torbackup.more -
re-enable Puppet:
puppet agent -t
Troubleshooting restore failures
could not locate required checkpoint record
If you find the following error in the logs:
FATAL: could not locate required checkpoint record
It's because postgres cannot find the WAL logs to restore from. There could be many causes for this, but the ones I stumbled upon were:
- wrong permissions on the archive (put the WAL files in
~postgres, not~root) - wrong path or pattern for
restore_command(double-check the path and make sure to include the right prefix, e.g.main.WAL)
missing "archive recovery complete" message
Note: those instructions were copied from the legacy backup system documentation. They are, however, believed to be possibly relevant to certain failure mode of PostgreSQL recovery in general, but should be carefully reviewed.
A block like this should show up in the
/var/log/postgresql/postgresql-13-main.log file:
2019-10-10 01:30:55.460 UTC [16953] LOG: redo done at 5B8/9C5BE738
2019-10-10 01:30:55.460 UTC [16953] LOG: last completed transaction was at log time 2019-10-10 01:04:23.238233+00
2019-10-10 01:31:03.536 UTC [16953] LOG: restored log file "00000001000005B80000009C" from archive
2019-10-10 01:31:06.458 UTC [16953] LOG: selected new timeline ID: 2
2019-10-10 01:31:17.485 UTC [16953] LOG: archive recovery complete
2019-10-10 01:32:11.975 UTC [16953] LOG: MultiXact member wraparound protections are now enabled
2019-10-10 01:32:12.438 UTC [16950] LOG: database system is ready to accept connections
2019-10-10 01:32:12.439 UTC [26501] LOG: autovacuum launcher started
The key entry is archive recovery complete here.
It should show this in the logs. If it is not, it might just be
still recovering a WAL file, or it might be paused.
You can confirm what the server is doing by looking at the processes, for example, this is still recovering a WAL file:
root@meronense-backup-01:~# systemctl status postgresql@13-main.service
● postgresql@13-main.service - PostgreSQL Cluster 13-main
Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
Main PID: 67840 (postgres)
Tasks: 5 (limit: 9510)
Memory: 50.0M
CPU: 626ms
CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
├─67851 postgres: 13/main: checkpointer
├─67853 postgres: 13/main: background writer
└─67855 postgres: 13/main: stats collector
... because there's a process doing:
67842 postgres: 13/main: startup recovering 0000000100000600000000F5
In that case, it was stuck in "pause" mode, as the logs indicated:
2022-10-27 15:08:54.882 UTC [67933] LOG: starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv6 address "::1", port 5432
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv4 address "127.0.0.1", port 5432
2022-10-27 15:08:54.998 UTC [67933] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-27 15:08:55.236 UTC [67939] LOG: database system was shut down in recovery at 2022-10-27 15:08:54 UTC
2022-10-27 15:08:55.911 UTC [67939] LOG: starting point-in-time recovery to 2022-10-01 00:00:00+00
2022-10-27 15:08:56.764 UTC [67939] LOG: restored log file "0000000100000600000000F4" from archive
2022-10-27 15:08:57.316 UTC [67939] LOG: redo starts at 600/F4000028
2022-10-27 15:08:58.497 UTC [67939] LOG: restored log file "0000000100000600000000F5" from archive
2022-10-27 15:08:59.119 UTC [67939] LOG: consistent recovery state reached at 600/F50051F0
2022-10-27 15:08:59.119 UTC [67933] LOG: database system is ready to accept read only connections
2022-10-27 15:08:59.120 UTC [67939] LOG: recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
2022-10-27 15:08:59.120 UTC [67939] LOG: pausing at the end of recovery
2022-10-27 15:08:59.120 UTC [67939] HINT: Execute pg_wal_replay_resume() to promote.
The pg_wal_replay_resume() is not actually the right statement to
use here, however. That would put the server back into recovery mode,
where it would start fetching WAL files again. It's useful for
replicated setups, but this is not such a case.
In the above scenario, a recovery_target_time was added but without
a recovery_target_action, which led the server to be paused instead
of resuming normal operation.
The correct way to recover here is to issue a pg_promote statement:
sudo -u postgres psql -c 'SELECT pg_promote();'
Deleting backups
If, for some reason, you need to purge an old backup (e.g. some PII
made it there that should not have), you can manual expire backups
with the expire --set command.
This, for example, will delete a specific backup regardless of retention policies:
sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01.torproject.org expire --set 20241205-162349F_20241207-162351D
Logs for this operation will show up in the (e.g.)
/var/log/pgbackrest/weather-01.torproject.org/weather-01.torproject.org-expire.log
directory.
You can also expire incremental backups associated only with the oldest full backup with:
host=weather-01
cd /srv/backups/pg/backup/$host.torproject.org
for set in $(ls -d *F | sort | head -1)*I ; do
sudo -u pgbackrest-$host pgbackrest --stanza=$host.torproject.org --dry-run expire --set $set;
done
Remove --dry-run when you're confident this will work.
To remove all incremental backups:
host=weather-01
cd /srv/backups/pg/backup/$host.torproject.org
for set in *I ; do
sudo -u pgbackrest-$host pgbackrest --stanza=$host.torproject.org --dry-run expire --set $set;
done
To remove all incremental backups from all hosts:
cd /srv/backups/pg/backup &&
ls | sed 's/\..*//'| while read host; do
cd $host.torproject.org &&
echo $host &&
for set in *I ; do
[ -d $set ] && sudo -u pgbackrest-$host pgbackrest --stanza=$host.torproject.org --dry-run expire --set $set
done
cd ..
done
Pager playbook
OOM (Out Of Memory)
We have had situations where PostgreSQL ran out of memory a few times (tpo/tpa/team#40814, tpo/tpa/team#40482, tpo/tpa/team#40815). You can confirm the problem by looking at the node exporter graphs, for example this link will show you the last 4 months of memory usage on materculae:
The blue "dots" (if any) show the number of times the OOM-killer was called. If there are no dots, it wasn't called, obviously. You can see examples of graphs like this in the history of tpo/tpa/team#40815.
If you are not sure PostgreSQL is responsible, you should be able to confirm by looking at the per-process memory graphs established in July 2022. Here's, for example, a graph of the per-process memory usage on materculae for the past 60 days:
... or a similar graph for processes with more than 2GB of usage:
This was especially prominent after the Debian bullseye upgrades where there is a problem with the JIT compiler enabled in PostgreSQL 13 (Debian bug 1019503, upstream thread). So the first thing to do if a server misbehaves is to disabled the JIT:
sudo -u psql -c 'SET jit TO OFF';
This is specifically what fixed a recurring OOM on Materculae in September 2022 (tpo/tpa/team#40815).
If that fails, another strategy is to try to avoid using the OOM killer altogether. By default, the Linux kernel over commits memory, which means it actually allows processes to allocate more memory than is available on the system. When that memory is actually used is when problems can occur, and when the OOM killer intervenes to kill processes using "heuristics" to hopefully kill the right one.
The PostgreSQL manual actually recommends disabling that feature with:
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=90
To make this permanent, add the setting in /etc/sysctl.d/:
echo vm.overcommit_memory=2 > /etc/sysctl.d/no-overcommit.conf
echo vm.overcommit_ratio=90 >> /etc/sysctl.d/no-overcommit.conf
This will keep the kernel from over-allocating memory, limiting the
total memory usage to the swap size plus 90% of the main memory
(default is 50%). Note that the comments about the oom_score_adj do
not apply to the Debian package as it already sets a proper score for
the PostgreSQL server.
Concretely, avoiding overcommit will make the caller fail when it tries to allocate memory. This can still lead to PostgreSQL crashing, but at least it will give a more useful stack trace that will show what was happening during that allocation.
Another thing to look into is possible bad behavior on the client
side. A client could abuse memory usage by doing multiple PREPARE
statements and never executing them. "HOLD cursors" are also
something, apparently.
Finally, PostgreSQL itself can be tweaked, see this part of the upstream documentation, again:
In some cases, it may help to lower memory-related configuration parameters, particularly [
shared_buffers][], [work_mem][], and [hash_mem_multiplier][]. In other cases, the problem may be caused by allowing too many connections to the database server itself. In many cases, it may be better to reduce [max_connections][] and instead make use of external connection-pooling software.
Exporter failures
If you get a PgExporterScrapeErrors alert like:
PostgreSQL exporter failure on weather-01.torproject.org
It's because the PostgreSQL exporter cannot talk to database server.
First, look at the exporter logs, which should show the error, for example in our case:
root@weather-01:~# journalctl -u prometheus-postgres-exporter.service -n 3 | cat
Sep 24 15:04:20 weather-01 prometheus-postgres-exporter[453]: ts=2024-09-24T15:04:20.670Z caller=collector.go:196 level=error msg="collector failed" name=bgwriter duration_seconds=0.002675663 err="pq: Peer authentication failed for user \"prometheus\""
Sep 24 15:04:20 weather-01 prometheus-postgres-exporter[453]: ts=2024-09-24T15:04:20.673Z caller=collector.go:196 level=error msg="collector failed" name=database duration_seconds=0.005719853 err="pq: Peer authentication failed for user \"prometheus\""
Sep 24 15:04:21 weather-01 prometheus-postgres-exporter[453]: ts=2024-09-24T15:04:21.670Z caller=postgres_exporter.go:714 level=error err="Error opening connection to database (user=prometheus%20host=/var/run/postgresql%20database=postgres%20sslmode=disable): pq: Peer authentication failed for user \"prometheus\""
Then you can turn to the PostgreSQL server logs to see the other side of that error:
root@weather-01:~# tail -3 /var/log/postgresql/postgresql-15-main.log
2024-09-24 15:05:20.672 UTC [116289] prometheus@postgres LOG: no match in usermap "torweather" for user "prometheus" authenticated as "prometheus"
2024-09-24 15:05:20.672 UTC [116289] prometheus@postgres FATAL: Peer authentication failed for user "prometheus"
2024-09-24 15:05:20.672 UTC [116289] prometheus@postgres DETAIL: Connection matched pg_hba.conf line 11: "local all all ident map=torweather"
In this case, it is a misconfiguration of the authentication
layer. The fix was to correct the pg_hba.conf file to avoid
overriding the configuration for the prometheus user in the
username map, see tor-puppet.git@123d79c19 (restrict the weather
pg_ident map to the right user, 2024-09-24).
But a more typical scenario is that the database server is down, make sure it is running correctly with:
systemctl status postgresql@15-main.service
Archiver failure
A PgArchiverFailed alert looks like:
Increased PostgreSQL archiver failure rate on test.example.com
It means the archive_command (from postgresql.conf) has been
failing for too long. A failure or two (say when the backup server is
rebooting) is normal, but the alert is specifically designed to alert
after a longer period of time.
This means the "point in time recovery" backups have stopped working, and changes since the failures started are not mirrored on the backup server.
Check the server log file (currently
/var/log/postgresql/postgresql-15-main.log) for errors. The most
typical scenario here is that the backup server is down, or there's a
configuration problem in the archive_command.
Here's a pgBackRest failure, for example:
2025-02-25 23:06:22.117 UTC [648720] DETAIL: The failed archive command was: pgbackrest --stanza=weather-01.torproject.org archive-push pg_wal/00000001000000280000009B
ERROR: [103]: unable to find a valid repository:
repo1: [FileOpenError] raised from remote-0 ssh protocol on 'backup-storage-01.torproject.org': unable to get info for path/file '/var/lock/pgbackrest/weather-01.torproject.org/weather-01.torproject.org.stop': [13] Permission denied
2025-02-25 23:06:25.287 UTC [648720] LOG: archive command failed with exit code 103
2025-02-25 23:06:25.287 UTC [648720] DETAIL: The failed archive command was: pgbackrest --stanza=weather-01.torproject.org archive-push pg_wal/00000001000000280000009B
2025-02-25 23:06:25.287 UTC [648720] WARNING: archiving write-ahead log file "00000001000000280000009B" failed too many times, will try again later
You can try running the archive command by hand, for pgBackRest servers, this would be:
cd /var/lib/postgresql/15/main/
sudo -u postgres pgbackrest --stanza=weather-01.torproject.org archive-push pg_wal/00000001000000280000009B
There used to be an issue where a reboot of the repository server would lead to the lock directory being missing, and therefore errors in the archiver. This was fixed in tpo/tpa/team#42058.
A more typical reason for those failures is a discrepancy between the pgBackRest version on the server and client, a known issue with pgBackRest:
status: error (other)
[ProtocolError] expected value '2.x' for greeting key 'version' but got '2.y'
HINT: is the same version of pgBackRest installed on the local and remote host?
The solution is to harmonize those versions across the fleet, see the upgrades section for details.
Once the archiver is fixed, you can force a write with:
sudo -u postgres psql -c CHECKPOINT
Watch the log file for failures, the alert should be fixed within a couple of minutes.
Archiver lag
A PgArchiverAge alert looks something like:
PostgreSQL archiver lagging on test.torproject.org
It means the archive_command (from postgresql.conf) has been
struggling to keep up with changes in the database. Check the server
log file (currently /var/log/postgresql/postgresql-15-main.log) for
errors, otherwise look at the backup server for disk saturation.
Once the archiver is fixed, you can force a write with:
sudo -u postgres psql -c CHECKPOINT
Watch the log file for failures, the alert should be fixed within a couple of minutes.
If this keeps occurring, settings could be changed in PostgreSQL to
commit changes to WAL files more frequently, for example by changing
the max_wal_size or checkpoint_timeout settings. Normally, a daily
job does a CHECKPOINT, you can check if it's running with:
systemctl status pg-checkpoint.timer pg-checkpoint.service
Resetting archiver statistics
This is not usually a solution that one should use for archive errors.
But if you're disabling postgresql archives and you end up with the
PgArchiverAge alert even though no archive is being done, intentionally, then
to clear out the alert you'll want to reset the archiver statistics.
To do this, connect to the database with the administrator account and then run one query, as follows:
# sudo -u postgres psql
[...]
postgres=# select pg_stat_reset_shared('archiver');
Connection saturation
A PgConnectionsSaturation looks like:
PostgreSQL connection count near saturation on test.torproject.org
It means the number of connected clients is close to the maximum number of allowed clients. It leaves the server unlikely to respond properly to higher demand.
A few ideas:
- look into the Diagnosing performance issue section
- look at the long term trend, by plotting the
pg_stat_activity_countmetric over time - consider bumping the
max_connectionssetting (inpostgresql.conf) if this is a long term trend
Stale backups
The PgBackRestStaleBackups alert looks like:
PostgreSQL backups are stale on weather-01.torproject.org
This implies that scheduled (normally, daily) backups are not running on that host.
The metric behind that alert
(pgbackrest_backup_since_last_completion_seconds) is generated by
the pgbackrest_exporter (see backups monitoring), based on the
output of the pgbackrest command.
You can inspect the general health of this stanza with this command on
the repository server (currently backup-storage-01):
sudo -u pgbackrest-weather-01 pgbackrest check --stanza=weather-01.torproject.org
This command takes a dozen seconds to complete, that is normal. It should return without any output. Otherwise it will tell you if there's a problem for the repository server to reach the client.
If that works, next up is to check the last backups with the info
command:
sudo -u pgbackrest-weather-01 pgbackrest info --stanza=weather-01.torproject.org
This should show something like:
root@backup-storage-01:~# sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01.torproject.org info | head -12
stanza: weather-01.torproject.org
status: ok
cipher: none
db (current)
wal archive min/max (15): 000000010000001F00000004/000000010000002100000047
full backup: 20241118-202245F
timestamp start/stop: 2024-11-18 20:22:45 / 2024-11-18 20:28:43
wal start/stop: 000000010000001F00000009 / 000000010000001F00000009
database size: 40.3MB, database backup size: 40.3MB
repo1: backup set size: 7.6MB, backup size: 7.6MB
The oldest backups are shown first, and here we're showing the first
one (head -12), let's see the last one:
root@backup-storage-01:~# sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01.torproject.org info | tail -6
diff backup: 20241209-183838F_20241211-001900D
timestamp start/stop: 2024-12-11 00:19:00 / 2024-12-11 00:19:20
wal start/stop: 000000010000002100000032 / 000000010000002100000033
database size: 40.7MB, database backup size: 10.3MB
repo1: backup set size: 7.7MB, backup size: 3.5MB
backup reference list: 20241209-183838F
If the backups are not running, check the systemd timer to see if it's properly enabled and running:
systemctl status pgbackrest-backup-incr@weather-01.timer
You can see the state of all pgBackRest timers with:
systemctl list-timers | grep -e NEXT -e pgbackrest
In this case, the backup is fresh enough, but if that last backup is
not recent enough, you can try to run a backup manually to see if you
can reproduce the issue, through the systemd unit. For example, a
incr backup:
systemctl start pgbackrest-backup-incr@weather-01
See the Running a backup manually instructions for details.
Note that the pgbackrest_exporter only pulls metrics from pgBackRest
once per --collect.interval which defaults to 600 seconds (10
minutes), so it might take unexpectedly long for an alert to resolve.
It used to be that we would rely solely on OnCalendar and
RandomizedDelaySec (for example, OnCalendar=weekly and
RandomizedDelaySec=7d for diff backups) to spread that load, but
that introduced issues when provisionning new servers or rebooting the
repository server, see tpo/tpa/team#42043. We consider this to be
a bug in systemd itself, and worked around it by setting the
randomization in Puppet (see puppet-control@227ddb642).
Backup checksum errors
The PgBackRestBackupErrors alert looks like:
pgBackRest stanza weather-01.torproject.org page checksum errors
It means that the backup (in the above example, for weather-01
stanza) contains one or more page checksum errors.
To display the list of errors, you need manually run the command like:
sudo -u pgbackrest-HOSTNAME pgbackrest info --stanza FQDN --set backup_name --repo repo_key.
For example:
sudo -u pgbackrest-weather-01 pgbackrest info --stanza weather-01.torproject.org --set 20241209-183838F_20241211-001900D
This will, presumably, give you more information about the checksum errors. It's unclear how those can be resolved, we've never encountered such errors so far.
Backups misconfigurations
A certain number of conditions can be raised by the backups monitoring system that will raise an alert. Those are, at the time of writing:
| Alert name | Metric | Explanation |
|---|---|---|
PgBackRestExporterFailure |
pgbackrest_exporter_status |
exporter can't talk to pgBackRest |
PgBackRestRepositoryError |
pgbackrest_repo_status |
misconfigured repository |
PgBackRestStanzaError |
pgbackrest_stanza_status |
misconfigured stanza |
We have never encountered those errors so far, so it is currently unclear how to handle those. The exporter README file has explanations on what the metrics mean as well.
It is likely that the exporter will log more detailed error messages in its logs, which should be visible with:
journalctl -u prometheus-pgbackrest-exporter.service -e
In all case, another idea is to check backup health. This will confirm (or not) that stanzas are properly configured, and outline misconfigured stanza or errors in the global repository configuration.
The status code 99 means "other". This generally means that some external reason is causing things to not run correctly. For example permission errors that make the exporter unable to read from the backup directories.
Disk is full or nearly full
It's possible that pgBackRest backups are taking up all disk space on the backup server. This will generate an alert like this on IRC:
17:40:07 -ALERTOR1:#tor-alerts- DiskWillFillSoon [firing] Disk /srv/backups/pg on backup-storage-01.torproject.org is almost full
The first step is to inspect the directory with:
ncdu -x /srv/backups/pg
The goal of this is to figure out if there's a specific host that's
using more disk space than usual, or if there's a specific kind of
backups that's using more disk space. The files in backup/, for
example, are full/diff/incr backups, while the files in archive/ are
the WAL logs.
You can see the relative size of the different backup types with:
for f in F D I ; do printf "$f: " ; du -ssch *$f | grep total ; done
For example:
root@backup-storage-01:/srv/backups/pg/backup/rude.torproject.org# for f in F D I ; do printf "$f: " ; du -ssch *$f | grep total ; done
F: 9.6G total
D: 13G total
I: 65G total
In the above incident #41982, disk space was used overwhelmingly
by incr backups, which were actually disabled to workaround the
problem. This, however, means WAL files will take up more space, so a
balance must be found in this.
If a specific host is using more disk space, it's possible there's an explosion in disk use on the originating server, which can be investigated with the team responsible for the service.
It might be possible to recover disk space by deleting or expiring backups as well.
In any case, depending on how long it will take for the disk to fill up, the best strategy might be to resize the logical volume.
Disaster recovery
If a PostgreSQL server is destroyed completely or in part, we need to restore from backups, using the backup recovery procedure.
This requires Puppet to be up and running. If the Puppet infrastructure is damaged, a manual recovery procedure is required, see Bare bones restore.
Reference
Installation
The profile::postgresql Puppet class should be used to deploy and manage
PostgreSQL databases on nodes. It takes care of installation, configuration and
setting up the required role and permissions for backups.
One the class is deployed, run the Puppet agent on both the server and storage server, then make a make a full backup. See also the backups section for a discussion about backups configuration.
You will probably want to bind-mount /var/lib/postgresql to
/srv/postgresql, unless you are certain you have enough room in
/var for the database:
systemctl stop postgresql &&
echo /srv/postgresql /var/lib/postgresql none bind 0 0 >> /etc/fstab &&
mv /var/lib/postgresql /srv/ &&
mkdir /var/lib/postgresql &&
mount /var/lib/postgresql &&
systemctl start postgresql
This assumes /srv is already formatted and properly mounted, of
course, but that should have been taken care of as part of the new
machine procedure.
Manual installation
To test PostgreSQL on a server not managed by Puppet, you can probably get away with installing Puppet by hand from Debian packages with:
apt install postgresql
Do NOT do this on a production server managed by TPA, as you'll be missing critical pieces of infrastructure, namely backups and monitoring.
Prometheus PostgreSQL exporter deployment
Prometheus metrics collection is configured automatically when the Puppet class
profile::postgresql is deployed on the node.
Manual deployment
NOTE: This is now done automatically by the Puppet profile. Those instructions are kept for historical reference only.
First, include the following line in pg_hba.conf:
local all prometheus peer
Then run the following SQL queries as the postgres user, for example
after sudo -u postgres psql, you first create the monitoring user to
match the above:
-- To use IF statements, hence to be able to check if the user exists before
-- attempting creation, we need to switch to procedural SQL (PL/pgSQL)
-- instead of standard SQL.
-- More: https://www.postgresql.org/docs/9.3/plpgsql-overview.html
-- To preserve compatibility with <9.0, DO blocks are not used; instead,
-- a function is created and dropped.
CREATE OR REPLACE FUNCTION __tmp_create_user() returns void as $$
BEGIN
IF NOT EXISTS (
SELECT -- SELECT list can stay empty for this
FROM pg_catalog.pg_user
WHERE usename = 'prometheus') THEN
CREATE USER prometheus;
END IF;
END;
$$ language plpgsql;
SELECT __tmp_create_user();
DROP FUNCTION __tmp_create_user();
This will make the user connect to the right database by default:
ALTER USER prometheus SET SEARCH_PATH TO postgres_exporter,pg_catalog;
GRANT CONNECT ON DATABASE postgres TO prometheus;
... and grant the required accesses to do the probes:
GRANT pg_monitor to prometheus;
Note the procedure was modified from the upstream procedure to
use the prometheus user (instead of postgres_exporter), and to
remove the hardcoded password (since we rely on the "peer"
authentication method).
A previous version of this documentation mistakenly recommended creating views and other complex objects that were only required in PostgreSQL < 10, and were never actually necessary. Those can be cleaned up with the following:
DROP SCHEMA postgres_exporter CASCADE;
DROP FUNCTION get_pg_stat_replication;
DROP FUNCTION get_pg_stat_statements;
DROP FUNCTION get_pg_stat_activity;
... and it wouldn't hurt then to rerun the above install procedure to
grant the correct rights to the prometheus user.
Then restart the exporter to be sure everything still works:
systemctl restart prometheus-postgres-exporter.service
Upgrades
PostgreSQL upgrades are a delicate operation that typically require downtime if there's no (logical) replication.
This section generally documents the normal (pgBackRest) procedure. The legacy backup system has been retired and so has its documentation.
Preparation
Before starting the fleet upgrade, read the release notes for the
relevant release (e.g. 17.0 to see if there are any specific
changes that are needed at the application level, for service
owners. In general, the procedure below does use pg_upgrade so
that's already covered.
Also note that the PostgreSQL server might need a fleet-wide
pgBackRest upgrade, as an old pgBackRest might not be compatible with
the newer PostgreSQL server or, worse, a new pgbackrest might not be
compatible with the one from the previous stable. During the Debian
12 to 13 (bookworm to trixie) upgrade, both of those were a problem
and the pgbackrest package was updated across the fleet, using the
apt.postgresql.org repository.
The upstream backports repository can be enabled in the
profile::postgresql::backports class. It's actually included by
default in the profile::postgresql but enabled only on older
releases. This can be tweaked from Hiera.
Procedure
This is the procedure for pgBackRest-backed servers.
-
Make a full backup of the old cluster or make sure a recent one is present:
fab -H testdb-01.torproject.org postgresql.backup --no-wait -
Make sure the pgBackRest versions on the client and server are compatible. (See note about fleet-wide upgrades above.)
-
Simulate the cluster upgrade:
fab -H testdb-01.torproject.org --dry postgresql.upgradeLook at the version numbers and make sure you're upgrading and dropping the right clusters.
This assumes the newer PostgreSQL packages are already available and installed, but that the upgrade wasn't performed. The normal "major upgrade" procedures bring you to that state, otherwise the https://apt.postgresql.org sources need to be installed on the server.
-
Run the cluster upgrade:
fab -H testdb-01.torproject.org postgresql.upgradeAt this point, the old cluster is still present, but runs on a different port, and the upgraded cluster is ready for service.
-
Verify service health
Test the service which depends on the database, see if you can read and write to the database.
-
Check that WAL files are still sent to the backup server. After an hour, if the archiver is not working properly, Prometheus will send a
PgArchiverFailedalert, for example. Such errors should be visible intail -f /var/log/postgresql/p*.logbut will silently resolve themselves. You can check the metrics in Prometheus to see if they're being probed correctly with:fab prometheus.query-to-series --expression 'pgbackrest_backup_info{alias="testdb-01.torproject.org"}'
Note that the upgrade procedure takes care of destroying the old
cluster, after 7 days by default, with the at(1) command. Make sure
you check everything is alright before that delay!
SLA
No service level is defined for this service.
Design and architecture
We use PostgreSQL for a handful of services. Each service has its own PostgreSQL server installed, with no high availability or replication, currently, although we use the "write-ahead log" to keep a binary dump of databases on the backup server.
It should be noted for people unfamiliar with PostgreSQL that it (or at least the Debian package) can manage multiple "clusters" of distinct databases with overlapping namespaces, running on different ports. To quote the upstream documentation:
PostgreSQL is a relational database management system (RDBMS). That means it is a system for managing data stored in relations. Relation is essentially a mathematical term for table. [...]
Each table is a named collection of rows. Each row of a given table has the same set of named columns, and each column is of a specific data type. [...]
Tables are grouped into databases, and a collection of databases managed by a single PostgreSQL server instance constitutes a database cluster.
See also the PostgreSQL architecture fundamentals.
TODO Services
TODO Storage
TODO Queues
TODO Interfaces
TODO Authentication
TODO Implementation
TODO Related services
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~PostgreSQL label.
Maintainer
PostgreSQL services are part of the core services maintained by
TPA. The postgres Puppet module and associated backup
synchronisation code was written by Peter Palfrader.
TODO: update wrt pgbackrest and new profile, mention lavamind
TODO Users
TODO Upstream
The PostgreSQL project itself is a major database, free software project, which calls itself "The World's Most Advanced Open Source Relational Database, with regular releases and a healthy community.
Monitoring and metrics
Prometheus monitors the PostgreSQL servers through the PostgreSQL
exporter deployed by Puppet through the
profile::prometheus::postgres_exporter class.
The Grafana server has a handful of dashboards in various working states:
- Postgres Overview - basic dashboard with minimal metrics
- PostgreSQL Overview (Percona) - mostly working
- GitLab Omnibus - PostgreSQL - broken
Note that there is a program called pgstatsmon which can provide very detailed information about the state of a PostgreSQL database, see this blog post for details.
Backups monitoring
PostgreSQL backups are monitored through the
pgbackrest_exporter, which pulls metrics from the pgbackrest
binary on the storage server periodically, and exposes them through a
web interface.
The collected metrics can be seen on this Grafana dashboard (grafana.com source).
Alertmanager has a set of alerts that look for out of date backups, see the pager playbook for a reference.
TODO Tests
Logs
PostgreSQL keeps log files in /var/log/postgresql/, one per
"cluster". Since it logs failed queries, logs may contain PII in the
form of SQL queries. The log rotation policy is the one set by the
Debian package and keeps logs for 10 weeks.
The backup system keeps logs of its periodic full/diff backups in systemd's journal files. To consult the logs for the full backups on rude, for example, see:
journalctl -b -u pgbackrest-backup-full@rude.service
Backups
The new backup system is based on pgBackRest. It works by SSH'ing
between the client and server and running pgbackrest commands, which
encapsulates all functionality including backup, and restore.
Backups are retained for (30 days), although the source of truth for
this is not here but in Hiera, in tor-puppet.git's
hiera/common/postgresql.yaml, the
pgbackrest::config:global:repo1-retention-full value. Expiration is
performed when backups are ran, from the systemd timers. See also the
upstream documentation on retention.
pgBackRest considers 3 different backup types, here are schedules for those:
| type | frequency | note |
|---|---|---|
full |
30 days | all database cluster files will be copied and there will be no dependencies on previous backups. |
diff |
7 days | like an incremental backup but always based on the last full backup. |
incr |
24h | incremental from the last successful backup. |
Backups are scheduled using systemd timers exported from each node,
based on a template per backup type, so there's a matrix of
pgbackrest-backup-{diff,full}@.{service,timer} files on the
repository server, e.g.
root@backup-storage-01:~# ls /etc/systemd/system | grep @\\.
pgbackrest-backup-diff@.service
pgbackrest-backup-diff@.timer
pgbackrest-backup-full@.service
pgbackrest-backup-full@.timer
pgbackrest-backup-incr@.service
pgbackrest-backup-incr@.timer
Each server has its own instance of that, a symlink to those, for example weather-01:
root@backup-storage-01:~# ls -l /etc/systemd/system | grep weather-01
lrwxrwxrwx 1 root root 31 Dec 5 02:02 pgbackrest-backup-diff@weather-01.service -> pgbackrest-backup-diff@.service
lrwxrwxrwx 1 root root 49 Dec 4 21:51 pgbackrest-backup-diff@weather-01.timer -> /etc/systemd/system/pgbackrest-backup-diff@.timer
lrwxrwxrwx 1 root root 31 Dec 5 02:02 pgbackrest-backup-full@weather-01.service -> pgbackrest-backup-full@.service
lrwxrwxrwx 1 root root 49 Dec 4 21:51 pgbackrest-backup-full@weather-01.timer -> /etc/systemd/system/pgbackrest-backup-full@.timer
lrwxrwxrwx 1 root root 31 Dec 16 18:32 pgbackrest-backup-incr@weather-01.service -> pgbackrest-backup-incr@.service
lrwxrwxrwx 1 root root 49 Dec 16 18:32 pgbackrest-backup-incr@weather-01.timer -> /etc/systemd/system/pgbackrest-backup-incr@.timer
Retention is configured at the "full" level, with the
repo1-retention-full setting.
Puppet setup
PostgreSQL servers are automatically configured to use pgBackRest to
backup to a central server (called repository), as soon as the
profile::postgresql is included, if
profile::postgresql::pgbackrest is true.
Note that the instructions here also apply if you're converting a legacy host to pgBackRest.
This takes a few times to converge: at first, the catalog on the repository side will fail because of missing SSH keys on the client.
By default, the backup-storage-01.torproject.org server is used as a
repository, but this can be overridden in Hiera with the
profile::postgresql::pgbackrest_repository parameter. This is
normally automatically configured by hoster, however, so you
shouldn't need to change anything.
Manual configuration
Those instructions are for disaster recovery scenarios, when a manual configuration of pgBackRest is required. This typically happens when Puppet is down, for example if the PuppetDB server was destroyed and need to be recovered, it wouldn't be possible to deploy the backup system with Puppet.
Otherwise those instructions should generally not be used, as they
are normally covered by the profile::postgresql class.
Here, we followed the dedicated repository host installation
instructions. Below, we treat the "client" (weather-01) as the
server that's actually running PostgreSQL in production and the
"server" (backup-storage-01) as the backup server that's receiving
the backups.
-
Install package on both the client and the server:
apt install pgbackrestNote: this creates a
postgresqluser instead ofpgbackrest. -
Create an SSH key on the client:
sudo -u postgres ssh-keygenCreate a user and SSH key on the server:
adduser --system pgbackrest-weather-01 sudo -u pgbackrest-weather-01 ssh-keygen -
Those keys were exchanged to the other host by adding them in
/etc/ssh/userkeys/$HOSTNAMEwith the prefix:restrict,command="/usr/bin/pgbackrest ${SSH_ORIGINAL_COMMAND#* }"For example, on the server:
echo 'restrict,command="/usr/bin/pgbackrest ${SSH_ORIGINAL_COMMAND#* }" ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJrOnnOpX0cyzQ/lqvNLQt2mcJUziiJ0MdubSf/c1+2g postgres@test-01' \ > /etc/ssh/userkeys/pgbackrest-weather-01On the client, the key should be in
/etc/ssh/userkeys/postgres. -
Test the cross-connect with:
root@weather-01:~# sudo -u postgres ssh pgbackrest-weather-01@backup-storage-01.torproject.orgThis should display the
pgbackrestusage. Also test from the server to the client:root@backup-storage-01:~# sudo -u weather-01 ssh postgres@weather-01.torproject.org -
Configure the client on the server, in
/etc/pgbackrest/conf.d/weather-01.torproject.org.conf:
[weather-01.torproject.org]
lock-path = /var/lock/pgbackrest/weather-01.torproject.org
pg1-host = weather-01.torproject.org
pg1-path = /var/lib/postgresql/15/main
log-path = /var/log/pgbackrest/weather-01.torproject.org
repo1-path = /var/lib/pgbackrest
- Configure the server on the client, in
/etc/pgbackrest/conf.d/server.conf:
[global]
log-level-file = detail
repo1-path = /var/lib/pgbackrest
repo1-host = backup-storage-01.torproject.org
repo1-host-user = pgbackrest-weather-01
[weather-01.torproject.org]
pg1-path = /var/lib/postgresql/15/main
-
Create the "stanza" on the server:
sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01 stanza-create -
Modify the PostgreSQL configuration on the client to archive to pgBackRest, in
/etc/postgresql/15/main/postgresql.conf:
archive_command = 'pgbackrest --stanza=main archive-push %p'
wal_level = replica
-
Test the configuration, on the client:
root@weather-01:~# sudo -u postgres pgbackrest --stanza=weather-01 checkNote that this will wait for an archive to be successfully sent to the server. It will wait a full minute before failing with a helpful error message, like:
ERROR: [082]: WAL segment 000000010000001F00000004 was not archived before the 60000ms timeout HINT: check the archive_command to ensure that all options are correct (especially --stanza). HINT: check the PostgreSQL server log for errors. HINT: run the 'start' command if the stanza was previously stopped.In my case, the
--stanzain thepostgresql.conffile was incorrect. -
Test the configuration, on the server:
root@backup-storage-01:~# sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01 check -
Perform a first backup, from the server:
root@backup-storage-01:~# sudo -u postgres pgbackrest --stanza=weather-01 backupThe warning (
WARN: no prior backup exists, incr backup has been changed to full) is expected.The first full backup completed in 6 minutes on
weather-01.
Other documentation
See also:
- PostgreSQL official documentation
- PostgreSQL wiki
- Debian DSA documentation
- postgresqlco.nf: easily accessible parameter documentation
pgBackRest
- Debian setup guide, built with SSH
- RHEL setup guide, built with TLS, see also this TLS guide, TLS considered for performance improvements, but might be premature optimization
- Configuration reference
- FAQ
Discussion
Overview
Technical debt that needs to eventually be addressed:
-
the
pgbackrest_exportercurrently runs as root since it needs to be able to read from backup directories under all of the backup users. We want to implement a better method for the exporter to get access to the files without running as root. -
pgBackRest runs over SSH, while it seems TLS offers better performance and isolation, see this comment and others
-
the
pgbackrestPuppet module has effectively been forked to support automated multiple servers backup, and should be merged back upstream -
PITR restores (e.g. "go back in time") are not well documented, but should be relatively easy to perform in pgBackRest
Goals
Must have
Nice to have
Non-Goals
Approvals required
Proposed Solution
Cost
Alternatives considered
Backup systems
We used to have a legacy system inherited from DSA without any other upstream, with code living here and there in various git repositories.
In late 2024 and early 2025, it was replaced with pgBackRest as part of TPA-RFC-65. It's not perfect: upstream documentation is, as often the case, not quite complete, but it's pretty good. Performance is excellent, it's much simpler and contained, it's well packaged in Debian, and well supported upstream. It seems to be pretty much the standard PG backup tool at this point.
This section document various alternative backup systems, including the legacy backup system.
Barman
Barman presumably makes "taking an online hot backup of PostgreSQL" "as easy as ordering a good espresso coffee". It seems well maintained (last release 3.2.0 on 20 October 20220, 7 days ago), and with a healthy community (45 contributors, 7 with more than 1000 SLOC, 5 pending PRs, 83 open issues).
It is still seeing active development and new features, with a few sponsors and professional support from the company owning the copyright (EntrepriseDB).
It's in Debian, and well maintained there (only day between the 3.2.0 release and upload to unstable). It's licensed under the GPLv3.
The documentation is a little confusing; it's a one page HTML page or a PDF on the release page. The main command and configuration files each have a manual page, and so do some sub-commands, but not all.
Quote from the about page:
Features & Goals
- Full hot physical backup of a PostgreSQL server
- Point-In-Time-Recovery (PITR)
- Management of multiple PostgreSQL servers
- Remote backup via rsync/SSH or pg_basebackup (including a 9.2+ standby)
- Support for both local and remote (via SSH) recovery
- Support for both WAL archiving and streaming
- Support for synchronous WAL streaming (“zero data loss”, RPO=0)
- Incremental backup and recovery
- Parallel backup and recovery
- Hub of WAL files for enhanced integration with standby servers
- Management of retention policies for backups and WAL files
- Server status and information
- Compression of WAL files (bzip2, gzip or custom)
- Management of base backups and WAL files through a catalogue
- A simple INI configuration file
- Totally written in Python
- Relocation of PGDATA and tablespaces at recovery time
- General and disk usage information of backups
- Server diagnostics for backup
- Integration with standard archiving tools (e.g. tar)
- Pre/Post backup hook scripts
- Local storage of metadata
Missing features:
- streaming replication support
- S3 support
The design is actually eerily similar to the existing setup: it uses
pg_basebackup to make a full backup, then the archive_command to
stream WAL logs, at least in one configuration. It actually supports
another configuration which provides zero data loss in case of an
outage, as setups depending on archive_command actually can result
in data loss, because PostgreSQL commits the WAL file only in 16MB
chunks. See the discussion in the Barman WAL archive for more
information on those two modes.
In any case, the architecture is compatible with our current setup and it looked like a good candidate. The WAL file compression is particularly interesting, but all the other extra features and the community, regular releases, and Debian packaging make it a prime candidate for replacing our bespoke scripts.
In September 2024, Barman was tested in tpo/tpa/team#40950, but it did not go well and Barman was ultimately abandoned. Debugging was difficult, documentation was confusing, and it just didn't actually work. See this comment for details.
pg_rman
pg_rman is a "Backup and restore management tool for PostgreSQL". It seems relatively well maintained, with a release in late 2021 (1.3.14, less than a year go), and the last commit in September (about a month ago). It has a smaller community than Barman, with 13 contributors and only 3 with more than a thousand SLOC. 10 pending PRs, 12 open issues.
It's unclear where one would get support for this tool. There doesn't seem to be commercial support or sponsors.
It doesn't appear to be in Debian. It is licensed under an unusual
BSD-like license requiring attribution to the NIPPON TELEGRAPH
AND TELEPHONE CORPORATION.
Documentation is a single manpage.
It's not exactly clear how this software operates. It seems like it's a tool to make PITR backups but only locally.
Probably not a good enough candidate.
repmgr
repmgr is a tool for "managing replication and failover in a cluster of PostgreSQL servers. It enhances PostgreSQL's built-in hot-standby capabilities with tools to set up standby servers, monitor replication, and perform administrative tasks such as failover or manual switchover operations".
It does not seem, in itself, to be a backup manager, but could be abused to be one. It could be interesting to operate hot-standby backup servers, if we'd wish to go in that direction.
It is developed by the same company as Barman, EntrepriseDB. It is packaged in Debian.
No other investigation was performed on the program because its designed was seen as compatible with our current design, but also because EntrepriseDB also maintains Barman. And, surely, they wouldn't have two backup systems, would they?
omniptr
omniptr is another such tool I found. Its README is really
lacking in details, but it looks like something like we do, which
hooks into the archive_command to send logs... somewhere.
I couldn't actually figure out its architecture or configuration from
a quick read of the documentation, which is not a good sign. There's a
bunch of .pod files in a doc directory, but it's kind of a mess
in there.
It does not seem to be packaged in Debian, and doesn't seem very active. The last release (2.0.0) is almost 5 years old (November 2017). It doesn't have a large developer community, only 8 developers, none of them with more than a thousand lines of code (omniptr is small though).
It's written in Perl, with a license similar to the PostgreSQL license.
I do not believe it is a suitable replacement for our backup system.
pgBackRest TLS server
pgBackRest has a server command that runs a TLS-enabled server that runs on the PostgreSQL server and the repository. Then the server uses TLS instead of SSH pipes to push WAL files to the repository, and the repository pulls backups over TLS from the servers.
We haven't picked that option because it requires running pgbackrest
server everywhere. We prefer to rely on SSH instead.
Using SSH also allows us to use multiple, distinct users for each backup server which reduces lateral movement between backed up hosts.
Legacy DSA backup system
We were previously using a bespoke backup system shared with DSA. It was built with a couple of shell and Perl script deployed with Puppet.
It used upstream's Continuous Archiving and Point-in-Time Recovery
(PITR) which relies on PostgreSQL's "write-ahead log" (WAL) to write
regular "transaction logs" of the cluster to the backup host. (Think
of transaction logs as incremental backups.) This was configured in
postgresql.conf, using a configuration like this:
track_counts = yes
archive_mode = on
wal_level = archive
max_wal_senders = 3
archive_timeout = 6h
archive_command = '/usr/local/bin/pg-backup-file main WAL %p'
The latter was a site-specific script which reads a config file in
/etc/dsa/pg-backup-file.conf where the backup host is specified
(e.g. torbackup@bungei.torproject.org). That command passes the
WAL logs onto the backup server, over SSH. A WAL file is shipped
immediately when it is full (16MB of data by default) but no later
than 6 hours (varies, see archive_timeout on each host) after it was
first written to. On the backup server, the command is set to
debbackup-ssh-wrap in the authorized_keys file and takes the
store-file pg argument to write the file to the right location.
WAL files are written to /srv/backups/pg/$HOSTNAME where $HOSTNAME
(without .torproject.org). WAL files are prefixed with main.WAL.
(where main is the cluster name) with a long unique string after,
e.g. main.WAL.00000001000000A40000007F.
For that system to work, we also needed full backups to happen on a
regular basis. That was done straight from the backup server (again
bungei) which connects to the various PostgreSQL servers and runs a
pg_basebackup to get a complete snapshot of the cluster. This
happens weekly (every 7 to 10 days) in the wrapper
postgres-make-base-backups, which is a wrapper (based on a Puppet
concat::fragment template) that calls
postgres-make-one-base-backup for each PostgreSQL server.
The base files are written to the same directory as WAL file and are named using the template:
$CLUSTER.BASE.$SERVER_FQDN-$DATE-$ID-$CLIENT_FQDN-$CLUSTER-$VERSION-backup.tar.gz
... for example:
main.BASE.bungei.torproject.org-20190804-214510-troodi.torproject.org-main-13-backup.tar.gz
All of this works because SSH public keys and PostgreSQL credentials are
passed around between servers. That is handled in the Puppet
postgresql module for the most part, but some bits might still be
configured manually on some servers.
Backups were checked for freshness in Nagios using the
dsa-check-backuppg plugin with its configuration stored in
/etc/dsa/postgresql-backup/dsa-check-backuppg.conf.d/, per
cluster. The Nagios plugin also took care of expiring backups when
they are healthy.
The actual retention period was defined in the
/etc/nagios/dsa-check-backuppg.conf configuration file on the
storage server:
retention: 1814400
That number, in seconds, was 21 days.
Running backups was a weird affair, this was the command, to run a backup for meronense:
sudo -u torbackup postgres-make-one-base-backup $(grep ^meronense.torproject.org $(which postgres-make-base-backups ))
Indeed, the postgres-make-base-backups file was generated by
Puppet based on Concat exported resources (!) and had its
configuration inline (as opposed to a separate configuration file).
This system was finally and completely retired in June 2025. Most of the code was ripped out of Puppet then, in ad6e74e31 (rip out legacy backup code (tpo/tpa/team#40950), 2025-06-04). Large chunks of documentation about the legacy system were also removed from this page in 67d6000d (postgresql: purge legacy documentation (tpo/tpa/team#40950), 2025-06-17).
Replication
We don't do high availability right now, but if we would, we might want to consider pg_easy_replicate.