Migrating to an Encrypted Postgres RDS Instance using Replication Techniques to Minimize Downtime

Migrating to an Encrypted Postgres RDS Instance using Replication Techniques to Minimize Downtime

Using synchronization to bring up a parallel database you can cut over to.

Introduction

Data encryption at rest has become one of the risk mitigation strategies adopted by most security standards (ISO 27001 etc).  If you are running a non-encrypted RDS Postgres instance in a prod. environment it can be tricky to upgrade. There are no “click-click” methods available in RDS to replace the database in place (like with size changes and param changes, minor version bumps etc) if you have DMS unsupported data field types.

It is easy to upgrade if you down everything, but if you want to minimize downtime things get tricky.  The method outlined in this blog is to create a replicated database in parallel to your production database and to use that to cut over to.  This method can also be used for major Postgres version upgrades.

"Click-Click" Amazon DMS Method (nice - tiny downtime)


This method is the recommended approach by Amazon. This however only works if DMS (Database Migration Service) supports all the field types that your database uses. In our case we were using Python and the Django’s ORM which uses field types that are unsupported by DMS so we couldn't use the DMS method.

Start by trying this method, it is well documented and clear: here.

Non-Replication Method (ouch - downtime)

As a backdrop & alternative to the replication upgrade method you can down your service and follow the following method:

  1. Down all services connecting to your database
  2. Create a snapshot of the primary database
  3. Stop the primary database
  4. Copy the snapshot (this gives you the option to move it to encrypted storage)
  5. Restore the copied & now encrypted snapshot
  6. Move all services to the new encrypted database host
  7. Start all services now connected to the new encrypted database

This method will take quite long and requires full downtime. The duration of the downtime is dependent on the size of your database and requires all your services using the database to be updated to point to the new database. When testing this on a 150GB database this whole process took about 1 hour.

Replication Method (nice - tiny downtime)

In Postgres version 10 the logical replication feature was introduced, this gives users the ability to set up a publisher and subscriber to all tables, or a subset of tables (thus the “logical” name). The logical replication feature is an alternative to binary WAL synchronisation (Write ahead logs, all database activities are synchronised). RDS uses the WAL synchronisation to run read replicas, this method is not available for external use in RDS but used internally by Amazon for creating read replicas. So the only builtin way available to user's of RDS is the logical replication method.

As a side note there are a few other options for synchronization and some third party tools like Fivetran, Xplenty, Stitch or Talend that use either the XMIN method or plugins like wal2json or proprietary methods (they install tables to track synchronization).

The concept being discussed here is to use Postgres internal tools to bring up a replication subscriber and to get that database in sync with the publisher and then once synchronized to then cut it over. This will require some downtime, but just the duration of the cut over, which can be a few seconds.

This method is also commonly used for major version upgrades of Postgres to minimize downtime.

This method requires two downtime slots: one to apply the new database configuration (1 min about if you run multi-zoned DBs) and another for the cut over to the new DB.

NB you must test and provision adequate IOPS else the process gets stuck (see reference notes below)

The process for the Postgres replication method is outlined below.

Summary:

  1. Change the Primary database replication config
  2. Restore a snapshot of the primary DB to become an encrypted version of itself
  3. Truncate all tables in the restored encrypted database
  4. Create Pub/ Sub user and grant access
  5. Create the PUBLICATION on the master database
  6. Create the SUBSCRIPTION on the encrypted database
  7. Check the status of the synchronisation
  8. Cut over

Change Primary database Replication Config

  1. Make a new configuration group/ or apply this parameter to your configuration: `rds.logical_replication` must be  set to “1”
  2. Apply the configured replication configuration group to the primary  database (requires downtime ±1min)
  3. Click "Modify" >> "Database options, database parameter group" >> Change to "postgres13-default-with-logical-replication-enabled" >> Continue >> Apply Immediately (or schedule in your maintenance period)
  4. Requires a reboot
  5. This sets rds.logical_replication to 1 by the configuration editor; (see here: Using PostgreSQL logical replication with Aurora - Amazon Aurora

Restore a snapshot of the primary DB to become an encrypted version of itself

  1. Automated backups (main nav) >> Click on the DB >> Actions >> Copy >> Enable Encryption (check box) >> (find the latest snapshot) Copy Snapshot
  2. Snapshots (main nav) >> Click on Snapshot created above >> Actions >> Restore Snapshot >> Add relevant VPC >> Restore DB Instance

Truncate all Tables in the Restored encrypted database

  1. Get connection creds & connect to the newly restored database
  2. Connect & create a function to truncate all tables, then execute the function:

Utility function to truncate all tables for a specific user:

CREATE OR REPLACE FUNCTION truncate_tables(username IN VARCHAR) RETURNS void AS $$
DECLARE
statements CURSOR FOR
SELECT tablename FROM pg_tables
WHERE tableowner = username AND schemaname = 'public';
BEGIN
FOR stmt IN statements LOOP
EXECUTE 'TRUNCATE TABLE ' || quote_ident(stmt.tablename) || ' CASCADE;';
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT truncate_tables('dbuser');

Create Pub/ Sub user and grant access

  1. Log into the original unencrypted database
CREATE USER replication_user WITH password 'a_good_password';

2.  Create a replication user on on the Master with REPLICATION rights

GRANT rds_replication TO replication_user;

3.  Grant privileges to the user

GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO replication_user;

Create the PUBLICATION on the Master (unencrypted DB)

  1. Create the publication (ref docs here)
CREATE PUBLICATION api_db_pub FOR ALL TABLES;

2.  Check the publication is set up:

SELECT * FROM pg_catalog.pg_publication;

3.  Check which tables are published:

SELECT * FROM pg_publication_tables;

Create the SUBSCRIPTION on the encrypted database

  1. Create the subscription (ref. docs. here):
CREATE SUBSCRIPTION api_db_sub
CONNECTION
'postgresql://replication_user:a_good_password@UN_ENCRYPTEDHOST_URL_HERE/db_name_here'
PUBLICATION api_db_pub;

2. Check the subscription:

SELECT * FROM pg_stat_subscription;

3. Check the status of the subscriber:

SELECT srsubid, pg_filenode_relation(0,srrelid), srsublsn, srsubstate FROM pg_subscription_rel;

4. Check last synchronised messages:

SELECT * FROM pg_stat_subscription;

Check the status of the synchronisation

Check how far individual tables are (State code: i = initialize, d = data is being copied, s = synchronized, r = ready (normal replication):

SELECT * FROM pg_subscription_rel;

Make sure all tables are synchronised (run manual counts (SELECT count(*) ...) on tables on both databases & then check Postgres logs reflect the tables are all done. Here are some example logs reflecting the completion of a table:

2021-10-08 12:35:43.075 SAST [20092] LOG: logical replication table synchronization worker for subscription "api_db_sub", table "cdrs_smscdr_meta" has finished

2021-10-08 12:35:43.089 SAST [20093] LOG: logical replication table synchronization worker for subscription "api_db_sub", table "cdrs_didprovider" has started

You can also get an estimate of how many rows and how far the processes is by comparing reltuples of one database vs the other (note, not exact, but you can see progress this way):

SELECT nspname AS schemaname,relname,reltuples
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE
nspname NOT IN ('pg_catalog', 'information_schema') AND
relkind='r'
ORDER BY reltuples DESC;

Cut Over

Once synchronised you can cut over the services to the new database. Note this is not a multi master setup so database writes will not write changes back to the unencrypted database, so when cutting over it has to be a hard cutover (i.e you can't migrate services that write to the database in a piecemeal faction, it must be done at once.

  1. Disconnect all services for unencrypted database
  2. Migrate services to the new encrypted database
  3. Switch off primary unencrypted database


Conclusion

The easiest option to create a synchronized database in RDS is to use Amazon's DMS service, failing that approach one should use the logical replication features Postgres offers. Once you have a synchronized database running it is easy to cut over your services either for migrating from an unencrypted database to an encrypted one or from one major version of Postgres to another.

IOPS Reference Notes

The synchronisation is IOPS heavy. It is strongly recommended that you migrate to a provisioned IOPS RDS storage class so you can speed up the synchronisation.  Test the migration first and see what the minimum IOPS you need before doing it in a production environment. In our experience the whole synchronisation stops and does not progress if you have too few IOPS provisioned.

You can also get higher IOPS by provisioning more storage. Baseline I/O performance for General Purpose SSD storage is 3 IOPS for each GiB, with a minimum of 100 IOPS. This relationship means that larger volumes have better performance. e.g. 300 GB = 900 IOPS