Bulk loading data into PGD clusters
Bulk loading data into PGD clusters
This guidance is specifically for environments where there's no direct access to the PGD nodes, only PGD Proxy endpoints, such as BigAnimal's distributed high availability deployments of PGD.
Without using care, bulk loading data into a PGD cluster can cause a lot of replication load on a cluster. With that in mind, this content describes a process to mitigate that replication load.
Provision or prepare a PGD cluster
You must provision a PGD cluster, either manually, using TPA, or on BigAnimal. This will be the target database for the migration. Ensure that you provision it with sufficient storage capacity to hold the migrated data.
We recommend that, when provisioning or, if needed, after provisioning, you set the following Postgres GUC variables.
GUC variable | Setting |
---|---|
maintenance_work_mem | 1GB |
wal_sender_timeout | 60min |
wal_receiver_timeout | 60min |
max_wal_size | Set to either: • A multiple (2 or 3) of your largest table or • More than one third of the capacity of your dedicated WAL disk (if configured) |
Make note of the target's proxy hostname and port. You also need a user and password for the target cluster.
The following instructions give examples for a cluster named ab-cluster
with an ab-group
subgroup and three nodes: ab-node-1
, ab-node-2
, and ab-node3
. The cluster is accessed through a host named ab-proxy
. On BigAnimal, a cluster is configured, by default, with an edb_admin user that can be used for the bulk upload.
Identify your data source
You need the source hostname, port, database name, user, and password for your source database.
Also, you currently need a list of tables in the database that you want to migrate to the target database.
Prepare a bastion server
Create a virtual machine with your preferred operating system in the cloud to orchestrate your bulk loading.
- Use your EDB account.
- Obtain your EDB repository token from the EDB Repos 2.0 page.
- Set environment variables.
- Set the
EDB_SUBSCRIPTION_TOKEN
environment variable to the repository token.
- Set the
- Configure the repositories.
- Run the automated installer to install the repositories.
- Install the required software.
- Install and configure:
- psql
- PGD CLI
- Migration Toolkit
- Install and configure:
Configure repositories
The required software is available from the EDB repositories. You need to install the EDB repositories on your bastion server.
- Red Hat
- Ubuntu/Debian
Install the required software
Once the repositories are configured, you can install the required software.
psql and pg_dump/pg_restore
The psql command is the interactive terminal for working with PostgreSQL. It's a client application and can be installed on any operating system. Packaged with psql are pg_dump and pg_restore, command-line utilities for dumping and restoring PostgreSQL databases.
- Ubuntu
- Red Hat
To simplify logging in to the databases, create a .pgpass file for both your source and target servers:
Create the file in your home directory and change its permissions to read/write only for the owner.
PGD CLI
PGD CLI is a command-line interface for managing and monitoring PGD clusters. It's a Go application and can be installed on any operating system.
- Ubuntu
- Red Hat
Create a configuration file for the PGD CLI:
For the example ab-cluster
:
Save it as pgd-cli-config.yml
.
See also Installing PGD CLI.
Migration Toolkit
EDB's Migration Toolkit (MTK) is a command-line tool that can be used to migrate data from a source database to a target database. It's a Java application and requires a Java runtime environment to be installed.
- Ubuntu
- Red Hat
See also Installing Migration Toolkit
Set up and tune the target cluster
On the target cluster and within the regional group required, select one node to be the destination for the data.
If you have a group ab-group
with ab-node-1
, ab-node-2
, and ab-node-3
, you can select ab-node-1
as the destination node.
Set up a fence
Fence off all other nodes except for the destination node.
Connect to any node on the destination group using the psql command.
Use bdr.alter_node_option
and turn the route_fence
option to true
for each node in the group apart from the destination node:
The next time you connect with psql, you're directed to the write leader, which should be the destination node. To ensure that it is, you need to send two more commands.
Make the destination node both write and raft leader
To minimize the possibility of disconnections, move the raft and write leader roles to the destination node.
Make the destination node the raft leader using bdr.raft_leadership_transfer
:
Because you fenced off the other nodes in the group, this command triggers a write leader election that elects the ab-node-1
as write leader.
Record then clear default commit scopes
You need to make a record of the default commit scopes in the cluster. The next step overwrites the settings. (At the end of this process, you need to restore them.) Run:
This command produces an output similar to::
Record these values. You can now overwrite the settings:
Prepare to monitor the data migration
Check that the target cluster is healthy.
- To check the overall health of the cluster, run
pgd -f pgd-cli-config.yml check-health
:
(When the cluster is healthy, all checks pass.)
- To verify the configuration of the cluster, run
pgd -f pgd-cli-config.yml verify-cluster
:
(When the cluster is verified, all checks.)
- To check the status of the nodes, run
pgd -f pgd-cli-config.yml show-nodes
:
To confirm the raft leader, run
pgd -f pgd-cli-config.yml show-raft
.To confirm the replication slots, run
pgd -f pgd-cli-config.yml show-replslots
.To confirm the subscriptions, run
pgd -f pgd-cli-config.yml show-subscriptions
.To confirm the groups, run
pgd -f pgd-cli-config.yml show-groups
.
These commands provide a snapshot of the state of the cluster before the migration begins.
Migrating the data
Currently, you must migrate the data in three phases:
- Transferring the “pre-data” using pg_dump and pg_restore, which exports and imports all the data definitions.
- Using MTK with the
--dataonly
option to transfer only the data from each table, repeating as necessary for each table. - Transferring the “post-data” using pg_dump and pg_restore, which completes the data transfer.
Transferring the pre-data
Use the pg_dump
utility against the source database to dump the pre-data section in directory format:
Once the pre-data is dumped into the predata directory, you can load it into the target cluster using pg_restore
:
The options=
section in the connection string to the server is important. The options disable DDL locking and set the commit scope to local
, overriding any default commit scopes. Using --section=pre-data
limits the restore to the configuration that precedes the data in the dump.
Transferring the data
In this step, Migration Toolkit is used to transfer the table data between the source and target.
Edit /usr/edb/migrationtoolkit/etc/toolkit.properties
. You need to use sudo to raise your privilege to do this, that is, sudo vi /usr/edb/migrationtoolkit/etc/toolkit.properties
.
Edit the relevant values in the settings.
Ensure that the configuration file is owned by the user you intend to run the data transfer as and read-write only for its owner.
Now, select sets of tables in the source database that must be transferred together, ideally grouping them for redundancy in case of failure:
This command uses the -truncLoad
option and drops indexes and constraints before the data is loaded. It then recreates them after the loading has completed.
You can run multiple instances of this command in parallel. To do so, add an &
to the end of the command. Ensure that you write the output from each to different files (for example, mtk_1.log
, mtk_2.log
).
For example:
This sets up four processes, each transferring a particular table or sets of tables as a background process.
While this is running, monitor the lag. Log into the destination node with psql, and monitor lag with:
Once the lag is consumed, return to the shell. You can now use tail
to monitor the progress of the data transfer by following the log files of each process:
Transferring the post-data
Make sure there's no replication lag across the entire cluster before proceeding with post-data.
Now dump the post-data section of the source database:
Then load the post-data section into the target database:
If this step fails due to a disconnection, return to monitoring lag (as described previously). Then, when no synchronization lag is present, repeat the restore.
Resume the cluster
Remove the routing fences you set up earlier on the other nodes
Connect directly to the destination node using psql. Use bdr.alter_node_option
and turn off the route_fence
option for each node in the group except for the destination node, which is already off:
Proxies can now route to all the nodes in the group.
Reset commit scopes
You can now restore the default commit scopes to the cluster to allow PGD to manage the replication load. Set default_commit_scope
for the groups to the value for the groups that you recorded in an earlier step.
The cluster is now loaded and ready for production. For more assurance, you can run the pgd -f pgd-cli-config.yml check-health
command to check the overall health of the cluster and the other PGD commands from when you checked the cluster earlier.