When you want to evaluate the migration from PostgreSQL to YugabyteDB you may:
- load metadata (DDL) from a pg_dump
- load data with pg_restore
- test the application
Here are some quick scripts that I've used for that. And it can give an idea of what can be done to import data from a http endpoint in JSON into a PostgreSQL table.
DDL
The DDL generated by pg_dump
creates the table and then adds the primary key. This is fine in PostgreSQL as tables are heap tables and the primary key index is a secondary index. However, YugabyteDB tables are stored in LSM Tree organized (sharded by hash, range and sorted) according to the primary key. Similar to tables in MySQL, clustered indexes in SQL Server or Index Organized Tables in Oracle, but with horizontal partitioning to scale out. So, if you run the CREATE TABLE and ALTER TABLE to add the primary key, you will create them with a generated UUID, and then reorganize the tables. On empty tables, this is not a big deal but better do all in the create statement. One way to do that is using yb_dump
, the Yugabyte fork of pg_dump
, to export from PostgreSQL. But if you already have the .sql file, here is a quick awk to modify it:
awk'
# gather primary key definitions in pk[] and cancel lines that adds it later
/^ALTER TABLE ONLY/{last_alter_table=$NF}
/^ *ADD CONSTRAINT .* PRIMARY KEY /{sub(/ADD /,"");sub(/;$/,"");pk[last_alter_table]=$0",";$0=$0"\\r"}
# second pass (i.e when NR>FNR): add primary key definition to create table
NR > FNR && /^CREATE TABLE/{ print $0,pk[$3] > "schema_with_pk.sql" ; next}
# disable backfill for faster create index on empty tables
/^CREATE INDEX/ || /^CREATE UNIQUE INDEX/ { sub("INDEX","INDEX NONCONCURRENTLY") }
NR > FNR { print > "schema_with_pk.sql" }
' schema.sql schema.sql
This is a quick-and-dirty two-pass on the DDL to get the primary key definition, and then pur it in the CREATE TABLE statement.
Note that I also change the CREATE INDEX to disable backfill. More info about this in a previous blog post: https://dev.to/yugabyte/create-index-in-yugabytedb-online-or-fast-2dl3
Youu will probably create the secondary indexes after the load, but I like to run the whole DDL once, before loading data, to see if there are some PostgreSQL features that we do not support yet.
COPY progress
Then you may import terabytes of data. This takes time (there is work in progress to optimize this) and you probably want to see the progress. As we commit every 1000 rows by default (can be changed with rows_per_transaction
) you can select count(*)
but, that takes time on huge tables, requiring to increase the default timeout.
We have many statistics from the tablet server, exposed though the tserver endoint, like http://yb1.pachot.net:9000/metrics, as JSON. But for a PoC you may not have set any statistics collection and want a quick look at the statistics.
Here is a script to gather all tablet metrics into a temporary table:
-- create a temporary table to store the metrics:
drop table if exists my_tablet_metrics;
create temporary table if not exists my_tablet_metrics (
table_id text,tablet_id text
,schema_name text, table_name text
,metric text, value numeric
,tserver text
,timestamp timestamp default now()
);
-- gather the metrics from the tserver:
do $do$
declare
tserver record;
begin
-- gather the list of nodes and loop on them
for tserver in (select * from yb_servers())
loop
-- use COPY from wget/jq to load the metrics
execute format($copy$
copy my_tablet_metrics(table_id,tablet_id
,schema_name,table_name
,metric,value,tserver)
from program $program$
wget -O- http://%s:9000/metrics |
jq --arg node "%s" -r '.[]
|select(.type=="tablet")
|.attributes.table_id+"\t"+.id
+"\t"+.attributes.namespace_name
+"\t"+.attributes.table_name
+"\t"+(.metrics[]|select(.value>0)
|(.name+"\t"+(.value|tostring)))
+"\t"+$node
'
$program$
with (rows_per_transaction 0)
$copy$,tserver.host,tserver.host);
end loop;
end;
$do$;
-- aggregate the per-tablet row insertion metrics:
select sum(value),format('%I.%I',schema_name,table_name)
from my_tablet_metrics where metric='rows_inserted'
group by 2 order by 1
;
I use a server-side COPY FROM PROGRAM here, so you must have installed wget
to read from the http endpoint and jq
to transform the JSON structure to a tab-separated text.
Here is an example after loading the Northwind demo schema:
If you find any issue when loading your PostgreSQL into Yugabyte, please reach out to us (https://www.yugabyte.com/community) there are many things in the roadmap.