osm2pgsql import speed is slow at ways

2.6k Views Asked by At

The import speed of ways is super slow with osm2pgsql. I need tipps how to speed up the import speed.

Server Setup

6 Core Xenon
20 GB Ram
1400 GB SAS

Speedtest of my SAS-drive

hdparm -tT /dev/sda2
/dev/sda2:
 Timing cached reads:   7492.82 MB/sec
SG_IO: 240.88 MB/sec

Import script

osm2pgsql --slim -r pbf -S mapfeatures_export.style -C 14000 -l --number-processes 5 -U osmdata --create --multi-geometry -d osmdatabase planet-190603.osm.pbf 

osm2pgsql console output during run

osm2pgsql version 0.96.0 (64 bit id space)

Using built-in tag processing pipeline
Using projection SRS 4326 (Latlong)
Setting up table: planet_osm_point
Setting up table: planet_osm_line
Setting up table: planet_osm_polygon
Setting up table: planet_osm_roads
Allocating memory for dense node cache
Allocating dense node cache in one big chunk
Allocating memory for sparse node cache
Sharing dense sparse
Node-cache: cache=14000MB, maxblocks=224000*65536, allocation method=11
Mid: pgsql, cache=14000
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels

Reading in file: /home/osmdata/planet-190603.osm.pbf
Using PBF parser.
Processing: Node(5234820k 342.0k/s) Way(15203k 0.20k/s) Relation(0 0.00/s)

I tested a SSD-setup, where the way-import-speed was 50k/s, but it is too expensive.

I followed the optimising tool chain from https://www.geofabrik.de/media/2012-09-08-osm2pgsql-performance.pdf

Hope, there are some additional options to tweak.

1

There are 1 best solutions below

0
Tim Cederquist On

Direct answer: Try to lower your -C to around 4096, you probably don't have enough memory for postgres and the osm2pgsql process. In additional, try '--flat-nodes /tmp/mycache.bin' to speed up processing with direct local stores instead of using the database. Might also try --unlogged but read the docs to make sure you are okay with the consequences.

Other thoughts if you are in a cloud environment, use a higher performance machine for loading like aws r5a.2xlarge, then drop down to something closer to t3.medium for serving content. In the cloud you can dedicate machines for different purposes or even use container hosts like fargate for processing and pay only while running the load or updates.

My current config on a 64G memory machine:

  osm2pgsql --create --slim --flat-nodes /tmp/osm-cache.bin -C 18000 --database postgres --unlogged -H osmsource.cu932j20s2i5e.us-east-1.rds.amazonaws.com -U gisloader -W /osmdata/north-america-latest.osm.pbf
  • With default -C (same as -C 800) loading north america:
    • Node(1137243k 1202.2k/s) Way(92064k 70.22k/s) Relation(291320 55.28/s) (stopped)
  • With -C 11000:
    • Node(1137243k 3203.5k/s) Way(92064k 70.28k/s) Relation(47320 80.87/s) (stopped, relations # dropping)
    • pct memory was 27%
  • With --number-processes 5 -C 9000
    • Processing: Node(1137243k 3203.5k/s) Way(92064k 69.96k/s) Relation(226630 54.97/s) (stopped)
  • With just -C 19000 (monitored threads, appears to use n-2 threads by default)
    • Node(1137243k 3176.7k/s) cancelled similar performance
  • Increased to a machine with a 7.2T nvme ssd drive and 12cpu's 96G memory

    • Node(1137243k 6729.3k/s) Way(92064k 90.17k/s) Relation(957280 69.26/s) parse time: 15012s
  • r5a.2xlarge

  • db.r5.large postgres 11.5 and the rest are defaults (rds instance could be smaller depending on your needs but above balanced with rds cpu at 90% and the server with a reasonably high memory utilization on and off during the load, only used about 10% cpu). You'll need to run the commands in the rds to activate the extensions (listed below or search aws docs for these commands). After loading you might consider a smaller instance class to control costs or backup/restore to your own machine or simply turn it off as the db is on the rds instance. Running local from what I've read is faster if you have the memory and fast drives.

Other thoughts: These settings could surely be tuned for better performance but for a quick setup this seems to perform well and is cheap as long as you don't leave the machines at these sizes during normal operations after loading. The above architecture only costs about $8 to load north america so hardly worth the labor to optimize further.

Next step thoughts for increasing speed if running postgres locally:

  • More memory
  • Higher speed ephemeral drive or raid0 storage for the flat-nodes location as the machine above will only push about 190MB/s. You'll need to choose a different class of machine to use local ephemeral ssd drives but with something like an m5ad you can bond two local ssd drives together in a raid0 array limited only by what you want to spend on the server class and number of drives.
  • Faster may be more expensive per hour but if you can balanced the load speed with cost you may find the ability to rent large machines might balance against time and have high cost tails on both ends vs time and a long low cost middle zone where cost is similar but time can be shortened (cost/hr vs speed/hr being roughly balanced by cost optimization). This would be an interesting study if time permitted.

RDS Commands for anyone interested from https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.html#Appendix.PostgreSQL.CommonDBATasks.PostGIS

create extension postgis;
create extension fuzzystrmatch;
create extension postgis_tiger_geocoder;
create extension postgis_topology;
create extension hstore;

alter schema tiger owner to rds_superuser ;
alter schema tiger_data owner to rds_superuser ;
alter schema topology owner to rds_superuser ;
\dn
CREATE FUNCTION exec(text) returns text language plpgsql volatile AS $f$ BEGIN EXECUTE $1; RETURN $1; END; $f$;
SELECT exec('ALTER TABLE ' || quote_ident(s.nspname) || '.' || quote_ident(s.relname) || ' OWNER TO rds_superuser;')
  FROM (
SELECT nspname, relname
FROM pg_class c JOIN pg_namespace n ON (c.relnamespace = n.oid)
WHERE nspname in ('tiger','topology') AND
relkind IN ('r','S','v') ORDER BY relkind = 'S')
s;

--TEST--
SET search_path=public,tiger;   

select na.address, na.streetname, na.streettypeabbrev, na.zip
from normalize_address('1 Devonshire Place, Boston, MA 02109') as na;
select topology.createtopology('my_new_topo',26986,0.5);

--setup for loading
create user gisloader with encrypted password 'somepassword';
create schema gis;
grant all privileges on schema gis to gisloader;