How to load Yahoo! GeoPlanet data into a PostgreSQL database

Yahoo! GeoPlanet is a fun API to play around with, especially because the WOEID (“Where On Earth”) place identifiers are used in services like Flickr and Twitter. But Yahoo! also makes the GeoPlanet data set available under a Creative Commons license for you to download and use yourself. If you’re just using the public API, the number of queries you are allowed to make per day is limited, so having the raw data available is great when you want to do some heavy lifting.

Here’s a quick overview of how I loaded the data set into PostgreSQL on my Mac:

1. Download the data

At the time of writing (22 May 2011), the latest version of the data set is 7.6.0 from 18 October 2010. It’s a 120MB zip file, available on the GeoPlanet data page. Unzip the file, make sure that the Postgres user will be able to access all the files in the unzipped directory (chmod 755 geoplanet_data_7.6.0/).

Inside the file are four TSV (tab separated value) files. (“x” represents the version number.)

geoplanet_places_x.tsv (240MB, 5.6M rows) holds the core list of places
geoplanet_aliases_x.tsv (80MB, 2.2M rows) holds aliases — synonyms or colloquial names — the places. This allows you to determine, for example, that “Big Apple” refers to New York City
geoplanet_adjacencies_x.tsv (260MB, 9.6M rows) tells you how places relate to each other geographically, so you can find neighbouring places.
geoplanet_changes_x.tsv (2KB, 100 rows) holds changes for this version. Geography is political. Places are not static over time.

2. Create tables

Here’s the DDL I used to create tables for each of these four files:

DROP TABLE places;
CREATE TABLE places
(
  woe_id integer NOT NULL,
  iso character varying(2) NOT NULL,
  "name" character varying(150) NOT NULL,
  "language" character varying(3) NOT NULL,
  placetype character varying(20) NOT NULL,
  parent_id integer NOT NULL,
  CONSTRAINT places_pkey PRIMARY KEY (woe_id)
);

DROP TABLE aliases;
CREATE TABLE aliases
(
  woe_id integer NOT NULL,
  "name" character varying(150) NOT NULL,
  name_type character (1) NOT NULL,
  "language" character varying(3) NOT NULL
);

DROP TABLE adjacencies;
CREATE TABLE adjacencies
(
  place_woe_id integer NOT NULL,
  place_iso character varying(2) NOT NULL,
  neighbour_woe_id integer NOT NULL,
  neighbour_iso character varying(2) NOT NULL,
  CONSTRAINT adjacencies_pkey PRIMARY KEY (place_woe_id, neighbour_woe_id)
);

DROP TABLE changes;
CREATE TABLE changes
(
  woe_id integer NOT NULL,
  rep_id integer NOT NULL,
  data_version character varying(10) NOT NULL
);

I will leave creation of indexes up to you – your use of the data will be different than mine. But the structures are really simple and obvious to work with.

3. Import the data

The COPY command makes importing data a breeze in PostgreSQL. The only minor wrinkle with these data files is that despite being tab-separated, you can’t use the default “text” format for importing, because the files have header rows. You could just chop the header rows yourself, but alternatively you can treat them as CSV (comma-separated) files, and override the default delimiter to be a tab instead of a comma:

COPY 
    <table name> 
FROM 
    <file path>
WITH 
    csv header delimiter '	';

Pay particular attention to the fact that the character between the apostrophes is a TAB character. Using “\t” doesn’t work.

For these four GeoPlanet tables, the exact commands I used were:

COPY 
    places
FROM 
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_places_7.6.0.tsv'
WITH 
    csv header delimiter '	';

COPY 
    aliases 
FROM 
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_aliases_7.6.0.tsv' 
WITH 
    csv header delimiter '	';

COPY 
    adjacencies 
FROM
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_adjacencies_7.6.0.tsv' 
WITH 
    csv header delimiter '	';

COPY 
    changes
FROM 
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_changes_7.6.0.tsv'
WITH 
    csv header delimiter '	';

4. Query away!

Now that I have the data loaded up, I can run neat SQL queries to do things like find out what places are nearby where I live:

SELECT 
	neighbours.*
FROM
	places
	INNER JOIN adjacencies 
	    ON places.woe_id = adjacencies.place_woe_id
	INNER JOIN places neighbours 
	    ON adjacencies.neighbour_woe_id = neighbours.woe_id
WHERE 
	places.name = 'Oostzaan';

The rest is up to you. Happy querying!

1. Download the data

2. Create tables

3. Import the data

4. Query away!

Further reading