{"id":2273,"date":"2011-05-22T15:32:08","date_gmt":"2011-05-22T15:32:08","guid":{"rendered":"http:\/\/sunpig.com\/mt-entry-2273.html"},"modified":"2012-03-27T14:12:38","modified_gmt":"2012-03-27T14:12:38","slug":"how-to-load-yahoo-geoplanet-data-into-a-postgresql-database","status":"publish","type":"post","link":"https:\/\/sunpig.com\/martin\/2011\/05\/22\/how-to-load-yahoo-geoplanet-data-into-a-postgresql-database\/","title":{"rendered":"How to load Yahoo! GeoPlanet data into a PostgreSQL database"},"content":{"rendered":"<p><a href=\"http:\/\/developer.yahoo.com\/geo\/geoplanet\/\">Yahoo! GeoPlanet<\/a> is a fun API to play around with, especially because the WOEID (&#8220;Where On Earth&#8221;) place identifiers are used in services like <a href=\"http:\/\/code.flickr.com\/blog\/2008\/09\/04\/whos-on-first\/\">Flickr<\/a> and <a href=\"http:\/\/engineering.twitter.com\/2010\/02\/woeids-in-twitters-trends.html\">Twitter<\/a>. But Yahoo! also makes the GeoPlanet data set available under a <a href=\"http:\/\/creativecommons.org\/licenses\/by\/3.0\/us\/\">Creative Commons license<\/a> for you to download and use yourself. If you&#8217;re just using the public API, the number of queries you are allowed to make per day is limited, so having the raw data available is great when you want to do some heavy lifting.<\/p>\n<p>Here&#8217;s a quick overview of how I loaded the data set into <a href=\"http:\/\/www.postgresql.org\/\">PostgreSQL<\/a> on my Mac:<\/p>\n<h2>1. Download the data<\/h2>\n<p>At the time of writing (22 May 2011), the latest version of the data set is 7.6.0 from 18 October 2010. It&#8217;s a 120MB zip file, available on the <a href=\"http:\/\/developer.yahoo.com\/geo\/geoplanet\/data\/\">GeoPlanet data page<\/a>. Unzip the file, make sure that the Postgres user will be able to access all the files in the unzipped directory (<code>chmod 755 geoplanet_data_7.6.0\/<\/code>).<\/p>\n<p>Inside the file are four TSV (tab separated value) files. (&#8220;x&#8221; represents the version number.)<\/p>\n<ul>\n<li><b>geoplanet_places_x.tsv<\/b> (240MB, 5.6M rows) holds the core list of places<\/li>\n<li><b>geoplanet_aliases_x.tsv<\/b> (80MB, 2.2M rows) holds aliases &#8212; synonyms or colloquial names &#8212; the places. This allows you to determine, for example, that &#8220;Big Apple&#8221; refers to New York City<\/li>\n<li><b>geoplanet_adjacencies_x.tsv<\/b> (260MB, 9.6M rows) tells you how places relate to each other geographically, so you can find neighbouring places.<\/li>\n<li><b>geoplanet_changes_x.tsv<\/b> (2KB, 100 rows) holds changes for this version. Geography is political. Places are not static over time.<\/li>\n<\/ul>\n<h2>2. Create tables<\/h2>\n<p>Here&#8217;s the DDL I used to create tables for each of these four files:<\/p>\n<pre class=\"code\"><code class=\"sql\" data-language=\"sql\">DROP TABLE places;\nCREATE TABLE places\n(\n  woe_id integer NOT NULL,\n  iso character varying(2) NOT NULL,\n  \"name\" character varying(150) NOT NULL,\n  \"language\" character varying(3) NOT NULL,\n  placetype character varying(20) NOT NULL,\n  parent_id integer NOT NULL,\n  CONSTRAINT places_pkey PRIMARY KEY (woe_id)\n);\n\nDROP TABLE aliases;\nCREATE TABLE aliases\n(\n  woe_id integer NOT NULL,\n  \"name\" character varying(150) NOT NULL,\n  name_type character (1) NOT NULL,\n  \"language\" character varying(3) NOT NULL\n);\n\nDROP TABLE adjacencies;\nCREATE TABLE adjacencies\n(\n  place_woe_id integer NOT NULL,\n  place_iso character varying(2) NOT NULL,\n  neighbour_woe_id integer NOT NULL,\n  neighbour_iso character varying(2) NOT NULL,\n  CONSTRAINT adjacencies_pkey PRIMARY KEY (place_woe_id, neighbour_woe_id)\n);\n\nDROP TABLE changes;\nCREATE TABLE changes\n(\n  woe_id integer NOT NULL,\n  rep_id integer NOT NULL,\n  data_version character varying(10) NOT NULL\n);<\/code><\/pre>\n<p>I will leave creation of indexes up to you &#8211; your use of the data will be different than mine. But the structures are really simple and obvious to work with.<\/p>\n<h2>3. Import the data<\/h2>\n<p>The <a href=\"http:\/\/www.postgresql.org\/docs\/9.0\/static\/sql-copy.html\">COPY<\/a> command makes importing data a breeze in PostgreSQL. The only minor wrinkle with these data files is that despite being tab-separated, you can&#8217;t use the default &#8220;text&#8221; format for importing, because the files have header rows. You could just chop the header rows yourself, but alternatively you can treat them as CSV (comma-separated) files, and override the default delimiter to be a tab instead of a comma:<\/p>\n<pre class=\"code\"><code class=\"sql\" data-language=\"sql\">COPY \n    &lt;table name&gt; \nFROM \n    &lt;file path&gt;\nWITH \n    csv header delimiter '\t';<\/code><\/pre>\n<p>Pay particular attention to the fact that the character between the apostrophes is a TAB character. Using &#8220;\\t&#8221; doesn&#8217;t work.<\/p>\n<p>For these four GeoPlanet tables, the exact commands I used were:<\/p>\n<pre class=\"code\"><code class=\"sql\" data-language=\"sql\">COPY \n    places\nFROM \n    '\/Users\/martin\/Public\/geoplanet_data_7.6.0\/geoplanet_places_7.6.0.tsv'\nWITH \n    csv header delimiter '\t';\n\nCOPY \n    aliases \nFROM \n    '\/Users\/martin\/Public\/geoplanet_data_7.6.0\/geoplanet_aliases_7.6.0.tsv' \nWITH \n    csv header delimiter '\t';\n\nCOPY \n    adjacencies \nFROM\n    '\/Users\/martin\/Public\/geoplanet_data_7.6.0\/geoplanet_adjacencies_7.6.0.tsv' \nWITH \n    csv header delimiter '\t';\n\nCOPY \n    changes\nFROM \n    '\/Users\/martin\/Public\/geoplanet_data_7.6.0\/geoplanet_changes_7.6.0.tsv'\nWITH \n    csv header delimiter '\t';<\/code><\/pre>\n<h2>4. Query away!<\/h2>\n<p>Now that I have the data loaded up, I can run neat SQL queries to do things like find out what places are nearby where I live:<\/p>\n<pre class=\"code\"><code class=\"sql\" data-language=\"sql\">SELECT \n\tneighbours.*\nFROM\n\tplaces\n\tINNER JOIN adjacencies \n\t    ON places.woe_id = adjacencies.place_woe_id\n\tINNER JOIN places neighbours \n\t    ON adjacencies.neighbour_woe_id = neighbours.woe_id\nWHERE \n\tplaces.name = 'Oostzaan';<\/code><\/pre>\n<p>The rest is up to you. Happy querying!<\/p>\n<h2>Further reading<\/h2>\n<ul>\n<li>Christian Heilmann: <a href=\"http:\/\/thinkvitamin.com\/code\/getting-started-with-yahoo-geoplanet-explorer\/\">Getting Started with Yahoo! GeoPlanet Explorer<\/a> (Think Vitamin)<\/li>\n<li>Christian Heilmann&#8217;s <a href=\"http:\/\/isithackday.com\/geoplanet-explorer\/\">GeoPlanet Explorer<\/a> (<a href=\"https:\/\/github.com\/codepo8\/geoplanet-explorer\">source on Github<\/a>)\n<li><a href=\"http:\/\/developer.yahoo.com\/geo\/geoplanet\/guide\/\">Yahoo! GeoPlanet developer docs<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Yahoo! GeoPlanet is a fun API to play around with, especially because the WOEID (&#8220;Where On Earth&#8221;) place identifiers are used in services like Flickr and Twitter. But Yahoo! also makes the GeoPlanet data set available under a Creative Commons license for you to download and use yourself. If you&#8217;re just using the public API, &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/sunpig.com\/martin\/2011\/05\/22\/how-to-load-yahoo-geoplanet-data-into-a-postgresql-database\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;How to load Yahoo! GeoPlanet data into a PostgreSQL database&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-2273","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/posts\/2273","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/comments?post=2273"}],"version-history":[{"count":0,"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/posts\/2273\/revisions"}],"wp:attachment":[{"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/media?parent=2273"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/categories?post=2273"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sunpig.com\/martin\/wp-json\/wp\/v2\/tags?post=2273"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}