You carry *how many* phones?

A few years ago, during one of the most entertaining job interviews I’ve ever had, one of the questions asked was, “how many computers do you have at home?”

Even at the time (circa 2005), this was a bit of a head-scratcher for me, and I had to ask what they considered a “computer,” whether it mattered if the computer was in active use or stored in the attic “just in case”, and if games consoles counted. The main purpose of the question, though, was to act as a nerd detector. If you even have to think about the answer, you’ve passed some kind of barrier.

The same is happening in mobile development now. If you are looking to hire a mobile (web) developer, ask them how many “mobile devices” they own. If their answer starts with “Wellll…”, then there’s a good chance you’ve got a live one on your hands.

(Warning: do not use this piece of advice after July 2012. I’m guessing that by that point, it’ll be much more common for techies to have multiple mobiles on hand. Right now, though, we’re at a point where it’s still mostly the hard-core who wander around with more than one.)

How to load Yahoo! GeoPlanet data into a PostgreSQL database

Yahoo! GeoPlanet is a fun API to play around with, especially because the WOEID (“Where On Earth”) place identifiers are used in services like Flickr and Twitter. But Yahoo! also makes the GeoPlanet data set available under a Creative Commons license for you to download and use yourself. If you’re just using the public API, the number of queries you are allowed to make per day is limited, so having the raw data available is great when you want to do some heavy lifting.

Here’s a quick overview of how I loaded the data set into PostgreSQL on my Mac:

1. Download the data

At the time of writing (22 May 2011), the latest version of the data set is 7.6.0 from 18 October 2010. It’s a 120MB zip file, available on the GeoPlanet data page. Unzip the file, make sure that the Postgres user will be able to access all the files in the unzipped directory (chmod 755 geoplanet_data_7.6.0/).

Inside the file are four TSV (tab separated value) files. (“x” represents the version number.)

  • geoplanet_places_x.tsv (240MB, 5.6M rows) holds the core list of places
  • geoplanet_aliases_x.tsv (80MB, 2.2M rows) holds aliases — synonyms or colloquial names — the places. This allows you to determine, for example, that “Big Apple” refers to New York City
  • geoplanet_adjacencies_x.tsv (260MB, 9.6M rows) tells you how places relate to each other geographically, so you can find neighbouring places.
  • geoplanet_changes_x.tsv (2KB, 100 rows) holds changes for this version. Geography is political. Places are not static over time.

2. Create tables

Here’s the DDL I used to create tables for each of these four files:

DROP TABLE places;
CREATE TABLE places
(
  woe_id integer NOT NULL,
  iso character varying(2) NOT NULL,
  "name" character varying(150) NOT NULL,
  "language" character varying(3) NOT NULL,
  placetype character varying(20) NOT NULL,
  parent_id integer NOT NULL,
  CONSTRAINT places_pkey PRIMARY KEY (woe_id)
);

DROP TABLE aliases;
CREATE TABLE aliases
(
  woe_id integer NOT NULL,
  "name" character varying(150) NOT NULL,
  name_type character (1) NOT NULL,
  "language" character varying(3) NOT NULL
);

DROP TABLE adjacencies;
CREATE TABLE adjacencies
(
  place_woe_id integer NOT NULL,
  place_iso character varying(2) NOT NULL,
  neighbour_woe_id integer NOT NULL,
  neighbour_iso character varying(2) NOT NULL,
  CONSTRAINT adjacencies_pkey PRIMARY KEY (place_woe_id, neighbour_woe_id)
);

DROP TABLE changes;
CREATE TABLE changes
(
  woe_id integer NOT NULL,
  rep_id integer NOT NULL,
  data_version character varying(10) NOT NULL
);

I will leave creation of indexes up to you – your use of the data will be different than mine. But the structures are really simple and obvious to work with.

3. Import the data

The COPY command makes importing data a breeze in PostgreSQL. The only minor wrinkle with these data files is that despite being tab-separated, you can’t use the default “text” format for importing, because the files have header rows. You could just chop the header rows yourself, but alternatively you can treat them as CSV (comma-separated) files, and override the default delimiter to be a tab instead of a comma:

COPY 
    <table name> 
FROM 
    <file path>
WITH 
    csv header delimiter '	';

Pay particular attention to the fact that the character between the apostrophes is a TAB character. Using “\t” doesn’t work.

For these four GeoPlanet tables, the exact commands I used were:

COPY 
    places
FROM 
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_places_7.6.0.tsv'
WITH 
    csv header delimiter '	';

COPY 
    aliases 
FROM 
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_aliases_7.6.0.tsv' 
WITH 
    csv header delimiter '	';

COPY 
    adjacencies 
FROM
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_adjacencies_7.6.0.tsv' 
WITH 
    csv header delimiter '	';

COPY 
    changes
FROM 
    '/Users/martin/Public/geoplanet_data_7.6.0/geoplanet_changes_7.6.0.tsv'
WITH 
    csv header delimiter '	';

4. Query away!

Now that I have the data loaded up, I can run neat SQL queries to do things like find out what places are nearby where I live:

SELECT 
	neighbours.*
FROM
	places
	INNER JOIN adjacencies 
	    ON places.woe_id = adjacencies.place_woe_id
	INNER JOIN places neighbours 
	    ON adjacencies.neighbour_woe_id = neighbours.woe_id
WHERE 
	places.name = 'Oostzaan';

The rest is up to you. Happy querying!

Further reading

Considerations for caching resources in localStorage

Many years ago I was bitten by a radioactive tester, and since then I have been cursed with the inability to ignore edge cases. As superpowers go, it’s a bit rubbish.

Steve Souders recently wrote about a technique he had observed on Google and Bing’s mobile search sites. It involves using the HTML5 localStorage API to cache JavaScript and CSS resources (and also images, as data-uris) instead of the standard browser cache or offline storage API (AppCache). Nicholas Zakas covered it in his presentation “Mobile Web Speed Bumps” at Web Directions Unplugged, and there was an excited buzz when Steve discussed it again in his presentation at Mobilism last week.

The technique

When a new visitor comes to your web site (no cookies), your server sends them the resources inline with the HTML:

<script id="js.resource1">
var tester = {'type':'Radioactive'};
...
</script>

<style id="css.resource2">
.tester {color:#f00;}
...
</style>

This can give you a pretty large file, but that’s okay, because this only happens on the first visit. (Well…maybe.) You also need to transmit some kind of storage/recovery (dehydration/rehydration) library; I think this probably needs to be sent as inline script with every visit, so that it is guaranteed to always be available immediately.

When the client receives the HTML, the storage code examines candidate inline resources (perhaps flagged with a data-* attribute, e.g. <script data-cacheable="1">), and inserts them into localStorage, using keys derived from the resources themselves, such as their IDs. The code sets a cookie to match the resources that have been cached. In pseudocode:

foreach resource {
    localStorage.setItem(resource.id, resource.contents);
    cachedResources.push(resource.id);
}
cookie['cachedResources'] = cachedResources.join('|');

On subsequent visits to the server, the client passes it the cachedResources cookie, which contains a list of resources that are already available locally. When the server interprets the cookie, it can send down a completely different inline script block:

<script>
rehydrate(['js.resource1', 'css.resource2']);
</script>

The rehydration script then dives into the localStorage, retrieves the resources by key, and creates new <script> and <style> blocks to inject the resources into the page (with the necessary attention to script execution order). Job done; no external HTTP calls needed for JS and CSS.

Except…for the edge cases.

Edge cases, part 1

The biggest ones to consider are:

  • What if the browser doesn’t support the localStorage API?
  • What if the browser’s cookies and localStorage have got out of sync?

No localStorage

If the browser doesn’t support the localStorage API, the first request from the visitor will be exactly the same: no cookies, so the server transmits the resources inline. The storage code detects that localStorage is not available, skips the caching steps, and records the browser’s absence of capabilities in a cookie, e.g. cookie['localStorageAvailable'] = false.

In the second visit, the server gets the cookie, and now knows that the browser can’t handle localStorage. Sending the resources inline regardless would be a waste, because the browser could probably cache them if they were referenced as external resources. So this is exactly what the server can do: head down a third code path, where it renders standard script or link tags for the resources:

<script id="js.resource1" 
    src="/js/resource1.js" ></script>
<link rel="stylesheet" id="css.resource2" 
    href="/css/resource2.css" />

When the browser receives these script tags in its second visit, it will have to make follow-up HTTP requests to retrieve the external resources. But on every subsequent visit, the browser should be able to pull them from its standard file cache.

This is somewhat wasteful, because the data is transmitted twice. That makes this an excellent use case for server-side browser sniffing to identify known old browsers that are incapable of using localStorage, and give them the external resource tags right from the very start.

Cookies and localStorage are out of sync

Browsers have no native mechanism for, and no obligation to keep cookies and localStorage in sync. Therefore, they will be out of sync on a regular basis.

If the user deletes their cookies, but leaves their localStorage intact, the server will think that the browser needs to receive the resources inline again. The storage code will stash the inline resources in localStorage, overwriting the old values. This uses more bandwidth than strictly necessary, but otherwise there is no harm in it.

If the user deletes their localStorage, but leaves their cookies intact, that’s a bit more problematic. The server won’t transmit the resource inline, and the rehydration script will try to pull it out of localStorage, and find that the cupboard is bare. Oops.

So the rehydration script needs a secondary code path to request a missing resource from the server as an external resource. Then, once the external resource has loaded, it can stash it in localStorage and update the cookie, ready for the next round.

function rehydrate(id) {
    serializedResource = localStorage.getItem(id);
    if (!serializedResource) {
        // Cupboard is bare
        loadAndStashRemoteScript(getPathFromId(id));
    }
}

Edge cases, part 2

Inserting stuff into a cache: easy. Fun, too. Figuring out when to remove items from a cache: hard.

I am sure that Bing and Google have put a lot of thought into this already, but I haven’t had time to dissect their code yet. Here are some of the things that just spring immediately to mind for me, though. (Edge case superpowers, remember?) What happens when…

  • the localStorage cache is full?
  • you have updated a resource server-side, and want to cache this new revision?
  • you stop using a particular server-side resource altogether?
  • you’re running multiple applications on the same domain, all of which use localStorage?

If localStorage is full, you get a QUOTA_EXCEEDED_ERR when you try to insert a value. The localStorage API has no provision for increasing storage, either automatically, or with a user’s authorization. (You could tell them that their cache is full, so they can empty it themselves, but nobody will have a clue what you’re talking about.)

So what to do? You could call localStorage.clear(), which will empty the cache (for that domain), but that’s the nuclear option. You may be deleting items that you yourself inserted just a moment ago.

You could iterate over the key/value pairs in the cache, and evaluate each one as a candidate for deletion. But how do you decide which items stay, and which ones go?

One possibility would be to make a timestamp part of the key whenever you insert an item, e.g. ‘js.resource1.2011-05-20T23:43:27+01:00‘. Then you can sort all the keys by time, eject the oldest, and try inserting your original value into the cache again to see if it fits. If it doesn’t, keep ejecting keys until it does.

Of course, you should remember to update the matching cookie, so that the server knows that any ejected resources are no longer available locally. And you still run the risk of removing values from the cache that you just inserted as part of the current server round-trip. Given that the most browsers give you 2-5MB of localStorage to play with, you’d have to be inserting quite a lot of data for that to happen, but don’t rule it out. Unless you implement your own compression algorithm, the data you insert into the localStorage cache will not be gzipped like it is on the wire.

To ensure that you don’t delete items that were cached in localStorage by other applications running on the same domain, you might also want to consider namespacing your keys (e.g. ‘monkeyfez.js.resource1.2011-05-20T23:43:27+01:00‘), so you don’t tamper with values that aren’t yours. On the other hand, what if other applications have used up all of the available localStorage, leaving nothing for you to play with? Coordinate with your colleagues to ensure that you use a common algorithm for de-caching, and that you come to some agreement about how much cache each application should be allowed to use.

Timestamping the key allows you to track when the value was inserted into the cache, but it doesn’t say anything about the version of the resource on the server. Revving your resource file names is a good idea for busting caches, and it would probably help here, too.

Say that your code is in a file called “main.js“. During your build process, you append a revision ID to it, so it becomes main.4e56afa.js. When you serve this to the client, generate the cache id based on the application name, resource type, file name and the revision number:

<script id="monkeyfez.js.main.4e56afa">
var tester = {'type':'Radioactive'};
...
</script>

When you come to insert the file into localStorage, you add the timestamp to the key, so it becomes monkeyfez.js.main.4e56afa.2011-05-20T23:43:27+01:00.

Now, when you come to retrieve the resource, you won’t know in advance when it was inserted into the cache, so you’ll have to iterate over all keys to find the ones that match the pattern monkeyfez.js.main.4e56afa.*. I wouldn’t rule out the possibility that you might have multiple copies of the same file (even with the same revision number) in the cache under different timestamps, so sort the keys based on their dates, and take the most recent one.

If you update the resource on the server, the file will get a different revision number, e.g. main.b99a341.js. The first time that a visitor comes back after this change, they will be unaware that this new version exists. According to their cookies, they have monkeyfez.js.main.4e56afa. The server recognizes that this is useless, so it sends the new version as an inline code block. When the new version hits the client, the storage code inserts the new version into the cache.

But the old version (monkeyfez.js.main.4e56afa) will still be there in localStorage, and a reference to that revision will still there in the user’s cookie. Is it safe to delete all previous versions of monkeyfez.js.main when you receive a differently revved version of this resource? Probably. In the odd case where you have multiple branches of your application living on the same domain, the worst that will happen is that when you use the other branches, the application will transmit a different revision, which will then cause your revision to be ejected from the cache again. Annoying, but not catastrophic.

The same applies if you stop using the file main.js altogether, and decide to move to core.js instead. Assuming that you implement this kind of FIFO strategy, and synchronize the cookie whenever you update localStorage, then eventually the unused monkeyfez.js.main.4e56afa.2011-05-20T23:43:27+01:00 cache item will be kicked out of the cache when it is the oldest item available.

Summary

There’s a lot more to this technique than first meets the eye. Implementing it involves changes on both the client and the server, and getting all the caching/cookie synchronization right is a non-trivial task. But is it worth it?

Reducing HTTP requests is always a good thing, especially on mobile, where the latency really kills you. But read the comments on Steve Souders’s article to see some discussion about slow I/O for localStorage. If you have to run through a manual try/fail cycle of cache expiry as soon as it fills up, I imagine that could slow things down considerably.

Is it faster than using the browser’s normal cache? I don’t know; I haven’t got a working implementation with which to test it yet. And even though you’re storing the resources locally, your browser still has to parse them, which will take just as long as it ever did. But I find the fact that Google and Bing are both using it for their mobile sites highly suggestive, and it’s certainly something I want to investigate more.

References and further reading:

Big Bag o’ Mobile (aka White Van Mobile Phone Man)

Mobilism business ideas #1:

Big bag o'mobileSuppose I were to offer a service where you could book me to show up at your office with a suitcase full of mobile devices. A large spectrum of phones (& tablets), ranging from old to new, smartphones to feature phones, different software levels. iPhones, Androids, Nokias, Blackberries, etc. Let’s say about 30 different models, just to give it a number. They’d all be fully charged (spare chargers in my other suitcase), and loaded with SIM cards with data plans. You could then proceed to spend a few hours or a day loading up your app or web site on the ones you are interested in, and running through test scripts.

The benefit of this approach over remote testing, or testing with emulators, or testing with devices connected to local networks, is that you would get to feel the devices in your hand, and experience actual network speeds and latencies. (One of the biggest take-way points from the Mobilism conference was that you really have to use actual devices to understand how your web site will behave in real life.)

Another benefit is that you wouldn’t have to buy a suitcase load of phones yourself, and tie yourself to the subscriptions that go with them. You could also use the time to invite along subjects so you could run usability tests. Lots of possibilities!

Questions, then:

  • Would you (or your company, or your clients) use such a service?
  • How often, or how frequently do you think you would want to use the service?
  • How much would you (or your company, or your clients) be willing to pay for this? (Per hour, or per day.)

The spectrum of commonly disabled browser features

Images
enabled?
CSS
enabled?
JS
enabled?
Use case
Y Y Y

The full web. The happy path. The zen garden. 95% of people will see some form of this. However, progressive enhancement, natural browser variations, and adaptive/responsive design means that you can’t (and shouldn’t) treat this 95% as a monolithic block of pixel-perfection. It’s a sub-spectrum of features.

Y Y N

JavaScript is disabled. Common reasons for this:

  • You are in a corporate environment that does not allow untrusted scripts to run
  • You have got sick of dynamic web pages, and want a more peaceful web environment
  • Your bandwidth is metered, or your connection speed is slow, and you don’t want to download lots of scripts that probably aren’t designed to enhance your reading experience.
  • A script has crashed somewhere on the page, taking everything else down along with it.

There are plenty of other reasons, but a lack of JavaScript does not imply that the user is working with a screen reader, or some other form of assistive technology. Screen readers do actually handle JavaScript these days.

This is definitely a use case you should care about. As Mark Pilgrim points out, “If your web application fails in browsers with scripting disabled, Jakob Nielsen’s dog will come to your house and shit on your carpet.”

Y N Y

CSS is disabled, but JavaScript is enabled. You’re a web developer.

Y N N

CSS and JavaScript are both disabled, but you can still see images. You’re a web developer trying to figure out why your print stylesheet doesn’t work.

N Y Y

Images are disabled. This is actually more common and more useful than you might think, especially in low-bandwidth or expensive bandwidth situations. Most modern web pages still work just fine without images, they’re just a little less visually interesting. You should try it some time.

Opera is the only browser that has consistently made it easy to toggle image downloading from a menu or keyboard shortcut. Other browsers allow you to do this, but you have to dive into their options or preferences screen, or install an add-on.

N Y N

Images and JavaScript are disabled, but CSS is still turned on. You’re trying to save ink while printing a web page.

N N Y

Images and CSS are disabled, but JavaScript still works. You are a client-side unit testing framework.

N N N

Free-range organic HTML, rendered as nature intended it. You’re an awesome time-travelling hard-core Unix geek from the Dark Days before Mosaic (aka 1992), or, more disappointingly, a bot. Hello, bot. Nice bot. Have a cookie.

Using a VPN from the guest OS in Parallels

Problem statement: I am running OS X 10.6 (Snow Leopard), and have a Windows 7 virtual machine running under Parallels 5. I want to have both of them connected to a remote VPN at the same time. Getting the host Mac to connect to the VPN was not a problem. But connecting to the VPN from the Windows 7 guest would not work. It would just sit there at the “Verifying user name and password…” message.”

Solution:

  1. Set the virtual machine instance to use Bridged Networking (Default Adapter) instead of Shared Networking.
  2. Make sure that the advanced options for the VPN in the host OS X are not set to send all traffic over the VPN connection. (This is actually the default setting, but if you’ve just spent an hour fucking around with your network configuration trying to make it work, you may very well have changed this.)

This now allows me to connect the VPN on the host and the guest completely independently. (Either one or the other, or both at the same time.) I have no idea if this is a universal solution; if it doesn’t work for you, asking me why really isn’t going to help.