(Update: since originally posting this, I have learned a trick or two which avoids some of the problems discussed here. Nonetheless, this post still stands as potentially helpful to anyone walking the same path I did. See below for details.)

This is one of those posts that a confounded developer hopes to find in a time of need. The situation: generating TopoJSON files. Presented here are two attempts to generate a TopoJSON file; the first one failed, the second worked for me, YMMV. The fail path is due to a naive attempt to do the conversion in one step. The successful path is one I eventually figured out that involves two steps. The two step process consists of using topojson and then GDAL (on OSX, it can be installed via brew install gdal).

As part of the seattle-boundaries project, I needed to translate a shapefile for the Seattle City Council Districts to TopoJSON. I got the shapefile, Seattle_City_Council_Districts.zip, from the City of Seattle’s open data site.

TopoJSON is from the mind of Mike Bostock. As part of that effort he created a “small, sharp tool” for generating TopoJSON files. The command line tool is appropriately called topojson.

My naive and doomed first attempt was to generate the TopoJSON directly from the shapefile, which is advertised as possible.

1
topojson --out seattle-city-council-districts-as-topojson.bad.json City_Council_Districts.shp

(Update: turns out the mistake I made was not using the --spherical option. Inspecting the *.prj file that came with the *.shp file revealed that the data was on a spherical projection. Re-running original command as topojson --spherical ... worked like a charm.)

Below is the generated file, uploaded to GitHub. Notice the orange line at the north pole. That is the TopoJSON rendered (read: FAIL).

Generated TopoJSON which does not work

The tool, topojson, can read from multiple file formats, including shapefiles (*.shp), but there were problems with it converting the Seattle City Council District file. The root of the problem is in the JSON with the nutty bbox and transform which clearly are not latitude and longitude numbers:

1
2
3
4
5
6
7
8
9
10
11
bbox: [
  1244215.911418721,
  183749.53649789095,
  1297226.8887299001,
  271529.74938854575
],
...
translate: [
  1244215.911418721,
  183749.53649789095
]

On the other hand, if this file is uploaded to mapshaper.org then it renders well. Note though that there is no geographic “context” for the rendering, i.e., no global map and the Seattle City Districts are scaled to take up the full window’s allocated pixels. Perhaps mapshaper is not using the bbox and such, which enables it to render.

I explored the topojson command line switches but was not getting anywhere, so I went to Plan B which eventually got me better results. This involved two stages: first use GDAL’s ogr2ogr to translate the shapefile to GeoJSON, and then feed the GeoJSON to topojson.

1
2
ogr2ogr -f GeoJSON -t_srs crs:84 seattle-city-council-districts-as-geojson.json City_Council_Districts.shp
topojson --out seattle-city-council-districts-as-topojson.good.json seattle-city-council-districts-as-geojson.json

The resulting TopoJSON renders on GitHub as follows.

Generated TopoJSON which does work

Notice how the districts are colored orange, similar to the TopoJSON “North Pole line” in the bad file.

I guess ogr2ogr is better at handling shapefiles. TopoJSON was invented in order to make a more efficient geo-info JSON format that the rather “naive” GeoJSON, so it stands to reason that Bostock’s tool is better at GeoJSON to TopoJSON than it is at Shapefile to TopoJSON. Or at least that is my guess. I have no ability to judge the quality of the input Shapefile; maybe the thing was funky to start with.

For more information and all the files related to this task, check out my GitHub repo on this topic.

Update: Take 2 of the work in the post used topojson@1.6.20 and I am not sure which version was used for Take 1 but that version of topojson was installed over six months earlier.

Also, the City now has UI for exporting (read: downloading) their datasets as GeoJSON, which leads to another option: use topojson to convert the City’s GeoJSON to TopoJSON, no shapefile involved at all.

This year some of my projects needed maps of Seattle, in particular while working with Open Seattle and the City of Seattle. Specifically I have needed City Council District and census-related maps. The census geographic hierarchy has three levels for Seattle: Census Blocks are the smallest areas which are then aggregated into Census Block Groups which in turn are combined into Census Tracks. Of course, there are other administrative geographic units into which Seattle has be subdivided: neighborhoods, zips code areas, school districts, etc.

The City of Seattle actually does a good job of making this information available as shapefiles on its open data site. Nonetheless, what web developers want is to have the data extremely readily available in a source code repository (for modern open source, read: a git repo) and in formats that are immediately usable in web apps, specifically GeoJSON and TopoJSON.

So, in Open Seattle we have been building out such a repository of JSON files and hosting it on GitHub. That repository is called seattle-boundaries.

Some Seattle boundaries

As a further convenience, Seth Vincent has packaged up the data for distribution via npm. Additionally, he has also taken the maps and made available an API service, boundaries.seattle.io. This service will reverse-geocode a point into a list containing one entry for each map in the repo, where the list items are the GeoJSON feature to which the point belongs. For example, let us say you already know where in town the best place for dim sum is (read: you have a point) and you are curious as to which regions it belongs. The URL to fetch from is:
http://boundaries.seattle.io/boundaries?long=-122.323334&lat=47.598109

Over the last few weeks I surveyed every bit of available data on the 2014 Ebola Outbreak in West Africa that I could find. There were two major sub-tasks to the survey: broad-search and then dig-into-the-good-stuff.

Various ebola visualizations

For the first sub-task, the work started with cataloging the datasets on eboladata.org. I sifted through those 36 (and growing) datasets. My analysis of those can be found on the EbolaMapper wiki at Datasets Listed on eboladata.org. An additional part of this first sub-task was to catalog the datasets available at HDX and NetHope.

I have posted the conclusions of sub-task #1 on the wiki at Recommendations for Data to Use in Web Apps. The humanitarian community seems most comfortable with CSV and Excel spreadsheets. Coming from a web services background I expected a JSON or XML based format, but the humanitarians are not really thinking about web services, although the folks at HDX started on an effort which shows promise. Finally, for data interchange, the best effort within the CVS/spreadsheet space is #HXL.

The second major sub-task centered on hunting down any “hidden” JSON: finding the best visualizations on the web and dissecting them with various web dev-tools in order to ferret out the JSON. That which was found could be considered “private” APIs; it seems that there has not yet been any attempt to come up with a API (JSON and/or XML) for infectious disease records. At best, folks just pass around non-standard but relatively simple CSVs and then manually work out the ETL hassles. My analysis of the web service-y bits can be found on EbolaMapper wiki as well at JSON Ebola2014 Data Found on the Web.

My conclusion from the second sub-task is that the world needs a standard data format for outbreak time series, one which is friendly to both the humanitarian community and to web apps, especially for working with mapping software (read: Leaflet). Someone should do something about that.

The EbolaMapper project is all about coming up with computer graphics (charts, interactives, maps, etc.) for visualizing infectious disease outbreaks.

A sign of excellent news for any given outbreak is when the bullseye plot animations go static. For example, consider the WHO’s visualization shown below which is plotting data for the 2014 Ebola Outbreak in West Africa.

Each bullseye shows two datum: the outer circle is cumulative deaths and the inner circle is new deaths in the last 21 days. 21 days is the accepted incubation period and that is why Hans Rosling tracks new cases for the last 21 days. When the inner circles shrink to zero the outbreak is over.

Yet there are much lower tech ways of presenting information to people that can be quite affecting. On the grim side there are the graves.

Sadder still are memorials such as The Lancet’s obituary for health care workers who died of ebola while caring for others – true fallen heroes.

On the other hand there are signs of positive progress.

The image on the left is from a MSF tweet:

The best part about battling #Ebola is seeing our patients recover. Here, hand prints are left in Monrovia #Liberia

The image on the right is from an UNMEER tweet:

Survivor Tree in Maforki, #SierraLeone holds piece of cloth for each patient who left ETU Ebola-free. #EbolaResponse

Those must be quite uplifting reminders on the front lines of the ebola response. Likewise EbolaMapper should have positive messages, say turning bulls-eyes green when there are no new cases. That will need to be kept in mind.

As part of the EbolaMapper project, I reported on the state of the ebola open data landscape to the Africa Open Data Group. That report is currently the best overview of the EbolaMapper project, including its goals and deliverables i.e. open source tools for visualizing epidemiological outbreaks via the Outbreak Time Series Specification data APIs.

the state of government open data

The project name, EbolaMapper, has become misleading as there is nothing ebola specific about the project (although ebola data is used for testing). That name was chosen during the ebola outbreak in West Africa before the project’s scope expanded beyond just a specific ebola crisis. So, a better name is needed to go with the description “reusable tools and data standard for tracking infectious disease outbreaks, with the intent of helping to modernize the global outbreak monitoring infrastructure, especially by decentralizing the network.”

(When the project started, there was a serious dearth of open data on the crisis. The project eventually outgrew the immediate need of curating data on the outbreak. Eventually the UN started publishing ebola statistics, and independently Caitlin Rivers’ efforts seeded the open data collection effort.)

If the report is tl;dr then perhaps just check out the interactive Ebola visualizations.

The Humanitarian Data Exchange is doing great work and they have a well defined road map they are plugging away at.

The sub-national time series dataset hosted there is the first data feed I am using to test EbolaMapper against.

They have a very nice, interactive ebola dashboard. The story of how the dashboard came to be is impressive but I want to work to make that tortured path not have to be traveled for future outbreaks.

hdx-dashboard

They do not have case mapped to locations over time. For that The New York Times is the best.

I have finally found quality outbreak data with which to work:

Sub-national time series data on Ebola cases and deaths in Guinea, Liberia, Sierra Leone, Nigeria, Senegal and Mali since March 2014

The dataset comes from The Humanitarian Data Exchange. Those HDX folks are doing great work. More on them later.

I came to this dataset via a long, convoluted hunt. The difficulty of this search has led me to understand that the problem I am working on is one of data, as well as of code. This will need to be addressed, with APIs and discoverability but for now it is time to move on to coding (finally).

After I concluded that the data was usable, I started poking around on its page on HDX a bit more. On the left side of the page there are links to various visualizations of the data. This is how I discovered Google’s Public Data Explorer which is quite a nice tool. Below is one view of the HDX data in the Explorer. Click through the image to interactively explore the data over at Google.
google-pd-viewer Also among the visualizations on the HDX page was, to my surprise, the NYTimes visualization. Low and behold that visualization credits their data source as the HDX:

Source: United Nations Office for the Coordination of Humanitarian Affairs, The Humanitarian Data Exchange

So, that is good enough for me: the data hunt over. It is time to code.

osm-of-monrovia

The Economist has a story, Off The Map, about the launch of MissingMaps.org which is a development out of OpenStreetMaps.

On November 7th a group of charities including MSF, Red Cross and HOT unveiled MissingMaps.org, a joint initiative to produce free, detailed maps of cities across the developing world—before humanitarian crises erupt, not during them.

I mention this here for multiple reasons.

1) OpenStreetMaps is a great resource and HOT has been active with the ebola response. OpenStreetMaps is the source for maps in the EbolaMapper project I am working on.

Although the current focus is the ebola outbreak, these open source tools that I am calling EbolaMapper can be easily repurposed for any future outbreaks, as they read their data via the generic <a href=”https://github.com/JohnTigue/EbolaMapper/wiki/Outbreak-Time-Series-Specification-Overview</a> APIs I am developing. Next time (and statistically that is likely to occur before 2020) there should be free, quality tools at the ready for people to quickly get started on outbreak monitoring without having to wait for large organizations to mobilize.

As I have gotten to know some of the folks who have been involved with responding to previous epidemic outbreaks, they sound like they are living through a nightmare version of Groundhog Day (Swine flu in 2009, SARS, etc.). Yet now this type of problem can be solved generic, mature, widely available Web technology i.e. it does not require complex novel technology that needs to be scaled massively. (On the other hand, we do need to be mindful that currently in Liberia “less than one percent of the population is connected to the internet.” [Vice News]).

With the current established culture of open source it would be shameful for this type of flatfooted, delayed response to occur again. We have the technology to enable local actors to immediately get started by themselves the next time there is an epidemic outbreak.

2) This is a perfect example of one way that funding in this weird space can be successful. Folks (private and public) trying to effectively allocate money can find open source and/or open data projects that are already working and then juice them with cash for scaling, which is always an aspect of the large success stories in open source.

This is a bizarre but exciting variant of the thinking of Steve Blank and the lean start-up folks as applied to open source business models. I say bizarre because the customers (those benefiting from public health and disaster relief projects like the ebola response) cannot pay and have no obvious monetizable value as users. Here the open-source community has found a successful model and now that it is proven out the funding organizations are providing the cash to accelerate tech development to scale, where normally that cash would come via a series A round with venture capitalists.

In many successful open source projects, tech companies are paying talent to produce code that will immediately be placed in essentially the public domain. The value of doing so is expertise status with paying customers and keeping that scarce talent in-house to service those customers.

As Marc Andreessen quipped on twitter:

“The best minds of my generation are thinking about how to make people buy support contracts for free software.” –Anonymous

But who is going to do the funding in disaster relief contexts, specifically for the maps and do so proactively? So, this is MissingMaps.org situations is great news.

3) This blog loves a good map visualization related to the ebola response. The Off The Map story has a neat one. In the above picture the red handle can be dragged left and right to see the before versus after map.

Hans Rosling was recently interviewed on the BBC’s More or Less. He was doing he regular excellent job of entertainingly engaging the public via statistics. The full interview is less than ten minutes. The BBC also did a write up of the interview.

rosling

Rosling reported that in Liberia at the peak of the outbreak daily infections were about 75 per day and are now stuck at around 25 per day. He believes the current (second) stage of the outbreak could well be labeled as endemic , an intermediate level epidemic that will take some time to put out.

A statistic that he says is important is the reproductive number. At the peak of the outbreak it was almost 2.0; currently it is closer to 1.0. The point is that the reproductive number is a key stat that needs to be tracked.

Later, five minutes into the interview, he has a go at main stream media’s reportage, specifically the use of cumulative numbers:

It is a bad habit of media. Media just wants as many zeros as possible, you know. So, they would prefer them to tell that in Liberia we have had about 2,700 cases or 3,000 case. The important thing is that it was 28 yesterday. We have to follow cases per day. I can take Lofa province, for instance, that has had 365 cases cumulative and the last week it was zero, zero, zero, zero every day. That is really hopeful that we can see the first infected [county] is where we now have very low numbers because everyone is aware.

Notice that the NYTimes ebola viz uses cases-per-week. We can build out visualization tools which provide a similar level headed overview of a situation, which might even help to reduce anxiety in the public compared to cumulative-cases representations.

Take away: the two statistics he pointed out, cases-per-day and reproductive-number, will be visualized in this open source epidemic monitoring dashboard tool set being built out here.