My main side project over the last many months is in response to what I saw during the last big Ebola outbreak which – sadly – is still simmering along such that “flare-ups of cases continue” despite the emergency having been officially declared over. The first piece of software to come out of that side project is called an omolumter (live demo of v0.1.0).
Back in December of last year I completed analysis of the open data on the Ebola 2014 West Africa outbreak. Currently, building on that analysis, I am working on a data format I am calling the Outbreak Time Series Specification (“the Spec”). The API design is actually leveraging the recently completed W3C Recommendation, CSV on the Web (CSVW). In parallel I am writing software which reads files (or in HTTP terminology, resources) that are formatted compliant to the Spec.
The software is called an omolumter, a word I just made up. It’s a meter for Omolu, the African orisha (“god”) of epidemics and the healing thereof. It will be the first app that reads outbreak data compliant to the Spec. Currently all it does is parse a CSV that contains outbreak data and renders it in a table view. Fancier visualizations such as charts and maps are to be added in later versions of the omolumeter. Everything will be implemented using liberally licensed web technology, no flash or native apps. This will be the codebase that is the reference implementation for the Spec. I am building this out as I write the spec in order to prove it is easy to implement software which works with the Spec.
Internet Explorer is not going to be supported. (Yeah, I said it.) Well, not by me. If anyone wants to join the party, pull requests are welcome :)
Back in April 2015, I started working with the City of Seattle on a very interesting project which just went live at version 1.0 (press release). Note: here “1.0” means it has been publicly deployed and is collecting data but 2.0 will be slicker still. I am really excited about this technology and more importantly the long term legal implications.
I was the project lead on this. Open Technology Institute was the “vendor.” Bruce Blood, Open Data Manager for the City, was the point man inside the government. Bruce is a good guy and he is doing great work; it will be a civic loss when he retires later this year.
The City is calling the tool the “Broadband Speed Test”. Personally I label it as an “ISP quality monitor.” (“Broadband” sounds so highfalutin that the term must have come out of some marketing department.) We already have air quality monitors and the County mails out pamphlets regularly which enumerate quality measures for the drinking water. The goal is of this project is that one day we will have laws regulating net neutrality and quality. In such a regulatory scheme we will need ISP quality monitoring tools.
Clearly we do not currently have laws for this but even it the government simply collects and disseminates network quality information that is a big win. And now for the first time in the USA we have the government (albeit only a municipal government) collecting this sort of information. Hopefully efforts like this will lead to federal legislation.
Even without regulatory laws, simply governmental monitoring of the network may well lead to the situation improving. An analogy can be made to commercial airlines. The federal Department of Transportation collects performance data on airlines. For example, “Gate Departure Time” data is collected. Then reports of delays per airline are periodically published.
Simply collecting and making available such data was sufficient to get the airline corporations to improve those numbers. (Ironically, this is one of the reasons that we Americans now spend so much time parked on runways: the numbers reflect when an aircraft pushed off from its gate, not when the airplane started flying.) Based on IP addresses collected during network speed testing, we can determine which ISP the test is being run over and then later generate quality reports per ISP. Hopefully, similar to the airline departure times, we will see an improvement in the performance numbers of ISPs now that we are starting to collect the relevant information.
So, legally this may well prove to be interesting stuff in the long term. Moving on to the technological aspects of this project, that is currently interesting. NDT (Network Diagnostic Test) is the very core of this technology package. NDT is the bit-pushing speed tester. The code is liberally licensed. This is the code which actually generates a fake file and sends it to the server for upload timing. Then the server sends a fake file to the client for download timing.
Piecewise is the name of the web app that runs the tests and generates the reports and provides the user interface. Unfortunately Piecewise is licensed under GPL3, rather than Apache or MIT, but that detail is not a show stopper.
(Update: since originally posting this, I have learned a trick or two which avoids some of the problems discussed here. Nonetheless, this post still stands as potentially helpful to anyone walking the same path I did. See below for details.)
This is one of those posts that a confounded developer hopes to find in a time of need.
The situation: generating TopoJSON files.
Presented here are two attempts to generate a TopoJSON file; the first one failed, the second worked for me, YMMV.
The fail path is due to a naive attempt to do the conversion in one step. The successful path is one I eventually figured out that involves two steps.
The two step process consists of using
topojson and then GDAL (on OSX, it can be installed via
brew install gdal).
As part of the seattle-boundaries project, I needed to translate a shapefile for the Seattle City Council Districts to TopoJSON. I got the shapefile, Seattle_City_Council_Districts.zip, from the City of Seattle’s open data site.
My naive and doomed first attempt was to generate the TopoJSON directly from the shapefile, which is advertised as possible.
(Update: turns out the mistake I made was not using the
--spherical option. Inspecting the
*.prj file that came with the
*.shp file revealed that the data was on a spherical projection. Re-running original command as
topojson --spherical ... worked like a charm.)
Below is the generated file, uploaded to GitHub. Notice the orange line at the north pole. That is the TopoJSON rendered (read: FAIL).
topojson, can read from multiple file formats, including shapefiles (*.shp), but there were problems with it converting the Seattle City Council District file.
The root of the problem is in the JSON with the nutty
transform which clearly are not latitude and longitude numbers:
1 2 3 4 5 6 7 8 9 10 11
On the other hand, if this file is uploaded to mapshaper.org then it renders well. Note though that there is no geographic “context” for the rendering, i.e., no global map and the Seattle City Districts are scaled to take up the full window’s allocated pixels. Perhaps mapshaper is not using the
bbox and such, which enables it to render.
I explored the
topojson command line switches but was not getting anywhere, so I went to Plan B which eventually got me better results. This involved two stages: first use GDAL’s ogr2ogr to translate the shapefile to GeoJSON, and then feed the GeoJSON to
The resulting TopoJSON renders on GitHub as follows.
Notice how the districts are colored orange, similar to the TopoJSON “North Pole line” in the bad file.
ogr2ogr is better at handling shapefiles. TopoJSON was invented in order to make a more efficient geo-info JSON format than the rather “naive” GeoJSON, so it stands to reason that Bostock’s tool is better at
GeoJSON to TopoJSON than it is at
Shapefile to TopoJSON. Or at least that is my guess. I have no ability to judge the quality of the input Shapefile; maybe the thing was funky to start with.
For more information and all the files related to this task, check out my GitHub repo on this topic.
Take 2 of the work in the post used
firstname.lastname@example.org and I am not sure which version was used for Take 1 but that version of
topojson was installed over six months earlier.
Also, the City now has UI for exporting (read: downloading) their datasets as GeoJSON, which leads to another option: use
topojson to convert the City’s GeoJSON to TopoJSON, no shapefile involved at all.
This year some of my projects needed maps of Seattle, in particular while working with Open Seattle and the City of Seattle. Specifically I have needed City Council District and census-related maps. The census geographic hierarchy has three levels for Seattle: Census Blocks are the smallest areas which are then aggregated into Census Block Groups which in turn are combined into Census Tracks. Of course, there are other administrative geographic units into which Seattle has be subdivided: neighborhoods, zips code areas, school districts, etc.
The City of Seattle actually does a good job of making this information available as shapefiles on its open data site. Nonetheless, what web developers want is to have the data extremely readily available in a source code repository (for modern open source, read: a git repo) and in formats that are immediately usable in web apps, specifically GeoJSON and TopoJSON.
So, in Open Seattle we have been building out such a repository of JSON files and hosting it on GitHub. That repository is called seattle-boundaries.
As a further convenience, Seth Vincent has packaged up the data for distribution via npm.
Additionally, he has also taken the maps and made available an API service, boundaries.seattle.io.
This service will reverse-geocode a point into a list containing one entry for each map in the repo, where the list items are the GeoJSON feature to which the point belongs.
For example, let us say you already know where in town the best place for dim sum is (read: you have a point) and you are curious as to which regions it belongs.
The URL to fetch from is:
Over the last few weeks I surveyed every bit of available data on the 2014 Ebola Outbreak in West Africa that I could find. There were two major sub-tasks to the survey: broad-search and then dig-into-the-good-stuff.
For the first sub-task, the work started with cataloging the datasets on eboladata.org. I sifted through those 36 (and growing) datasets. My analysis of those can be found on the EbolaMapper wiki at Datasets Listed on eboladata.org. An additional part of this first sub-task was to catalog the datasets available at HDX and NetHope.
I have posted the conclusions of sub-task #1 on the wiki at Recommendations for Data to Use in Web Apps. The humanitarian community seems most comfortable with CSV and Excel spreadsheets. Coming from a web services background I expected a JSON or XML based format, but the humanitarians are not really thinking about web services, although the folks at HDX started on an effort which shows promise. Finally, for data interchange, the best effort within the CVS/spreadsheet space is #HXL.
The second major sub-task centered on hunting down any “hidden” JSON: finding the best visualizations on the web and dissecting them with various web dev-tools in order to ferret out the JSON. That which was found could be considered “private” APIs; it seems that there has not yet been any attempt to come up with a API (JSON and/or XML) for infectious disease records. At best, folks just pass around non-standard but relatively simple CSVs and then manually work out the ETL hassles. My analysis of the web service-y bits can be found on EbolaMapper wiki as well at JSON Ebola2014 Data Found on the Web.
My conclusion from the second sub-task is that the world needs a standard data format for outbreak time series, one which is friendly to both the humanitarian community and to web apps, especially for working with mapping software (read: Leaflet). Someone should do something about that.
The EbolaMapper project is all about coming up with computer graphics (charts, interactives, maps, etc.) for visualizing infectious disease outbreaks.
A sign of excellent news for any given outbreak is when the bullseye plot animations go static. For example, consider the WHO’s visualization shown below which is plotting data for the 2014 Ebola Outbreak in West Africa.
Each bullseye shows two datum: the outer circle is cumulative deaths and the inner circle is new deaths in the last 21 days. 21 days is the accepted incubation period and that is why Hans Rosling tracks new cases for the last 21 days. When the inner circles shrink to zero the outbreak is over.
Yet there are much lower tech ways of presenting information to people that can be quite affecting. On the grim side there are the graves.
Sadder still are memorials such as The Lancet’s obituary for health care workers who died of ebola while caring for others – true fallen heroes.
On the other hand there are signs of positive progress.
The image on the left is from a MSF tweet:
The best part about battling #Ebola is seeing our patients recover. Here, hand prints are left in Monrovia #Liberia
The image on the right is from an UNMEER tweet:
Survivor Tree in Maforki, #SierraLeone holds piece of cloth for each patient who left ETU Ebola-free. #EbolaResponse
Those must be quite uplifting reminders on the front lines of the ebola response. Likewise EbolaMapper should have positive messages, say turning bulls-eyes green when there are no new cases. That will need to be kept in mind.
As part of the EbolaMapper project, I reported on the state of the ebola open data landscape to the Africa Open Data Group. That report is currently the best overview of the EbolaMapper project, including its goals and deliverables i.e. open source tools for visualizing epidemiological outbreaks via the Outbreak Time Series Specification data APIs.
The project name, EbolaMapper, has become misleading as there is nothing ebola specific about the project (although ebola data is used for testing). That name was chosen during the ebola outbreak in West Africa before the project’s scope expanded beyond just a specific ebola crisis. So, a better name is needed to go with the description “reusable tools and data standard for tracking infectious disease outbreaks, with the intent of helping to modernize the global outbreak monitoring infrastructure, especially by decentralizing the network.”
(When the project started, there was a serious dearth of open data on the crisis. The project eventually outgrew the immediate need of curating data on the outbreak. Eventually the UN started publishing ebola statistics, and independently Caitlin Rivers’ efforts seeded the open data collection effort.)
National Geographic sure knows the power of good visuals. Their ebola tracking report has one particularly neat yet unusual view: a graphical calendar of 2014, one map per week.
They do not have case mapped to locations over time. For that The New York Times is the best.
I have finally found quality outbreak data with which to work:
Sub-national time series data on Ebola cases and deaths in Guinea, Liberia, Sierra Leone, Nigeria, Senegal and Mali since March 2014
I came to this dataset via a long, convoluted hunt. The difficulty of this search has led me to understand that the problem I am working on is one of data, as well as of code. This will need to be addressed, with APIs and discoverability but for now it is time to move on to coding (finally).
After I concluded that the data was usable, I started poking around on its page on HDX a bit more. On the left side of the page there are links to various visualizations of the data. This is how I discovered Google’s Public Data Explorer which is quite a nice tool. Below is one view of the HDX data in the Explorer. Click through the image to interactively explore the data over at Google.
Also among the visualizations on the HDX page was, to my surprise, the NYTimes visualization. Low and behold that visualization credits their data source as the HDX:
Source: United Nations Office for the Coordination of Humanitarian Affairs, The Humanitarian Data Exchange
So, that is good enough for me: the data hunt over. It is time to code.