Over the weekend of 29-31st July 2016 I participated in my first GovHack. I teamed up with Jack Simpson, whom I’ve known since we attended the same Software Carpentry instructor training in Melbourne and worked out, through the normal course of conversation, that we were both Canberra residents.
We camped out at the UC Heritage Hack node at the University of Canberra. Tim Sherratt organised the “themed” node to cater to those interested in devising hacks around cultural heritage datasets. He provided a large number of examples demonstrating the fascinating insights that can be gleaned from such datasets. He mentioned that cultural heritage data is often far from “clean”. Instead it contains inconsistencies, ambiguous elements, formatting issues, missing fields, and a blend of structured and unstructured data.
All of this was true of the dataset Jack and I chose to work with. Our stated goal was simple: to visualise convict ship journeys to Australia over time. For this we turned to the State Library of Queensland digitised records of the British convict transportation registers. It lists over 123,000 of the estimated 160,000 convicts transported to Australia. What they were convicted of, what ship (or fleet) they journeyed on, when they departed, and where in Australia they alighted. It is a remarkable resource.
At first, a weekend to work solely on the one project felt like an ample time budget. As a father of two marvellous children, it’s fair to say that time for hobbies is eked out in dribs and drabs. But as the list of tasks dawned on the team, it didn’t take long to realise that time would be exceptionally tight. I detected myself subvocalising the words “minimum viable product” throughout the weekend. (Such things happen to those who find interesting lists on Wikipedia). At several points we narrowed our scope from its initial grandiose vision:
The result was Conviction Currents. No clever interactive slider, no ability to freeze on a journey and unpack each convict’s story. But something, nevertheless, on the internet, nonetheless. And a video to boot:
By the late afternoon on the Sunday, we had submitted; one of 437 hacks. I’m not sure about Jack, but I felt shattered the next day.
Here’s what we used to complete our hack:
- Open Refine: Once known as Freebase Gridworks and then as Google Refine, this tool is still one of my favourites for getting a handle on an unfamiliar dataset. Its faceting feature is great for seeing the range of values a particular field takes on and very quickly reveals any inconsistency. The “cluster and merge” functionality it provides is super neat for auto-detecting and fixing inconsistenies.
- Python and Jupyter: What’s not to love about the simplicity “Python in a notebook” brings to visualising data?
- Ruby: Ruby is my programming language of choice and its CSV support proved super handy for reading in the datasets and simplifying them.
- Regular Expressions: s/kind of/massively/ useful.
- Classic unix commands:
awk. All supremely useful. (How many lines of code could I have avoided writing if as a whippersnapper I had bothered to learn these properly?)
- Snap SVG: Our chosen library for producing animated curves representing the ship journeys.
- Git and GitHub for source control and static website hosting.
- iMovie: Quite fun but worth learning the keyboard shortcuts.
- 6-10pm Friday
- 9-5pm Saturday
- Create project page
- Data exploration in Open Refine
- Successive data cleaning, simplication and summarisation in Ruby scripts
- Informative conversations
- GitHub account,
git add .,
git commit -m...,
- 9-6pm Sunday
For next time…
- Bigger team: Jack and I worked well as a team. But more people, more better. So many things to do, calling for skill sets in project management, marketing, catering, reading fine print and domain expertise in the data, just to name a few.
- Pack wholesome food: Our self-catering consisted of danishes and croissants. Delicious: yes. Healthy: no.
- Make more time for conversations: There were some brilliant mentors on site and there was some incredible stuff happening at other tables. But some conversations were cut short as the pressure mounted.
That’s it. Thanks to all involved. Looking forward to next year.