lucsmall.com

Visualising Research Data in Australia

This visualisation explores the growth of research data collections in ANDS Research Data Australia (RDA) over time. We were interested in who contributed collection records in what disciplines when.

The visualisation gives us insights into questions such as:

  • Who got the ball rolling?

  • When were there bursts of activity?

  • How did RDA grow over time?

  • What did the discipline coverage look like at any point in time?

What you’re seeing are collections grouped by contributing research institution and then by top-level Field of Research (FOR) code.

  • Each cluster represents an institution.

  • Each colour represents an FOR code.

  • Each dot represents a collection

This visualisation was produced by Luc Small and Jared Berghold as part of the 2012 eResearch Australasia conference Developers Lounge Challenge.

It makes extensive use of the source code visualisation tool Gource as discussed further below. Gource is great for viewing the contribution of actors (usually developers) to (source code) repositories over time. RDA is effectively a repository, and the contributing institutions are actors, so Gource is a perfect tool for visualising its growth over time.

The data was sourced from the Australian National Data Service (ANDS) and used under a Creative Commons Attribution 3.0 Australia License.

We also acknowledge the work of J Lang, whose song “The Garden of Forking” is used here and has been made available under a Creative Commons Attribution Noncommercial (3.0) License.

Methodology

The visualisation was produced using these steps:

  1. Download all collection records as XML/RIF-CS from RDA (about 41,000 on the day of the download).

  2. Use the Ruby REXML SAX2Parser to extract all collection records with ANZSRC FOR codes. For each such collection, the registryObject group attribute, the collection dateModified attribute and FOR code value were dumped out in a simple line-oriented file format. If a collection record had 3 FOR codes associated with it, then three lines of output were produced. The output file was in excess of 110,000 lines. This script also converted the date modified to Unix timestamp format (seconds since Epoch), as required by gource.

  3. Use the Unix sort command to ensure the dumped log file is sorted chronologically by the date modified field.

  4. Transform the dump produced in step 3 into a valid gource custom log format file. A Ruby script was used for this. A key feature of the script was to map the collections data into what was effectively a hierarchical (i.e. file system-like) structure that Gource could display. After trying a few permutations we settled on structuring the “file paths” as InstitutionName/Textual top-level FOR/RandomNumber. The first value was derived from the registryObject group attribute. The second was produced by taking the numeric FOR code, trimming it to the first two digits, and mapping those two digits to their textual equivalent. The random number was simply used to give each collection a (hopefully) unique “file name”; a hash would have been better.  A colour was also assigned to each collection record according to the FOR code it belonged to.  (The collection record was discarded in the event of it having any bad data, e.g. an invalid FOR code).

  5. Produce a background image, with a key to the FOR code colouring system, with the aid of Processing.

  6. Run the log file through gource using some command-line switches to help tune the visualisation presented.

  7. Use FFmpeg to dump the gource visualisation to a video file.

  8. Use iMovie to assemble and edit the video.

Comments