This chapter outlines a set of staple tools for data-driven research, explaining what it’s useful for and what the barrier to entry is.
Issue | Tools | Level | Notes |
Data not available | Freedom of Information Portals (e.g. What Do They Know, Frag den Staat) | Basic—though some education may be required to inform people that they have the right to ask, how to phrase an FOI request, whether it is possible to submit these requests electronically, etc. | While Freedom of Information portals are a good way of getting data, results often end up scattered. It would be useful to have results structured into data directories so that it was possible to search successful responses together with proactively released data so that there is one common source for data. |
Data available online but not downloadable (e.g. in HTML tables on webpages) | For simple sites (e.g. information on an individual webpage), Google Spreadsheets and ImportHTML Function, or the Google scraper extension (basic).
For more complex webpages (information spread across numerous pages), a scraper will be required. Scrapers are ways to extract structured information from websites using code. There is a useful tool to make doing this easier online – Scraperwiki (advanced). |
For the basic level, anyone who can use a spreadsheet and functions can use it. It is not, however, a well-known command, and awareness must be spread about how it can be used—people are often unnecessarily daunted because they presume scraping involves code. Scraping using code is advanced and requires knowledge of at least one programming language. |
In the Open Knowledge Foundation's international interviews with data-driven researchers, the need to be able to scrape was mentioned in every country interviewed. For more information or to learn to start scraping, see the School of Data course on Scraping. |
Data available only in PDFS or images | A variety of tools are available to extract this information. Most promising non-code variants are ABBYY Finereader (not free) and Tabula (new software, still a bit buggy and requires people to be able to host it themselves to use). | Advanced. Most tools require knowledge of coding. Some progress is being made on non-technical tools. For more info and to see some of the advanced methods, see the School of Data course. |
These tools are still imperfect, and it is still vastly preferable to advocate for data in the correct formats rather than to teach people how to extract. Recently published guidelines coming directly from government in the UK and US can now be cited as examples to get data in the required formats. |
Leaked data | Several projects make use of secure dropboxes and services for whistleblowers | Advanced. Security of utmost concern. |
For example, MagyarLeaks. |
Data no longer available online | Internet Archive Wayback Machine | Basic |
Issue | Tools | Level | Notes |
Messy data, typos, blanks (various) | Spreadsheets, Open Refine, Powerful text editors (e.g. Text Wrangler) plus knowledge of Regular Expressions; messytables |
Basic: spreadsheets Intermediate: Open Refine Advanced: messytables, regular expressions |
|
Managing and transforming data with code | Miso, Recline.js | Advanced | |
Reconciliation of entity names | Nomenklatura, OpenCorporates | Advanced |
Reconciling entities is complicated both due to the tools needed as well due to the often corrupt state of the data.
|
Mapping names to geographic data | GeoNames | ||
Visualizing networks and relationships between entities | Gephi | Intermediate to Advanced | |
Need to be able to work with many many lines of data (too big to be able to fit in Excel) | OpenSpending.org, Other database software (PostGres, MySQL), command line tools | Basic to Advanced | As few countries currently release transaction-level data, this is not a frequent problem, but it is already problematic in places such as Brazil, the US, and the UK. As we push for greater disclosure, this will be needed ever more. |
Performing repetitive tasks or modelling | Excel macros | Basic to Intermediate | |
Slicing and dicing CSV files | CSVkit Python library | Advanced | |
Entity extraction (e.g. from large bodies of documents) | Open Calais, Yahoo/YQL Content Analysis API, TSO data enrichment service | Intermediate | This is far from a perfect method, and it would be vastly easier to answer questions relating to entities if they were codified by a unique identifier. |
Analysis needs to be performed on datasets that are published in different languages (e.g. in India) | To some extent, Google Translate for web based data | Basic | Still searching for a solution to automatically translate offline spreadsheets. |
Figures change in data after publication |
For non-machine readable data – tricky. For simple, machine-readable file formats, such as CSV – version control tools (e.g. Git, Mercurial). For web-based data – some scrapers can be configured to trigger (e.g. email someone) whenever a field changes. |
Intermediate to Advanced | Future projects that are likely to tackle this problem: DeDupe. |
Finding statistical patterns in spending data | R (free), SPSS (proprietary) and other statistical software for clustering and anomaly detection (also see note) | Advanced | Examples: Data from Supervizor has been used to track changes in spending on contractors changes in government.
A note on statistical analysis software can be found below. |
Note on SPSS and R: It’s our impression that interviewees seemed largely to have been trained to use SPSS. R is important to mention, however, as it offers free access to a broad section of the same models, though based on a programming interface.
A few examples of analysis of spending data, which can be done with statistical software such as SPSS or R:
a) Hidden Markov Models: Hidden Markov Models were originally developed for finding patterns in bioinformatics, but they have turned out to be useful for predicting fraudulent and corrupt behaviour as well. They were, for example, used to analyse spending data from 50 mio transactions in the Slovenian platform Supervizor. Using Hidden Markov Models requires high quality data.
b) Benford’s law: Benford's law checks the distribution of figures in your data against how it should actually look. Diversions from the normal distribution can help detect fraudulent reporting (eg. if companies tend to report ernings less than $500 mio. in order to fit a particular regulation, Benford’s law could be a tool to detect that). Check out this example using Benford’s law to test the release of all Danish corporate tax filings, and check out this R blog post on the topic.
Though SPSS is fairly easy to get started using, it can be difficult to collaborate around, as it applies its own SPSS data format. Some models might also be unavailable from the basic SPSS package. R is a free alternative where all extensions are accessible and where community support and code samples are widely available. One possible compromise bridging the convenience of SPSS and the wide usability of R is the proprietary software R Revolution.
Issue | Tools | Level | Notes |
Basic visualisations, time series, bar charts | DataWrapper, Tableau Public, Many Eyes, Google Tools | Basic | |
More advanced visualisation | D3.js, Raphael, Rickshaw | Advanced | Used in e.g. OpenBudgetOakland |
Mapping | TileMill, Fusion Tables, Kartograph QGIS | Basic to Advanced | |
Creating a citizen’s budget | OpenSpending.org: Disqus commenting module added to OS for commenting and feedback |
OpenSpending.org – making a custom visualisation – basic. Making a custom site enabling discussion – advanced. |
Used in e.g. OpenBudgetOakland |
Issue | Tools | Level | Notes |
Need a place online to store and manage raw data, especially from Freedom of Information requests | DataNest, CKAN, Socrata, various Data Portal Software options | Basic to use. Intermediate to advanced to install. | Can require a programmer to get running and set up a new instance. |
Individual storage of and online collaboration around datasets | Google Spreadsheets, Google Fusion Tables, GitHub |
GitHub: intermediate. All others: basic. |
There has been error in communication with Booktype server. Not sure right now where is the problem.
You should refresh this page.