Spending Data Handbook

Introduction
Introduction
Data-driven research
Data literacy
Introducing data
Types of spending data
Reference data
Machine-readable and useable data
Working with data
The data processing pipeline
Getting data
Cleaning data
Analysis
Presentation and engagement
Framing: topic, objective, audience
Engaging the public online
Case study: BudgIT
Working with others
Why collaborate?
Using technology in your work
Countering arguments against publishing data
Appendices
About this book
Privacy guide
Resources
Glossary

Resources

This chapter outlines a set of staple tools for data-driven research, explaining what it’s useful for and what the barrier to entry is.

Key to levels

Basic. An off-the-shelf tool that can be learned and used independently within a day. No advanced installation (e.g. installation on servers) required.
Intermediate. Between one day and one week to master basic functionality. May require tweaking of code.
Advanced. Requires programming ability.

Stage 1: Extracting and getting data

Issue	Tools	Level	Notes
Data not available	Freedom of Information Portals (e.g. What Do They Know, Frag den Staat)	Basic—though some education may be required to inform people that they have the right to ask, how to phrase an FOI request, whether it is possible to submit these requests electronically, etc.	While Freedom of Information portals are a good way of getting data, results often end up scattered. It would be useful to have results structured into data directories so that it was possible to search successful responses together with proactively released data so that there is one common source for data.
Data available online but not downloadable (e.g. in HTML tables on webpages)	For simple sites (e.g. information on an individual webpage), Google Spreadsheets and ImportHTML Function, or the Google scraper extension (basic). For more complex webpages (information spread across numerous pages), a scraper will be required. Scrapers are ways to extract structured information from websites using code. There is a useful tool to make doing this easier online – Scraperwiki (advanced).	For the basic level, anyone who can use a spreadsheet and functions can use it. It is not, however, a well-known command, and awareness must be spread about how it can be used—people are often unnecessarily daunted because they presume scraping involves code. Scraping using code is advanced and requires knowledge of at least one programming language.	In the Open Knowledge Foundation's international interviews with data-driven researchers, the need to be able to scrape was mentioned in every country interviewed. For more information or to learn to start scraping, see the School of Data course on Scraping.
Data available only in PDFS or images	A variety of tools are available to extract this information. Most promising non-code variants are ABBYY Finereader (not free) and Tabula (new software, still a bit buggy and requires people to be able to host it themselves to use).	Advanced. Most tools require knowledge of coding. Some progress is being made on non-technical tools. For more info and to see some of the advanced methods, see the School of Data course.	These tools are still imperfect, and it is still vastly preferable to advocate for data in the correct formats rather than to teach people how to extract. Recently published guidelines coming directly from government in the UK and US can now be cited as examples to get data in the required formats.
Leaked data	Several projects make use of secure dropboxes and services for whistleblowers	Advanced. Security of utmost concern.	For example, MagyarLeaks.
Data no longer available online	Internet Archive Wayback Machine	Basic

Stage 2: Cleaning, Working with and Analyzing Data

Issue	Tools	Level	Notes
Messy data, typos, blanks (various)	Spreadsheets, Open Refine, Powerful text editors (e.g. Text Wrangler) plus knowledge of Regular Expressions; messytables	Basic: spreadsheets Intermediate: Open Refine Advanced: messytables, regular expressions
Managing and transforming data with code	Miso, Recline.js	Advanced
Reconciliation of entity names	Nomenklatura, OpenCorporates	Advanced	Reconciling entities is complicated both due to the tools needed as well due to the often corrupt state of the data.
Mapping names to geographic data	GeoNames
Visualizing networks and relationships between entities	Gephi	Intermediate to Advanced
Need to be able to work with many many lines of data (too big to be able to fit in Excel)	OpenSpending.org, Other database software (PostGres, MySQL), command line tools	Basic to Advanced	As few countries currently release transaction-level data, this is not a frequent problem, but it is already problematic in places such as Brazil, the US, and the UK. As we push for greater disclosure, this will be needed ever more.
Performing repetitive tasks or modelling	Excel macros	Basic to Intermediate
Slicing and dicing CSV files	CSVkit Python library	Advanced
Entity extraction (e.g. from large bodies of documents)	Open Calais, Yahoo/YQL Content Analysis API, TSO data enrichment service	Intermediate	This is far from a perfect method, and it would be vastly easier to answer questions relating to entities if they were codified by a unique identifier.
Analysis needs to be performed on datasets that are published in different languages (e.g. in India)	To some extent, Google Translate for web based data	Basic	Still searching for a solution to automatically translate offline spreadsheets.
Figures change in data after publication	For non-machine readable data – tricky. For simple, machine-readable file formats, such as CSV – version control tools (e.g. Git, Mercurial). For web-based data – some scrapers can be configured to trigger (e.g. email someone) whenever a field changes.	Intermediate to Advanced	Future projects that are likely to tackle this problem: DeDupe.
Finding statistical patterns in spending data	R (free), SPSS (proprietary) and other statistical software for clustering and anomaly detection (also see note)	Advanced	Examples: Data from Supervizor has been used to track changes in spending on contractors changes in government. A note on statistical analysis software can be found below.

Note on SPSS and R: It’s our impression that interviewees seemed largely to have been trained to use SPSS. R is important to mention, however, as it offers free access to a broad section of the same models, though based on a programming interface.

A few examples of analysis of spending data, which can be done with statistical software such as SPSS or R:

a) Hidden Markov Models: Hidden Markov Models were originally developed for finding patterns in bioinformatics, but they have turned out to be useful for predicting fraudulent and corrupt behaviour as well. They were, for example, used to analyse spending data from 50 mio transactions in the Slovenian platform Supervizor. Using Hidden Markov Models requires high quality data.

b) Benford’s law: Benford's law checks the distribution of figures in your data against how it should actually look. Diversions from the normal distribution can help detect fraudulent reporting (eg. if companies tend to report ernings less than $500 mio. in order to fit a particular regulation, Benford’s law could be a tool to detect that). Check out this example using Benford’s law to test the release of all Danish corporate tax filings, and check out this R blog post on the topic.

Though SPSS is fairly easy to get started using, it can be difficult to collaborate around, as it applies its own SPSS data format. Some models might also be unavailable from the basic SPSS package. R is a free alternative where all extensions are accessible and where community support and code samples are widely available. One possible compromise bridging the convenience of SPSS and the wide usability of R is the proprietary software R Revolution.

Stage 3: Presenting Data

Issue	Tools	Level	Notes
Basic visualisations, time series, bar charts	DataWrapper, Tableau Public, Many Eyes, Google Tools	Basic
More advanced visualisation	D3.js, Raphael, Rickshaw	Advanced	Used in e.g. OpenBudgetOakland
Mapping	TileMill, Fusion Tables, Kartograph QGIS	Basic to Advanced
Creating a citizen’s budget	OpenSpending.org: Disqus commenting module added to OS for commenting and feedback	OpenSpending.org – making a custom visualisation – basic. Making a custom site enabling discussion – advanced.	Used in e.g. OpenBudgetOakland

Publishing Data

Issue

Tools

Level

Notes

Need a place online to store and manage raw data, especially from Freedom of Information requests

DataNest, CKAN, Socrata, various Data Portal Software options

Basic to use. Intermediate to advanced to install.

Can require a programmer to get running and set up a new instance.

Individual storage of and online collaboration around datasets

Google Spreadsheets, Google Fusion Tables, GitHub

GitHub: intermediate.

All others: basic.