Hackathon Diaries: Prospecting for Charity Donors
Earlier this year I participated in a charity hackathon hosted by the BI team at Barclays.
Charities, like any organisation nowadays gather large amounts of data related to their operations, but unlike the big companies that employ many of us, generally can’t budget for teams of BI consultants to extract value from it. So here was my chance to reconcile all the times I’ve pretended to receive a phone call when walking past a charity mugger on the street.
I wanted to do justice to any data arriving in my inbox. Rather than rendering numbers to simply reconstruct a familiar story for the charity, I set myself the goal of delivering some deeper insight…..something that could inspire action or at least stimulate productive discussions among decision-makers.
I also had more selfish motives for throwing my name in the hat. Forcing myself to deliver something meaningful in scant time would push me out of my comfort zone, and into a zone conducive to personal development. A week to deliver an end-to-end project alongside a full-time job is a tall order that’s bound to have you thinking on your feet, while pushing your data exploration arsenal to its limit. We all know it can sometimes take the best part of a week just to familiarise yourself with a novel dataset, as you try to reason how the numbers relate to actual operations at the ground level.
I formed a team with Lukas Deibel, a colleague from the German branch of The Information Lab. We decided to tackle different but complementary components of the project: myself looking at geo-spatial patterns in charity support, and Lukas at how these patterns change over time. Our efforts earned us runner-up position with a special mention from the guest judges. This was a nice little surprise, considering on day 6 it was touch and go whether we’d have anything concrete to deliver. It also suggested there might be something of value in our approach worth packaging into a blog post and sharing.
Below I’ve provided a step-by-step walk-through of my analytical process, including key decisions made to focus the angle of interrogation, and the application of predictive modelling techniques to extract meaning from complex support patterns, and in turn inform fundraising strategy.
1. Eye-balling the data
Our assigned charity (who I can’t disclose due to an element of commercial sensitivity) set us quite a simple yet open brief: to find interesting geo-spatial insights into how their supporters are distributed over the UK.
Supporters are people or organisations who have donated directly or raised money somehow, e.g. through running a marathon, organising a church cake sale, or jumping in a bath of baked beans. As the main revenue source for the charity, understanding how support is distributed over the UK will help to assess where campaigns have/haven’t been working, and to identify opportunities for more targeted marketing – vital information to the survival of a charity with finite resources.
Our raw material was a list of ~75K unique supporters, each linked to a location right down to the ward level (the most granular UK electoral sub-division). A decent dataset in my opinion; simple enough so as not to overwhelm, and deep enough to support a range of analytical approaches – a high row count and a high spatial granularity together opened the door to a multitude of opportunities.
In fact, we considered ward an unnecessarily granular spatial unit for our purposes, and instead rolled support up to the parliamentary constituency (PC) level. There were 3 main justifications for this:
(1) Patterns were quite evident at this level, with initial explorations revealing high variability among constituencies.
…In landscape ecology, where explicit consideration of scale is central to theory and practice, the level at which the dominant patterns emerges is referred to as the characteristic scale. This is inherently the level at which the mechanisms of interest are likely to operate, and analyses are likely to be most fruitful.
(2) Parliamentary constituencies are still numerous enough (~570 constituencies across England & Wales) to provide ample statistical power in quantitative analyses.
(3) From the charities’ perspective, parliamentary constituencies seemed a meaningful level to think of communities as entities that can be practically targeted with marketing campaigns.
2. Normalising the data
Alone, the raw figures provided in the dataset are not particularly informative. For starters, if you were to map support numbers across constituencies, you’d see a picture heavily skewed by population density. Therefore, before doing anything else I jumped onto NOMIS – an online repository of census data to normalise values by population size, providing a supporters/1000 residents density value for each constituency (mapped below; click on the image to interact).
With a level playing field we now have a basis to pick out interesting patterns in supporter distribution, and formulate hypotheses of causality. For example, we can see a lot of support in and around London, and tailing off into the South-West. Are people here wealthier and more likely to donate? Alternatively, is this a reflection of mentality/value systems of people in these areas? Also, what’s happening in that cluster of supportive constituencies towards the Scottish border?
Instead of using folk wisdom and stereotypes to speculate, we are better off providing a quantitative basis to interpret and rationalise patterns.
3. Explaining patterns
Patterns of support are evidently complex, and it would be naive to assume that any single variable would adequately explain the nuances of the support map above. Instead, patterns will be shaped by myriad variables operating over different scales: some may have broad countrywide influence, while others may be more influential in shaping finer-scale patterns at the local level. Interactions can also contribute further texture to the picture. E.g. age may be negatively associated with support in rural areas but positively associated with support in built-up areas – a potentially valuable nugget of wisdom to a charity looking to parsimoniously distribute their resources.
To enable this complexity to be fully captured, I chose to model the data with an analytical technique that does a good job of preserving hierarchical structure in the data – a decision tree.
Decision Trees work by repeatedly partitioning samples (in this case parliamentary constituencies) into smaller sub-groups based on split points in the predictor variables, such that that the values in each sub-group remain as similar as possible. The resulting tree defines patterns in the data with a series of if-then statements, providing an intuitive framework for understanding and navigating complex patterns. The mechanics of decision trees are elegantly explained in this animated walk-though.
Like any predictive technique, a decision tree is only as powerful as the explanatory/predictor variables factored into the model. Recent analyses of election/referendum results had armed me with some preconceptions about factors associated with charity support, while a paper I found on characteristics affecting charitable donations in Britain corroborated these ideas.
With this In mind, I went back to NOMIS and pulled out the following explanatory variables, all handily available at parliamentary constituency level:
- Median Age
- Occupational Characteristics: % in professional roles, % in trade-type roles
- Median Salary
- Educational Qualifications: % residents with university degree
Alongside these, I also included Region (which splits the UK into 10 sub-divisions) as a predictor, to capture any geographic influences that might operate independently of the socio-demographic factors.
Note: These variables will not tell us definitively or explicitly which mechanisms drive support, but instead will describe characteristics of communities where support is more likely.
I built the decision tree in Alteryx. Alteryx is perhaps best know for its data cleaning and preparation capabilities, but it also has a very powerful suite of tools for predictive analytics. These tools are powered by R language, but easily deployed through drag & drop functionality, and configured through a user interface – making it ideal for quickly playing around with a variety of model iterations.
The resulting Decision Tree model explained ~60% of the variability in the data with only 3/4 splits.
In other words, our small bag of predictor variables had done a pretty good job of accounting for support patterns. The remaining ~40% unexplained variability represents a combination of noise and influence of factors we didn’t/couldn’t control for.
The final pruned tree is visualised below. Click on the image to interact.
The biggest split in the data was caused by grouping of Southern vs. Northern regions, splitting the UK in half and perhaps reinforcing popular notions of some sort of division in mentality, value system, or lifestyle either side of this imaginary division.
Within these 2 broad groupings, lots of patterning remains – explained by subsequent splits in the tree. Within the Southern half of the UK, where support on average is greater, constituencies with a high proportion of professionals (>22% of workers) foster particularly high supporter densities.
Then, among those professional Southern constituencies, those with a particularly low representation of tradesman have even greater densities. What we are left with is a cluster of affluent constituencies in and around West London, and interestingly also Bristol – consistent with its reputation as an outpost for the liberal ‘metropolitan elite’, with large student and hippy contingents.
On the Northern branch, where support on average is lower, the same sort of demographic factors separate constituencies. Prevalence of university education is the biggest split point, with occupational characteristics and income having secondary influence.
So what we have here is essentially a predictive model informing the charity where they are most likely to find support based on geography and characteristics of inhabitants. Great!
Not so fast….there’s a slight caveat here.
Much of what we have just described may be a self-fulfilling prophecy. Perhaps the patterns aren’t a true reflection of support propensity, but more a reflection of the charities’ historical engagement strategy. Separating these influences would mean requisitioning additional data and counsel from the charity – something out of the scope of this week-long exercise. I was assured by my charity contact, however, that it would be reasonable to assume even marketing distribution for this exercise, though we should still be wary of its potentially confounding influence.
4. Teasing out the signal
Whether the patterns explained above are a full reflection of support propensity or not, the socio-demographic links – though interesting – are also kind of intuitive. We can more or less guess where support will be greatest based on our existing knowledge of areas.
I was verging into territory I initially set out to avoid – ‘rendering numbers to simply reconstruct a familiar story for the charity’.
So at this stage I asked myself: what can the charity actually do with this information? And how could I add value to this predictive model of support?
An oft fruitful pathway of exploratory data analysis is to look for the extreme values, or the exceptions to the rule. In this instance, finding out where support is unexpectedly high or unexpectedly low would surely have a direct link to action. Areas that are under-performing could represent focal points of opportunity, while areas that are over-performing could hold secrets to garnering extra support elsewhere.
Our decision tree provides a robust model of expectation, telling us what levels of support to expect given the socio-demographic characteristics of an area (be it direct effect or mediated by marketing effort). Now, if we strip away the support attributable to these characteristics, any remaining pattern will by definition be unexpected. The more extreme the deviations from expected (in either direction) the stronger the signal, the bigger the surprise, and the greater the intrigue. I have mapped this surprise values below (click to interact).
The value and utility of visualising ‘surprise’ over actual values is championed by Jeffrey Heer and co. of the UW Interactive Data Lab, which heavily influenced my thinking at this stage.
The resulting ‘Surprise Map’ provides a tool for the charity to ‘prospect’ for opportunity and added insight. Both blue cells (over-performing constituencies) and red cells (under-performing constituencies) are interesting to the charity in their own way.
Red: Maybe these PCs are being neglected? They are home to people who are theoretically receptive to the charity, but perhaps they aren’t being targeted effectively?
Of particular note are the clusters of red constituencies in and around Birmingham (the UK’s second largest metropolitan area) and Brighton (a town with quite a liberal, educated populous). These areas represent untapped opportunities that could potentially be exploited by deploying more fundraisers on the ground. Or…maybe traditional methods haven’t been working here, and resources would be better spent targeting responsive factions of these communities through alternative marketing channels.
The blue over-performing constituencies might hold secrets to help inform such a change in tack.
Blue: What things are we doing right in these PCs that could be extended to enhance engagement elsewhere?
The dark blue constituencies, such as St. Austell & Newquay in the South-West and Hackney South & Shoreditch in East London, are anomalies that warrant ad hoc interrogation. Perhaps success here can be tied back to bespoke engagement strategies or greater local presence. If so, similar methods could be transplanted to comparable communities elsewhere.
The connected groups of blue constituencies, such as the little cluster by the Scottish border and those chained through Southern Wales, also provide interesting focal points – perhaps indicating a local social influence not captured by our predictor variables. Interrogating these will at worst help to improve future models, and at best provide the charity with a key insight into unexpected sources of support.
5. Closing Thoughts
So despite starting with a relatively basic dataset, we’ve managed to uncover some pertinent insights that can readily be acted upon.
Supplementing the dataset with open source census data supported deeper analysis that allowed us to control for the banal and highlight salient cases which demand attention. In doing so we could provide the charity with a signposted trail to identify low-hanging fruit, instead of sending them on a wild goose chase littered with false positives.
This process wasn’t a unilateral trajectory, but more a meandering pathway. The decision tree initially used to make sense of support patterns subsequently became the model of expectation, acting as a baseline from which to tease out ‘surprise’ values.
In other words, there was no real plan and things could have gone either way. However, by considering our options at each stage, iterating quickly, and framing the problem from the stakeholder’s perspective I feel we managed to keep this on track and boost our chances of delivering something of worth.
Although we didn’t win, the subsequent response from the charity was reward in itself. It was great to see charity employees with domain expertise connecting with our findings and getting excited about the utility of BI in their organisation. Lukas and myself have since been invited to give an extended presentation to the board of directors.
Off the back of the hackathon, I also had the privilege of spending the day at Google HQ with one of the guest judges, Anders, and a couple of other winners, discussing creative ways of integrating technology into business operations….high-brow conversation not accurately represented in the pictorial evidence.