Computers are reasonably good at analysing large datasets, but there is one class of problem where they require a bit of help from puny humans – high dimensional datasets. By ‘high-dimensional’, we mean ‘wide’, as in lots of columns. When we have wide data, it’s very hard to spot commonalities across a number of those columns. For example, if we have data from a large number of sensors, and all of them have something to say about what’s going on, it’s very hard to detect what is similar about all those readings when a particular type of event occurs.
A typically difficult problem is predicting cancer outcomes based on levels of gene expression. We have measures of hundreds (sometimes thousands) of genes, all of which interact to influence medical outcomes, such as survival rates, response to particular chemotherapies and growth rates. We also have thousands of patients, so we end up with a table that is hundreds (or thousands) of columns wide and thousands of rows deep.
It’s impossible for a human to work out what combinations of levels of gene expression predict a given outcome. However, it’s also a very difficult problem for a computer to solve because it has to conduct huge numbers of comparisons of all the possible data values; this is known as the ‘curse of dimensionality’. A computer also needs to know what to look for if it’s going to come up with anything useful in a reasonable timeframe, which pretty much means you need to know the outcome before you run the analysis. A fat lot of good that is.
One answer to this problem is to combine human cognition with machine learning to circumvent some of the heavier computation. The computer organises the data in an unbiased fashion and presents it as a graphical model to the analyst. At this point, the (as yet) uniquely human skill of pattern recognition comes into play, and the analyst is able to spot shapes and clusters in the data that indicate something interesting is going on. The tricky bit is in getting the computer to present something that helps the analyst. The traditional approach is to conduct what is known as a dimensional collapse (or dimension reduction) on the data. This sounds like a cataclysmic event from an episode of Doctor Who, but is actually just a way to simplify the data.
Before looking at dimension reduction, let’s look at a simpler example; a dataset with three columns of numbers. We can treat each row as a coordinate in 3D and plot the points. When the points are visualized, we can immediately tell if there is a random distribution (a cloud of points) or if some clusters are present. If we’re getting clusters, it’s a good indication that there is some commonality in the data. If we then colour-code our points based on some other variable (e.g. patient survival rate) we can immediately see if any of the clusters have a dominant colour. If they do, then we’ve found a pattern in our data that’s worth investigating.
But what if I’ve a thousand columns – how do I visualise that? This is where the dimensional collapse comes in. Traditional approaches such as Principal Component Analysis (PCA) and more recent machine learning algorithms such as T-SNE allow us to collapse thousands of dimensions down to two or three while still (hopefully) maintaining proximity between the data points – i.e. if two points were close in n-dimensions, they’ll also be close to each other in two or three dimensions. What this means is that if we see clusters in 2D or 3D, it’s a direct result of there being some similarity in the data, and we may have something worth investigating.
Such approaches have been around for years, and work well for spotting clusters. They’re not without their drawbacks, though. PCA can have the effect of introducing a bias to the data if not used carefully. T-SNE is a more neutral approach, but if the data is noisy, and the signal is weak, it becomes very hard to find the signal in a cloud of points. What is needed is a way to increase the contrast and reveal the patterns in a more obvious way to the user.
This is what topological data analysis (TDA) is all about. TDA attempts to build a topological model of the data by grouping and linking data points that are similar in n-dimensions. This can then be visualised as a network chart to show the ‘shape of the data’. This has the effect of making the clusters more obvious to the human eye, as well as highlighting smaller clusters that would be lost in the noise in a traditional dimensional collapse approach. It also reveals a shape to the data which tends to draw the eye to particularly interesting features.
The market leader for this sort of analytics is Ayasdi, and they make some bold, but supportable, claims about what the shape of the data can reveal, such as circular plots are indicative of time-series data.
The company I work for (illumr) takes a slightly different approach, but still with an emphasis on using shapes to draw the analyst’s eye to interesting aspects of the data. We don’t emphasise TDA, though our approach results in a topological model of the data. Debate rages (as only mathematicians can rage) on the internet about how TDA differs from dimensional collapse. This is somewhat missing the point, as the end-result is still a dimensional collapse but with the topology of the n-dimensional shape retained and presented in 2D (Ayasdi) or 3D (illumr).
There are different ways to achieve this, but Ayasdi tends to focus on a persistent homology approach – extracting the overall topology and displaying it as nodes and links. In the case of Ayasdi, each node corresponds to either a single data-point or a cluster of points, depending on how they connect (the topology). The illumr approach results in a node for every data point, as the emphasis here is on finding weak signals in data.
For instance, illumr worked with HouseMark, a provider of social housing data for the UK. The data self-organises by our algorithms to reveal the inherent structure of the dataset in 3D, independent of any human bias. Nodes in the figure are coloured by the number of repairs (where red = low no. of repairs; green = high no. of repairs).
The dataset has some very definable clusters associated with the number of repairs. By examining these clusters, we can reveal non-intuitive insights missed by other methodologies.
For example, there are important exceptions to the rule that maintenance costs increase with the age of the property. We identified and described a complex interaction between the age of the property, its size and also its type (house/flat), explaining a large proportion of variance in the number of repairs carried out.
This interaction would be impossible to identify without illumr’s exploratory data-analysis guided by self-organising the data in 3D space. Most importantly, as the analysis is based on the natural inherent structure within the data, and not blind machine-learning, the interaction can be summarised in a human-understandable format.
Such findings are empowering the housing providers to better understand and predict the patterns of responsive repairs, thus helping to improve their operational efficiencies.
Jason Lee is founder and CEO of illumr.