24 August, 2016

Is A Picture Worth A Thousand Words? The Truth About Big Data And Visualisation

Data visualisation has always been a vital weapon in the arsenal of an effective analyst, enabling complex data sets to be represented efficiently and complex ideas to be communicated with clarity and brevity. And as data volumes and analytic complexity continue to increase in the era of big data and data science, visualisation has come to be regarded as an even more vital technique – with a vast and growing array of new visualisation technologies and products coming to market.

Whilst preparing for an upcoming presentation on the Art of Analytics recently, I had reason to re-visit Charles Minard’s visualisation depicting Napoleon’s disastrous Russian campaign of 1812. In case you aren’t familiar with this seminal work, it is shown below.
This visualisation has been described as “the best statistical graphic ever drawn”. And by no less an authority than Edward Tufte, author of “The Visual Display of Quantitative Information”, the standard reference on the subject for statisticians, analysts and graphic designers.
There are many reasons why Minard’s work is so revered. One reason is that he manages to represent six types of data – geography, time, temperature (more on this in a moment), the course and direction of the movement of the Grande Armée and the number of troops in the field – in only two dimensions.
A second is the clarity and economy that enables the visualisation to speak for itself with almost no additional annotation or elaboration. We can see clearly and at a glance that the Grande Armée set off from Poland with 422,000 men, but returned with only 10,000 – and this only after the “main force” was re-joined by 6,000 men who had feinted northwards, instead of joining the advance on Moscow.

And yet a third reason is that the visualisation was ground-breaking; though flow diagrams like these are named for Irish Engineer Matthew Sankey, he actually used this approach for the first time very nearly 30 years after the Minard visualisation was published. Today, Sankey diagrams are used to understand a wide variety of business phenomena where sequence is important. For example, we can use them to map how customers interact with websites so that we can learn the “golden path” most likely to lead to a high-value purchase – and equally to understand which customer journeys are likely to lead to the abandonment of purchases before checkout.
But even Minard’s model visualisation is arguably partial. Minard shows us the temperature that the Grande Armée endured during the winter retreat from Moscow – inviting us to conclude that this was a significant reason for the terrible losses incurred as the army fell back, as indeed it was.
However, the Russians themselves regarded the winter of 1812 / 1813 as unexceptional – and the conditions certainly did not stop the Cossack cavalry from harrying Napoleon’s retreating forces at every turn. Napoleon’s army was equipped only for a summer campaign – because Napoleon had believed that he could force the war to a successful conclusion before the winter began. As the explorer Sir Rannulph Fiennes has said, “There is no such thing as bad weather, only inappropriate clothing.”
Exceptional weather also affected the campaign’s advance, with a combination of torrential rain followed by extremely hot conditions killing many men from dysentery and heatstroke. But Minard either cannot find a way to represent this information, or chooses not to. In fact, he gives us few clues as to why the main body of Napoleon’s attacking force was reduced by a third during the first eight weeks of the invasion and before the major battle of the campaign – even though, numerically at least, this loss was greater than that suffered during the retreat the following winter.
Terrible casualties also arose from many other sources – with starvation as a result of the Russian scorched earth policy and inadequate supplies playing key roles. The state of the Lithuanian roads is regarded by historians as a key factor in this latter issue, impassable as they were to Napoleon’s heavy wagon trains both after the summer rains and during the winter. But again, Minard either cannot find a way to represent the critical issue (the tonnage of supplies reaching the front line) or its principal cause (the state of the roads) – or chooses not to.
Minard produced this work 50 years after the events it describes, at a time when many in France yearned for former Imperial glories and certainties. His purpose – at least if the author of his obituary is to be believed – seems to have been to highlight the waste of war and the futility of overweening Imperial ambition. It arguably would not have suited his narrative to articulate that Napoleon’s chances of success might have been greater had the Russia of 1812 been a more modern nation with a more modern transport infrastructure – or had Napoleon’s strategy made due allowance for the fact that it was not.
With the benefit of 20th century hindsight, today we might still conclude that the vastness of the Russian interior and the obduracy of Russian resistance would anyway have doomed a better planned and executed campaign; but that hindsight was not available in 1869, either to Minard – or to the contemporaries he sought to influence.
Did Minard’s politics affect his choice of which data to include? Or were the other data simply not available to him? Or beyond his ability to represent in a single figure? From our vantage point 150 years after the fact, it is difficult to answer these questions with certainty.
But when you are looking at a data visualisation, you certainly should attempt to understand the author’s agenda, preconceptions and bias. What is it that the author wants you to see in the data? Which data have been included? Which omitted? And why? Precisely because good data visualisations are so powerful, you should make sure that you can answer these questions before you make a decision based on a data visualisation. Because whilst a good data visualisation is worth a thousand words, it does not automatically follow that it tells the whole truth.
This post first appeared on Forbes TeradataVoice on 31/03/2016.

No comments:

Post a Comment