I was talking to somebody last week who had recently moved to San Francisco, and she randomly interjected Big Data into the conversation. She said she’d learned that’s what you do in SF – Big Data is a buzzword that can be used at any time on any topic. I found this amusing, because Big Data is becoming almost messianic – all of our problems will be solved once we have Big Data! Everybody should become a statistician or economist or data analyst!
My response? Don’t believe the hype.
The trends are unmistakable – humans are creating and capturing more data than ever before. IBM estimates that 90% of the data in the world today was created in the last two years. And the tools we have to sift through data are becoming ever more powerful, with open-source packages like Hadoop for map/reduce, and R for statistical computing.
This flood of data and the tools to analyze it are creating market opportunities for businesses and career opportunities for individuals – the story of Target identifying a pregnant teenager from her purchases is a founding myth of Big Data. And some credit Obama winning the presidency to data analysis. So why am I skeptical?
There is a belief that if we could only quantify everything, we would be in control. The management saying is “You can’t manage what you don’t measure.” So if we have more data, and can measure everything, we should be able to manage everything! Except that the world is not that simple. Just because something has been quantified doesn’t mean that it is good or meaningful data – how the data is collected can introduce biases or trends that render it useless for making decisions. And just because an analysis gives a numerical output does not make it into useful knowledge or wisdom.
What I am seeing in the rush to Big Data is the urge to quantify things before understanding them. Recording 600 metrics and tossing them all into a database creates a ton of data, and analysts can spend weeks or months looking through the data. But is that really driving value for an organization? Similarly, I’ve seen situations where an analyst uses a standard ARIMA model to forecast a trend with confidence (because it’s data-driven!), and later being surprised that the forecast is wrong because they never really understood the underlying data. Another example is when a consultant creates a 500-line Excel spreadsheet, where every possible variable is quantified and every change ripples through the spreadsheet… but of those 500 lines, 490 are assumptions, so it’s impossible to tell which variables really matter.
Another potential peril is when analysts start their work with a preconceived notion of the result they want to get. With Big Data, you have enough data to support almost any conclusion if you slice the data in the right way. One of my favorite stories about the perils of data analysis came from my time as an intern at CERN – a grad student was looking for a particular energy resonance from the L3 detector data, and displayed this beautiful graph showing that resonance. Dr. Sam Ting, Nobel Prize winner, smelled a rat – the result looked _too_ clean. He told the grad student to show the data with all of the filters removed, and the raw data showed nothing but noise. The student had applied the filters to show what he wanted to see. Note that I’ve seen similar things happen at Google – as a coworker commented to me recently, if even Google (and MIT grad students) can’t consistently get data analysis right, can anybody?
I worry that the quantification of the world in the form of Big Data is being seen by businesspeople as an end in itself, rather than as the tool it is. Like any tool, data analysis can be used well by those who have trained in its use, or it can be used poorly and cause damage by those without experience. Understanding data is hard. It takes time and effort, and while a well-constructed tool can accelerate that process, it doesn’t replace the need to sit and work with the data to understand its quirks and characteristics. After really understanding the data, you may discover that only 3 metrics out of 600 really matter, and so you don’t need Big Data to run your organization – just a dashboard with the 3 things that matter.
Big Data isn’t a silver bullet that will fix everything with your organization. It is a powerful tool that can help you better understand what is going on, but only if you spend the time to use it properly. Just because your analysts create output that is quantitative doesn’t mean it’s right. Trust, but verify. Use your judgment and all of your tools including walking around to figure out what to do, because in the end, you are the one responsible, no matter what the data says.