Friday, May 24, 2013

Review of "Big Data"

I had mentioned in a post that I was looking forward to reading "Big Data: A Revolution that will Transform How We Live, Work, and Think" by Viktor Mayer-Schonberger and Kenneth Cukier. Mayer-Schoberger and Cukier live (or at least write) by the law of threes, and they have three good points to make about the big data culture whose development we are witnessing:
  1. We will have so much more data that we won't need to sample;
  2. More data means that we won't need to worry so much about exactitude; and
  3. We will make decisions based on correlation, not causality.
Naturally, each of these ideas, when it gets a chapter of its own, gets developed a little further. (Each chapter has a clever one-word title.) In the chapter "More" the authors point out that, while a truly random sample can be quite small, it can be difficult to obtain one. Systematic biases, for example, often taint the collection process. Further, if you are analyzing a small, random sample, you often do not have enough data points to drill down further into the data. Modern computing power and the huge amount of data now available mean that analysts don't have to limit themselves to samples. Much bigger datasets allow us to "spot connections and details that are otherwise cloaked in the vastness of the information." So, they conclude, with data, bigger really is better.

Large datasets, they go on in a chapter entitled "Messy," will have several types of errors: some measurements will be wrong; combining different datasets that don't always match up exactly will give approximations, rather than exact numbers. But the tradeoff, say the authors, is worth it. They provide as an example language translation programs - simple programs and more data are better at accurate translation than complex models with less data. They are careful to add that the results are not exact. "Big data transforms figures into something more probabilistic than precise."

The chapter "Correlation" explains why it's not so important to know "why" when you can know, through correlations, "what" happens, or, to put it more precisely, what is more likely to happen. As the authors put it, with correlations, "there is no certainty, only probability." As a result, we need to be very chary of coincidence. (We often think we see causality when in fact we have observed correlation. Or coincidence.) They add that correlations can point the way to test for causal relationships.

So far, so good. The authors go on to chapters about the turning of information into data, and the creation or capture of value. The book is written in a breezy, accessible style; it never mentions the term "Bayesian," for example, although that is clearly what the authors are talking about. But towards the end the energy peters out, and the final chapters feel like filler. The chapter "Risks," which raises some entirely speculative concerns - that we might be punished simply for our "propensity" to behave in a certain way, for example - feels rushed and empty. Its over-simplification of the US criminal justice system made me wonder what else might have been altered beyond recognition. So read the first part of the book for its useful outline of what big data entails, but go elsewhere for a more serious discussion of the policy implications.
Image via Amazon.com

No comments:

Popular Posts