Alexandra Bowie Consulting: Big data

Showing posts with label Big data. Show all posts

Friday

Review of "Big Data"

I had mentioned in a post that I was looking forward to reading "Big Data: A Revolution that will Transform How We Live, Work, and Think" by Viktor Mayer-Schonberger and Kenneth Cukier. Mayer-Schoberger and Cukier live (or at least write) by the law of threes, and they have three good points to make about the big data culture whose development we are witnessing:

We will have so much more data that we won't need to sample;
More data means that we won't need to worry so much about exactitude; and
We will make decisions based on correlation, not causality.

Naturally, each of these ideas, when it gets a chapter of its own, gets developed a little further. (Each chapter has a clever one-word title.) In the chapter "More" the authors point out that, while a truly random sample can be quite small, it can be difficult to obtain one. Systematic biases, for example, often taint the collection process. Further, if you are analyzing a small, random sample, you often do not have enough data points to drill down further into the data. Modern computing power and the huge amount of data now available mean that analysts don't have to limit themselves to samples. Much bigger datasets allow us to "spot connections and details that are otherwise cloaked in the vastness of the information." So, they conclude, with data, bigger really is better.

Large datasets, they go on in a chapter entitled "Messy," will have several types of errors: some measurements will be wrong; combining different datasets that don't always match up exactly will give approximations, rather than exact numbers. But the tradeoff, say the authors, is worth it. They provide as an example language translation programs - simple programs and more data are better at accurate translation than complex models with less data. They are careful to add that the results are not exact. "Big data transforms figures into something more probabilistic than precise."

The chapter "Correlation" explains why it's not so important to know "why" when you can know, through correlations, "what" happens, or, to put it more precisely, what is more likely to happen. As the authors put it, with correlations, "there is no certainty, only probability." As a result, we need to be very chary of coincidence. (We often think we see causality when in fact we have observed correlation. Or coincidence.) They add that correlations can point the way to test for causal relationships.

So far, so good. The authors go on to chapters about the turning of information into data, and the creation or capture of value. The book is written in a breezy, accessible style; it never mentions the term "Bayesian," for example, although that is clearly what the authors are talking about. But towards the end the energy peters out, and the final chapters feel like filler. The chapter "Risks," which raises some entirely speculative concerns - that we might be punished simply for our "propensity" to behave in a certain way, for example - feels rushed and empty. Its over-simplification of the US criminal justice system made me wonder what else might have been altered beyond recognition. So read the first part of the book for its useful outline of what big data entails, but go elsewhere for a more serious discussion of the policy implications.
Image via Amazon.com

Monday

One simple - too simple? - graph to explain US economy's performance

The graph comes from Thomson Datastream via Derek Thompson of TheAtlantic.com - and it shows that the US economy's performance over the last five years was better than that of comparable developed countries: a shallower recession with a faster recovery. Thompson attributes this performance to the facts that:

(a) control our own currency and (b) used aggressive monetary policy to save the banks and lower interest rates while running high deficits.

Do you agree? What's your interpretation of the graph?

Tuesday

David Brooks is not thinking straight about big data

David Brooks is doing the public no favors in his column today in which he suggests, among other things, that analysis of big data is devoid of human interpretation, bias, or judgment. (I am looking forward to reading "Big Data" by Viktor-Mayer Schonberger and Kenneth Cukier.) Leaving aside the headline, which Brooks may not have written, the commenters take the argument apart pretty well. I would just add one more thing: Mr. Brooks gave Jim Manzi's "Uncontrolled" a pretty big push last year. Has he forgotten what he said then? Big data is big data.

You can read my review of "Uncontrolled" here. And if you haven't read it yet, you should.

Friday

NYC's 311 Map

That's a visualization of the 1,551,402 contacts New York City's 311 system received in 2012 - by phone, text, and online. New York City's Open Data project has posted the video. There are other visualizations available here.

Tuesday

The Signal and the Noise, by Nate Silver

I've been a fan of Nate Silver's work since the 2008 election when I, like perhaps many of you, obsessively checked his blog. I've always thought that his writing is clear and that he is transparent - to a point - about his methodology. So I was eager to read his very interesting book, "The Signal and the Noise."

What Silver sets out to do in this book is explore our ability to make predictions based on big data. Silver's main thesis is that we should be using Bayesian statistics to make and judge our predictions about the world. As Silver puts it,

The argument mad by Bayes and Price is not that the world is intrinsically probabilistic or uncertain . . . It is, rather, a statement . . . about how we learn about the universe: that we learn about it through approximation, getting closer and closer to the truth as we gather more evidence. [Italics in original.]

As Silver acknowledges, this approach is not the one we are taught in school (or in classes in the history and philosophy of science. For a review of that approach, read the first third or so of Jim Manzi's book "Uncontrolled." My review of "Uncontrolled" is here.) Instead, Silver argues, we use statistics that focus on our ability to measure events. We ask, given cause X, how likely is effect Y to occur? This approach raises lots of issues, such as separating cause from effect - we get mixed up a lot about the difference between correlation and causality. We mistake the approximation for reality. We forget we have prior beliefs, so allow our conclusions to be biased.

In contrast, Silver explains, the Bayesian approach is to regard events in a probabilistic way. We are limited in our ability to measure the universe, and Pierre-Simon Laplace, the mathematician who developed Bayes' theorem into a mathematical expression, found an equation to express this uncertainty. We state what we know, then make a prediction based on it. After we collect information about whether or not our prediction is correct, we revise the hypothesis. Probability, prediction, scientific progress - Silver describes them as intimately connected. And then he makes a broader claim:

Science may have stumbled later when a different statistical paradigm, which de-emphasized the role of prediction and tried to recast uncertainty as resulting from the errors of our measurements rather than the imperfections in our judgments, came to dominate in the twentieth century.

Silver describes the use of Bayesian statistics (to greater or lesser rigor) in many contexts, including sports betting, politics, the stock market, earthquakes, the weather, chess, and terrorism. We are better at predictions in some of these contexts than we are in others, and he uses the chapters to illustrate various corollaries to his main theme. In his first chapter, on the 2008 financial meltdown, he identifies characteristics of failed predictions: the predictor focused on stories that describe the world we want, we ignore risks that are hard to measure, and our estimates are often cruder than we think they are. On the other hand, in a chapter about sports data, he makes a compelling case for the premise that a competent forecaster gets better with more information. Throughout, he urges us to remember that data are not abstractions but need to be understood in context.

This is not a how-to book, and it certainly left me with many questions. How do you test social programs using Bayesian analysis? But it is a very good starting point.

Image via amazon.com

Data managers and proxy data

I've been thinking about measurement lately (see also here for a slightly different view). So I read this post in the HBR Blog Network with some interest. In it, two consultants recommend that corporations need a Chief Data Officer - and I think it's a concept that has an analogy in the not-for-profit world as well. The CDO's purpose, say the authors, Anthony Goldbloom and Merav Block, is "oversight and evangelism at the highest levels of management." Specifically, the person:

Figures out how data can be used to support the organization's most important priorities - it's easy, after you've gone to the trouble of setting up outcome measures and other data management systems, to think that the functions can now be handed off to mid-level staff. But there's always something pressing that new or different data might help resolve.
Keep checking to make sure the organization is collecting the right data - what was once an outcome may now be an output, or you may have obtained funding to do further followup. Having a senior manager responsible for thinking about what is collected allows organizations to collect the data they need. It also allows you the possibility of experimenting with different service models because you can compare data.
Ensuring the organization is able to collect the needed data - a top level view combined with the clout of a senior manager means that the organization will probably make better decisions about allocating limited resources for developing or enhancing data systems.

Update: Just after I posted this I came across an article that nicely illustrates the second point. There is a great deal of uncertainty whether hurricanes are increasing in number along with global warming - not all hurricanes strike land, and while we have good records from the 1970s forward (when we started using satellites to track hurricanes) that's not enough time to tell whether we are seeing natural fluctuations are a change.

Now a scientist at the University of Copenhagen, Alex Grinsted has published a report looking not at hurricanes but at storm surges, which have been measured reliably by tide gauges since the 1920s. It's a nice use of a different type of data that helps you find the information you are looking for. Grinsted's conclusion?

“Using surges as an indicator . . . we see an increase in all magnitudes of storms when ocean temperatures are warmer.” As ocean temperatures have risen inexorably higher in the general warming of the planet due to human greenhouse-gas emissions, the scientists concluded, hurricane numbers have moved upward as well. The implication: they’ll keep increasing along with global temperatures unless emissions are cut significantly.

Wednesday

Big data and the Olympics

NBC has unleashed a huge amount of data about how Americans watched the London Olympics: we watched on TV, yes, but we also watched on NBC's web sites, on our computers, tablets, and phones. We tweeted and posted comments on Facebook. And, the New York Times reports, a bunch of us participated in studies about what we did. The data don't seem to be available publicly yet, but NBC Universal (the research lab) shared some of its conclusions with the Times. Here are some of the findings:

[E]ight million people downloaded NBC’s mobile apps for streaming video, and there were two billion page views across all of NBC’s Web sites and apps. Forty-six percent of 18- to 54-year-olds surveyed said they “followed the Olympics during my breaks at work,” and 73 percent said they “stayed up later than normal” to watch, according to a survey of about 800 viewers by the market research firm uSamp . . . .

The results signaled vast changes from just two years ago in Vancouver, when tablets and mobile video streaming were still in their infancy. The two most streamed events on any device during the London Olympics, the women’s soccer final and women’s gymnastics, surpassed all the videos streamed during the Vancouver Olympics combined.

Fascinating. That's all I have time for today, but I will keep an eye on the story.

Monday

The Human Face of Big Data

There's a lot of excitement this morning about the Human Face of Big Data project. It's an attempt to use a smartphone data capture tool to "help measure the world." From September 25 to October 2 participants in what the organizers are calling a "crowdsourced media project" will provide data on their lives, families, sleep, trust, dating and dreams. The idea is to illustrate some of what we can learn about ourselves by aggregating data. There are a lot of practical uses for this kind of data aggregating; you can see some of them by clicking through a series of screens here. There are some cool videos here; you can download the app here if you decide to participate.

I've written before about self-quantifiers, though I'm thinking about participating in this one. The idea of finding a data doppelganger, someone who matches your data closely, which is one of the promised payoffs, is appealing. (Your identity, and your doppelganger's will not be revealed.) The production team was responsible for the Day in the Life series. Judging by the photos and video that's already been produced, the pictures will be beautiful. The screenshot above? It's a photograph showing pizza deliveries in Manhattan.

I have a couple of analysis questions: even with the hoped-for 10 million downloads, is that enough to represent a world with a population of more than seven billion people? And what about people who download the app but contribute only a few data points? There's also some reason to be cautious. According to news reports, the project will not collect identifying information, but the privacy policies have not yet been posted. What do you think? Are you participating? Share your experience in the comments.

Thursday

I've commented before (see here for an example) about how important classification is to data analysis: you have to put data into categories before you can count them. How you define the categories, and deciding which category is the best fit for ambiguous data is something you'll need to do (one researcher I know called this 'digging around in the data' her favorite part of the work).

Classification of information is also important among web pages, though of course on a much larger scale. You can read, here, a very interesting article by David Auerbach called "The Stupidity of Computers" (it's from the current issue of the magazine n+1). In the context of searching the web, all of human knowledge becomes hard to classify. But there are some shortcuts, as Google has demonstrated.

Auerbach argues that two of the best shortcuts are those used by Amazon, and Facebook. Amazon reaches shoppers by using categories that they already know: books, jewelry, housewares, and so on.

[Amazon] didn’t have to explain their categories to people or to computers, because both sides already agreed what the categories were. . . . They could tell customers which were the bestselling toasters, which toasters had which features, and which microwaves were bought by people who had bought your toaster.

We don't complain about Amazon and privacy; we are willing to give up information because of the great convenience of Internet shopping. Facebook, on the other hand, goes much further: it asks for information, and then categorizes it:

As it grew, Facebook continued to impose structure on information, but the kind of information it cared about changed. It cared less about where you went to school and a lot more about your tastes and interests—i.e., what you might be willing to buy. This culminated in a 2010 redesign in which Facebook hyperlinked all their users’ interests, so that each interest now led to a central page for that artist, writer, singer, or topic, ready to be colonized by that artist’s management, publisher, or label. “The Beatles,” “Beatles,” and “Abbey Road” all connected to the same fan page administered by EMI. Updates about new releases and tours could be pushed down to fans’ news feeds.

And, Auerbach says, there's more: Facebook wants to amass information about what its users do on other sites. Every time you log in somewhere using your Facebook ID, you are contributing data for analysis. It's something we can expect to see increasingly in the future, and what use some corporation is making of this data is worth thinking about with every login.

Wednesday

Recipe organization

Organizing recipes, especially the kind clipped out of newspapers and magazines over many years, can be a challenge. Here's how one (perhaps obsessive? what do you think? let me know in the comments) New York Times staffer managed it. In case you are not ready to program your own recipe database, in addition to the options the article lists, you might also check out Evernote - I like it because it syncs across platforms, letting you link to recipes at your desk but look up ingredients on your phone while you're grocery shopping. And for another approach, try Cookstr, which allows searches by ingredients and has a separate gluten-free index.

Photo via cookstr.com

Monday

Facebook and Big Data

Today's column by Nick Bilton in the Bits blog of the New York Times is a good reminder of the power of huge amounts of data. Harness it, and it can give you a sense of what your users want. But it's also a reminder that every time you click on 'like' or 'share,' well, you're giving Facebook (or Microsoft of Google or numerous others) yet another piece of information about what you like, use, or want.

Big data roundup

Update, March 4: Here's an article in the NY Times about an exciting use of integrated big data, a series of programs IBM designed for Rio de Janeiro.

There has been a series of articles in the past couple of years about the era of big data we've arrived in, including today's NY Times article about Facebook's efforts to manage its stream ("firehose" is the term the NY Times used) of user data in the face of privacy concerns. McKinsey recently compiled a chart showing the potential some sectors of the US economy have for using big data:

Source: McKinsey and Company

Not-for-profits that provide social services, including mental health, health, or other social services have been collecting large amounts of data for years, but many have had trouble unlocking its potential for any number of reasons. Yet analysis of large amounts of data can allow providers to test interventions and make better management decisions, improving productivity and services.

The McKinsey Global Institute analyzed data in five domains, including health care in the US and the public sector in Europe, and concluded that big data can generate value in each. These conclusions seem easily extendable to the not-for-profit sector in the US, and I recommend reading McKinsey's full report or at least the executive summary, both available free, after registration here. A shorter version of the executive summary, focusing on strategy, is available as an article here. The basic points are:

the era of big data is here, allowing providers to collect data across units, integrate information, and analyze the information;
big data can change the way you do business, making processes and information transparent;
you can use data to experiment and test hypotheses;
privacy is already a concern;
managers will need to understand big data, and will need specialists who can provide analytic support.

These are big questions. But it's not too early to start thinking about them.

Friday

Review of "Big Data"

Monday

One simple - too simple? - graph to explain US economy's performance

Tuesday

David Brooks is not thinking straight about big data

Friday

NYC's 311 Map

Tuesday

The Signal and the Noise, by Nate Silver

Data managers and proxy data

Wednesday

Big data and the Olympics

Monday

The Human Face of Big Data

Thursday

Wednesday

Recipe organization

Monday

Facebook and Big Data

Big data roundup

Blog Archive

Popular Posts