Small Big Data? Of the many hyperbolic predictions in bestselling books devoted to big data, none is more astounding than Mayer-Schönberger’s and Cukier’s claims that big data will eliminate the need for sampling (why sample when you’ve got all the data?). But here’s the thing. We don’t have all of the data. Let’s look at Twitter. First, people who tweet are not a representative sample of the population. Second, like most commercial platforms, Twitter has moved towards more proprietary policies on the data they have mined from us. Most of us can only access up to 1% of relevant tweets for a given query. That can still be a lot of tweets, and that data is, for the moment, free. But is that big data? In other words, we’ve got sampling bias. If you can detect it, though, you can correct for it— Morstatter et al recommend bootstrapping the data in order to correct for the biased sample. But it may not be so easy with some of the work we do. For example, t...
Occasional posts on anthropologically interesting science fiction, anthropological futures and my own future as an anthropologist.