Data Anonymization and Re-identification: Some Basics Of Data Privacy

Why Personally Identifiable Information is irrelevant. An introduction to information entropy, open data, and the possible end of crowdsourcing. 

Tim O'Reilly and ZIP Codes

From his Strata Conference on Data Science, Tim O'Reilly tweeted with dismay the recent California court decision that the zipcode is now to be classified as "personally identifiable information". "No more demographics" he lamented. A little later he retweeted a response that "apparently 87% of US residents can be uniquely identified by zip+DOB+gender: bit.ly/qysMqs" and later followed up with "Here's a reference for the claim that zip code, gender and DOB uniquely identify 87% of individuals: http://www.citeulike.org/user/burd/article/5822736 via @crdant".

These tweets are odd and disturbing. The zip/DOB/gender finding is a basic one in studies of privacy, published years ago by Latanya Sweeney of Carnegie Mellon University. I gave a talk at work on privacy a year ago, and this was one of the first references I came across. Tim O'Reilly has been pushing an agenda of Open Data, particularly Open Government Data, for the last couple of years, and yet it looks as if he isn't aware of the basic privacy issues around such data. Can that really be the case?

If it is, then here, to help Tim along, are some notes from my talk as a kind of introduction to data privacy, or at least to data-anonymization and re-identification. A great resource on some of these issues from a legal perspective is Paul Ohm's 2009 paper "Broken Promises of Privacy: Responding to the Surprising Failures of Anonymization" (PDF), University of Colorado Law Legal Studies Research Paper No. 09-12. It's long, but it's so well written it's an easy read. Much of these notes originated with this paper, in one form or another.

How Privacy Broke Crowdsourcing

A few years ago Netflix ran its highly successful and widely publicised crowdsourced prize competition, in which it released a data set of users and their movie ratings and let competitors download them and search for patterns. The data consisted of a customer ID (faked), a movie, the customer's rating of the movie, and the date of the rating.

In the FAQ for the competition, Netflix said this:

Q. Is there any customer information in the dataset that should be kept private?

A. No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy… Even if, for example, you knew all your own ratings and their dates you probably couldn't identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation.

This certainly looked reasonable enough, but Arvind Narayanan and Vitaly Shmatikov of the University of Texas had other ideas.1 First, they looked at the claim that the data was perturbed by asking acquaintances for their rankings. They found that only a small number of the ratings were perturbed at all, which makes sense because perturbing the data gets in the way of its usefulness.

In the Netflix data set, different users have distinct sets of movies that they have watched. The data set is sparse (most people have not seen most movies), and there are many different movies available, so individual tastes and viewing histories leave a clear fingerprint. That is, if you knew what movies someone watched, you could pick them out of the data set because no one else would have seen the same combination.

A closer look showed that with 8 ratings (of which 2 may be completely wrong) and dates that may have a 14-day error, 99% of the records in the Netflix data set uniquely identify an individual. For 68% of records, two ratings and dates are sufficient. Various combinations of information are sufficient to identify users, eg 84% by 6 of 8 movies outside the top 500.

But of course there is no personally identifiable information in the data set. So is this a privacy issue? It is when you have another data set to look at. The researchers took a sample of 50 IMDB users. The IMDB data is noisy – there is no ranking, for example. Still, they identified two users whose Netflix records were 28 and 15 standard deviations away from the next best. One from ratings, another from dates.

So despite Netflix's best efforts, the data set included enough information to identify some individuals. Partly because of this, a planned follow-up competition was scrapped, and the whole enterprise of crowdsourcing recommender algorithms was given a possibly terminal blow.

What's this all about?

Just to be clear, this set of notes is not about the following things:

  • Encryption
  • Restricting access to data
  • Lost USB keys and CDs

It is about these:

  • Deliberately released data that turns out to infringe on privacy
  • HIPAA, EU Data Directive, corporate rules for handling customer data
  • Advertising and ISPs
  • Gov 2.0, data.gov, and "openness"

It's about claims such as: "Attorneys on Monday accused Google of intentionally divulging millions of users' search queries to third parties in violation of federal law and its own terms of service" (October 26 2010)

"MySpace and some popular applications on the social-networking site have been transmitting data to outside advertising companies that could be used to identify users, a Wall Street Journal investigation has found" (October 23, 2010)

"Facebook users may inadvertently reveal their sexual preference to advertisers in an apparent wrinkle in the social-networking site's advertising system, researchers have found" (October 22, 2010)

(These claims are a year old, found in the week before I gave the talk. I'm sure there are many more.) The Facebook case was one in which advertisers (for a nursing program I believe) asked to target their ads specifically at females and at men interested in other men. But unlike, for example, an ad about a gay bar where the target demographic is blatantly obvious, a male user reading the ad text would have no idea that it had been targeted solely at a very specific demographic, and that by clicking it he would reveal to the advertiser both his sexual preference and a unique identifier (cookie, IP address, or e-mail address if he signs up on the advertiser's site). "Furthermore (the researchers wrote) such deceptive ads are not uncommon; indeed exactly half of the 66 ads shown exclusively to gay men (more than 50 times) during our experiment did not mention 'gay' anywhere in the ad text."

Don't we have laws to deal with this?

Indeed we do. Europe and the USA adopt different approaches to balancing privacy and utility, with the US adopting industry-specific rules (HIPAA for health, FERPA for education, Driver's Privacy Protection Act, FDA regulations, Video Privacy Protection Act etc), while the EU has taken a global approach with the Data Protection Directive. But both approaches are based on a common set of concepts and assumptions.

The big thing is that there is an assumption that data can be anonymized, and once it is then you can share it, because where's the harm? Both sets of rules are built on the idea that there is such a thing as personally identifiable information (PII) and that you can hide it, while still releasing data that is useful. The release process is "release and forget" because if data is properly anonymized why do you have to track what's done with it? There is a faith in the anonymization process, and that faith was broken by the Netflix study and a couple of other related studies.

Latanya Sweeney and the Massachusetts Governor

Let's go back to a time before HIPAA, when the debate was focused in terms of how much anonymization you needed to do. Here are some quotations from Latanya Sweeney's paper (PDF), that Tim O'Reilly appeared unaware of.

"Figure 1" below is a simple Venn diagram with two intersecting circles. The left circle holds medical data: Ethnicity, Visit Date, Diagnosis, Procedure, Medication, Total Charge. The right circle holds a voter list: Name, Address, Date Registered, Party Affiliation, Date Last Voted. And in the intersection is ZIP, Date of Birth, Sex.

The National Association of Health Data Organizations (NAHDO) reported that 37 states in the USA have legislative mandates to collect hospital level data and that 17 states have started collecting ambulatory care data from hospitals, physicians offices, clinics, and so forth. The leftmost circle in Figure 1 contains a subset of the fields of information, or attributes, that NAHDO recommends these states collect; these attributes include the patient's ZIP code, birth date, gender, and ethnicity.

In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC collected patient-specific data with nearly one hundred attributes per encounter along the lines of the those shown in the leftmost circle of Figure 1 for approximately 135,000 state employees and their families. Because the data were believed to be anonymous, GIC gave a copy of the data to researchers and sold a copy to industry.

For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes. The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter. This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures and medications to particularly named individuals.

For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code. [Editor's note: a 5-digit zip code may have several thousand people in it.]

The example above provides a demonstration of re-identification by directly linking (or "matching") on shared attributes. The work presented in this paper shows that altering the released information to map to many possible people, thereby making the linking ambiguous, can thwart this kind of attack. The greater the number of candidates provided, the more ambiguous the linking, and therefore, the more anonymous the data.

In a theatrical flourish, Dr. Sweeney sent the Governor's health records (which included diagnoses and prescriptions) to his office.

Now, of course, health information in the US is governed by HIPAA, but according to HIPAA, "de-identified" health information is unregulated. De-identified means either: a statistician says it is de-identified, or the 18 Personally Identifying Information (PII) identifiers are suppressed or generalized. These PIIs are things like Name, e-mail address, social security numbers, computer IP addresses, and so on.

The EU doesn't list specifics. Instead it says that PII is "anything that can be used to identify you". But what does that cover? IP addresses perhaps? Here is Google in their argument to the EU:

  • we "are strong supporters of the idea that data protection laws should apply to any data that could identify you. The reality is though that in most cases, an IP address without additional information cannot."
  • "We believe anonymizing IP addresses after 9 months and cookies in our search engine logs after 18 months strikes the right balance."
  • "we delete the last octet after nine months (170.123.456.XXX)"

The Latanya Sweeney result was the first to show that once you can mix and match data sets, PII is just not enough to provide privacy. And nowadays, of course, data mining multiple data sets is big business.

How Do You Anonymize Data? k-anonymity

Let's step back a little and look at the technical side of anonymization. There are four basic methods for anonymizing data:

    Replacement - substitute identifying numbers
    Suppression - omit from the released data
    Generalization - for example, replace birth date with something less specific, like year of birth
    Perturbation - make random changes to the data

Then you have to measure how private a data set. Latanya Sweeney came up with the notion of k-anonymity to define this. Here's how it works.

Think about a table, with rows and attributes. Each attributes is either part of a quasi-identifier (like a name or address), or is sensitive information (like the fact you had an operation on a particular afternoon). A quasi-identifier is a set of attributes that, perhaps in combination, can uniquely identify individuals. Sensitive information includes the attributes that we want to keep private. Your driving license number is an identifier; our driving record is sensitive information. The table satisfies k-anonymity iff each sequence of values in any quasi-identifier appears with at least k occurrences. Bigger k is better.

So here is a table with 11 rows.

Name Race Birth Gender Zip Problem
Sean Black 1965-09-20 M 02141 Short breath
Daniel Black 1965-02-14 M 02141 Chest pain
Kate Black 1965-10-23 F 02138 Hypertension
Marion Black 1965-08-24 F 02138 Hypertension
Helen Black 1964-07-11 F 02138 Obesity
Reese Black 1964-01-12 F 02138 Chest Pain
Forest White 1964-10-23 M 02138 Chest Pain
Hilary White 1964-03-15 F 02139 Obesity
Philip White 1964-08-13 M 02139 Short breath
Jamie White 1967-05-05 M 02139 Chest pain
Sean White 1967-03-21 M 02138 Chest pain

If we remove all the attributes except for the problem we have a very anonymized data set (k = 11):

Name Race Birth Gender Zip Problem
          Short breath
          Chest pain
          Hypertension
          Hypertension
          Obesity
          Chest Pain
          Chest Pain
          Obesity
          Short breath
          Chest pain
          Chest pain

On the other hand, if we just remove the name and generalize the zip code and date of birth we have a less anonymized set. Exercise: convince yourself that k=2 for this set.

Name Race Birth Gender Zip Problem
  Black 1965 M 0214* Short breath
  Black 1965 M 0214* Chest pain
  Black 1965 F 0213* Hypertension
  Black 1965 F 0213* Hypertension
  Black 1964 F 0213* Obesity
  Black 1964 F 0213* Chest Pain
  White 1964 M 0213* Chest Pain
  White 1964 F 0213* Obesity
  White 1964 M 0213* Short breath
  White 1967 M 0213* Chest pain
  White 1967 M 0213* Chest pain

Of course, the issue is utility. There is a tradeoff between keeping the data useful for research and maintaining privacy. Researchers and attackers are doing the same thing after all: looking for useful patterns in the data. With the k=2 data set you can ask questions about correlation of problems with gender, or with geography to some extent (although not very specific geographical factors, like toxic leaks).

It would be nice if you could make the data set anonymous for the purposes of attackers, but still useful for researchers. But it turns out you can't. In a paper called The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing, Justin Brickell and Vitaly Shmatikov investigated the problem. They took a set of different sanitization methods and compared it to a data set with trivial sanitization (removal of identifiers). Here are the results.

Privacy_utility

The left bar of each pair is the privacy (smaller = more private), The right represent the utility to the researcher (bigger = more useful). Anonymization seeks to shorten left without shortening the right, but the results show, depressingly, that small increases in privacy cause large decreases in utility.

Please could you tell me about the Database of Ruin?

OK, if you insist.

If we are going to take a new look, we need to recognize that privacy is not a binary issue, and it is not a property of a single data set. We need to worry about reidentification attacks that do not reveal sensitive information. As Paul Ohm writes: "For every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her… the 'database of ruin' containing this fact but now splintered across dozens of databases on computers around the world."

Privacy is erased incrementally as successive queries reduce uncertainty and narrow in on an individual. The way to quantify this reduction in uncertainty is to use the idea of information entropy, adopted from the thermodynamic concept, and usually identified by H. The information gained in a query is

H(before) – H(after)

as you increase your knowledge of a system, the entropy (loosely, disorder) decreases.

So what is the formula for H?

For a set of outcomes {i,…}, each with probability pi, the information entropy is:

H = – SUM pi log2(pi)

(excuse the lack of greek sigma), and is measured in bits. The logarithm appears because the probability of two independent events occurring is the product of the probabilities of each event, but the information we gain from observing two independent events is the sum of the information we gain from each event

Take a simple example: a coin toss. Before the toss, there are two outcomes with equal probabilities, so

H = -(1/2) log2(1/2) – (1/2)log2(1/2)

= – log2(1/2)

= log2(2)

= 1 bits

which makes sense if you think about it, because the coin could be heads (1) or tails (0).

After the toss, H = log2(1) = 0 : there is no uncertainty left and we have complete information about the system.

In the same way, a dice roll has (before rolling it) an entropy of

H = log2(6) ~ 2.6 bits

So if the challenge is to identify one person in the population of the world, how much information entropy is there? The identity of a random, unknown person is just under 33 bits (233 ~ 8 billion). Hence the web site 33bits.org.

Learning a fact about the individual reduces the uncertainty (reduces information entropy). So if you learn that the star sign is Capricorn then that's -log2(1/12) = log2(12) = 3.58 bits of information.

If you find out other independent pieces of information you add up the contributions to find out how much the the entropy has been reduced. So a ZIP code may provide 23.81 bits of information, a birthday 8.51, and gender 1 bit for a total of 33.32 bits: it probably identifies one individual.

The Netflix de-anonymization paper used these ideas a bit. The a priori entropy of the data set (additional information required for complete de-anonymization) is 19 bits (219 = 524288, which is about the number of individuals in the data set). Individual movies give from 1 to 18 bits of information, depending on what you know about them (dates within 14 days, rankings +/- 1). Very popular movies gave little information, but niche movies viewed by few individuals yielded many bits of information. So little auxiliary information is needed to re-identify records in the database. In a theoretical excursion, the researchers showed that de-anonymization is going to be robust against noise, and does not need much additional information, so long as the data set is large and sparse enough.

So what now?

I can do nothing better than quote from Paul Ohm to summarize the privacy dilemma we find ourselves in.

"Abandoning PII is a disruptive and necessary first step, but it is not enough alone to restore the balance between privacy and utility that we once enjoyed. How do we fix the dozens, perhaps hundreds, of laws and regulations that we once believed reflected a finely calibrated balance but in reality rested on a fundamental misunderstanding of science?

Techniques that eschew release-and-forget may improve over time, but because of inherent limitations like those described above, they will never supply a silver bullet alternative. Technology cannot save the day, and regulation must play a role.

Ohm notes that the US sectoral approach to regulation sets the privacy bar too low, by focusing on explicitly listed PII. Meanwhile the EU, by saying "anything that can be used to identify you" would, if interpreted in the light of modern de-anonymization techniques, be too high.

The direction to take, says Ohm, is to focus on the people, not the anonymization, and to distinguish private from public release. We need to codify notions of trust and practices and apply strong sanctions against re-identification. This will put more administrative and procedural burdens in our future, but is needed to preserve privacy.

And, to return to Tim O'Reilly's tweet, open data advocates and big data enthusiasts need to pay more attention to these issues rather than relying, as some do, on personally identifying information as a sufficient solution.

Footnotes:

1 Arvind Narayanan and Vitaly Shmatikov, Robust De-anonymization of Large Sparse Datasets, IEEE Symposiom on Security and Privacy, 2008. (gated link)

Internet-Centrism 3 (of 3): Tweeting the Revolution (and Conflict of Interest)

[Time saver announcement: The most interesting thing in here is probably the conflict of interest in a recent Twitter-Arab Spring paper, which starts here.]

Earlier today I thought I was doomed to fail; that part 3 of this prematurely-announced trilogy was just not going to get written. I tried a few things, threw them away. Tried a few more, scrunched them up into balls of electrons, and dragged them to the little waste bin up there in the corner of the screen. Life was looking grim (although not as grim as the awesomely atmospheric trailer for the Andrea Arnold version of Wuthering Heights, shot in wonderful Swaledale, and coming soon to a theatre near me, I hope).

But then two research papers about Twitter and the Arab Spring appeared within days of each other. Could these be raw material for a little grumbling about Internet-centrism, I thought? Well knock me sideways, it turns out they could.

First up is Opening Closed Regimes, by Philip N. Howard and others, a report from the Project on Information Technology and Political Islam at the University of Washington. "What was the role of social media during the Arab Spring?" it asks, and it answers the question with ne'er a duelling anecdote in sight, which you would think would be a good thing.

But it's not. The paper is resolutely and exclusively like the drunk looking under the lamp post to find his keys, because that's where the light is brightest. There's lots of data about what happens on Twitter because the Twitter API provides access to large numbers of tweets together with all their metadata, and then you can slice and dice them to your heart's content. And that's what these researchers have done. I know from personal experience how tempting it is to think there's gold in them thar queries, that if you interrogate the data in just the right way you'll find the key to unlock the box and find enlightenment, or at least an unmixed metaphor.

Not that there's anything wrong with data mining Twitter. But if it's the only data source you use, you really can't use it to make broader statements. And the paper does make broad claims, like "Social media played a central role in shaping political debates in the Arab Spring", and "Twitter seems to have been a key tool in the region for raising expectations of success and coordinating strategy. Twitter also seems to have been the key media for spreading immediate news about big political changes from country to country in the region." These are statements that reach out from Twitter to the broader social context of which it is a part, but if Twitter is literally the only medium you look at, then obviously it is likely to appear "central" and "a key tool". A section titled "Social Media's Centrality to Political Conversation" analyzes the volume of online content during the time leading up to the uprisings, and looks at how the web sites of political actors linked to social networking and other news services. There's nowt but digital content. What the section is really looking at is social media's centrality to online political content.

There's no mention here of other ways in which individuals actually exchanged information, whether in university campuses, on the football terraces, in mosques, through professional networks of lawyers and teachers, via posters or leaflets, or around kitchen tables. All this teeming activity is simply invisible, out there in the obscure unilluminated and unexamined darkness. Out of sight, out of mind.

Opening Closed Regimes reminds us that comparisons are essential. Mapping social media interactions, or online linkage patterns, no matter how ingeniously, is never going to tell us anything about how online behaviour fits into the broader social world. And it's never going to justify conclusions like "During the Arab Spring, individuals demonstrated their desire for freedom through social media, and social media became a critical part of the toolkit used to achieve that freedom."

But Opening Closed Regimes is better than the other paper, The Revolutions Were Tweeted: Information Flows During the 2011 Tunisian and Egyptian Revolutions  (pdf) by Gilad Lotan and others including (to my dismay) danah boyd, whose writings I usually really learn from – like Six Provocations for Big Data, for example.

This paper also takes Twitter feed data – nothing else at all: nada, zilch, zero – slices, dices, etc etc as before, and then says "we discuss how Twitter plays a key role in amplifying and spreading timely information across the globe".

The paper is published in the International Journal of Communication, which you'll be glad to know "adheres to the highest standards of peer review and engages established and emerging scholars from anywhere in the world". So I was a little surprised when I looked up lead author Gilad Lotan, who lists his organizational affiliation as Social Flow.

[Update: the authors now provide an appendix to the paper explaining the relationship between the company, the lead author, and the work. I accept that explanation. I leave the post here, but please read it in that light.]

Why? Because it turns out that Social Flow is a private company that "has developed the industry's first and only social media optimization technology that uses real-time data – including the Twitter and bitly firehoses – to help publishers, retailers, and brands earn maximum potential engagement on Twitter." That's right, it's basically SEO for Twitter. It develops algorithms that "dynamically publishes the best message at the best time to ensure its clients get the maximum amount of engagement from their Followers." Yup, these people are going to pollute your Twitter feed, and make money from it. Small wonder that the paper paints such a positive picture of Twitter's influence.

Social Flow is a venture capital funded company based in New York. According to CrunchBase, they recently received $7M in Series A funding from a variety of venture capitalists including "Softbank, with Softbank NY, RRE Ventures, Betaworks, High Line Venture Partners, AOL Ventures, SV Angel and some individual angels investing." Plus, they have a lovey-dovey relationsihp with Twitter itself. TechCrunch comments that:

Neither side would comments on the terms of the relationship — but it's clearly beneficial to both.

"This is exactly what we want to see out of the ecosystem," Twitter Platform lead Ryan Sarver adds. "These guys are building a real business,"

Thanks Ryan.

And just in case you wondered: their investor Betaworks has also invested in Tweetdeck and Summize: other companies that have actually been bought by Twitter.

There is no mention of this blatant conflict of interest, which took me all of 30 minutes to look up, in the paper. The "International Journal of Communication" should be ashamed for publishing it.

After that, it's hardly worth even reading the paper. But I did. It suffers from some of the usual Internet-centric fallacies.

Here's the VERY FIRST SENTENCE (capital letters in tribute to Tiger Beatdown's third birthday): "The shift from an era of broadcast mass media to one of networked digital media has altered both information flows and the nature of news worked." Apparently positioning Internet sources as a competitor to broadcast media is so obviously true that it doesn't need support. Except that it's far from obvious that there is such a shift, even in North America, with audiences consuming just as much television as they ever have, with social media being an increasingly important complement to TV. And of course in North Africa the spread of television is much more recent – the growth of Al Jazeera even more so – so that there has almost certainly been no such shift at all.

The conceptual background of the paper is rife with such assumptions of the way the digital and "mass media" world differ. Studies investigate "how news emerges from networked actors who span different professional and organization identities and contexts" – note the organic, bottom up word "emerges" – in contrast to the world of mainstream media, in which "journalists tend to produce news designed for their publishers and editors", and enact "rituals of objectivity" to distance themselves from audiences. It's a caricature of the Internet compared to a cartoon of non-digital media.

And so it goes. The study itself is quite narrow, analysing information flows around Twitter, and makes cautious conclusions about the meaning of those information flows. But it acquires interest only from the suggestions, throughout, that this is part of a bigger picture of bottom-up newsmaking through social media. Their findings, for example, "confirm the notion that Twitter supports distributed conversation among participants and that journalism, in this era of social media, has become a conversation".

Now I don't mind a critical look at mainstream media – there's no shortage of things to criticize after all, and I suspect this is why well-intentioned people with anti-establishment agendas fall into the Internet-centricity trap. But even if "mass media" was nothing but a Debord-ian "society as spectacle" dystopia, that doesn't make the venture-capital-driven world of social networks an alternative: social media companies  are as mainstream, capitalist, and establishment as you can get. And they are no friends of liberation movements.

So, going back to the MIT Technology Review articles that I discussed last time, there are three contributions from people who are clearly on the left politically, and who seem to follow a broadly anti-establishment agenda, and whose writings I respect. But, if I can psycho-analyze from a distance, their writings show that they would dearly love Internet technology to be a natural ally of those who seek a more egalitarian, less hierarchical world. Unfortunately, wishes don't always come true.

Aaron Bady argues that social media promotes, by its architecture, a notion of leaderlessness that is counter to the authoritarian nature of Egypt and other Arab states. Zeynep Tufekci sees Facebook's features as "the ideal infrastructure to create the information/action cascade" that was the Arab Spring uprisings. Jillian York is more sceptical, but still looks for "the democratizing prospects of social media tools elsewhere: the strengthening of the public sphere… the transfer of the interaction from social networks to manifestation in the real world, on the street". Reading these accounts, one cannot help but see three people who want to change the world for the better, and who are hoping that these new technological tools tip the balance of this continuing struggle in favour of the powerless, and against the powerful.

But their claims place technology ahead of issues of commerce and ownership, and I think that's a mistake:

  • Networks on Facebook may not have leaders, but Facebook itself is a centralized, commercialized system that has promoted its owners into positions of wealth (six billionaires) and power (access to heads of state, influence in Washington, a welcome at Davos). Is this an architecture of leaderlessness? Why would we spend intellectual and other effort arguing that this business is (even with caveats) a force for good in the world? Sure I use Facebook, but I also watch TV and that doesn't mean I trust the TV companies further than I could throw them.
  • The wealth and power that Facebook and Google owners have accumulated depends directly on the commercial exploitation of the traffic through their sites. YouTube, Facebook, and Twitter will act as a "public sphere" only until the commercial interests of the site collide with the politics of the discussion. Once that boundary is crossed, these spaces become more Panopticon than public square. Is this an ideal infrastructure for activism?
  • By focusing their hopes to a particular set of technologies that carry massive public data sets with them, it's almost impossible not to overestimate the role these technologies play in public discussion, as we saw with the two research papers. We track the conversations on Twitter during the revolution, but not the conversations among neighbours around kitchen tables. Out of sight, out of mind.

So, from those who seek to make money from commercializing our every social interaction, to those who mistakenly identify today's mainstream Internet with the alternative ethos of a decade ago, there are plenty of people out there who are prone to see the world through the rose-tinted and digitally-enhanced glasses of Internet-centricity. But that doesn't make them right.

Internet-Centrism 2 (of 3): Streetbook

So three cheers for Evgeny. Now back to the MIT Technology Review, and in particularJohn Pollock's Streetbook, an extended set of interviews with two secretive Tunisian digital activists who go under the pseudonyms "Foetus" and "Waterman". The article describes how "their organization, Takriz, performed a remarkable and largely unknown role" in the uprisings.


F—b— is such an important technology of mass coordination among young people that it was shut down in Egypt and other North African countries as the uprisings erupted in January 2011. F—b— was a technology that allowed many-to-many communication of political ideas, even under the censorious eye of repressive states; it provided a space where urban youth could gather to display their collective strength and build a clear identity separate from the state; it had fostered trans-national networks across the whole of North Africa; it lent itself to leaderless mass organization; and it produced some of the key organizing groups of the Tahrir Square protests. F—b— is, of course, Football.

The Ultras are groups of hard-core soccer fans who carry out dramatic displays of their fanaticism (fireworks, huge choreographed artworks, inventive songs) and who compete amongst each other to out-invent and out-display the fans of opposing teams. And, yes, fight with police from time to time. John Dorsey {blog} has followed Ultra culture closely. Here he is from Egypt in January 2011: {link}

With Egypt entering its second day of unprecedented anti-government protests, soccer fans constitute a well-organized and feared pillar of the marshalling grassroots coalition determined to ensure that President Hosni Mubarak suffers the same fate as Tunisian President Zine El Abidine Ben Ali, who was toppled earlier this month by mass demonstrations.

Alaa Abd El Fattah, a prominent Egyptian blogger {Wikipedia article}, was widely quoted as saying "The ultras–the football fan associations–have played a more significant role than any political group on the ground at this moment." {link}

Here is Debbie Randle of BBC's "Newsbeat" {link}.

When Newsbeat spoke to some of the ultras, they told us large numbers of them fought on the front line at the protests.

They said they passed on their knowledge to other demonstrators and protected people and their homes.

Abdullah, who's a member of the Zamalek White Knight Ultras, told Newsbeat they also helped to organise the protests.

He said: "We had many meetings to discuss the revolution before it started. We discussed that we must play an important part for Egypt, we must support Egypt."

The official line from the Ultras is that they didn't take part as a group but Newsbeat has been assured this isn't the case.

An ultra from a rival group, thought to have around 15,000 members, said they also played a role.

They worked alongside the White Knights, usually sworn enemies, to protest for their country.

A member of Egypt's NDP was reported on January 26 2011 as acknowledging the importance of the Ultras: "what we saw on the streets yesterday are not just Muslim Brotherhood members or sympathisers but Egyptians at large; those are the Egyptians that you would see supporting the football national team– and their show of frustration was genuine and it had to be accommodated." {link}

And Sports Illustrated, on January 31 {link}:

Over the decades that have marked the tenure of Egypt's "President for Life" Hosni Mubarak, there has been one consistent nexus for anger, organization, and practical experience in the ancient art of street fighting: the country's soccer clubs. Over the past week, the most organized, militant fan clubs, also known as the "ultras," have put those years of experience to ample use.

Why soccer? Let's return to James Dorsey as quoted in Sports Illustrated: "The involvement of organized soccer fans in Egypt's anti-government protests constitutes every Arab government's worst nightmare. Soccer, alongside Islam, offers a rare platform in the Middle East, a region populated by authoritarian regimes that control all public spaces, for the venting of pent-up anger and frustration." Does this seem familiar?

So it looks like the Ultras, which some describe as a social movement, played a self-conscious role in the Egyptian uprising, using the uncensored public space of the football terraces to organize themselves in impressive ways. See (via @techsoc's Twitter feed) this remarkable protest last week over the continued imprisonment of Ultra members following recent protests. The words above the pictures of the imprisoned apparently read "Free Fans (Ultras)".

The Ultras do use digital technologies to organize themselves, but the primary means is clearly physical. See these videos for a few examples: the Stretford End has nothing on these characters.

As we read Streetbook, however, we get a different impression of the Ultras. The narrative that the article conveys is one in which the digital activists of Takriz played the key role while the Ultras themselves are secondary.

Streetbook highlights Takriz's myriad connections, but presents its "main audience" as "alienated street youth: the lifeblood, often spilled, of the rebellion in North Africa". Though the article has many fascinating things to say about the contacts between Takriz and the street groups, it does not support the presentation of Takriz as the primary actor in these contacts, and there is a recurrent classism in the description of how the groups relate.

So Pollock describes how the Taks (members of Takriz), including his interviewees "Foetus" and "Waterman", had the inspiration of "turning the spirit (of soccer fans) to political ends", and developing a web forum for Ultras from different teams. Then "when the revolution began, the Ultras would come out to play a very different game. They were transformed into a quick-reaction force of bloody-minded rioters." The narrative is that the digital activists shaped the events, transforming soccer hooligans into a "bloody-minded" political fighting force. But is that what happened? Did initiatives such as the web forum or the Takriz efforts to direct the Ultras have any impact? There is no evidence for this in the article. And there is no voice here from the Ultras themselves. Do they not have anything to say beyond the wielding of sticks? It is remarkable that in all the coverage of the Ultras listed above, there is only one piece that mentions a name. Is digital presence required to have a voice in the debate about what happened in North Africa in 2011? It's not surprising that the importance of digital tools is exaggerated.

In the same spirit Streetbook describes how, when the Tunisian regime fell, Egyptian digital activist Hassan Mostafa "reached out to some of the hardened criminals, 'murderers and drug dealers', he had met while imprisoned for his Khaled Said protest: their skills would prove useful in stealing police riot helmets and guns. Through them, Mostafa recruited an army of toughs from the poorest areas." In Tunisia, Takriz leaders "use street culture, slang, and obscenities to fire up street youth". Again, the narrative has the digital activists as the brains, and the "toughs from the poorest areas" with no digital connections are reduced to a tool to be "fired up", with no agency of their own.

The over-emphasis on digital tools is everywhere in Streetbook. Most obvious is the claim that after visiting (overland, the old way) a group in Serbia, an activist returned with "a book about peaceful tactics and a computer game called A Force More Powerful, which lets people play with scenarios for regime change. Taking advantage of the game's Creative Commons license, April 6 members wrote an Egyptian version." It's not like in previous times activists haven't shared resources and built on each others' work because they didn't have the right clauses in their license agreements. The hacker group Anonymous gets a mention for targeting Tunisian government websites with DDOS "attacks", but was Anonymous's action a significant step in the escalation of protest? I doubt it. A Parisian-based activist "wrote a script, using semantic search techniques based on keywords… to measure how much time it took for posts to result in responses like comments". There is nothing wrong with such work, but it is clearly secondary – at best – to the main events, but in the Internet-centric "digital revolution" narrative, it gets a speaking part.

Yet throughout the interview, there are nuggets that reveal an alternative plot between the lines. Despite the focus on YouTube as a mechanism for sharing videos, Foetus describes how the video that "made the second half of the [Tunisian] revolution" was taken when the regime had shut down the Internet, so "according to a Tak who asked to remain anonymous, Takriz smuggled a CD of the video over the Algerian border" before forwarding it to Al Jazeera. What I see here is that, while YouTube has made it easier and safer to make videos available (at least so long as Google lets it be done anonymously), when an important video was available, the Internet was not essential to the process of distribution. On December 30 in Tunisia, "Lawyers gathered around the country to protest the government and were attacked and beaten", and a week later they all went on strike, followed by the teachers. Lawyers and teachers have their own pre-existing networks, of course, and probably used them to organize – at least there is no mention of a lawyers' Facebook page. So while Pollock diligently tracks each use of FourSquare, each request for help over Facebook, each use of emails with attachments, and each fake Twitter account, he skips quickly over the role of leaflets and flyers, meetings in mosques, university campus organizing, television, and balcony-to-balcony messaging.

Those who are convinced of the progressive potential of digital technologies sometimes accuse sceptics of exaggeration. No one, of course, thinks that Facebook caused the Arab Spring uprisings. But the Internet-centric narrative is not so much wrong in any one particular, as inherently distorted in the overall weight it gives to digital tools. It lends itself to narratives in which technological tools play a central role, in which the direction of influence is "from social networks to manifestation in the real world", as Streetbook quotes Samir Garbaya. The nuggets embedded in Streetbook tend to show that, more often than not, digital technologies are one option among several when it comes to the tasks activists face, and that information transmission is commonly not the bottleneck. There is no doubt that Facebook played a significant role in the Egyptian uprising, and that some key videos uploaded to YouTube helped to raise awareness of persecution, but alternatives have existed and do exist, and it is a mistake to pass quickly over these alternatives.

Next post – which I hope I'll get done in a day or two – is something more constructive. If Internet-centrism is a problem, how do we debate the role of digital technologies in politics? Henry Farrell has an extensive survey of the topic and some fine insights here, and I'll just try to add a few postscripts of my own.

Internet-Centrism 1 (of 3): Evgeny Morozov and The Net Delusion

I've never met Evgeny Morozov, but I have to say I love the guy because over the last three years he's succeeded in doing something that desperately needed to be done: he's provided a strong counterweight to an overhyped narrative of digital revolution. And then, since The Net Delusion came out at the beginning of the year, he's had to put up with a lot of unjustified condescension and caricature from some who suffer from the very "Internet-centrism" that is the target of his book.

Take the MIT Technology Review, which just ran a set of writings about the Arab Spring uprisings. The contributors characterize the debate over social media's importance as "wildly overdrawn" (Aaron Bady); "highly polarized" (John Pollock), a "false debate" (Zeynep Tufekci), a "false dichotomy" (Jillian C. York). And some put Morozov, together with Malcolm Gladwell, at one pole of that debate, with "the cyber-utopians" at the other. Now Gladwell is a high-profile name who dipped his oar into the waters of digital scepticism in a single article, and he can deal with what he gets from that. But Morozov, judging from his age, is looking to build an academic career on his work and he's at a vulnerable stage, so being the focus of so much attention is both a blessing and a curse.

In Streetbook, his contribution to the MIT Review articles, John Pollock summarizes The Net Delusion by writing that "[it] decries the 'naive belief in the emancipatory nature of online communication'". Now if you are going to accuse Morozov of being one pole in an overly polarized and, by implication, simplistic debate, you should at least get his position right, and this portrayal is simply unfair. The quotation comes from page xiii of the book's introduction, and is Morozov's portrayal of "cyber-utopians". Using this quotation makes it seem that Morozov sees his opponents in exagerrated and simplistic terms. But the bigger target of the book is what Morozov labels "Internet-centrism", a setof assumptions that has policy-makers "answer[ing] every question about democratic change by first reframing it in terms of the Internet rather than the context in which that change is to occur." {p. xvi}

Morozov is not a digital dystopian; hell, he owns both a Kindle and an iPad; he writes "The premise of this book (The Net Delusion) is thus very simple: To salvage the Internet's promise to aid the fight against authoritarianism, those in the West who still care about the future of democracy will need to ditch both cyber-utopianism and Internet-centrism" {p xvii}. His position is well within the confines of mainstream political science — in particular, he accepts the "project of promoting democracy" as generally a well-intentioned effort from the US, which is a far-from radical stance — so it's remarkable that he has become such a polarizing figure. The Net Delusion is a strong book, but a limitation is that it fails to challenge that project and the intentions behind it sufficiently.

Over the course of the year, The Net Delusion has become a victim of its own success. Critiques argue that it is time to move beyond "duelling anecdotes" to a more sophisticated level of analysis, and both Patrick Meier and Mary Joyce (as well as the more predictable Adam Thierer) accuse Morozov of dealing too much in story, not enough in larger-scale data analysis or theoretical work. The accusations miss the point: there was no duel of any significant kind until Morozov stood up against the tide of "Twitter Revolution" euphoria that erupted along with the Iranian protests in 2009. He introduced a counterweight into a public debate environment that uncritically accepted the thesis that the Internet is intrinsically on the side of dissidents under authoritarian regimes; that the affordances it offers benefit the powerless against the powerful. Having rebalanced the debate, he is now met with a somewhat patronizing declaration that we are past all that, and it is time to move on to more sophisticated and nuanced discourse. Joyce is right when she says that he can be binary in his thinking, and that the book lacks theory, but these accusations are beside the point. It was a book written quickly, in response to the needs of the time, and it met those needs. Arguing that a more subtle, thorough book would be better is to miss the political situation in which it was written, and which it effectively challenged. Morozov deserves a better response from social scientists than he has received.

Despite the claims of moving beyond the Morozov/Shirky binary debate, some of the MIT Review articles suffer from the very Internet-centrism that Morozov criticizes, in particularly the Streetbook centrepiece. But that's a topic for Part 2.

Two Digital Fallacies

The first is what I call the Long Tail Fallacy. It goes like this:

  1. Look on the shelves of a big chain bookstore or music store. It's mainly mainstream stuff. Boo.
  2. Look at the variety at Amazon or iTunes. Hooray!
  3. Isn't it great how the Internet has liberated us from the tyranny of physical shelves and geography?

Did you see the switch? Here it is again. Watch closely.

  1. Look at what was on mainstream network TV decades ago. Not much. Boo.
  2. Look at all that variety on YouTube. Hooray!
  3. Isn't it great how the Internet has liberated us from the tyranny of mainstream media?

See how I did that? Or again, this time from Digitally Enabled Social Change by Jennifer Earl and Katrina Kimport (p91).

  1. Look at offline rallies as reported by the New York Times. Only big and complex protest events. Boo.
  2. Look on the Internet. Online petitions, campaigns to save TV shows, all kinds of actions. Hooray!
  3. Isn't it great how the Internet has unleashed a torrent of activism among the population?

I think of the second as the Christmas Fallacy:

  1. Publishing used to be expensive.
  2. Now it's cheap.
  3. We have an abundance of publishing!

Seems reasonable enough? What about this:

  1. Christmas comes but once a year.
  2. I wish it could be Christmas every day.
  3. We'd have an abundance of Christmas!

We wouldn't, of course. We'd have no Christmas at all.

Media Disruption Exacerbates Revolutionary Unrest? Notes.

In the New York Times today, there's a piece about a conference paper by Navid Hassanpour, "Media Disruption Exacerbates Revolutionary Unrest: Evidence from Mubarak's Natural Experiment" {link}. I took a look: here's some immediate reactions.

The theoretical part of the paper is yet another cascade model of protest, a kind of Granovetter++, which comes down to this:

  • Individuals are nodes in a network, and face a decision of protesting or not. Each has a personal threshold for action (in terms of the activity of their neighbours) and some small portion have a zero threshold, acting as seeds for the dynamics.
  • Individuals update their threshold based on the level of activity among their neighbours; the more their neighbours are protesting, the more likely they are to join in.

The dynamics of the model is the spread of action through the network, a cascade caused by each actor becoming first more prone to action as those around them take up action, and then joining in themselves.

A "media disruption" reshapes the network – different media lend themselves to different network structures – and the dynamics of this reshaped network will be different, leading to different regions of activity and inactivity.

For many networks, one equilibrium is for almost everyone to be inactive, as each actor has a high threshold for action absorbed from their inactive neighbours. A set of dense connections can freeze regions of a network in states of inactivity: each individual may come in contact with active individuals, but the numerous connections to inactive individuals keeps them mute.

Individuals in a less dense network have a greater possibility of becoming active in response to a population of activists because they are less restrained by the inactive people around them, and this is the basic mechanism for the "media disruption exacerbates unrest" thesis.

So it's a bit like an Ising model of magnetism with ferromagnetic coupling: ions can be spin Up or Down, and their orientation depends on the orientation of their neighbours. One difference: ions don't come with different tendencies to choose Spin Up. It's a simple model – it doesn't try to model the availability of information itself, or have anything to say about political preferences or economic situation beyond the threshold for action. But it does bring in the relationship between network structure and outcome.

There is an empirical part to the paper as well. There is a survey of past revolutions and their relationship to media penetration, which does not convince one way or another (it seems to me there is a conflation of social media and mass media that extends throughout the paper). More interestingly, Hassanpour looks at the progress of the Egyptian uprising and contends that when the government shut down the Internet and phone networks on January 28 it produced, instead of the single gathering at Tahrir Square that had been in progress for several days, eight separate protests in different parts of Cairo. He quotes Peter Bouckaert of Human Rights Watch: "It's clear that the very extensive police force in Egypt is no longer able to control these crowds. There are too many protests in too many places." In the absence of a broadly unifying social media that focused attention on the single winner-take-all protest at Tahrir, people were driven out onto the street to see what was happening and generated local protests instead.

Hassanpour suggests three effects of the disruption of the Internet and of the more widely used cell phone networks:

  • it upset apolitical citizens, turning them against the government;
  • it forced more face-to-face contact;
  • it decentralized the rebellion.

The facts on the ground are open to many interpretations, of which Hassanpour's is one. He doesn't convince me  (I don't have the expertise to judge his empirical evidence) but it's a possibility, and I found it a valuable read.

Despite the title, the main interest is not whether social media increases or decreases the net level of activity, but that it reshapes it. There is overlap with Kieran Healy's just-posted work-in-progress, The Performativity of Networks, {link} where he argues that network algorithms, when implemented in social media products, "reorganize the phenomena they purport to describe" (social networks in this case). Facebook is a performance of a theory of friendship.

Digital technologies have been given credit for an almost endless list of roles in uprisings in authoritarian states. Digital technologies help activist circles to better communicate (blogs, IRC chat, and encrypted communications of various levels of security). They build a broader and deeper public sphere by being a space for open discussion (Facebook and Twitter). They give voice to citizens to speak to the outside world and witness the actions of their government (YouTube videos, Twitter, and blogs). They are a stark contrast to mass media. But they also complement Al Jazeera, who can draw from citizen videos. And they complement offline collaboration and offline organizing as well. Hassanpour points out that digital organizing activities are mainly Internet based while SMS is a medium more suited for on-the-fly information (SMS). In short, there is little that can't be attributed to social media in one form or another.

Hassanpour at least makes a specific claim, that a disruption of digital technologies produced an immediate and tangible reshaping of political activity, and he has a mechanism for his claim. I'm sure others will interrogate the evidence. There have been many calls to go beyond "duelling anecdotes" in looking at the role of social media, and this paper is an attempt to do that. It's flawed, but if you ask me it's going in the right direction.

As TV goes online, the Internet becomes more like TV

As I said in my So You Think You Can Dance confession, it's quite possible that the Internet will end up complementing mass media, rather than competing with it. It turns out this is a current topic.

At Rough Type, Nick Carr rounds up some recent evidence, including a Nielsen report showing that American TV viewing per person is at an all-time high:

A year ago, the Nielsen Company reported that Americans' TV viewing hit an all-time record high in the first quarter of 2010, with the average person spending 158 hours and 25 minutes a month in front of the idiot box.* That record didn't last long. Nielsen has released a new media-usage report, and it shows that in the first quarter of 2011, the average American watched TV for 158 hours and 47 minutes a month, up another 0.2 percent and, once again, a new all-time high.* Twenty years into the Web revolution, and we're boob-tubier than ever.

… and more, here.

Meanwhile, in the mainstream press, the Globe and Mail's John Doyle has been to Beverley Hills to listen to TV execs talk about their line-ups, and they are surprisingly cheery about the state of the biz. A decade ago, according to NBC's Robert Greenblatt,  “there was a sense that it’s a declining business, and let’s just sort of manage the decline and hope we can get the best out of it.” Now, things are looking up. Here's CBS chief of research David Poltrack, echoing some of what Nick Carr reports (emphasis is mine):

"As this century opened, the focus of the world was on the Internet, and the focus of the television industry was on the DVR. At that time, the business of prognostication was booming, and no one was too optimistic about the future of the broadcast networks. Yet here we stand 11 years later, and the business of network television is alive and well. That doesn’t mean that the entertainment market has not been transformed by the new technologies. It has. What the pundits got wrong was not the impact of the new technologies but the alleged vulnerability of the television networks. … As the broadcasters and the cable networks move more content online, the viewers’ online video diet has included more and more episodes of television programs.

“It would be no surprise to find, for the first time, a significant number of viewers reporting that they’re watching more television than last year. The fact is, the viewer’s perception of television viewing is program- or content-driven, not distribution-driven. Whether they are watching their favourite program on their computer, their tablet or their smartphone, they consider themselves to be watching television. So contrary to the expectations of the pundits, the explosion of new sources of video distribution has actually fortified the market for the broadcast networks’ content.”

A couple of years ago Cosma Shalizi said this about search: "If your search process succeeds in aggregating what large numbers of people think, it will mostly reproduce established, mainstream cultural hierarchies, by definition."  (link now rotten – no, fixed) The same goes for social media as a whole.

Along the same lines, it is perhaps not surprising to see that Wikipedia is, like other institutions, male dominated with less than 10% of edits made by women. The full paper, from the University of Wisconsin, is here.

For me, what comes out of these observations is not that the Internet is retrograde — despite my posts, I do try not to be an old fogy when it comes to technological changes –but that it is best viewed as a site of contention rather than as an actor in the continuing struggle over the future shape of society.