Ethan Zuckerman’s “Cute Cats and the Arab Spring”

Table of Contents

Cory Doctorow (*) and Jillian York (*) were both full of praise for Ethan Zuckerman's Vancouver Human Rights Lecture on Cute Cats and the Arab Spring (*), so I listened to the podcast from CBC's Ideas (*). You can also watch the lecture on YouTube (*).

Ethan Zuckerman (EZ) has a long and admirable history of involvement in digital activism and a wide knowledge of both technology and social change; the lecture is worth an hour of your time. But (you knew there was a but) in the end I have to disagree with his main thesis.

1 Dry Tunisian Tinder

EZ tells us how, after years of sporadic and failed protests in Tunisia, one particular spark in the city of Sidi Bouzid blossomed into the forest fire of revolution. When Mohamed Bouazizi set himself on fire in protest at official interference with his vegetable stall it was a dramatic and desperate act, but not unique: he wasn't the first person to do so even that year. What was different this time?

EZ's argument is that digital social media was different. The early protest was captured on video using a cheap phone and posted to a social networking site where… it did NOT "go viral". Instead the video was picked up by Tunisians outside the country (including EZ's friend Sami ben Gharbia1), who were scanning Tunisian web content for political news and curating it on a site called nawaat.org (*).

Al Jazeera got the video from nawaat.org and broadcast it back into Tunisia; Tunisians found out in turn what was going on from Al Jazeera. What's important here, says EZ, is that the new low-cost participatory media is an essential part of a larger media ecosystem that helped to stir up feelings within Tunisia.

2 Cute Cats and Malaysian Opposition

In the 1990s EZ ran a web site called Tripod for college/university students. Surprisingly, many people used it not for the Worthy Purposes he and his colleagues had planned, but to share simple and casual things, like pictures of cute cats. Also surprisingly, some of the heaviest use came from Malaysia. Wondering what was going on, Zuckerman got the Malay content translated, only to find that his site was hosting the Malaysian opposition Reformasi movement (*). Tripod was a space that was difficult for the Malaysian government to censor while being easy to hold discussions.2

And so we reach the "cute cat theory": the ideal places for those who suddenly have important, politically sensitive material they want to share are sites designed for sharing videos and pictures of "cute cats" (Facebook, Twitter, YouTube, Flickr). These sites are easy to use, have a wide reach, and are difficult to censor – if the government shuts them down it annoys a lot of people and alerts them that something interesting is going on. "Cute cats" sites are natural tinder boxes for revolutionary sparks.

The events EZ recounts are compelling, but a lot of compelling things happen in this strange world, so my first thoughts whenever I hear a story of the Internet producing some unique chain of events is: can I think of a non-Internet example that matches? So here is the lunch-room theory of political dissent (details from here).

3 Polish lunch rooms

On July 8, 1980, in the lunch room at a transport equipment plant in the eastern Polish town of Swidnik, the price of a pork cutlet jumped from 10.20 zloty to 18.10. For Miroslaw Kaczan, this jump was the final straw, and after lunch he switched off the machines he was working on. Others in Department 320 joined him, and other departments in the factory were quick to join. Soon there was a factory-wide stoppage, and it wasn't just about pork cutlets: the demands of the protesters revealed a wealth of pent-up frustration.

News about the strike in Swidnik spread so quickly that within two weeks 50,000 people in the region were on strike. This wave of strikes was resolved on July 25, but the disruption was far from over: three weeks later the strikes at the Gdansk ship yards in northern Poland started, and within a year Solidarnosc had over 9 million members.

In the early days of the strikes, Poles had a hunger for news of the protests, of course, and despite the heavy censorship of official media they found them, through short-wave radio broadcasts from other countries.

So the lunch-room theory is not that different from the cute-cat theory, except that there's no Internet. People gather wherever they gather for their everyday conversations and interactions, and it is in these everyday places that a spark of frustration can catch fire. And once it does catch fire, a combination of broadcast media and a networked public spreads the news quickly.

Perhaps, the Polish example shows, the Internet is not essential for the spark to turn into a fire. Perhaps a digitally networked public is not the only networked public.

4 Tunisia's Second Act

Even in Tunisia, politically sensitive material for which there is a high demand has found its way through dangerous pathways to reach a public desperate for news.

In a long piece called Streetbook (*) John Pollock interviews two members of an underground Tunisian group called Takriz [update: see Ethan Zuckerman and Jillian York's comments below for reservations about Streetbook]. One of these "Taks" describes how the video that "made the second half of the [Tunisian] revolution" was taken when the regime had shut down the Internet, so "Takriz smuggled a CD of the video over the Algerian border" before forwarding it to Al Jazeera. YouTube may make it it easier and safer to make videos available (at least so long as Google lets it be done anonymously), but when an important video was available, the Internet was not essential to the process of distribution.

5 Media Ecology or Network Ecology?

If we are really going to talk about a "media ecology" in the sense EZ means, we need to include all those gathering places–online and offline–which are difficult to shut down precisely because of their everyday, general purpose role. In addition to Facebook and YouTube we need to include factory lunchrooms, mosques and churches, football stadia (*), universities, popular music (*), balconies (*), and more.

All these share a number of properties with Cute Cats sites. They are difficult to shut down without annoying large numbers of previously quiescent people, they are difficult to monitor in detail because of the dispersed and varied nature of the interactions that go on, and they are already familiar places for the gathering and sharing of information. EZ says that "we don't take these 'cute cat tools' seriously enough. These tools that anyone can use, that are used 99% of the time for completely banal purposes" but he doesn't take offline everyday institutions for banal sharing seriously enough.

EZ's mistake is the achilles heel of social media advocates. Talk of a "networked society" is justified by comparing today's digitally connected populations to a population of couch potatoes watching prime time TV, but such a comparison overlooks all those other institutions of public networking. Instead of talking of a "media ecology" we should be talking of a "network ecology": the intricate tapestry of multiple networking institutions and practices that makes up a society.

Do digital social media supplement other networking instutions or displace them? There has been a lot of work on this at the individual level, but it's much more difficult to evaluate on a societal level. It is possible that digital social media increase the richness of social networks in a society, but it's also possible (likely?) that digital social media are the kudzu of networks, thriving while they strangle the other components of a rich and diverse network ecology; the best network left standing in an impoverished environment.

Footnotes

1 Among other things, Sami ben Gharbia is author of a fantastic essay on The Internet Freedom Fallacy and Arab Digital Activism (*)

2 In fact it may not have been so much that the site was difficult to censor, as that Malaysian government had decided to exclude the Internet as a whole from its otherwise-strict censorship rules (*).

Date: 2012-01-05 22:50:21 EST

Org version 7.6 with Emacs version 23

2012 Predictions: Turning Points for the Web

Avoiding Cynicism (As If)

Peering into the New Year, my better half Lynne reflected yesterday that it is a duty of each of us, as we grow older, to be vigilant against encroaching cynicism. She's right (of course!), and I do feel that strong and steady current tugging me sluggishly downstream towards the lazy, easy waters of geezerhood, to a place where everything new shows itself only by its flaws and in which every new glass is basically empty.

Luckily, 2012 looks like being a banner year for those of us who take a critical view of the hype and commercialization around digital technology, so I'm actually feeling quite cheery. The number of digital hecklers is growing1, and will continue to do so as the relations between the mainstream Internet and its audience/members sours. A growing wave of disenchantment is gathering enough steam [sic] to become a creative force in its own right, and I think that's going to be fascinating to watch, as well as potentially a period of renewal for alternative culture.

So Happy New Year, and here are a few predictions for 2012. I don't think the full impact of any will be over and done during the calendar year, but I do think we'll look back at 2012 as a turning point in attitudes to digital technologies.

Facebook: Privacy Hits the Mainstream

Prediction
High-profile privacy cases in 2012 will dramatically accelerate the level of public distrust in Facebook, which will spill over to other Internet aggregators.

Privacy has always been the other side of the openness coin. Everyone loves openness, of course, but the last year or two has made it clear that behind Mark Zuckerberg's Facebook profile (*) claim that "I'm trying to make the world a more open place" there is a hard, cold, commercial reality. Are we sharing among each other, or are we feeding Facebook? And where is the boundary between the two?

Here's a dilemma my son faces, which also confronts many other young people. After university one potential employer is the Canadian government. If he clicks his support on Facebook for political protests, will government background checks have access to this information and will it count against him? There's no point asking Facebook even if you did trust it, because today's terms and conditions may change, and the laws governing it may change too. From being an open space where it is easy to express our political views, Facebook is becoming a panopticon where we censor ourselves, not knowing who is watching.

It's not clear that the advertising driven model of web technology is sustainable given its dependence on data that we are increasingly reluctant to give up. As ex-Facebook engineer Jeff Hammerbacher says, "The best minds of my generation are thinking about how to make people click ads," he says. "That sucks." We've lived with this downside until now but as the choices become more stark this may change, and when things change on the Internet, they can change very quickly. danah boyd's view that "Facebook is a utility; utilities get regulated" (*) will become mainstream. We'll see demands2 for changes to Facebook's practices (see the Europe vs Facebook group (*), and the Irish data protection commissioner's report here) gaining momentum.

Amazon: Abusing Community

Prediction
Change in the open source world as Google takes on Amazon.

Amazon is rapidly making a name for itself as the company to give the Internet a bad name. From brutal working conditions (*) to treating physical bookstores as showrooms (*) to union-bashing (*) to McCarthyist policies around Wikileaks (*) to tax opposition (*) to screwing libraries (*), this company has done everything it can to demolish the image of the Internet as a source of cooperation, collaboration, and open friendship. It has perfected the act of free-riding on open source efforts, building its (remarkable, it must be said) profitable EC2 infrastructure on Xen Hypervisor, using Linux extensively, and not contributing back (*), in the same way it happily takes all those volunteer hours put into Wikipedia and uses them to sell its own devices, messing with authors' rights as it does so (*).

The Kindle Fire is the icing on the cake: Amazon has taken the Android operating system and its Linux kernel and used it to power the Amazon tablet. In doing so, it has taken Google's language of openness around Android (always suspect) and thrown it right back in Google's face, removing the Google applications and most evidence that the device is running Android, and making it an Amazon device from end to end.

With the Kindle Fire looking likely to become the top selling Android tablet, you have to wonder how long Google will welcome this state of affairs. There's a lot of talk about the rivalry between Google and Apple, but the tension between Google and Amazon is the conflict that may change the open source world. The licensing terms for open source software have been increasingly friendly to commercial exploitation of community projects, moving steadily away from the more restrictive GPL (*), and Amazon's nose-thumbing may be the step that forces a re-evaluation of this enterprise-friendly stance.

Apple: Stepping in Front of Google

Prediction
As the open web fragments, Google will look to its bottom line.

Speaking of Google, Apple's voice control system Siri may be the biggest threat the friendly ad-broker has yet faced, and you could argure that Siri is the major threat to the openness of the web.

It's increasingly obvious that the web has several natural bottlenecks, and that these bottlenecks are simultaneously the places where money can be made and chokepoints where political pressure can be applied. Ever since broadband and mobile access replaced ye olde dialup and Internet access became dominated by telcos and cable companies, ISPs have been one set of bottlenecks. Mobile device makers are another. The DNS system itself is yet another, which SOPA is looking to squeeze. Finally, there is aggregation, Silicon Valley's preferred source of influence.

Aggregation creates a single point of entry into a part of the web, whether it's aggregating consumer items (Amazon), digital products (Apple), people (Facebook), or the web itself (Google), and aggregation is driven by increasing returns to scale. The point of aggregators is to stand between us and what we want to reach, guiding us to those parts of it that seem best.

The thing about Siri is that it stands in front of Google, potentially displacing the search box as iPhone users' point of entry to the web. Just as removal from Google's search engine makes you vanish from the web, so Siri has the potential to make Google vanish. Well, not vanish in the short term, but fade at least. Apple negotiates deals with providers like Yelp and Wolfram Alpha, doing an end run around the PageRank algorithm.

If Siri and other voice-recognition "assistants" move towards the mainstream, we can expect to experience an increasingly curated/censored version of the web (*). The relationship between Apple and the anti-establishment has always been love-hate, and Siri may drive it into hate-hate.

Google's friendly image can last only so long as its growth rate and profit margins stay healthy. It's already lost the aura of being the place to be for programmers, soon we'll soon see enough competition to force Google into a more orthodox stance, and that will shock a lot of observers.

Footnotes:

1 A few years ago Andrew Keen's silly "Cult of the Amateur" was the most prominent digital criticism book. Now we have Zittrain ("The Future of the Internet and How to Stop It"), Carr ("The Shallows"), Turkle ("Alone Together"), Wu ("The Master Switch"), Lanier ("You Are Not a Gadget") and many more.

2 "Demands" is not the best word, as Chris Dillow points out (*)

Comment problems and one other update

I've had a couple of people tell me they were unable to post comments. After taking this up with Typepad, I have shifted from their "Typepad connect" system back to the vanilla commenting system. If you have problems posting a comment, then I would appreciate an email at last name dot firstname (gmail). And you can rest assured it isn't personal unless you are a spambot, in which case it is, or would be if you were a person.

Also, in a recent post I suggested that there was a conflict of interest in a paper I read. The authors have now published an explanation at the end of the paper and I retract the suggestion (I still don't understand why lead author Gilad Lotan lists his affiliation as he does, but that's his personal decision). I've put a note in the post to clarify.

Morozov on Jarvis: Is There a Point?

Jeff Jarvis's 2009 book What Would Google Do? is a breathless paean to the benefits of sharing, linking, and being open, but it has not a single reference or footnote, and no bibliography. Jarvis extols the virtues of listening and speaks of mutuality but in the end, of course, the benefits flow one way. Jeff Jarvis has become wealthy from this new ethic of sharing — he is fond of "starting conversations" which he can then take ownership of — but when it comes to giving credit to those who come before, for example by referencing previous writers on the topics he addresses, well it just seems like it's too much work for him. The book is one long argument by assertion, unsupported by facts and liberally sprinkled with utterances like "small is the new big" or "We have shifted from an economy based on scarcity to one based on abundance" or "Google has built its empire on trusting us".

It looks like his new book, Public Parts, is more of the same. The New Republic just published a long review of the book by Evgeny Morozov here or here. It's forthright, opinionated, angry, entertaining and also makes some damning arguments against the book. 

Jeff Jarvis responds to the review here in bizarre fashion. He first raises the prospect of personal prejudice ("Morozov reliably dislikes me, just as he dislikes people I quote") and then dismisses the review as "he writes only a personal attack". Morozov spends 800 words critiquing Jarvis's misunderstanding of ideas about the public sphere and his oversimplification of Habermas, which Jarvis distorts and reduces to a complaint "about the names Habermas and Oprah appearing in the same book". Morozov spends 600 words on Public Parts' culturally narrow ideas about Germany, Finland, and the strange attitudes of non-Americans to privacy, which Jarvis encapsulates as "[Morozov] finds Streetview to be a case of Germans 'tyrannized by an American company'". In short, Jarvis exaggerates and distorts the arguments before dismissing them.

*                                *                                 *

To anyone who reads carefully, the argument is over and Morozov wins, but unfortunately that's not the end of the story. Much of Morozov's frustration comes from Jarvis's refusal to engage with the world of facts. He stays safely in the world of pronouncement ("Publicness is a sign of our empowerment", "the crowd owns the wisdom of the crowd" and so on). Jarvis is skilled at the marketing of ideas: if you Google [publicness], four of the first page listings are about or by Jarvis, and this canny use of branding will keep his profile high, well beyond the reach of factual criticism. Jarvis knows his audience and what they want to hear, and what they want is a self-help message for businesses: the world is changing, everything you thought you knew is irrelevant, and I have the key to the future.

So what, then, is the point of the hours Morozov spent writing a 7,000 word review if he won't reach Jarvis's core constituency? There are two other audiences that such pieces can reach. One is to shore up those who broadly agree with Morozov's perspective (yes, like me) that there is an ulterior motive, a very familiar and old-fashioned one, behind this talk of sharing and publicness. We cannot read every new book, watch every new TED talk, attend every conference and yet we do need to stay current and stay informed. I am not going to read Public Parts because there are so many other things to read, but I cannot afford to be completely ignorant of it. Morozov's review does the job for me.

The second is more important. Many people are attracted by the romantic rhetoric of openness, sharing, and the end of existing institutions, but not all have yet sorted out the political consequences of a commitment to these virtues. There are still people on the fence – and it's important for these people to know that, no matter what progressive-sounding language is used, some of the most idealistic arguments for sharing are made by those who will mine the data you provide in order to build fortunes from advertising. To shape that debate and to keep a political space open for an Internet that does not simply follow the venture-capitalist idea of progress, we need fact based arguments, so kudos to Morozov for doing the necessary work in this case.

Broken Promises: Following Your Dreams, and the 99 percent

This speech by Steve Jobs has been posted in many places over the last 24 hours:

 

It is a strange speech: quite moving, personal, modest, and thoughtful. But in the end it’s a “follow your dreams” speech, and as such is quite a contrast to another Internet event of the moment, the very moving stories being posted at We are the 99 percent.

“Follow your dreams” invokes a cosmic bargain (fortune favours the brave) and it also invokes a social bargain: that if you work hard, and have a little luck, society will ensure that your efforts are rewarded. Meanwhile, the "we are the 99%" posters “sense that the fundamental bargain of our economy – work hard, play by the rules, get ahead – has been broken, and they want to see it restored” (Felix Salmon, quoted here).

So nothing against the guy, but over the next few days I’ll think more about what the 99%-ers say than what Steve Jobs said at Stanford. One of the stories he tells is of dropping out of college and, instead, monitoring courses independently. It's an inspiring story, but the contrast to this post, made today, is glaring.

My favourite post…

… is this one.

Every now and then I look back at previous posts on this blog. Some I still like, some not so much.  Some got a lot of views at one time or another, but my favourite post of all got little attention. 

I think that this time right now, with Amazon's and Facebook's recent announcements and Apple's to come next month, mark a turning point in attitudes to the web and the companies that dominate it.

So please, I don't often trumpet my own writing and it's not that easy to read, but this post is exactly what this blog is all about.

Data Anonymization and Re-identification: Some Basics Of Data Privacy

Why Personally Identifiable Information is irrelevant. An introduction to information entropy, open data, and the possible end of crowdsourcing. 

Tim O'Reilly and ZIP Codes

From his Strata Conference on Data Science, Tim O'Reilly tweeted with dismay the recent California court decision that the zipcode is now to be classified as "personally identifiable information". "No more demographics" he lamented. A little later he retweeted a response that "apparently 87% of US residents can be uniquely identified by zip+DOB+gender: bit.ly/qysMqs" and later followed up with "Here's a reference for the claim that zip code, gender and DOB uniquely identify 87% of individuals: http://www.citeulike.org/user/burd/article/5822736 via @crdant".

These tweets are odd and disturbing. The zip/DOB/gender finding is a basic one in studies of privacy, published years ago by Latanya Sweeney of Carnegie Mellon University. I gave a talk at work on privacy a year ago, and this was one of the first references I came across. Tim O'Reilly has been pushing an agenda of Open Data, particularly Open Government Data, for the last couple of years, and yet it looks as if he isn't aware of the basic privacy issues around such data. Can that really be the case?

If it is, then here, to help Tim along, are some notes from my talk as a kind of introduction to data privacy, or at least to data-anonymization and re-identification. A great resource on some of these issues from a legal perspective is Paul Ohm's 2009 paper "Broken Promises of Privacy: Responding to the Surprising Failures of Anonymization" (PDF), University of Colorado Law Legal Studies Research Paper No. 09-12. It's long, but it's so well written it's an easy read. Much of these notes originated with this paper, in one form or another.

How Privacy Broke Crowdsourcing

A few years ago Netflix ran its highly successful and widely publicised crowdsourced prize competition, in which it released a data set of users and their movie ratings and let competitors download them and search for patterns. The data consisted of a customer ID (faked), a movie, the customer's rating of the movie, and the date of the rating.

In the FAQ for the competition, Netflix said this:

Q. Is there any customer information in the dataset that should be kept private?

A. No, all customer identifying information has been removed; all that remains are ratings and dates. This follows our privacy policy… Even if, for example, you knew all your own ratings and their dates you probably couldn't identify them reliably in the data because only a small sample was included (less than one tenth of our complete dataset) and that data was subject to perturbation.

This certainly looked reasonable enough, but Arvind Narayanan and Vitaly Shmatikov of the University of Texas had other ideas.1 First, they looked at the claim that the data was perturbed by asking acquaintances for their rankings. They found that only a small number of the ratings were perturbed at all, which makes sense because perturbing the data gets in the way of its usefulness.

In the Netflix data set, different users have distinct sets of movies that they have watched. The data set is sparse (most people have not seen most movies), and there are many different movies available, so individual tastes and viewing histories leave a clear fingerprint. That is, if you knew what movies someone watched, you could pick them out of the data set because no one else would have seen the same combination.

A closer look showed that with 8 ratings (of which 2 may be completely wrong) and dates that may have a 14-day error, 99% of the records in the Netflix data set uniquely identify an individual. For 68% of records, two ratings and dates are sufficient. Various combinations of information are sufficient to identify users, eg 84% by 6 of 8 movies outside the top 500.

But of course there is no personally identifiable information in the data set. So is this a privacy issue? It is when you have another data set to look at. The researchers took a sample of 50 IMDB users. The IMDB data is noisy – there is no ranking, for example. Still, they identified two users whose Netflix records were 28 and 15 standard deviations away from the next best. One from ratings, another from dates.

So despite Netflix's best efforts, the data set included enough information to identify some individuals. Partly because of this, a planned follow-up competition was scrapped, and the whole enterprise of crowdsourcing recommender algorithms was given a possibly terminal blow.

What's this all about?

Just to be clear, this set of notes is not about the following things:

  • Encryption
  • Restricting access to data
  • Lost USB keys and CDs

It is about these:

  • Deliberately released data that turns out to infringe on privacy
  • HIPAA, EU Data Directive, corporate rules for handling customer data
  • Advertising and ISPs
  • Gov 2.0, data.gov, and "openness"

It's about claims such as: "Attorneys on Monday accused Google of intentionally divulging millions of users' search queries to third parties in violation of federal law and its own terms of service" (October 26 2010)

"MySpace and some popular applications on the social-networking site have been transmitting data to outside advertising companies that could be used to identify users, a Wall Street Journal investigation has found" (October 23, 2010)

"Facebook users may inadvertently reveal their sexual preference to advertisers in an apparent wrinkle in the social-networking site's advertising system, researchers have found" (October 22, 2010)

(These claims are a year old, found in the week before I gave the talk. I'm sure there are many more.) The Facebook case was one in which advertisers (for a nursing program I believe) asked to target their ads specifically at females and at men interested in other men. But unlike, for example, an ad about a gay bar where the target demographic is blatantly obvious, a male user reading the ad text would have no idea that it had been targeted solely at a very specific demographic, and that by clicking it he would reveal to the advertiser both his sexual preference and a unique identifier (cookie, IP address, or e-mail address if he signs up on the advertiser's site). "Furthermore (the researchers wrote) such deceptive ads are not uncommon; indeed exactly half of the 66 ads shown exclusively to gay men (more than 50 times) during our experiment did not mention 'gay' anywhere in the ad text."

Don't we have laws to deal with this?

Indeed we do. Europe and the USA adopt different approaches to balancing privacy and utility, with the US adopting industry-specific rules (HIPAA for health, FERPA for education, Driver's Privacy Protection Act, FDA regulations, Video Privacy Protection Act etc), while the EU has taken a global approach with the Data Protection Directive. But both approaches are based on a common set of concepts and assumptions.

The big thing is that there is an assumption that data can be anonymized, and once it is then you can share it, because where's the harm? Both sets of rules are built on the idea that there is such a thing as personally identifiable information (PII) and that you can hide it, while still releasing data that is useful. The release process is "release and forget" because if data is properly anonymized why do you have to track what's done with it? There is a faith in the anonymization process, and that faith was broken by the Netflix study and a couple of other related studies.

Latanya Sweeney and the Massachusetts Governor

Let's go back to a time before HIPAA, when the debate was focused in terms of how much anonymization you needed to do. Here are some quotations from Latanya Sweeney's paper (PDF), that Tim O'Reilly appeared unaware of.

"Figure 1" below is a simple Venn diagram with two intersecting circles. The left circle holds medical data: Ethnicity, Visit Date, Diagnosis, Procedure, Medication, Total Charge. The right circle holds a voter list: Name, Address, Date Registered, Party Affiliation, Date Last Voted. And in the intersection is ZIP, Date of Birth, Sex.

The National Association of Health Data Organizations (NAHDO) reported that 37 states in the USA have legislative mandates to collect hospital level data and that 17 states have started collecting ambulatory care data from hospitals, physicians offices, clinics, and so forth. The leftmost circle in Figure 1 contains a subset of the fields of information, or attributes, that NAHDO recommends these states collect; these attributes include the patient's ZIP code, birth date, gender, and ethnicity.

In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC collected patient-specific data with nearly one hundred attributes per encounter along the lines of the those shown in the leftmost circle of Figure 1 for approximately 135,000 state employees and their families. Because the data were believed to be anonymous, GIC gave a copy of the data to researchers and sold a copy to industry.

For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes. The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter. This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures and medications to particularly named individuals.

For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code. [Editor's note: a 5-digit zip code may have several thousand people in it.]

The example above provides a demonstration of re-identification by directly linking (or "matching") on shared attributes. The work presented in this paper shows that altering the released information to map to many possible people, thereby making the linking ambiguous, can thwart this kind of attack. The greater the number of candidates provided, the more ambiguous the linking, and therefore, the more anonymous the data.

In a theatrical flourish, Dr. Sweeney sent the Governor's health records (which included diagnoses and prescriptions) to his office.

Now, of course, health information in the US is governed by HIPAA, but according to HIPAA, "de-identified" health information is unregulated. De-identified means either: a statistician says it is de-identified, or the 18 Personally Identifying Information (PII) identifiers are suppressed or generalized. These PIIs are things like Name, e-mail address, social security numbers, computer IP addresses, and so on.

The EU doesn't list specifics. Instead it says that PII is "anything that can be used to identify you". But what does that cover? IP addresses perhaps? Here is Google in their argument to the EU:

  • we "are strong supporters of the idea that data protection laws should apply to any data that could identify you. The reality is though that in most cases, an IP address without additional information cannot."
  • "We believe anonymizing IP addresses after 9 months and cookies in our search engine logs after 18 months strikes the right balance."
  • "we delete the last octet after nine months (170.123.456.XXX)"

The Latanya Sweeney result was the first to show that once you can mix and match data sets, PII is just not enough to provide privacy. And nowadays, of course, data mining multiple data sets is big business.

How Do You Anonymize Data? k-anonymity

Let's step back a little and look at the technical side of anonymization. There are four basic methods for anonymizing data:

    Replacement - substitute identifying numbers
    Suppression - omit from the released data
    Generalization - for example, replace birth date with something less specific, like year of birth
    Perturbation - make random changes to the data

Then you have to measure how private a data set. Latanya Sweeney came up with the notion of k-anonymity to define this. Here's how it works.

Think about a table, with rows and attributes. Each attributes is either part of a quasi-identifier (like a name or address), or is sensitive information (like the fact you had an operation on a particular afternoon). A quasi-identifier is a set of attributes that, perhaps in combination, can uniquely identify individuals. Sensitive information includes the attributes that we want to keep private. Your driving license number is an identifier; our driving record is sensitive information. The table satisfies k-anonymity iff each sequence of values in any quasi-identifier appears with at least k occurrences. Bigger k is better.

So here is a table with 11 rows.

Name Race Birth Gender Zip Problem
Sean Black 1965-09-20 M 02141 Short breath
Daniel Black 1965-02-14 M 02141 Chest pain
Kate Black 1965-10-23 F 02138 Hypertension
Marion Black 1965-08-24 F 02138 Hypertension
Helen Black 1964-07-11 F 02138 Obesity
Reese Black 1964-01-12 F 02138 Chest Pain
Forest White 1964-10-23 M 02138 Chest Pain
Hilary White 1964-03-15 F 02139 Obesity
Philip White 1964-08-13 M 02139 Short breath
Jamie White 1967-05-05 M 02139 Chest pain
Sean White 1967-03-21 M 02138 Chest pain

If we remove all the attributes except for the problem we have a very anonymized data set (k = 11):

Name Race Birth Gender Zip Problem
          Short breath
          Chest pain
          Hypertension
          Hypertension
          Obesity
          Chest Pain
          Chest Pain
          Obesity
          Short breath
          Chest pain
          Chest pain

On the other hand, if we just remove the name and generalize the zip code and date of birth we have a less anonymized set. Exercise: convince yourself that k=2 for this set.

Name Race Birth Gender Zip Problem
  Black 1965 M 0214* Short breath
  Black 1965 M 0214* Chest pain
  Black 1965 F 0213* Hypertension
  Black 1965 F 0213* Hypertension
  Black 1964 F 0213* Obesity
  Black 1964 F 0213* Chest Pain
  White 1964 M 0213* Chest Pain
  White 1964 F 0213* Obesity
  White 1964 M 0213* Short breath
  White 1967 M 0213* Chest pain
  White 1967 M 0213* Chest pain

Of course, the issue is utility. There is a tradeoff between keeping the data useful for research and maintaining privacy. Researchers and attackers are doing the same thing after all: looking for useful patterns in the data. With the k=2 data set you can ask questions about correlation of problems with gender, or with geography to some extent (although not very specific geographical factors, like toxic leaks).

It would be nice if you could make the data set anonymous for the purposes of attackers, but still useful for researchers. But it turns out you can't. In a paper called The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing, Justin Brickell and Vitaly Shmatikov investigated the problem. They took a set of different sanitization methods and compared it to a data set with trivial sanitization (removal of identifiers). Here are the results.

Privacy_utility

The left bar of each pair is the privacy (smaller = more private), The right represent the utility to the researcher (bigger = more useful). Anonymization seeks to shorten left without shortening the right, but the results show, depressingly, that small increases in privacy cause large decreases in utility.

Please could you tell me about the Database of Ruin?

OK, if you insist.

If we are going to take a new look, we need to recognize that privacy is not a binary issue, and it is not a property of a single data set. We need to worry about reidentification attacks that do not reveal sensitive information. As Paul Ohm writes: "For every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her… the 'database of ruin' containing this fact but now splintered across dozens of databases on computers around the world."

Privacy is erased incrementally as successive queries reduce uncertainty and narrow in on an individual. The way to quantify this reduction in uncertainty is to use the idea of information entropy, adopted from the thermodynamic concept, and usually identified by H. The information gained in a query is

H(before) – H(after)

as you increase your knowledge of a system, the entropy (loosely, disorder) decreases.

So what is the formula for H?

For a set of outcomes {i,…}, each with probability pi, the information entropy is:

H = – SUM pi log2(pi)

(excuse the lack of greek sigma), and is measured in bits. The logarithm appears because the probability of two independent events occurring is the product of the probabilities of each event, but the information we gain from observing two independent events is the sum of the information we gain from each event

Take a simple example: a coin toss. Before the toss, there are two outcomes with equal probabilities, so

H = -(1/2) log2(1/2) – (1/2)log2(1/2)

= – log2(1/2)

= log2(2)

= 1 bits

which makes sense if you think about it, because the coin could be heads (1) or tails (0).

After the toss, H = log2(1) = 0 : there is no uncertainty left and we have complete information about the system.

In the same way, a dice roll has (before rolling it) an entropy of

H = log2(6) ~ 2.6 bits

So if the challenge is to identify one person in the population of the world, how much information entropy is there? The identity of a random, unknown person is just under 33 bits (233 ~ 8 billion). Hence the web site 33bits.org.

Learning a fact about the individual reduces the uncertainty (reduces information entropy). So if you learn that the star sign is Capricorn then that's -log2(1/12) = log2(12) = 3.58 bits of information.

If you find out other independent pieces of information you add up the contributions to find out how much the the entropy has been reduced. So a ZIP code may provide 23.81 bits of information, a birthday 8.51, and gender 1 bit for a total of 33.32 bits: it probably identifies one individual.

The Netflix de-anonymization paper used these ideas a bit. The a priori entropy of the data set (additional information required for complete de-anonymization) is 19 bits (219 = 524288, which is about the number of individuals in the data set). Individual movies give from 1 to 18 bits of information, depending on what you know about them (dates within 14 days, rankings +/- 1). Very popular movies gave little information, but niche movies viewed by few individuals yielded many bits of information. So little auxiliary information is needed to re-identify records in the database. In a theoretical excursion, the researchers showed that de-anonymization is going to be robust against noise, and does not need much additional information, so long as the data set is large and sparse enough.

So what now?

I can do nothing better than quote from Paul Ohm to summarize the privacy dilemma we find ourselves in.

"Abandoning PII is a disruptive and necessary first step, but it is not enough alone to restore the balance between privacy and utility that we once enjoyed. How do we fix the dozens, perhaps hundreds, of laws and regulations that we once believed reflected a finely calibrated balance but in reality rested on a fundamental misunderstanding of science?

Techniques that eschew release-and-forget may improve over time, but because of inherent limitations like those described above, they will never supply a silver bullet alternative. Technology cannot save the day, and regulation must play a role.

Ohm notes that the US sectoral approach to regulation sets the privacy bar too low, by focusing on explicitly listed PII. Meanwhile the EU, by saying "anything that can be used to identify you" would, if interpreted in the light of modern de-anonymization techniques, be too high.

The direction to take, says Ohm, is to focus on the people, not the anonymization, and to distinguish private from public release. We need to codify notions of trust and practices and apply strong sanctions against re-identification. This will put more administrative and procedural burdens in our future, but is needed to preserve privacy.

And, to return to Tim O'Reilly's tweet, open data advocates and big data enthusiasts need to pay more attention to these issues rather than relying, as some do, on personally identifying information as a sufficient solution.

Footnotes:

1 Arvind Narayanan and Vitaly Shmatikov, Robust De-anonymization of Large Sparse Datasets, IEEE Symposiom on Security and Privacy, 2008. (gated link)