Another Few Words on Netflix

Last week’s post on the Netflix Prize brought more readers here than anything else I’ve written.

Being vain I tracked its progress – it got few readers in the first couple of days, then a burst because of being picked up by the Economist Link Mafia of Brad DeLong and Marginal Revolution – thanks to both. But the big spike came when someone picked it up from there and posted it to reddit. Then it made its way to some other tech sites like ycombinator and geekpress, and that kept things going for a while. After a week, the server’s disks are having a chance to cool down.

There were lots of good comments as well on the various places it got listed (as well as here).

ZH suggested using a Java app together with HSQL to do analysis. I’m confident that SQL Anywhere is as fast as HSQL, but you are right that you could do more with the data in an application than with simple queries like I was doing. However, in the end you have to deal with sparse matrices, and databases don’t usually store or manipulate those well.

Someone on reddit (can’t find the link now) posted that they have posted a C++ framework for analyzing the data set, together with one of the more popular avenues, a Singular Value Decomposition algorithm. So if you want to be able to try out algorithms that are better than Netflix’ current one, then go hunt the article on reddit and you’ll find one.

Quite a few people pointed out that just because there seem to be challenges with the Netflix Prize because of the way the data is collected doesn’t mean all  recommender systems must be thrown in the bin, and I agree. Narrowing it down to one person per "customer" instead of a household would help. Other things (mood indicators, half star ratings) may make incremental improvements. But for movies, mostly watched once, there are always going to be some issues of variability. For music – and anything else where we return over and over again to something we buy or experience – rating systems have a better chance of success. Mostly we buy music we either know we like or have a good idea we will like, so we can use purchases instead of surveys as a way of tracking preferences. But movies are experience goods except for those few you may watch more than once.

The other main topic was people’s experience of rating systems. This seemed highly variable – some say they are useless, others that they are fine, and others that a small difference in accuracy may make a big difference. Some suggested that tracking the RMSE is not very useful – but that was Netflix’s decision, not mine. What I find interesting here is that with all the impressions we have collected now there is so little (that I know of) to back up or refute our subjective impressions. 

And now it’s time for this blog to step off the information highway and back onto the information backroads where it belongs.

The Netflix Prize: 300 Days Later

Today the Netflix Prize Competition has been running for 300 days.

Online DVD rental outfit Netflix caused a real buzz last October when it announced the competition. If anyone can come up with a recommender system for predicting customer DVD preferences that beats its own algorithm (Cinematch) by a certain amount, Netflix will hand over $1million. The prize got a lot of attention because it exemplifies the idea of crowdsourcing. Not only does Netflix rely on crowdsourcing of DVD ratings (user ratings of DVD titles) but the competition itself is an attempt to use crowdsourcing to develop the algorithms to make the most of those ratings. Instead of doing the work itself, or hiring specialists, Netflix lets whoever anyone enter their competition and pays the winner. The competition is still in progress: Netflix says it will run until at least 2011. So now the initial buzz has died down, what can we learn from the Netflix Prize?

First, the competition details (see here (PDF) for a short paper by two Netflix employees). Netflix made public a database of customer DVD ratings (tweaked to ensure privacy) that included over 100 million individual ratings of 17,770 titles by 480,189 people. If you sign up for the prize, you can download these ratings. Each rating involves one customer giving an integer number from one star (very bad) to five stars (very good) for a given title. For example, customer 296452 gave title 234 ("Animation Legend: Winsor McCay") a rating of 1 (very bad).

The idea is that competition entrants try to develop an algorithm by using the training set (which is the 100 million plus set of ratings), try it out on a set of probe set of test data that they also give you, and once they think they have a good algorithm, create a set of predictions for a qualifying set of users and titles, and upload it to Netflix. Netflix test these predictions against the actual rankings (which they keep private) for that qualifying set. They post the leading algorithms on a leaderboard.

The quality of any algorithm is determined by its root mean squared error (RMSE). To calculate the RMSE you take the difference between the rating the algorithm predicts and the actual rating, and square it so it’s guaranteed to be a positive number. Then you take the average of all these over the set of data to get the mean squared error. Finally, taking the square root gives the RMSE, which is the roughly the size of a typical error.

A perfect algorithm would predict exactly what rating every user would give to every title and would have an RMSE of zero. A random set of predictions has an RMSE of 1.95. But the actual range of action is much narrower than this 1.95 range. A simple algorithm that uses the average rating for each title as the prediction – "let’s see, the average rating for the 104,000 customers who rated Mean Girls was 3.514, so I predict you will give it a rating of 3.514" – gets an RMSE of 1.0540. Netflx’s Cinematch algorithm has an RMSE of 0.9525. Netflix set the prize target at a 10% improvement over that, which is an RMSE 0.8563. So the range that recommendation systems can realistically cover – from naively simple to cutting-edge research – seems to be the narrow band between the middle three lines in the following diagram.

In the days and weeks after the prize was announced, progress was rapid. The Cinematch score was matched within a week. Within a month the leaders were half way to the winning prize with a 5% improvement. But getting further improvement progress has proved more and more difficult. It took another month to get to a 6% improvement, about 5 more months to get to 7%, and the current (July 29 2007) leader is at 7.8% improvement and has been unchanged for a month. Here is a graph of the progress, showing the three lines above and the prize leader progress:

At this stage it is not clear if the prize is winnable: the existing algorithms use a lot of linear algebra and some pretty fancy machine learning ideas (see a description by a leading participant here and some sample code for a similar approach here), the leading groups include university research labs from around the world, and many of the more obvious approaches have been explored. Certainly media and blog interest – huge in the early days – has dropped off in recent months. This New York Times article is one of the few from the last month or two.

Let’s not get into the computer science of recommender systems – there’s a good review from 2004 called Evaluating Collaborative Filtering Recommender Systems here if you want to know more. Instead, let’s step back a bit and ask what this prize tells us so far, and look at a couple of things we can learn by poking around the massive data set that Netflix provided.

One question is: how good is an algorithm with an RMSE of 1, and is an algorithm with an RMSE of 0.8563 much better for the average customer? Actually, I guess that’s two questions. Anyway, if the errors followed a normal distribution (which they don’t, but we’re talking back-of-envelope here) then if a customer actually rated a title as 2 (poor), an algorithm with an RMSE of 1.0 would predict somewhere between 1 and 3 about 70% of the time. Not bad, but not startling. If the algorithm gave ten recommended movies, then it would get on average seven out of ten within one unit of the customer’s actual rating. Meanwhile, the RMSE=0.8563 algorithm would get 7.6 out of ten. While this is an improvement, and while it may be a remarkable technical accomplishment, it does not seem to be exactly a revolutionary leap compared to the really simple algorithms as far as customers go.

[Update, December 25, 2007: Yehuda Koren of leading team KorBell approaches the recommendation problem a different way, looking at ordering of recommendations rather than at matching them. His way is more appropriate, and gives much more encouraging results. See here.]

As soon as you start looking at the data set it becomes obvious why it is so difficult to get good results. Databases don’t have the linear algebra and other mathematical tools for taking a run at the prize but they are convenient for exploring data sets, so I loaded the data into a SQL Anywhere  database (The developer edition is a free download, and I’ll provide a perl script to load the data if you really want it) and started poking around. Here are a few of the more obvious oddities (all these observations have been posted elsewhere – see the Netflix prize forum for more):

  • Customer 2170930 has rated 1963 titles and given each and every one a rating of one (very bad). You would think they would have cancelled their subscription by now.
  • Five customers have rated over 10,000 of the 17,770 titles selected – and presumably they also have rated some of the others among the 60,000 or so titles Netflix had available when they released the ratings. Are these real people?
  • Customer 305344 had rated 17654 titles. Even though Netflix make it easy to rate titles that you have not rented from them (so they can get a handle on your preferences) can this be real?
  • Customer 1664010 rated 5446 titles in a single day (October 12, 2005).
  • Customer 2270619 has rated 1975 titles. 1931 were given a 5, 31 were given a 4, 10 given a 3, 2 given a 2 (Grumpy Old Men and Sex In Chains) and a single title was given a 1. That title? Gandhi, which has an average rating of over 4 and which less than 2% of those who watch it give a 1.
  • The most often rated movie? Miss Congeniality with ratings by over 232,000 of the 480,000 customers. And which title is most similar to it in terms of ratings (using a slightly weighted Pearson formula)? Bloodfist 5: Human Target.
  • Most highly rated – Lord of the Rings: Return of the King (Extended Edition), with 4.7.

Some of the more bizarre facts above may be artifacts of whatever tweaking process Netflix put the data set through (although they claim not to have materially affected the statistics). While odd, bizarre users are not always difficult to deal with: if you have rated each of the last 1963 titles you’ve watched as 1 it is pretty easy to predict what you will rate the next title. But others are more tricky.

One reason for these oddities is one of the things that Evaluating Collaborative Filtering Recommender Systems identifies. They note that (on other, smaller data sets) even the best algorithms don’t seem to get beyond an RMSE of 0.73 on a five-point scale, and speculate that the cause may be "natural variability". We users provide inconsistent ratings – sometimes we’d rate a movie a 3 and sometimes a 4, with no consistency. It may depend on our mood when we watched the movie – we may give a romantic movie a higher rating if we watched it on a first date than if we watched it a week later after being left broken hearted, or a demanding movie a low rating because we were tired and out of sorts when we watched it – or it may depend on our mood when we actually provide the rating.

There are other, more obvious reasons which, for reasons I don’t understand, don’t seem to get discussed much. Netflix itself and most competitors talk about the data in terms of "movies" and "users". But the "movies" in the list are not all movies: a lot are TV series or music video collections. The variability among the episodes of a series (Do you think Lost Season 1 deserves a 3 or a 4?) must make single-number ranking even more variable and these composite DVDs figure prominently among those titles that have the biggest variance in ratings.

Then there’s the fact that a customer might not really be a single person. It might be a household with several viewers in it. So perhaps one person likes Terminator, one likes Bridget Jones, and one likes Spongebob Squarepants. Once we realize that the "user" might be a collection of people there is no strangeness between giving high ratings to each of these, but you can see how, depending on which household member entered the rating, the values may be quite different (perhaps this is why titles like ‘N Sync: Making of the Tour, Pokemon Vol 9, and Boston Red Sox 2004 World Series Collectors’ Edition have high variance – the person rating may not always have been the person who wanted to watch it). If the data set contains these inevitable variations (in addition to the plain kookiness on show in the Netflix set) then it may be that even the clevverest algorithm can make little progress in untangling all the intrinsic vagueness of the data.

So what I get from the Netflix prize is that there are probably significant limits to recommender systems. Even the smartest don’t do a whole lot better than the simple approaches, and a lot of work is required to eke out even a little more actual information from the morass of data. It seems surprisingly difficult to get reliable, factual information on this important question of how useful they can be. Part of the reason is that they are new – Amazon has only been in business for about ten years after all – and part of the reason is that the behaviour of these systems is often a closely guarded secret despite the aura of openness that web companies cultivate.

This matters because there is a surprising amount riding on the effectiveness of recommender systems. Silicon Valley’s new-economy enthusiasts see them as the key to developing a new level of cultural democracy: they see recommender systems as a trebuchet hurling rocks at the castles of the old elite of mainstream media, big publishers with big marketing departments, big-chain book stores and Hollywood sequels. Recommender systems are claimed to embody the "wisdom of crowds". The idea is that everyone just publishes stuff (blogs, wikipedia entries and so on) and amateur readers or viewers decide what has merit by their actions (rating stories, buying and rating books and DVDs and so on). The work of critics is "crowdsourced" to customers, but it is the recommender system that distills these ratings to yield the aforementioned wisdom.

If faith in recommender systems is misplaced, then the new boss may look much like the old boss only with more computer hardware. There is a danger that recommender systems may simply magnify the popularity of whatever is currently hot – that they may just amplify the voice of marketing machines rather than reveal previously-hidden gems. Even worse, their presence may drive out other sources of cultural diversity (small bookstores, independent music labels, libraries) concentrating the rewards of cultural production in fewer hands than ever and leading us to a more homogeneous, winner-take-all culture.

I’m no futurist, but I see little evidence from the first 300 days of the Netflix Prize that recommender systems are the magic ingredient that will reveal the wisdom of crowds.

Arms Races

There’s nothing that isn’t obvious here. I’m just so mad I have to post something.

First the Americans  say that uranium enrichment is a valid part of a civilian nuclear program, but only for countries it likes.

If you were Iran, how would you respond to nuclear proliferation among American allies?

Now we hear about a huge arms deal with Saudi Arabia (BBC) :

The United States is reported to be preparing a major arms deal with Saudi Arabia worth $20bn (£9.8bn) over the next decade.

It is said to be part of a strategy for countering Iran’s growing strength.

If you were Iran, how would you respond to additional American arms in the middle east?

Of course, the Americans realise that someone in the region is likely to be pissed off when large amounts of arms start rolling off the ships in Saudi Arabia. But they have a plan to deal with that:

To counter objections from Israel, they said, the Jewish state would be offered significantly increased military aid.

If you were Iran, how would you respond to increased American military aid to Israel?

Of course, the Americans realise that other Arab states in the region are likely to be pissed off because the Saudis are being treated as a favourite. But they have a plan to deal with that too:

Other US allies in the region – Bahrain, Kuwait, Oman,
Qatar and the United Arab Emirates – could receive equipment and
weaponry as part of the deal, the officials said.

If you were designing a policy to encourage Iran to build nuclear weapons and expand its military capabilities, is there anything that you would do differently? I can’t think of anything.

Dave Meleney

This is a difficult post to write.

Those of you who read the comments section of the blog will have seen quite a few comments by Dave Meleney, a libertarian from Colorado. Sadly I found out – simply because I looked at a google search that came to this weblog and clicked a link – that Dave died last Thursday in an accident while trimming trees. The link I clicked led to a brief note here.

As is the way with Internet contacts I knew a few things about Dave, but almost nothing about his real life. I knew that he was a Bonsai enthusiast who had grown huge numbers of bonsai trees, and yet I didn’t know how old he was. I knew that he had been to China, but not what he looked like.

The other thing I know is that Dave commented on weblogs a lot (both here at Whimsley and elsewhere, at Marginal Revolution for example). Most recently we had a brief exchange in which he posted a comment last Wednesday (link) that I replied to a few days later, not having a clue of what had happened in the meantime. We disagreed – mainly over the role of multinational companies in poor countries, which he saw as overall hugely positive for the poorest people in those countries, and over the role of government – but he was unfailingly polite and thoughtful and articulate. He even offered my family the use of a cabin even though we had never met and even though I don’t know what family he has.

So I am in the odd position of saying that I will miss Dave, even though I hardly knew him. I am convinced he was a kind and generous man -something that emails I have received in the last 24 hours from a couple of people who did know him better confirms.  My heart goes out to his family.

If it is not inappropriate I’d like to suggest that anyone of his Internet contacts who has read his comments and engaged in conversation with Dave, and who wishes to express their condolences, might go to the posting at the Libertarian Party of Colorado and add a brief note to the comments there.

Rest in peace Dave.

If That’s All Right With You – A Modest Manifesto

My "Happy Shoes" series seems to have faded out one episode before I meant it to. Oh well, maybe I’ll get back to it soon. Meanwhile, here is a something a little different, which owes a lot to various posts by Chris Dillow at Stumbling and Mumbling.


The names of Vasili Arkhipov and Stanislav Petrov do not appear in most lists of 20th century heroes, but they should. After all, who else could claim to have literally saved the world?

Arkhipov’s moment came during the Cuban Missile Crisis on October 27, 1962, when he was an officer on a Soviet nuclear-armed submarine. When the submarine was bombarded by an American ship an intelligence officer on board thought "that’s it – the end" and the captain gave the order to prepare to fire a nuclear missile. Had the missile launched, nuclear war would have begun, but firing a nuclear warhead required the approval of three officers and Arkhipov prevailed on his fellow officers to wait — and things calmed down. When the story became public in 2002 Thomas Blanton, director of the US National Security Archive, said simply: "A guy called Vasili Arkhipov saved the world".

In 1983, Stanislav Petrov was monitoring the Soviet early warning satellites for signs of a US attack. His instructions were clear: if he detected missiles targeting the USSR he was "to push the button and launch a counter-offensive". But when the system showed five missile launches in the US all headed towards the USSR, sirens blared and warning lights flashed, and a room full of people waited for him to push the button, he didn’t. It didn’t look right to him and he reported the alarm to his superiors but declared it false. Petrov was right: the signal was a false alarm triggered by the satellite itself, and a war was averted.

You could hardly find two more unheroic heroes. They were not powerful generals or strong warriors, they were mid-level functionaries in the Soviet military: small cogs in a very big machine, far from the centres of political power. And their actions were not the decisive, bold gestures that we expect from heroes but were cautious and sceptical. When others demanded action their heroic response was to say "let’s wait and see".

We should value modest people such as Arkhipov and Petrov more highly. It is time for modest people to get the credit they deserve.


Look around. It is easy to see what kind of character traits we value. We nourish "leaders" all the way from elementary school (where programs select "leaders of tomorrow") through university scholarships and on into the adult world. We heap admiration on those who succeed in reaching positions of power. We encourage single-minded ambition with every book or show or speaker that tells you to "find your passion" or "follow your dream". We applaud extroverts for their gregariousness and self-confidence.

On the other hand, we undervalue traits such as meekness, generosity, doubt, introversion, a sense of balance, gentleness, irony and competence. In short, we reward arrogance and punish modesty. (I don’t mean modesty as in women covering up their bodies so that men don’t get excited, of course, I mean modesty as in a sense of limits, lack of pretension).

These unbalanced values distort many aspects of our culture. Consider heroism. We search for our heroes among those who are "exceptional" (climb higher mountains, score more goals, make more money…) but a different and better idea of heroism is possible, in which heroic acts are those that reveal our humanity. Arkhipov and Petrov are two examples. For another, consider the 17th century inhabitants of the village of Eyam (rhymes with dream) in Derbyshire, England. I heard of this from John Trevor‘s song "Roses of Eyam", as sung by the wonderful Roy Bailey. The song records how the villagers gave up their lives when bubonic plague arrived in Eyam by sealing themselves off to make sure the  disease did not spread to the surrounding areas. The number who died varies according to the telling: some say 259 of 333, some 318 from 350, but there was no doubt that those who sealed themselves off had little chance of survival. There was nothing grandiose here – no rewards, no trophies, no immortality save of the most simple kind. And yet how much more heroic these unknown people are than, say, Bill Gates or Bono.

James Joyce realized that our common humanity is at the core of heroism. By building the greatest novel of the twentieth century around a day in the life of a Leopold Bloom — a middle-aged, cuckolded, advertising salesman — Joyce highlighted the epic nature of everyday life. And he is right; the great things in life are universal. Birth and death, giving birth, caring for others: you don’t need to explore the remote corners of the world to find these things, yet what can be greater? As Chris Dillow reminds us, Thomas Gray knew the value of modest lives. In his "Elegy Written in a Country Churchyard" he surveyed the graves of the "rude Forefathers" of a hamlet:

Let not Ambition mock their useful toil,
Their homely joys, and destiny obscure;
Nor Grandeur hear with a disdainful smile
The short and simple annals of the Poor

The poet reminds us of the importance and nobility of modest work compared to the contributions of those who enjoy "the pomp of power".  Dillow again: "All the essentials of life come from the little people who clean the streets and make our food. The humblest binman has done more good for me in the last 10 years than [Tony] Blair’s managed." Or, as I read in a cookbook, "The discovery of a new dish does more for human happiness than the discovery of a star".

We would have better politicians and CEOs if a sense of service – an inherently modest quality – rather than "leadership quality" was seen as a character trait to be rewarded, and if they recognized that their success comes as much from luck and the quirks of history as from merit. And even those who do reach the pinnacles with good motives must be treated with suspicion. Power of all strands does, after all, corrupt, and those who see power are the most vulnerable to corruption.

Public discourse suffers from the same warped perspective. Those who parade bold visions and big ideas gain much of the limelight, but such efforts often have more to say about the vanity and presumption of their authors than about the reach of their intellects. A discourse based on big ideas is prone to being diverted by demagogues. Flashy writing and speaking does have its place, but mainly as popularization, and it means little if not built on cautious and detailed work carried out by those with more modest aims. The devil is often in the details and the world can more often be seen by looking closely at a grain of sand than by scanning the far horizon.

Vanity is linked to what must be The Word For 2007: "passionate". At every turn it seems that we are told that the way to happiness is to "find your passion". Companies boast that they have a "passion for excellence". Be all you can be. Follow your dream.

Such ideas are literally self-centred, and are the opposite of the modest life. When you are the star of your own life, those around you are reduced to supporting roles. Who, I cannot help but wonder, is paying for that dream? Who is finding the children’s clothes while you are busy finding your passion? Usually it is the family members who have to suffer the absences (physical and mental) of the dreamer. Conor Cruise O’Brien has been called "the greatest living Irishman", yet in a trip to Toronto in the 1990s he was without his family for once, and he was at a loss. Why? Because his family members usually handled the money, the arrangements, the mundane details of his trips. Surely no one who gets those around him to do the drudge work can be considered "great".

Passion not only leads to a self-centred life; it is also the enemy of scepticism, of doubt, and of reflection. To be passionate is to be blinkered. Evangelists, monomaniacs, and demagogues are as passionate as anyone, and follow their dream wherever it takes them. But they are terrible role models. We would do better to emulate those who make and accept the compromises of a modest life; those who treat people around them with respect, who accept that others have dreams too and that, if we all give a little, we may not reach our dream but we may have a better world.

It’s not that we should cast down the extrovert and immodest. Every parade needs a leader and, as the saying goes, "you can’t lead a parade if you think you look funny sitting on a horse". Movies need stars and some rock bands benefit from a little swagger, but the point is that the starring role, while it grabs the spotlight, is just one of many that combine to produce the finished event. The star cannot shine without a supporting cast. Every great band needs its rhythm section, every orchestra its second fiddles. No politician gets elected without dedicated campaign workers and no matter how comfortable you feel on a horse, you can’t lead a parade all on your own because that’s not a parade. We need to remember that music, parades, and other events are collective efforts, and value those who feel more comfortable behind the scenes together with those who revel in the spotlight.

Of all roles, perhaps that of the spectator is least appreciated. Being a spectator is seen as passive and uninspired: how often do you hear the phrase "mere spectator" contrasted with "active participant"? Yet us spectators have an essential part to play too, because great events are made great by their spectators. What is a cup final without the fans? A rock concert without the crowd? A festival without the festival-goers? Or consider books: novelist Zadie Smith recently wrote that "A novel is a two-way street, in which the labour  required on either side is, in the end, equal… Reading is a skill and an art and readers should take pride in their abilities and have no shame in cultivating them if for no other reason than the fact that writers need you".

Extroverts, as Jonathan Rauch says in a widely-read essay, dominate public life. "This is a pity", he goes on, "If introverts ran the world, it would no doubt be a calmer, saner,  more peaceful sort of place." Yet introverts get little respect:   

Extroverts are seen as bighearted, vibrant, warm, empathic. "People person" is a compliment. Introverts are described with words like "guarded", "loner", "reserved", "taciturn", "self-contained", "private" – narrow, ungenerous words.

There are good reasons for this. Extroverts, after all, are in a good position to spread the idea that being outgoing is a good thing. Introverts, on the other hand, are not well suited to evangelize the virtues of quiet.

Restoring a balance will not be easy, because it demands immodest behaviour of modest people. It is difficult to promote the quiet virtues in a world drowning in the verbiage of the loudmouthed. Does it even make sense to stage an outspoken demand for modesty (if that’s all right with you?), a brash call for humility, a blunt demand for subtlety, an uncompromising plea for flexibility? Can we be unequivocally on the side of doubt? Does it make sense to spout a monologue on the benefits of shutting the hell up?

Probably not. The very idea of a manifesto is, of course, immodest. But I think it could be saved by something that is lacking in this attempt, which is irony. Perhaps someone else can do a better job.

We should be able to speak up while accepting the limits of our own arguments if we acknowledge (with Leonard Cohen) that "there is a crack in everything. That’s how the light gets in." Contradiction is an opportunity to learn rather than a debate to be fought. Thesis and antithesis are the beginning of synthesis. In exploring both sides of contradictions and arguments we learn to see both sides of a dispute, empathize with hurts and griefs that are not our own, and start to see the cracks in our own beliefs. And that must surely be a goal of any modest agenda – although not before we get the immodest down off their high horses and get them to shut up for a minute.