Netflix Prize: Was The Napoleon Dynamite Problem Solved?

I just gave a talk at work on “Recommender Systems and the Netflix Prize”, and included the two major popular articles about the prize in its final year or so. One was in Wired Magazine and one was in the New York Times., and each focused on one outstanding problem that the competitors faced. Wired looked at the quirkiness of users as they rate movies, and the NYT focused on the difficulty of predicting ratings for a handful of divisive movies.

Now that the contest is over we can answer the question, “were those problems solved?”

Let’s start with the Wired article. Entitled “This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize” [link] it interviewed Gavin Potter, aka “Just a guy in a garage”. Here’s the hook:

The computer scientists and statisticians at the top of the leaderboard have developed elaborate and carefully tuned algorithms for representing movie watchers by lists of numbers, from which their tastes in movies can be estimated by a formula. Which is fine, in Gavin Potter’s view — except people aren’t lists of numbers and don’t watch movies as if they were.

Potter is focusing on effects like the Kahneman-Tversky anchoring effect:

If a customer watches three movies in a row that merit four stars — say, the Star Wars trilogy — and then sees one that’s a bit better — say, Blade Runner — they’ll likely give the last movie five stars. But if they started the week with one-star stinkers like the Star Wars prequels, Blade Runner might get only a 4 or even a 3. Anchoring suggests that rating systems need to take account of inertia — a user who has recently given a lot of above-average ratings is likely to continue to do so. Potter finds precisely this phenomenon in the Netflix data; and by being aware of it, he’s able to account for its biasing effects and thus more accurately pin down users’ true tastes.

Well Potter didn’t win, but did these kind of ideas help when it came to the winning submission? The answer is yes. The winning teams worked these kind of patterns, which are independent of particular user-movie combinations, into their models through the bland name of “baseline predictors”.

Any model has to predict ratings that individual users will give for particular movies. A very simple baseline predictor could take the average of all the ratings for that movie, take the average of all ratings given by the user in question, and split the difference. So if the movie has an average rating of 3.45 and the user has rated all the movies he/she/they have watched with an average of 2.55, then the model would predict a rating of 3. This includes some minimal level of user-quirkiness (are they a high or low rater?), some level of information about the movie (is it rated highly or not?) yet has nothing to say about how this particular user is expected to react to this particular movie.

In their winning submission, the BellKor team [PDF link] list their major improvements in the final year of the competition, and the first item they give is improved baseline predictors. In particular,

Much of the temporal variability in the data is included within the baseline predictors, through two major temporal effects. The first addresses the fact that an item’s popularity may change over time. For example, movies can go in and out of popularity as triggered by external events such as the appearance of an actor in a new movie. This is manifested in our models by treating the item bias as a function of time. The second major temporal effect allows users to change their baseline ratings over time. For example, a user who tended to rate an average movie “4 stars”, may now rate such a movie “3 stars”. This may reflect several factors including a natural drift in a user’s rating scale, the fact that ratings are given in the context of other ratings that were given recently and also the fact that the identity of the rater within a household can change over time…

It was brought to our attention by our colleagues at the Pragmatic Theory team (PT) that the number of ratings a user gave on a specific day explains a significant portion of the variability of the data during that day.

A model with these variations, and no specific user-movie considerations (i.e., one that is useless for presenting a list of recommendations to a user) actually ends up being significantly more accurate than Netflix’s own Cinematch algorithm was at the beginning of the competition.

So score one for the winners – they solved the user-quirkiness problem.

The second article was in the New York Times and was called “If You Liked This, You’re Sure to Love That” [link]. Its focus was not the quirkiness of users, but unpredictable movies. Author Clive Thompson interviewed Len Bertoni, a leading contestant:

The more Bertoni improved upon Netflix, the harder it became to move his number forward. This wasn’t just his problem, though; the other competitors say that their progress is stalling, too, as they edge toward 10 percent. Why?

Bertoni says it’s partly because of “Napoleon Dynamite,” an indie comedy from 2004 that achieved cult status and went on to become extremely popular on Netflix. It is, Bertoni and others have discovered, maddeningly hard to determine how much people will like it. When Bertoni runs his algorithms on regular hits like “Lethal Weapon” or “Miss Congeniality” and tries to predict how any given Netflix user will rate them, he’s usually within eight-tenths of a star. But with films like “Napoleon Dynamite,” he’s off by an average of 1.2 stars…

And while “Napoleon Dynamite” is the worst culprit, it isn’t the only troublemaker. A small subset of other titles have caused almost as much bedevilment among the Netflix Prize competitors. When Bertoni showed me a list of his 25 most-difficult-to-predict movies, I noticed they were all similar in some way to “Napoleon Dynamite” — culturally or politically polarizing and hard to classify, including “I Heart Huckabees,” “Lost in Translation,” “Fahrenheit 9/11,” “The Life Aquatic With Steve Zissou,” “Kill Bill: Volume 1” and “Sideways.”

So this is the question that gently haunts the Netflix competition, as well as the recommendation engines used by other online stores like Amazon and iTunes. Just how predictable is human taste, anyway? And if we can’t understand our own preferences, can computers really be any better at it?

The reason Napoleon Dynamite is a problem is not because it’s the most difficult movie to predict – it isn’t – but because it’s difficult to predict and it was rated by a lot of people. A movie that is difficult to predict but which is rated by only a handful of users will contribute very little to the total error in a model.

Well, now the competition is done, the complete data set and the  predictions of the winning submission are available for download [link], so download them I did, loaded them into a SQL Anywhere database, and graphed the results. Here is a plot of the remaining error for each movie against the total number of ratings for that movie, for all 17770 movies. Napoleon Dynamite is the red dot.

grand_prize_erorr_vs_rating_count

With 116,362 ratings, Napoleon Dynamite has a higher error than any other movie rated more than 50,000 times. I’s RMSE is 1.1934: just as bad as it was for Len Bertoni when the original article was written.

So there you go: user quirkiness was resolved, at least the the extent that was needed to win the prize, while quirky movies remained stubbornly unpredictable till the end.

Why I Am Such an Infrequent Blogger

“My assumption, always, is that everyone knows everything I know AND MORE. Rephrase. Everyone who is interested in the kinds of thing that interest me knows everything I know AND MORE. If they're not interested they don't know but don't want to. So there's no point in mentioning things that strike me as interesting, unless a) these are events in the last, say, 5 minutes (so those disposed to be interested might not be au fait or b) I'm up for proselytizing (those not disposed to be interested might be with enough encouragement).”

See? Helen DeWitt even knows more than I do about why I am such an infrequent blogger.

More Long Tail: Everyone Is Still Wrong

Every now and then another study comes out about the long tail and gets discussed in the usual places. More often than not, the end result is additional confusion, because the thing they are talking about (Chris Anderson’s book) defines the concept in many different ways, depending on what the author feels like talking about at the moment.

The latest is a working paper called Is Tom Cruise Threatened: Using Netflix Prize Data to Examine the Long Tail of Electronic Commerce, by Tom Tan and Serguei Netessine of the Wharton Business School at the University of Pennsylvania [PDF, summarized here]. It got slashdotted here, written up in The Register here, and Chris Anderson responds to it here.

Here is what the paper does. It  takes the sample of 100 million DVD ratings provided by Netflix for the recently-completed Netflix Prize, and breaks down the trends in ratings from 2000, when Netflix stocked relatively small numbers of titles and had relatively few users, to 2005 when Netflix had many more DVD titles and many more users. Then they ask whether demand for “hit movies” and “niche movies” increased or decreased over that time, as reflected in the number of ratings in Netflix’s sample data set. Not surprisingly, they notice that when measured absolutely (top 10, top 100) the demand for hits decreases and when measured in percentage terms (top 1%, top 10%) it increases.

The problem is that the Netflix Prize data set, while fascinating to explore, has nothing to say about the long tail by itself. Between 2000 and 2005 the DVD as a format exploded, with many old titles getting put on the new format, and Netflix exploded as the convenience of online movie rental took off. But comparing early Netflix to late Netflix doesn’t tell us anything at all about the evolution of consumer taste in the online world, or about the relative diversity of demand from online and ‘bricks and mortar’ stores, which is supposed to be what this is all about.

To be charitable, the nearest we can get is that it’s a comparison of a restricted set of choices (2000) and a broad set of choices (2005), but given that the size of the available set of titles increased by a factor of about 5 while the user base increased by a factor of 50, interpreting the results as “the effect on demand of this increase in variety [of titles]”, as Anderson does, is simply seeing what you’d like to see.

If we are going to take Anderson seriously then we should adopt his standard definition when the long tail gets challenged:

This is a good moment to remind everyone of the normal definition of "head" and "tail" in entertainment markets such as music. "Head" is the selection available in the largest bricks-and-mortar retailer in the market (that would be Wal-Mart in this case). "Tail" is everything else, most of which is only available online, where there is unlimited shelf space. [link]

It’s a definition that is skewed to guarantee success for his model, and which is completely uninteresting (as I have posted about ad nauseam) but hey, it’s his definition. And the Netflix data has nothing to say about it.

So when Chris Anderson posts his favourite graph from the data and claims it’s a vindication of the long tail (“Netflix data shows shifting demand down the Long Tail”), it can only be because it looks like the schematic, unlabelled, number-free graphs in his book. It’s cherry picking the data for the most simplistic of reasons  because the two lines he’s comparing have no relation to what he talks about elsewhere, but hey, who cares?

Google Book Settlement: I think I opted out

This morning I opted out of the Google Book Settlement. At least, I think I did. After going through the forms and clicking the buttons there is a generic page saying something like "your opt-out has been received".  But I have no receipt, no acknowledgement – in short, no proof that I have opted out. Very odd. I emailed Rust Consulting (the settlement administrator) and assume I will get something sometime, but it seems very amateurish for what they keep telling us is such an important change.

As for why I opted out, well I hope to write about that. But I have a deadline for next weekend and a busy time at home (never mind the day job), so I can't do that right now, and this blog will continue to be quiet for a while.

Update: After emailing the settlement administrator, I did get an email confirming my opt-out.

No Free Stuff Here

A few people have asked me whether I’m going to repeat my Critical Reader’s Guide to The Long Tail with Chris Anderson’s new book Free. It’s nice to be asked, so thank you. And I am, of course, susceptible to bribes and flattery. But the answer right now is No. Three reasons:

  • I don’t have a vendetta against the guy, so there’s no particular reason to go after him again.
  • To be honest, I doubt that it will be interesting enough to spend that amount of time on.
  • I don’t think the ideas will be as destructive as I think the Long Tail has been.