Infrequently Asked Questions

I did say I have a few pieces I wanted to post still, so here (now the Ontario election is over) is one of them. A few questions a few people have asked me about No One Makes You Shop at Wal-Mart

Where did you get the title?
The title evolved along with the content of the book. I think the first version of it came from Paul Krugman’s essay Enemies of the WTO which was published in Slate on November 24, 1999. I’d admired Krugman’s writings for some time, and yet had anti-globalization leanings. The end of his essay (which I quote and misattribute on p. 190) encapsulates the challenge I wanted to address:

Although they [anti-globalization protesters] talk of freedom and democracy, their key demand is that individuals be prevented from getting what they want–that governments be free, nay encouraged, to deny individuals the right to drive cars, work in offices, eat cheeseburgers, and watch satellite TV. Why? Presumably because people will really be happier if they retain their traditional "language, dress, and values." Thus, Spaniards would be happier if they still dressed in black and let narrow-minded priests run their lives, and residents of the American South would be happier if planters still sipped mint juleps, wore white suits, and accepted traditional deference from sharecroppers … instead of living in this "dreary" modern world in which Madrid is just like Paris and Atlanta is just like New York.                                    
Well, somehow I suspect that the residents of Madrid and Atlanta, while they may regret some loss of tradition, prefer modernity. And you know what? I think the rest of the world has the right to make the same choice.

Just before that paragraph he uses the sentence "And nobody forces you to eat at McDonald’s", which I latched onto. When I first sent the book to the publisher I took a couple of syllables out, shortening it to "No One Makes You Eat at McDonald’s".

The change to Wal-Mart came about for two reasons. First, the McDonald’s example turns out to be a difficult one that I only address 90% of the way through the book, and so the title didn’t match the structure of the manuscript. My Wal-Mart story had moved into Chapter 1 by that time, as the very first story in the book, and so "No One Makes You Shop at Wal-Mart" just made a lot more sense. Also, of course, in the years between 1999 and 2005 Wal-Mart had supplanted McDonald’s as the pre-eminent symbol of rampaging capitalism. So that decided it.

In retrospect: I’m pretty pleased with how the title worked out. It’s misleading because everyone thinks it’s a book about Wal-Mart and assumes it’s kind of a popular journalistic polemic, but at least it catches the attention and that’s all you can hope for.

 

Why do you call the town Whimsley?

To me the name is a mixture of something whimsical and something down to earth, both of which the stories are meant to be. I think the whimsical-whimsley part of that is obvious enough. The "ley" ending is a reference to the many millstone grit towns of Yorkshire that end in that syllable – Burley, Ilkley, Batley, Bramley, Barnsley and probably hundreds of others, but particularly Otley, the nearest town of any size where I grew up. There cannot be many more pragmatic, realistic places around.

In retrospect: I still like the name.

Why are the two characters called Jack and Jill?

These names are a reference to the R.D.Laing book Knots, which I was introduced to by either Nigel Perry or Clive Norris in 1978. It’s a quirky book that sets out patterns of tangled behaviour in short structured vignettes like this:    

    Jack is afraid Jill is like his mother
    Jill is afraid Jack is like her mother

    Jack is afraid
                 Jill thinks he is like her mother
    and that     Jill is afraid
                                    Jack thinks she is like his mother

    Jill is afraid
                Jack thinks she is like his mother
    and that     Jack is afraid
                                    Jill thinks he is like her mother

To tell the truth, I never really got very far with the book, slim though it is, but the recursive form of these vignettes has stayed with me. I had hoped I could make my stories condensed and elegant in the same fashion, but they ended up being more pedestrian. Ah well. (In earlier drafts the two characters were called Winston and Julia, but that is melodramatic and obvious, so it went.)

Why did you coin the word MarketThink?

Was it needed? I’m not sure now. At the time it seemed important. It is close to the idea of market populism that Thomas Frank writes about so well in One Market Under God, and close to the idea of market fundamentalism, a phrase used by many people. But both of these terms convey the idea that their supporters are promoting markets as solutions to all problems. The reason I didn’t want to use those is that those who promote this loose, populist ideology do not always promote free markets – certainly not ideal competitive markets. Intellectual property is an obvious area where promoters of private industry are keen to prevent the competition they claim to believe in elsewhere. Nevertheless, the rhetoric of MarketThink portrays the world (governments aside) as if it works like an ideal competitive market, even when proposing actions that contradict that portrayal. Boeing is quite happy to argue for the necessity of government subsidy in the name of markets, and companies that grew large under protectionist regimes are happy to promote free trade as long as they are beneficiaries. So I thought a different word was needed, and MarketThink seemed to be it.

In retrospect: I’d probably avoid coining a new word and simply use "market populism". I was splitting hairs.

INLAND EMPIRE

Some films you watch for plot, some for action, some for characterization, some for laughs. David Lynch films you watch for the atmosphere and for the occasional shocking scene.  Complaining about the plot of a Lynch movie is like complaining about olives not being sweet: it’s just not the point.

I don’t know what I think of INLAND EMPIRE "overall". I don’t even know what such an "overall" would mean – should I add up the minutes I like, subtract the minutes I didn’t like, and assess the film on the resulting number?

To judge a Lynch film I ask whether it stays with me; whether scenes play themselves over in my mind during the days after I watch it. And INLAND EMPIRE has enough of those scenes to make me glad I watched it. It’s not Mulholland Drive, one of my favourite films of all time, but then what is?

One scene in particular is stunning – not one I’ve seen talked about elsewhere. It’s about fifteen minutes in. Fading star Nikki Grace (Laura Dern) has won a part in a new film being directed by an unctuous Kingsley Stewart (Jeremy Irons), and she turns up on the empty, hangar-like set to do a read-through with her co-star, the sleazy Devon Berk (Justin Theroux). From what we’ve seen so far all these characters seem shallow; caricatures of actors and directors.

Dern, Irons, and Theroux sit at a trestle table and then Irons suggests they read through a scene where Devon’s character (Billy Side) arrives home to find Nikki’s character (Susan Blue) upset. After a few awkward moments, Devon/Theroux starts reading.

And everything changes.

The camera closes in on Dern as the two characters say their lines, quietly and intently, staring at each other, the plain dialogue punctuated by long pauses. You start to wonder, is this acting or is it real? Who is meaning these lines? Susan or Nikki or Laura Dern? Billy or Devon or Justin Theroux?

"Are you crying?" whispers Billy/Devon/Theroux.

A long pause.

"Yes" mouths Susan/Nikki/Dern. And a tear tracks down her cheek.

Then they are interrupted and the tension is broken.

When the interruption came I realized I’d been holding my breath during the whole reading. It’s actually shocking, how fine good acting can be. The film is worth it for that scene alone.

Winding Down Whimsley

Blogging has been non-existent for the last few weeks. I also have several emails from blog contacts that I have failed to reply to — sorry Henry, Aaron and others. I’ve been diverted by non-digital politics (Ontario Elections), home life, and trying to do my day job.

I started the blog, as the top of the page says, mainly in an attempt to promote No One Makes You Shop at Wal-Mart. Well that came out 16 months ago, so anything I could do along those lines is done.

I have a few more pieces I want to post here, and then I’ll take an indefinite blogbreak and try to do some longer-form writing.

After all, in the words of the Talking Heads; "Say something once, why say it again?"

Preconceptions

If you read this paragraph about Kanye West from today’s Observer and don’t do a double take, you have fewer preconceptions than I.

An only child (‘Kanye’ is an Ethiopian name which means ‘the only
one’), whose parents split before he was a year old and who divorced
when he was four, Kanye was mainly raised by his doting mother, Donda.
They are still extremely close; he wrote a song for her on his last
album, ‘Hey Mama’, which, sweetly, she now has as the ringtone on her
cellphone. When Kanye was still a young child, they moved from Atlanta,
Georgia, to Chicago, where she became the chair of the English
department at Chicago State University before latterly taking over as
his manager.

Another Few Words on Netflix

Last week’s post on the Netflix Prize brought more readers here than anything else I’ve written.

Being vain I tracked its progress – it got few readers in the first couple of days, then a burst because of being picked up by the Economist Link Mafia of Brad DeLong and Marginal Revolution – thanks to both. But the big spike came when someone picked it up from there and posted it to reddit. Then it made its way to some other tech sites like ycombinator and geekpress, and that kept things going for a while. After a week, the server’s disks are having a chance to cool down.

There were lots of good comments as well on the various places it got listed (as well as here).

ZH suggested using a Java app together with HSQL to do analysis. I’m confident that SQL Anywhere is as fast as HSQL, but you are right that you could do more with the data in an application than with simple queries like I was doing. However, in the end you have to deal with sparse matrices, and databases don’t usually store or manipulate those well.

Someone on reddit (can’t find the link now) posted that they have posted a C++ framework for analyzing the data set, together with one of the more popular avenues, a Singular Value Decomposition algorithm. So if you want to be able to try out algorithms that are better than Netflix’ current one, then go hunt the article on reddit and you’ll find one.

Quite a few people pointed out that just because there seem to be challenges with the Netflix Prize because of the way the data is collected doesn’t mean all  recommender systems must be thrown in the bin, and I agree. Narrowing it down to one person per "customer" instead of a household would help. Other things (mood indicators, half star ratings) may make incremental improvements. But for movies, mostly watched once, there are always going to be some issues of variability. For music – and anything else where we return over and over again to something we buy or experience – rating systems have a better chance of success. Mostly we buy music we either know we like or have a good idea we will like, so we can use purchases instead of surveys as a way of tracking preferences. But movies are experience goods except for those few you may watch more than once.

The other main topic was people’s experience of rating systems. This seemed highly variable – some say they are useless, others that they are fine, and others that a small difference in accuracy may make a big difference. Some suggested that tracking the RMSE is not very useful – but that was Netflix’s decision, not mine. What I find interesting here is that with all the impressions we have collected now there is so little (that I know of) to back up or refute our subjective impressions. 

And now it’s time for this blog to step off the information highway and back onto the information backroads where it belongs.

The Netflix Prize: 300 Days Later

Today the Netflix Prize Competition has been running for 300 days.

Online DVD rental outfit Netflix caused a real buzz last October when it announced the competition. If anyone can come up with a recommender system for predicting customer DVD preferences that beats its own algorithm (Cinematch) by a certain amount, Netflix will hand over $1million. The prize got a lot of attention because it exemplifies the idea of crowdsourcing. Not only does Netflix rely on crowdsourcing of DVD ratings (user ratings of DVD titles) but the competition itself is an attempt to use crowdsourcing to develop the algorithms to make the most of those ratings. Instead of doing the work itself, or hiring specialists, Netflix lets whoever anyone enter their competition and pays the winner. The competition is still in progress: Netflix says it will run until at least 2011. So now the initial buzz has died down, what can we learn from the Netflix Prize?

First, the competition details (see here (PDF) for a short paper by two Netflix employees). Netflix made public a database of customer DVD ratings (tweaked to ensure privacy) that included over 100 million individual ratings of 17,770 titles by 480,189 people. If you sign up for the prize, you can download these ratings. Each rating involves one customer giving an integer number from one star (very bad) to five stars (very good) for a given title. For example, customer 296452 gave title 234 ("Animation Legend: Winsor McCay") a rating of 1 (very bad).

The idea is that competition entrants try to develop an algorithm by using the training set (which is the 100 million plus set of ratings), try it out on a set of probe set of test data that they also give you, and once they think they have a good algorithm, create a set of predictions for a qualifying set of users and titles, and upload it to Netflix. Netflix test these predictions against the actual rankings (which they keep private) for that qualifying set. They post the leading algorithms on a leaderboard.

The quality of any algorithm is determined by its root mean squared error (RMSE). To calculate the RMSE you take the difference between the rating the algorithm predicts and the actual rating, and square it so it’s guaranteed to be a positive number. Then you take the average of all these over the set of data to get the mean squared error. Finally, taking the square root gives the RMSE, which is the roughly the size of a typical error.

A perfect algorithm would predict exactly what rating every user would give to every title and would have an RMSE of zero. A random set of predictions has an RMSE of 1.95. But the actual range of action is much narrower than this 1.95 range. A simple algorithm that uses the average rating for each title as the prediction – "let’s see, the average rating for the 104,000 customers who rated Mean Girls was 3.514, so I predict you will give it a rating of 3.514" – gets an RMSE of 1.0540. Netflx’s Cinematch algorithm has an RMSE of 0.9525. Netflix set the prize target at a 10% improvement over that, which is an RMSE 0.8563. So the range that recommendation systems can realistically cover – from naively simple to cutting-edge research – seems to be the narrow band between the middle three lines in the following diagram.

In the days and weeks after the prize was announced, progress was rapid. The Cinematch score was matched within a week. Within a month the leaders were half way to the winning prize with a 5% improvement. But getting further improvement progress has proved more and more difficult. It took another month to get to a 6% improvement, about 5 more months to get to 7%, and the current (July 29 2007) leader is at 7.8% improvement and has been unchanged for a month. Here is a graph of the progress, showing the three lines above and the prize leader progress:

At this stage it is not clear if the prize is winnable: the existing algorithms use a lot of linear algebra and some pretty fancy machine learning ideas (see a description by a leading participant here and some sample code for a similar approach here), the leading groups include university research labs from around the world, and many of the more obvious approaches have been explored. Certainly media and blog interest – huge in the early days – has dropped off in recent months. This New York Times article is one of the few from the last month or two.

Let’s not get into the computer science of recommender systems – there’s a good review from 2004 called Evaluating Collaborative Filtering Recommender Systems here if you want to know more. Instead, let’s step back a bit and ask what this prize tells us so far, and look at a couple of things we can learn by poking around the massive data set that Netflix provided.

One question is: how good is an algorithm with an RMSE of 1, and is an algorithm with an RMSE of 0.8563 much better for the average customer? Actually, I guess that’s two questions. Anyway, if the errors followed a normal distribution (which they don’t, but we’re talking back-of-envelope here) then if a customer actually rated a title as 2 (poor), an algorithm with an RMSE of 1.0 would predict somewhere between 1 and 3 about 70% of the time. Not bad, but not startling. If the algorithm gave ten recommended movies, then it would get on average seven out of ten within one unit of the customer’s actual rating. Meanwhile, the RMSE=0.8563 algorithm would get 7.6 out of ten. While this is an improvement, and while it may be a remarkable technical accomplishment, it does not seem to be exactly a revolutionary leap compared to the really simple algorithms as far as customers go.

[Update, December 25, 2007: Yehuda Koren of leading team KorBell approaches the recommendation problem a different way, looking at ordering of recommendations rather than at matching them. His way is more appropriate, and gives much more encouraging results. See here.]

As soon as you start looking at the data set it becomes obvious why it is so difficult to get good results. Databases don’t have the linear algebra and other mathematical tools for taking a run at the prize but they are convenient for exploring data sets, so I loaded the data into a SQL Anywhere  database (The developer edition is a free download, and I’ll provide a perl script to load the data if you really want it) and started poking around. Here are a few of the more obvious oddities (all these observations have been posted elsewhere – see the Netflix prize forum for more):

  • Customer 2170930 has rated 1963 titles and given each and every one a rating of one (very bad). You would think they would have cancelled their subscription by now.
  • Five customers have rated over 10,000 of the 17,770 titles selected – and presumably they also have rated some of the others among the 60,000 or so titles Netflix had available when they released the ratings. Are these real people?
  • Customer 305344 had rated 17654 titles. Even though Netflix make it easy to rate titles that you have not rented from them (so they can get a handle on your preferences) can this be real?
  • Customer 1664010 rated 5446 titles in a single day (October 12, 2005).
  • Customer 2270619 has rated 1975 titles. 1931 were given a 5, 31 were given a 4, 10 given a 3, 2 given a 2 (Grumpy Old Men and Sex In Chains) and a single title was given a 1. That title? Gandhi, which has an average rating of over 4 and which less than 2% of those who watch it give a 1.
  • The most often rated movie? Miss Congeniality with ratings by over 232,000 of the 480,000 customers. And which title is most similar to it in terms of ratings (using a slightly weighted Pearson formula)? Bloodfist 5: Human Target.
  • Most highly rated – Lord of the Rings: Return of the King (Extended Edition), with 4.7.

Some of the more bizarre facts above may be artifacts of whatever tweaking process Netflix put the data set through (although they claim not to have materially affected the statistics). While odd, bizarre users are not always difficult to deal with: if you have rated each of the last 1963 titles you’ve watched as 1 it is pretty easy to predict what you will rate the next title. But others are more tricky.

One reason for these oddities is one of the things that Evaluating Collaborative Filtering Recommender Systems identifies. They note that (on other, smaller data sets) even the best algorithms don’t seem to get beyond an RMSE of 0.73 on a five-point scale, and speculate that the cause may be "natural variability". We users provide inconsistent ratings – sometimes we’d rate a movie a 3 and sometimes a 4, with no consistency. It may depend on our mood when we watched the movie – we may give a romantic movie a higher rating if we watched it on a first date than if we watched it a week later after being left broken hearted, or a demanding movie a low rating because we were tired and out of sorts when we watched it – or it may depend on our mood when we actually provide the rating.

There are other, more obvious reasons which, for reasons I don’t understand, don’t seem to get discussed much. Netflix itself and most competitors talk about the data in terms of "movies" and "users". But the "movies" in the list are not all movies: a lot are TV series or music video collections. The variability among the episodes of a series (Do you think Lost Season 1 deserves a 3 or a 4?) must make single-number ranking even more variable and these composite DVDs figure prominently among those titles that have the biggest variance in ratings.

Then there’s the fact that a customer might not really be a single person. It might be a household with several viewers in it. So perhaps one person likes Terminator, one likes Bridget Jones, and one likes Spongebob Squarepants. Once we realize that the "user" might be a collection of people there is no strangeness between giving high ratings to each of these, but you can see how, depending on which household member entered the rating, the values may be quite different (perhaps this is why titles like ‘N Sync: Making of the Tour, Pokemon Vol 9, and Boston Red Sox 2004 World Series Collectors’ Edition have high variance – the person rating may not always have been the person who wanted to watch it). If the data set contains these inevitable variations (in addition to the plain kookiness on show in the Netflix set) then it may be that even the clevverest algorithm can make little progress in untangling all the intrinsic vagueness of the data.

So what I get from the Netflix prize is that there are probably significant limits to recommender systems. Even the smartest don’t do a whole lot better than the simple approaches, and a lot of work is required to eke out even a little more actual information from the morass of data. It seems surprisingly difficult to get reliable, factual information on this important question of how useful they can be. Part of the reason is that they are new – Amazon has only been in business for about ten years after all – and part of the reason is that the behaviour of these systems is often a closely guarded secret despite the aura of openness that web companies cultivate.

This matters because there is a surprising amount riding on the effectiveness of recommender systems. Silicon Valley’s new-economy enthusiasts see them as the key to developing a new level of cultural democracy: they see recommender systems as a trebuchet hurling rocks at the castles of the old elite of mainstream media, big publishers with big marketing departments, big-chain book stores and Hollywood sequels. Recommender systems are claimed to embody the "wisdom of crowds". The idea is that everyone just publishes stuff (blogs, wikipedia entries and so on) and amateur readers or viewers decide what has merit by their actions (rating stories, buying and rating books and DVDs and so on). The work of critics is "crowdsourced" to customers, but it is the recommender system that distills these ratings to yield the aforementioned wisdom.

If faith in recommender systems is misplaced, then the new boss may look much like the old boss only with more computer hardware. There is a danger that recommender systems may simply magnify the popularity of whatever is currently hot – that they may just amplify the voice of marketing machines rather than reveal previously-hidden gems. Even worse, their presence may drive out other sources of cultural diversity (small bookstores, independent music labels, libraries) concentrating the rewards of cultural production in fewer hands than ever and leading us to a more homogeneous, winner-take-all culture.

I’m no futurist, but I see little evidence from the first 300 days of the Netflix Prize that recommender systems are the magic ingredient that will reveal the wisdom of crowds.