IBM’s Watson competes on Jeopardy! — what’s in a score?

==================================================================

What does it mean when a computer scores higher than a human?

Next week (February 14, 15, and 16) a special episode of NBC’s quiz show Jeopardy! will feature two former champions of the show playing against a contestant who is a newcomer in more than one sense of the word:  a computer named Watson.  Watson isn’t coming to the event without preparation for the game, however: IBM specifically designed, built, programmed, trained, and tested Watson to play Jeopardy!  IBM designers say it was a challenging project;  to play the game, a contestant has to judge whether or not to respond to a clue at the risk of losing points for a wrong answer:  “It’s going to have to know what it knows”, says IBM Project Manager David Ferucci.  (“IBM ‘Watson’ to Challenge Humans at Jeopardy!” http://www.youtube.com/watch?v=3e22ufcqfTs )

The news that the nonhuman Watson scored higher than the two human Jeopardy champions it played against in a practice game has been all over the news lately.   In this promotional video about it:  “IBM and the Jeopardy Challenge”  at  http://www.youtube.com/watch?v=FC3IryWr4c8 ,  Jeopardy’s executive producer Harvey Friedman says “I think we’ve gone from impressed to blown away.”

But what does it mean when a computer does better than a human on a certain task? Does it say something about the computer . . . or something about the task?  Or, perhaps it tells us something about the humans who decide how performance on the task is to be judged?  The rules of a given game generally do not provide a straightforward measure of any one capability, especially when extremely capable contestants are playing each other: my spouse, who speaks from his own personal experience of being on Jeopardy!, tells me a lot rides on not pressing your buzzer button too early.  ( You need to press your buzzer button for a chance to respond, but do so before the cue, and you’re locked out for a bit; someone who presses their buzzer button just after you might get the chance instead. )   Designing a game that can be used as a basis for a test of a high level cognitive skill is really difficult, so, practically speaking, researchers often look to time-tested games already in use for a benchmark.  When discussing the question of when people might find it natural to speak of a machine’s behavior as intelligent, the Cambridge mathematician Alan Turing, writing in 1950, imagined how one might employ a computer as a contestant in a parlor game of his day called the imitation game.

Certainly IBM put a lot of thought into choosing the next challenge it was going to take on.  IBM evaluated several different possible challenges they might go for, before deciding to invest talent and money in designing a computer that could meet the challenge of being a champion-level contestant on Jeopardy!  And we’re talking about a lot of their talent and money:  Watson is more than an algorithm or, even, a very sophisticated piece of software.  Watson is a specially built computer, and is physically much larger than a human:  it’s about the size of eight refrigerators, and requires “tons of air conditioning”, according to IBM’s informational videos.   IBM’s David Ferucci observes that being able to win at Jeopardy! doesn’t entail having mastered natural language, yet he does think that designing a machine able to compete in a Jeopardy! game “will drive technology in the right direction” (“IBM Watson – Why Jeopardy?”    http://www.youtube.com/watch?v=_1c7s7-3fXI ): On Jeopardy!, there’s “the broad and open domain”  of subject matter a contestant has to handle.  There is wide variety in the ways that a question might be asked (they counted thousands of ways),  and also in the kind of answer that might be sought.  Which kind of answer is appropriate for a particular question can get tricky, too.  Watson has to handle puns — and lots of other sources of ambiguity.

The topic of question answering is not new to philosophy; at an American Philosophical Association (APA) convention 45 years ago, there was a Symposium devoted to the topic of “Questions” (J Phil vol 63, No 20, October 26, 1966).  One of my favorite professors in graduate school, Nuel Belnap, co-authored a now-classic book on it: The Logic of Questions and Answers.  IBM collaborated with Carnegie-Mellon’s Eric Nyberg, who, though now a professor at the Language Technologies Institute in CMU’s School of Computer Science, obtained his PhD in a program that was located in CMU’s Department of Philosophy at the time.

The discipline of artificial intelligence has a name for the kind of task for which IBM felt huge advances were yet to be made:  “automated open domain question answering.”  (I’m ignoring the farce that Jeopardy! responses are worded as questions here; for our purposes, responses on Jeopardy! are answers.)  What especially interested IBM about the rules of Jeopardy!, though, was the need for an accurate assessment of confidence in a particular response.  Watson computes a confidence level for a bunch of candidate responses, but it doesn’t just pick the best one from the list and go with it:  there is a threshold of confidence that must be exceeded before Watson will try to buzz in for the opportunity to answer a question.   It was the challenge of designing a natural language system with this “confidence aspect” that made taking on the Jeopardy! challenge so valuable to IBM.

Designing a computer that can parse natural language — not to mention being able to converse in it — is a notoriously difficult problem in itself.  Alan Turing, too, identified a question answering game as the next challenge (after chess) for computers, though he put his own spin on the task (more of that in my next post).  Ferucci points out the ways in which mastering natural language differs from some other tasks computers are well-suited for:  natural language is “implicit, highly contextual, ambiguous, and often imprecise.”

I think an especially interesting point mentioned in the IBM promotional videos about Watson’s algorithms is that in order to parse natural language, Watson has to take into account how people actually use words — which can vary depending on the context.  That, AI researchers have come to realize, is a huge deal.  You might think dealing with context dependence means adding a few footnotes to the rulebook here and there, but, it turns out, dealing with context dependence means treating the rulebook as just one textual resource among many others.  Ferucci gives an example from the game in which finding the correct answer to a Jeopardy! question required making use of a relation between the words fluid and liquid that would not have been indicated, according to the award-winning lexical database Wordnet.  Context dependence of word use is a far bigger deal than you might realize — until you try to build a computer to speak a natural language.  David Gondek of IBM gives this great example of a Jeopardy! question that Watson was able to answer only because of it made use of context-sensitive meanings of the word “introduce”:  “The clue was: It was introduced by the Coca-Cola Company in 1963. Watson can find a passage stating that ‘Coca-Cola first manufactured Tab (the correct response) in 1963’, so in order to answer the question, Watson needed to understand that introducing and manufacturing can be equivalent – if a company is introducing a product. But that is highly dependent on context: if you introduce your uncle, it doesn’t mean you manufactured him.” ( February 3rd entry on http://ibmresearchnews.blogspot.com/ )

Scoring high on Jeopardy when playing against champion-level players, then, is not a matter of displaying the skills used in applying the kind of strict rules of usage found in references such as Wordnet.  In fact, Watson may even have to violate them occasionally in order to compete effectively against champion-level players, because how words are used is so highly context-dependent.  And, being able to make sense of and use slang and idioms, which are added to the natural language via cultural products and practices with each passing day, often requires massive violations of such generalized formal rules, too.  There are many things that Watson has to learn in some other way than by referring to Wordnet-style rules for using words, and there are cultural aspects of what it has to learn that are available only in unstructured formats.

Watson can learn from text in unstructured formats, and it can learn dynamically.  In fact, Watson is updating even as it is playing the game; it can do so since it receives feedback on its own responses.  On Jeopardy, question categories can be constructed around puns, wordplay, or other capricious things.  IBM researchers say that Watson is often learning about a category as it moves through it while playing the game.  And, adds IBM’s Jon Lechner:  “Not only does Watson get better at answering as clues in a category are revealed, but its understanding of its own in-category ability is also refined.” (Thursday, February 3, entry on http://ibmresearchnews.blogspot.com/)

What role does the human competition play here?  This is a valuable question to ask.  It’s seldom done, and the answers are sometimes surprising.  Is it necessary that Watson learn in the same way that the humans against which it is competing have learned?  Even if that were ideal, it seems to be asking the impossible; certainly there are emotional and developmental aspects to human learning that one wouldn’t be able to duplicate in any nonhuman learner.  We might still ask what is indispensable about learning the sorts of things needed to parse natural language, however.

Is it necessary that Watson at least perform as a human, i.e., give a set of responses that a human might give?  Maybe not.  Even though it is competing against humans, its answers are not being directly compared to those of the humans against which it competes.  It doesn’t matter if it shows a combination of an ignorance of one fact and knowledge of another that a human champion level player would never exhibit.  Scoring is pretty straightforward, seldom requiring a human judge to weigh in.  The use of a numeric score that can be compared to the scores of other contestants in a time-tested game eliminates the need to answer questions about how similar Watson is to a human.  I don’t see any reason that Watson would have to give a set of answers that meets a consistency criterion corresponding to having a human personality or upbringing.  It is what it is:  a score in a Jeopardy game played against the current top champions.  Watson doesn’t need to come off as plausibly human to meet this Jeopardy! challenge.

The practice round seems to indicate that Watson has met the challenge.   If Watson wins the actual round next week, I expect there will be those who wonder about details such as the buzz-in lockouts:  Watson’s nonhuman capabilities and lack of human weaknesses might give it an unfair advantage there.  IBM’s assurances seem a bit off the mark to me:  They compare Watson’s interaction with the physical components of the game with that of the human contestants’s.  It’s the same in some ways:  “Watson sends a signal to a mechanical thumb, which is mounted on exactly the same type of Jeopardy! buzzer used by human contestants. Just like Ken and Brad, Watson must physically depress a button to buzz in.”  It differs in others:  “The best human contestants don’t wait for, but instead anticipate when Trebek will finish reading a clue. They time their ‘buzz’ for the instant when the last word leaves Trebek’s mouth and the ‘Buzzer Enable’ light turns on. Watson cannot anticipate. He can only react to the enable signal. While Watson reacts at an impressive speed, humans can and do buzz in faster than his best possible reaction time.” This leaves unaddressed one of the things contestants find most frustrating:  The problem of pressing their buzzer button a tiny fraction of a second too early, only to find they are locked out for a certain period of time before their buzzer button will register again.  If the description IBM has given on the January 10th entry on their blog ( http://ibmresearchnews.blogspot.com/ ) is correct, it appears that the lockout due to prematurely pressing the buzzer button is something Watson is protected from.  On some of the videos of the practice rounds between Watson and human contestants, you can see the human contestants pumping their buzzer buttons frantically — that’s what’s going on. It happens a lot on shows featuring only human contestants, too.

Should there be squabbles about who would have won had the buzz-in rules been different, they still won’t detract from this important fact, though:  Watson is able to give responses at a champion level of play.   And that really does say  something.

More to come . . .

How is Watson able to do what no computer before it has been able to do?   Watson’s “learning from reading.”

The Jeopardy! challenge compared with challenges derived from the parlor game Cambridge mathematician Alan Turing described.

Approaches Watson designers did not use: some surprises here.

Leave a comment