NLP is hard even for Amazon

I was looking at Judea Pearl’s book on Bayesian networks, Probabilistic Reasoning in Intelligent Systems, on Amazon today, and ran into an interesting demonstration of why NLP is so difficult. I’m including the full quote below, because it may be interesting for those interested in language processing to think about how they might fix the fluke.

Recently I needed to learn the principles of Bayesian networks quickly, so I bought three books: this one by Pearl, “Pattern Recognition and Machine Learning” by Bishop, and “Bayesian Artificial Intelligence” by Korb and Nicholson. Each has a very different audience and different set of strengths.

The Bishop book would probably be a great text for a serious student with a year to spend learning the theory of machine learning. But I found it a bit too concise, with a bias towards an “algebraic” description rather than a “geometric” one (my preference). I wound up spending a lot of time trying to translate equations into mental pictures in order to grasp the concepts. Too much work, so I dropped this after a couple of days.

Next I tackled the Korb and Nicholson book. This one’s aimed at the application engineer who wants to get a network up and running quickly, and is not too concerned about how it works. I’ve been in that position many times in my career, and have always welcomed books like this for giving me a quick start into a new field. But this time I needed to really understand how Bayesian networks worked, and for this the Korb and Nicholson book is not great. In the first 9 pages of chapter 3 they try to explain the belief propagation algorithm, but their hearts just weren’t it in–I found their explanation to be unintelligible. (I suspect most readers just skim this to get to the applications.) So after several days of struggling and getting nowhere, I tossed this aside as well.

The Pearl book was the only left; I put it off to last since I was initially somewhat intimidated by it. After all, this is one of the books that kicked off the “Bayesian revolution,” so I was fearing a foundational math book consisting of one dry theorem after another. Not so! Although you have to read 174 pages to get through the belief propagation algorithm for trees, this took far less time than reading the first 62 pages of Korb and Nicholson, which cover roughly the same ground. The reason: Pearl is a gifted teacher and writer. His explanations are a series of baby steps, leaving nothing out, never assuming the reader will make “obvious” inferences, and supplying motivation every step of the way. Although he doesn’t have a lot of figures in the book, the ones he does have are excellent, and by the time I hit page 175, I had a clear picture in my head of not just how it all worked, but why it worked. In fact, after just two days of reading, I was able to implement the belief propagation algorithm in software in an afternoon (I tested the software with examples from the Korb and Nicholson book, so that book was ultimately useful). Pearl made the subject seem almost obvious. If you are looking for a book to help you get canned Bayesian software up and running for an application quickly, this is not it. But if you want to really understand how these things work, and don’t have a lot of time available, I cannot imagine a better book than this.

Amazon has a very cool feature under the “Customer Reviews” section of product pages that displays quotes from reviews that people found to be generally useful. The review can be both positive and negative, and its usefulness is determined by the number of people that click on “Yes” when asked “Was this review useful to you?” The interesting part is that they extract quotes that they believe to sum up or characterize the overall message of the review. In the particular review above, the following snippet was extracted and displayed at the top of the customer reviews section:

So after several days of struggling and getting nowhere, I tossed this aside as well.

If you find this sentence in the review (final sentence of the third paragraph), you’ll see that the customer was actually talking about another book he had bought to learn the same subject! When I saw this snippet, I was shocked to see that someone had had such a negative experience with this book—I’ve heard from many people that it’s incredibly lucid. I pulled up the full review, and was tickled to see that an intelligent system had made a mistake when summarizing user reviews on a book about intelligent systems.

My first thought was that anaphora resolution would be the way to prevent this type of mistake. After reading the paragraph, though, I think it would be tricky for an automatic system to resolve this reference. The preceding constituent is fairly far back in the text. I’ve never worked on this problem, but I imagine it’s fairly difficult to resolve a reference when it spans more than a few sentences.

Quote | Posted on by | Tagged | Leave a comment

Visualizing Structured Prediction

There doesn’t seem to be a well-defined definition of structured prediction yet. At least not that I’ve encountered. There are definitely definitions (from people like Noah Smith, Hal Daume, Michael Collins, etc.), but it’s still unclear what exactly makes a problem fall in the domain of structured prediction. One of my favorite descriptions by Noah Smith in one of his online lectures is “you’ll know it when you see it”. is a cool “news for researchers” site started by Percy Liang and Hal Daume. You can monitor research papers in a particular area by subscribing to a channel. I looked at the structured prediction channel, and wanted to visualize what terms occurred frequently in the abstracts of the structured prediction papers (I’ve also been looking for an excuse to try

Posted in Uncategorized | 1 Comment

Conditional Random Fields

This post will cover the original conditional random fields (CRF) paper by Lafferty et al. You can find bibliographic information on my CiteULike page here.

First pass summary: The paper introduces a new method for labeling and segmenting sequential data. This is a problem traditionally solved using Hidden Markov Models (HMM), and, more recently, Maximum Entropy Markov Models (MEMM). HMMs are generative models, meaning that they model a joint distribution over observation and label sequences. Using generative models with complex data (such as sequences) can cause the inference problem to be intractable since there are exponentially many observation-label pairs. MEMMs, on the other hand, are conditional models that solve some of the intractability issues of HMMs. Lafferty et al. point out, however, that non-generative sequence models suffer from the label-bias problem. They claim that CRFs maintain the advantages of MEMMs over HMMs, but address the label-bias problem.

Posted in Uncategorized | Leave a comment

Hello world!

Hi! I’m a graduate student at Carnegie Mellon University in the School of Computer Science. I’m interested in natural language processing, machine learning, and artificial intelligence. You can find my academic web page here.

This blog will be an informal place for me to write up summaries and reviews of various research papers that I read. It’s primarily for personal use, but I’m making everything public in case someone might find the content useful in his or her own research. That being said, you should probably take everything you read here with a grain of salt. I am not an expert by any measure.

Posted in Uncategorized | Leave a comment