How Consistent Are Wine Judges?
Survey shows only 30 of 65 judging panels achieved similar results

Writing in the current issue of the Journal of Wine Economics, Dr. Robert Hodgson documents the significant variability in decisions by judges at the California State Fair Commercial Wine Competition.
A survey of approximately 65 judging panels between 2005 and 2008 yielded just 30 panels that achieved anything close to similar results, with the data pointing to "judge inconsistency, lack of concordance--or both" as reasons for the variation. The phenomenon was so pronounced, in fact, that one panel of judges rejected two samples of identical wine, only to award the same wine a double gold in a third tasting.
Hodgson, a retired professor in the Department of Oceanography at Humboldt State University in Arcata, Calif., and proprietor of Fieldbrook Winery in Fieldbrook, Calif., became curious about the phenomenon as a result of his own experience as a judge for the California fair's competition.
"Sitting there with 30 wines in front of you, trying to make decisions and going back from one to another, I just didn't think I was very capable of it," he said. He eventually resigned as a judge, joining the competition's advisory board at the invitation of G.M. "Pooch" Pucilowski, the competition's chief judge.
While the lack of concordance among competition judges is well known, Hodgson was intrigued at the specific causes, and suggested examining judge reliability at the California competition. To do so, Hodgson developed two sets of trials that saw triplicate samples embedded in flights of wines the judges tasted during regular competitions at the California State Fair. The goal was to see if judges could replicate their results. "It turns out to be a fairly difficult task," Hodgson said.
The first trial saw wines qualified for a final judging. A single wine was given to four judges in triplicate. A sample rejected by at least half the judges would not proceed to the final judging. Ultimately, three of four judges rejected the first two of the triplicate samples. The judges accepted the third sample for the final judging, where it received a double-gold medal.
The second trial saw four triplicate samples given to each judge. These were poured from the same bottle, reducing the potential for variation among the samples each judge received. The triplicate samples were typically delivered during the second flight of the day, Hodgson explained, with a view to making it easy for the judges to reach similar conclusions.
"We decided to try to make it as easy as possible," he said. "The results show that there are some judges that are more consistent than others, but there's a lot of variation or inconsistency in general."
Hodgson made clear that the results don't necessarily reflect the skill of the judges.
"It's just a terribly difficult task," he said of the judging process. "It's not that the judges are bad judges. I think the format of having a judge taste 30 wines four times a day exceeds the limits of their abilities."
Indeed, the conditions are such that the same judges enjoying the same wines over the course of a dinner might reach different conclusions.
Pucilowski agreed. Although the wine competition's advisory board debated whether or not to let Hodgson publish the study's results, initially discussed at the August 2008 meeting of the American Association of Wine Economists in Portland, Ore., it considered the question of inconsistent results a significant issue--not just at the California fair's competition, but at competitions across the industry.
"I'm happy we did it. I'm happy we'll continue to do it. And I'd be a whole lot happier if other competitions would step up and also do it," Pucilowski said.
His support of Hodgson's work stems from an interest in identifying a baseline for judges that would reduce inconsistent results and thereby improve the professional rigor of decisions at the competition, started in 1855. "Frankly, the reason I did this is I wanted to get the best judges I could possibly get. How do I know by looking at you or by looking at your scores that you're a good judge? I have no idea," Pucilowski told Wines & Vines. "The one thing I want is consistency."
Hodgson said the question of judge reliability is important because competitions are not cheap for participants, but winning gold medals or other honors stands to significantly improve a winery's sales. Consistent results could underscore the merits of the wines receiving awards.
The question of inconsistent results has drawn attention from researchers in the past. Hodgson's own interest was stirred by a report in the California Grapevine in 2003 surveying medals won by wines at competitions in various states.
More recently, Richard Gawel and P.W. Godden examined the results of "expert wine tasters" over a 15-year period in the Australian Journal of Grape Wine Research. Gawel and Godden concluded that consistency varied greatly among individuals, but improved through the combination of scores from a small team of tasters.
On the personal side, Hodgson feels better about his own performance as a judge after seeing the results of his research.
"After doing the study, I don't think I'm any worse than anyone else," he said.
SHARE »
Competitions would do far better and be much fairer if the judging was done by panels and not individuals. We saw the improvement when we switched a 30-year-old competition from individual judging and no discussion, to four-person panels with a leader. It's still less than perfect, but vastly better and fairer because we try to bring a mix of experience to each panel: a winemaker, a retailer, a journalist, a sommelier. etc. When a wine is graded by four palates who must find agreement, it makes a big difference.
I would argue that had the article's researchers put those wines on panels intead of individuals, three or four times during a competition, chances are vastly greater that the panel would have spotted the similarity in the wines.
In that case, that would make for much more accurate and better judging. No?
--The attack on winewriters is absurd if one praises Jancis and criticizes Parker. For whatever else he may be, he is consistent.
--Wine judging is not a science. Big competitions use lots of judges and ask them to rate hundreds of wines. Very few people could ever do that consistently, but experienced palates who are used to judging lots of wines will do it better.
--Many people, including some who have posted here, no longer will submit themselves to the agony of trying to judge hundreds of wines at a time. The dred Petite Sirah 75 at nine in the morning, or the Sparkling Shiraz as the first order of the day, were enough to send me screaming into the night.
--These various judgings are no substitute for one's own analysis. Nor is the informed opinion of Robinson, Parker, Tanzer, Laube, Connoisseurs' Guide or anyone else, but, consumers want and need help. Hence, the variety of beauty contests.