In a recent posting on his website The Morning Claret, Simon Woolf astutely contemplates the state of wine competitions. Intriguing and encouraging are the dynamics of “discussion, retasting, and encouragement” among judges that regularly result in reassessments and a mutual learning experience. There’s no question humans are amenable to discussing or disputing taste and to rational persuasion, whether via requests to reexamine under a different aspect, to attend to characteristics one might have missed, or to see how those might be perceived to hang together in a certain way—perhaps whereby the whole proves more aesthetically appealing than the sum of its parts. Arguably, though, the role of point scores in group assessments ignores indefensible assumptions.
Woolf points to the ubiquity of “medal thresholds” at wine competitions as measured in points. But a cogent, credible account of group scoring would presuppose commensurability of tasters’ scores that is chimeric, its implausibility merely camouflaged by the shared use of numerals. As evidence, consider you and I are tasting the same St-Joseph under common conditions and ask each other what score (if forced to) we would award it. I say 90, and you say 88. Can we even conclude from this that I like the wine more, consider it a better example of the wine grower’s craft, or deem it a better example of some particular type among the many it represents (northern Rhône, St-Joseph, Syrah…) than you do? By no means. The numerical difference might instead reflect a difference in the level of craftsmanship, aesthetic appeal, or satisfaction each of us deems appropriate to a given numerical rating. And how would—how could—that level be ascertained or that source of difference be inferred?
The situation with wine competitions calls to mind how doctors, after asking patients to rate experiences of pain one to ten, purport to thereby infer their “tolerance” for pain. How could they rule out the obvious possibility that different patients simply have different notions of what numeral is appropriate to ranking a given experience—or, indeed (if unafraid to dive in at the epistemological deep end), how to rule out that putting an inch of one’s index finger into a vise grip under controlled conditions (or mutatis mutandis, the same wine under one’s nose and into one’s mouth) is simply experienced in a distinctly different way by different people? Perhaps neurophysiologists now believe themselves up to this, but it would require debatable theoretical assumptions.
The pain example suggests a further conundrum shared with the 100-point wine scale. If “10” or “100” are significant, wouldn’t their employment imply that one cannot possibly imagine experiences more painful or more aesthetically satisfying? What happens, then, when a pain or wine reveals itself as surpassing what was heretofore imaginable? Presumably, one would be forced into downward revision of one’s previous point assessments.
Beyond standard deviation
Failure of commensurability is demonstrated with equal obviousness in the context that prompted Robert Parker’s adoption of a 100-point scale—namely, its nearly universal use in US schools. That scale, incidentally—with group averages to the decimal place, no less—was treated as routine in certain tasting circles 125 years before the founding of The Wine Advocate, as witness an 1853 account in The Western Horticultural Review and Botanical Magazine of a session featuring Cincinnati’s once numerous and internationally respected wine growers. (Harvard had been assessing academic performance that way from as early as 1837.)
No grade determination in American schools could be more fraught than that separating the letter grades D and F, since the latter signifies failure, hence lack of accreditation or of permission to continue on from one sequential course or grade level to the next. And yet the numerical threshold for failure has long been set quite variously depending on the institution, being in some cases as low as 50 and in some as high as 70. Ultimately, each institution—indeed, each teacher—sets his or her standard and gauges the appropriateness of a given quiz or test not just with reference to the intuitive meaning of “failed” (to have sufficiently grasped the relevant material), of “excellent” (aka A), “good” (B), “fair” (C) or “poor” (D), but also to his or her conviction as to what degree of comprehension, grasp, or proficiency merits those intuitively intelligible descriptions. If numerical test results deviate widely from a teacher’s intuitive sense of where its takers rank in their degree of comprehension, those intuitions justifiably take precedence, and he or she determines to, in this instance, “curve” the grades. Surely the situation for wine judges is not so different. They likely have a more intuitive and securely shared sense of what, in wine competitions, justifies a gold medal than they do of what constitutes “a 90-point wine.”
Is mutual calibration possible? In principle, yes. Teachers tasked with grading so-called Advanced Placement Exams in the US are trained on a series of paradigms. But these reflect rigidly structured test questions in a unique context and are meant to exemplify just six whole-number scores. Try applying that approach to the diversity of wine, on a scale of 100, and with each judge asked, for the sake of commensurability, to jettison his or her usual grading procedure… Barring that, the positing of a common denominator that could underwrite averaging quantified assessments from myriad tasters is on scarcely firmer footing than averaging the numerical distance between two points as expressed in kilometers, miles, leagues, and furlongs.