Why Rubrics Do Not Work For Competitions & Music Festivals

Rubrics are everywhere. We experience rubrics when filling out surveys to rate our experience with a home builder, cable company, or hotel. In academia, rubrics are present when teachers define that 20% of the semester grade will be homework, 40% will be papers/presentations, 30% tests, etc. Assignments themselves have rubrics: 25% of one’s presentation grade will be for length of presentation, 50% on quality of content, 20% for eye contact / speaking clearly, etc. Even a stereotypical math homework assignment that appears at first to be void of rubrics such as “Do page 117, every odd problem” is saturated with an implicit rubric since each problem is assigned equal weight. Rubrics come into play whenever an evaluator assigns (whether intentionally or not) equal or unequal weight to each scoring component of an evaluation. Rubrics are used in academia so that a) students know what to expect from an evaluation, b) students are given clear feedback after their evaluation, and c) feedback is objectified as much as possible (in other words, hopefully all teachers would arrive at the same grade given a certain rubric).

So if rubrics are used so ubiquitously and successfully in so many places, why do I claim that rubrics in the realm of judging music performance is a terrible mistake?

The Useful Rubric That Malfunctions

First, just a brief point of clarification: the word rubrics can refer to general rules or procedures according to the dictionary, but within the realm of pedagogy, it is widely-accepted that the term rubrics refers specifically to scoring/assessment rubrics.

For argument’s sake, suppose we have the following rubric to score a musical performance (as you’ll see, it really doesn’t matter how exactly we construct the rubric):

Score 0-10 for all of the following performance aspects:

  • Dynamics
  • Balance & Voicing
  • Phrasing
  • Articulative detail (legato, staccato, rests)
  • Pedaling
  • Pitch and rhythmic accuracy
  • Musical imagination
  • Technical facility / physical control
  • Tempo / interpretative appropriateness
  • Memory / fluency of performance

Whether deliberately or not, by creating this rubric, I have predetermined that all of these different categories will be 10% of each performer’s overall grade (100% divided by 10 equal categories). What if a student plays a gigue at an inappropriately slow tempo but otherwise very musical, accurate and fluid? A real example that I experienced as a judge of a local music festival years ago would be the Tischer Gigue in E minor played at the extremely slow tempo of dotted quarter = 66 (right). The rubric above will deliver a very high rating to this student: 100% in all categories except “Tempo / interpretive appropriateness,” resulting in no less than a 90% grade. This translates to the highest grade possible in most or all music festivals, when in fact the character of the student’s performance is drastically affected by the decision to play it so incredibly slow (or perhaps the student needs more time to speed the piece up). At this tempo, it can’t even be called a gigue!

As tragic as this may be for the student involved, it would have been irresponsible to deliver such a dishonest rating to the student for something that interferes so profoundly with the character of the music, even when the tempo was the teacher’s decision based on lack of understanding of compound meter, as I later learned – the teacher thought the “Allegro” referred to the speed of the eighth notes.  [It helps me to be honest with students when I remember the central core of my teaching philosophy, which is that there is always a gentle way to deliver the truth to students.]  The unfortunate reality of musical evaluation is that it is necessary to adapt the weight of each factor to the individual piece and to the individual performance in order to deliver fair and accurate results, and rubrics do not offer this critical flexibility.

Similarly, sometimes we hear aspects of a performance that are so good, they deserve more than just a 5 out of 5 or 10 out of 10 in the given category, and again, no rubric allows for one category to be given additional weight on an individual basis. This isn’t just some tiny flaw of using rubrics in music performance judging. It is nothing short of debilitating.

Let’s consider Burgmuller’s “Harmony of the Angels” (right) as well as Bach’s Little Prelude in C, BWV 939 (below) and think about various rubric categories that would work for both pieces. How about the category of tempo? Bach did not specify a tempo in the prelude, and it could be played rapidly/technically or slowly/expressively. Either approach could be pulled off successfully, so the tempo category completely fails for the Bach piece. How about the category of articulation/rests? There is a lot to hear in this category in the Bach, but in the Burgmuller, the pedal is down almost constantly, making this almost a null factor for the Burgmuller. Similarly, a rubric category of pedaling would fail for the Bach piece since the vast majority of performers will play that piece without any pedal. Ok then, what about the category of dynamics? Bach did not specify dynamics. Some of our greatest artists today play Bach with lots of dynamic expression. Others put in very minimal dynamic inflection, always lingering somewhere between mp and mf. The rare purist doesn’t play with any dynamics in order to simulate the harpsichord, and in fact usually those people would object to playing Bach on the piano in the first place since the piece was conceived for harpsichord.

One reader labeled the rubric I describe above as a “stupid rubric.” I agree. Any rubric that is built with such specific aspects of musicianship will work well with a small number of pieces and completely malfunction for the rest.

The rubric used for guild exams put on by the American College of Musicians uses nothing short of forty-three categories, but it doesn’t malfunction since not all categories must be filled out for all pieces. Not only that, but feedback is extremely vague: the report card instructs the judge to merely indicate which piece has a given issue, so if piece #4 has an issue with pedaling, the judge must only write “4” in the pedaling category. As one who participated in the guild many years, it felt incredibly impersonal and anti-climactic, like my performance was just a product on an assembly line. The feedback didn’t feel customized. (My opinion here isn’t tainted by any negative experiences – I always received high marks, and I received gold pins my last two years.)

Maja Wilson reflects on this impersonal nature of rubrics in the realm of evaluating English writing when she says in her article, “Why I Won’t Be Using Rubrics to Respond to Students’ Writing” (English Journal 96.4, 2007):

“No matter how elaborate or eloquent the phrases I was invited to circle, the feedback they offered to students was still generic because they weren’t uttered in reaction to the students’ actual work. Instead of emerging from what Louise M. Rosenblatt would call the transaction between an individual reader and text, the feedback offered by a rubric made bypassing that interaction all too easy.” (p. 63)

“The way that rubrics attempt to facilitate my responses to students—by asking me to choose from a menu of responses—troubles me, no matter how eloquent or seemingly comprehensive or conveniently tabbed that menu might be.” (p. 63)

At the 2007 World Piano Pedagogy Conference, I attended a very interesting lecture by Yoheved Kaplinsky. In that lecture, she said something fascinating that has really stuck with me: “Talent as a factor in musical success is like sex as a factor in marriage:  if it’s there, it’s 10%; if it’s not, it’s 90%.” [Being the head of the pre-college department at Julliard, she of course defines “success” here to be becoming a concert pianist – obviously she would define success differently if we were just talking about the average student or recreational student.] The logic implied by this statement relates to this rubric scenario, because in both cases, you have factors that contribute in drastically different ways to the overall result depending on just how good or bad each factor is (and the nature of what is wrong with each factor). If pedaling is correct, it may contribute 5% or 10% to a student’s high score. If pedaling is abysmal, it may contribute 50% to the student’s low score. Same with most other aspects of the performance.

This breakdown still occurs in academic evaluations, whether presentations or overall class grades, but the breakdown occurs at a negligible level since evaluated factors are more objective than factors in musical performance. A speaker either covers the topic completely, or he/she does not. A student either turns in homework assignments or does not. But the same cannot be said about pedaling, phrasing, or any other musical/technical aspect of performance.

So then, if we still insist on using rubrics in spite of their impersonal, inflexible and malfunctioning nature, the only place left to go is to broaden the categories in order to make them more personal and flexible.

The Less Useful Rubric That Doesn’t Malfunction as Much

An example of this broadening would be the five rubric categories used for the Associated Board of the Royal Schools of Music (ABRSM) exam: pitch, time, tone, shape and performance (each category is worth 20%). This gets rid of certain malfunctions: tempo and articulation/rests are both now part of the time category, pedaling is now part of tone, and dynamics are now part of the tone and/or shape categories. Sometimes these things can even cross into different categories, such as pedaling: if the pedal goes down too early or late, it’s in the time category, while if it doesn’t go down (or back up) far enough, it’s in the tone category.

Unfortunately, we still have that all of these five aspects are automatically given equal weight for every piece for every performer, which makes no sense. For example, there is a lot more that can go wrong in the time category when playing the first movement of Mozart’s Sonata K. 545 than in the first movement of Beethoven’s Op. 27, No. 2 (“Moonlight”) Sonata, so in the Beethoven, as long as the student uses the metronome to ward off rushing, which is a thousand percent easier in the Beethoven than in the Mozart, the time category almost acts as an automatic 20% (an “easy A”), while in the Mozart, the time category is more like a tightrope over a pool of sulfuric acid: the slightest thing wrong with rests, rhythm or pulse/steadiness just about wrecks the piece. Nailing the “time” category in the Mozart ought to be worth 40% of their grade. As another example, tone and shape will play a far more significant role in a Chopin nocturne than in a Bach gigue, and yet we still have that the rubric forces the judge to give equal weight to tone and shape in the Bach gigue as with the Chopin nocturne. So again, this rubric isn’t flexible enough to address the reality of each individual piece of music.

You can see where this is going. Ultimately we need a rubric that is so vague that we are left with maybe two or three categories. Just imagine the extreme here: what if we go all the way down to one single rubric category called quality? Could that still be called a “rubric”? Of course not. This would represent the other end of the spectrum – this time being a “stupid” rubric because it literally changes nothing. Having a rubric with one category is the same as not having a rubric at all and just asking the evaluator to “assign a single overall grade using the best of your intuition.” The less categories there are, the less the rubric even deserves to be called a rubric.

But let’s keep going on this progression of making categories more and more vague anyway.

The Useless Rubric That Never Malfunctions

The vaguest categories I’ve ever seen in any rubric would be the Trinity College of London rubric, which assigns a total of 22 points to only three categories:

  • Fluency & Accuracy (7 points)
  • Technical Facility (7 points)
  • Communication and Interpretation (8 points)

These categories are now so vague that just about anything wrong with a piece could be accounted for in any of the above categories depending on how the judge frames it. Problems with pedaling, tempo and rhythm could possibly be any of the three categories depending on whether the problem is deliberate (wrongly taught or wrongly learned) or accidental (fluke).

As an example, if a student stumbles several times, I could dock the student in any one of the three categories since it was a problem with fluency, it was possibly a problem with technical facility, and it certainly interferes with communication during a performance. Should I dock in one, two, or all three categories? This is as much – if not more – a subjective decision as the decision I’d have without the rubric at all, which would be, “How much do those stumbles detract from the overall rating?” Does placing this stumble into a category make my feedback to the student more transparent or accurate? No, it does neither. Normally, I would have just briefly cited the stumble, but now I’m having to arbitrarily assign it to possibly multiple categories, and I’m having to arbitrarily decide how many points the stumble is worth in each category. From a student’s point of view, this would be like the math problem that is marked wrong in several steps all because of one singular mistake early in the problem. This is not an improvement in clarity of feedback over a traditional freeform style of judging. It is cumbersome for the judge and actually more confusing for the student, possibly even unfair.

The more rigid the rubric, the more it malfunctions. The more loose the rubric, the less it really changes anything and more mental gymnastics it requires on the part of the adjudicator to make it work. The RCM rubric is clever in that it’s easier to prevent it from malfunctioning like stereotypical rubrics, but then why are we having the rubric at all at that point? Communication, transparency, and putting people in the loop? These categories are quite vague – they do not speak for themselves like the more commonly-understood “stupid” rubrics above do – they will still rely on responsible and competent judges to explain what it was in each category that caused them to mark it down. So no rubric, or vague rubric, the student and parent are still very much out of the loop if the judge isn’t good at explaining him/herself.

Authenticity

Using rubrics also produces less genuine feedback. Nobody articulates this better than Maja Wilson in the same article referenced above. Wilson says regarding judging written papers (an activity that is similarly complex and subjective as judging a musical performance):

“In my first few years of teaching, I often “fudged” the scores to make sure I didn’t award high scores to vacuous writing or low scores to writing that showed great promise. In addition, I’d come to think that the categories of the rubric represented only a sliver of my values about writing: voice, wording, sentence fluency, conventions, content, organization, and presentation didn’t begin to articulate the things I valued such as promise, thinking through writing, or risk-taking (Broad, Bob).” (p. 62-63)

Putting competent judges into rubric straitjackets turns them into manipulators. Judging performance with rubrics is literally a process of manipulation. We score everything in each category, then we look at the overall score and think, “Wait a minute, the student doesn’t deserve a score that low / that high.” So, we add or subtract a point here and there to make sure the overall score isn’t so blatantly inaccurate. This is inevitable when we remember how arbitrary it is that we assign a pedaling issue to the “technique” category or the “interpretation” category, and how arbitrary it is that we take off 1, 2 or 3 points for that issue. In the end, we are left with a score that is the sum of tons of even more error-prone calculations than if we didn’t use the rubric. A competent judge must manipulate their scores, because few if any of the small judgments they made in the process of coming up with the overall score were objective issues.

Bad Apples

Some say that we use rubrics to protect students from bad judging. In festivals and evaluations, judges are always told to justify the ratings they give out carefully, especially when the student receives anything but the highest rating, and the vast majority of judges do a good job of this. If a student doesn’t receive a high rating, they are told why. Transparency is already there, and if it isn’t, then that judge should simply be avoided in future judging. Bad apples shouldn’t be allowed to spoil the basket. Resorting to rubrics is too heavy a price for the majority of decent judges to pay for the marginal benefit of improving the judging of a very small number of judges who can’t follow directions and who shouldn’t be judging in the first place.

There are certain people who, for whatever reason, are just not good judges, and no scoring rubric is going to change the quality of their judging. Over-penalization is the biggest problem with certain “bad apples.” If a student stumbles just once and misses one tiny staccato and is docked too much for it, whether this reasoning is delivered to the student in paragraph form (“the reason you only got an excellent rating is…”) or in point/rubric form (“stumbled once and missed a staccato – minus 5 in such-and-such category”) doesn’t matter. Neither way is more objective. Neither way is more transparent. They both stink, it’s just that one of these ways requires the judge to jump through pointless hoops in order to provide the student with the disservice of over-penalization. In both cases, the student feels gypped, and the judge needs additional training and guidance.

Focus

Rubrics can also impede the focus of the evaluator. Wilson says in her article, “When Terra Lange, a teacher from Illinois, put aside the rubric, she was shocked to see the difference in her focus.” In the realm of musical performance, it’s certainly not an impedance to have an optional list of musical qualities where a judge can take notes, but forcing the judge to use rubrics of any kind will definitely make the judge think less, not more, about what they’re actually hearing. I know personally that any brainpower devoted to deciding, “What category does that issue I just heard belong in?” or, “How many points was that issue worth?” is brainpower I could instead use to give the performer the same highest-quality feedback I give when conducting lessons.

After discussing all of these vices of using rubrics, I haven’t even gotten to the one that burns me the most. This one leaves me confused about whether to laugh or cry, which is why I saved it for last.

Replacing Real Fruit With Fake Fruit

Suppose hypothetically that you were able to create a rubric that anyone (including myself) looks at and says, “Yes, that’s awesome, that is the best rubric in the universe and will always produce consistent, accurate results no matter who uses it!” Tell me, where exactly does the authority of a “good rubric” come from? In other words, if there is an “answer key” to tell us which rubrics are good and which are bad, where would we find that answer key?

The answer: we judge a good written scoring rubric by how closely it imitates a good judge who isn’t using a written scoring rubric. We cannot evaluate the effectiveness of a rubric according to some magical rubric meter. In an act of terrible irony, we create rubrics under the illusory hope that we end up giving automatically and mechanically the same grade and feedback that we would have normally given anyway, without the rubric! It is for this reason that I have a hard time holding back giggles when I hear the phrase “stupid rubric.” My only thought when I hear that phrase is, “Exactly!”

There is only one way to create a truly “smart” rubric that is both useful (i.e. specific) and that doesn’t malfunction (accommodates complexity), and I’ll tell you right now that it’s not possible for a human being to ever use it, at least not directly. To call it “cumbersome” would be an understatement. A smart rubric will consist of millions of lines of code that a computer software system follows 20, 50, 100 or 200 years from now in order to evaluate a performance. [I can never bring myself to call this kind of thing “artificial intelligence” since I’m not sure if computers will ever develop a will of their own, and you can’t have actual intelligence without first having will. Only through independent will can a computer ever carry out an instruction that diverges from what it is told to do, and true intelligence is not possible without independent, divergent thinking.]

Perhaps (and I say lamentfully, probably) one day, a very sophisticated piece of software will evaluate a performance with more fairness, consistency, transparency and thoroughness than any single human being ever possibly could. But even then, we will still ultimately be reaching that conclusion by comparing those results to the actions of a skilled human not using rubrics. As long as humans are doing the judging, instead of tweaking rubrics to continue making futile improvements to overly-simplistic algorithms of judging, we should be simply focusing on improving judges themselves. For the few judges that could be considered hopeless cases (i.e., judges who don’t understand music well enough or judges who “do their own thing” even when told not to), we simply shouldn’t use them once we figure out who they are, and we should put forth more effort to motivate decent judges to judge, paying them if necessary. We should study the makings of good judging and have adjudication workshops that train judges to listen, think creatively, prioritize, encourage, contextualize and synthesize more effectively.

What is the real reason we use rubrics?

The Law of Duality states that there are always two reasons for doing anything: the reason that sounds good, and the real reason. Maja Wilson offers some words that I take to be an answer to this question of why we really use rubrics (emphasis added by me):

“Rubrics are writing assessment’s current sacred cow because they provide the appearance of objectivity and standardization that allows direct writing assessment a place in standardized testing programs (Broad, Bob – What We Really Value: Beyond Rubrics in Teaching and Assessing Writing, 2003).”  (p. 62)

Alfie Kohn offers similar criticism in “The trouble with rubrics” (English Journal 95.4, 2006):

Rubrics are, above all, a tool to promote standardization, to turn teachers into grading machines or at least allow them to pretend that what they are doing is exact and objective.” (p. 12)

“…they [rubrics] can never deliver the promised precision; judgments ultimately turn on adjectives that are murky and end up being left to the teacher’s discretion. … This attempt to deny the subjectivity of human judgment is objectionable in its own right.” (p. 13)

The desire to standardize writing so it can be included in standardized tests is no less a worthy goal than the goal of standardizing musical performances. Judging writing is no less subjective than judging a musical performance. But both of these goals to standardize lead to a decision that is, if you really think about it, rooted in fear. Stakes are high. Both tests can affect admission to colleges as well as scholarships, and the appearance of accuracy is more important than accuracy itself. Whether a rubric is so specific that it malfunctions most of the time, or a rubric is so vague that it doesn’t actually have much function (in fact, it may actually increase the number of subjective decisions the judge is making since the judge has to arbitrarily assign categories and points to everything they hear), people favor form over function and still sometimes opt to use rubrics.

Transparency? Feedback? Say What?

In competitions, competitors only receive feedback for their own playing, and without seeing the feedback of other competitors, there is still absolutely no way for competitors, parents or even teachers to be “in the loop” of the judges’ relative decisions, scoring rubric or not. In a competition, comparison between performers is what actually matters as far as the outcome of the competition is concerned, and rubrics aren’t going to help this at all.

In judged music festivals, I don’t think I’ve ever encountered any, ever, that don’t in some way encourage generous ratings for students. How in the world can there be a need for additional transparency in events whose main problem is in giving students ratings that are, if anything, too generous? This policy is put into place for good reason – festivals designed for average, hard-working students need to find a careful balance between honesty and encouragement. Some festivals explicitly ask judges to err on the side of generosity if they are torn between two ratings. Other festivals embed generosity into their deliberation systems by only having two judges, and if one judge rates “Superior” and the other “Excellent,” the student’s final rating is rounded up to a Superior. The shock and surprise I experience by judge comments and ratings is almost always in favor of the student, and the rare anomalies where students are “ripped off” by crabby judges wouldn’t have been helped by a rubric that is so vague that it still gives judges license to infuse their crabbiness into the rating.

In my 20+ years of reading judge comments to students, never did we feel a rubric would have made things easier to understand. But I can think of many scoring rubric scenarios where I could imagine someone saying, “Oh come on, that mistake couldn’t have been worth 8 points – what was the judge thinking?” Points are still assigned arbitrarily, so the student and parent are still not made part of the loop any more than if no rubric were used. It is also possible to be too accountable and too transparent – anyone who has ever served on a board, jury or committee knows this! Good judges already put everyone into the loop by giving the high-quality feedback they are asked to give. So many arguments of the proponents of rubrics seem cynical to me – they make it sound like a significant percentage of judges are utterly incapable of communicating themselves effectively when trusted to, and that’s just not my experience at all in events run by people who pay attention to judging quality.

The Today and Tomorrow of Judging

At least for now, the job of adjudicating should be left to human intuition. Our amazing minds, while imperfect, are still incomprehensibly complex when compared to any computer software in existence, let alone a simple rubric! While it is an honorable goal to attempt to neutralize skewed results by limiting the rare rogue judges’ ability to impose their idiosyncrasies onto the overall outcome, any gain achieved by the rubric remedy will not come close to countering the collective unfairness of imposing the same rigid rubric on every piece, every performer, every performance, and every judge in a festival or competition. Class piano evaluations (i.e., piano proficiency exams for non-piano music majors) are a different situation entirely: we have one evaluator for every student, so the rigid rubric works well for that one evaluator since they are the one who designed it. Most importantly, these performers are being evaluated on skills that are much more objective, such as ability to play scales/arpeggios with correct fingering, basic principles of musicianship/technique, etc.

Thankfully, most festivals and nearly all competitions get it. They do not provide rubrics for judges to evaluate performances. Those that do provide such rubrics find that the rubrics are almost never used as the exclusive mode of evaluation – instead, a small number of judges might use the rubric as a way to keep mental notes from being forgotten. Ultimately, I believe most judges out there intuitively and rightfully know that the evaluation of performed music is far too complex of an endeavor to be helped by the framework of rubrics, since the framework itself needs to adapt to each unique piece and performance.  We need to give more weight to the factor of pedaling in one piece, while in another piece, pedaling isn’t even a factor at all. Some pieces have very little to say musically and are all about technique, while other pieces are easy to play technically but are more involved musically. Using the same rubric for both of these pieces would, at best, force a skilled judge to manipulate their thoughts so that the “real” rating they know the student deserves is justifiable using the awkward rubric, or at worst, force a judge to give an inaccurate rating since the rubric fails to adjust to the piece.

In the few festivals (and competitions?) that currently use rubrics, let’s stop with illusory optics and instead focus on the real thing: work on developing judging skills of those who judge. Instruct them on values, philosophies, strategies and processes that are helpful in good judging, and stop trying to make judging that is so inevitably subjective conform to preconceived ideas that only pretend to make it objective.

(c) 2020 Cerebroom

6 thoughts on “Why Rubrics Do Not Work For Competitions & Music Festivals

  1. Thank you for sharing this. In general, I am not a fan of rubrics, although I’ve also experienced how useful they can be when grading a large class of college students. The points that you bring up are interesting, and should definitely be considered for future MTNA forms. I’ve only adjudicated at two local events so far, but I too appreciated the manner in which they have judges conduct themselves. I’m looking forward to our studio joining the festivals next year.

    1. You’re right, and all the education experts seem to agree on the benefits of rubrics in the classroom environment. It’s just important to distinguish is the subject material. It’s one thing to evaluate student output of purely cognitive endeavors (homework, reports, tests, presentations), but music throws a wrench into this simplicity by also involving so many other dimensions that interact with each other – physical coordination, timing, dynamic expression… too much for any rubric to handle, even when we construct the rubric for ourselves! We must use a different rubric for every performer and every piece of music, which of course what any good judge does in their heads as they assign an overall score to each performer in a competition.

  2. Interesting perspective. I agree with ineffectiveness of having 10 categories all scored out of 10, since it assumes they are all 10% of the total performance. However, when I’ve adjudicated using this kind of system, I think of the overall score it should be (88, for example), and then fill in the 10 categories so they add up to 88.

    1. I would just reiterate that the ineffectiveness isn’t because of which numbers are assigned, it is because numbers are assigned at all. As stated above, the actual numbers and percentages don’t matter – as soon as any are assigned, it already crystallizes the priorities of this “meta-judge” (the one who creates the rubric) and imposes it onto every performer and piece they hear (and onto every judge who uses it if the rubric is used by others too). If a rubric allows judges to change the numbers, then (to me anyway) this makes the judging process more complicated than it needs to be.

      1. Good points. I like the idea of having a list of elements as prompts for the judges, but not assigning a numerical value/percentage. Thanks.

Leave a Reply