|
Measurement, Reliability, and Validity:
Part Four in the continuing series, Getting Quality Right
By
Clift Hurst
July/August 2008
In conducting a statistical
analysis, it is important to understand the level of measurement being used, how
reliable it is, and if it is valid. These issues will be addressed in this
article.
Measurement can occur at
four levels: nominal, ordinal, interval, and ratio. For our purposes, we’ll
treat interval and ratio data the same.
Nominal Data assigns
numbers to names since software can only crunch numbers, not names. Suppose a
call center operates two shifts a day. For analytic purposes, you may want to
differentiate between shifts as you monitor the scores. To enter data into a
statistical software package, you will assign a number to “day” and a number to
“evening.” This is called coding the data.
You may define the day shift as
“1” and the night shift as “2.” This doesn’t mean that “2” is higher than “1”
in a judgmental sense or arithmetic sense. In nominal data, numbers simply
become placeholders for names.
Ordinal Data signifies
good, bad, and variations thereof. Ordinal means that there is an order to the
numeric rating. It is good practice to code better ratings with higher numbers
and lower ratings with lower numbers.
Let’s say you want to monitor
whether agents verify a caller’s identity before disclosing confidential
information. If the caller provides his or her last name and account number and
these match your records, then proper verification has been made. Call
monitoring will evoke either a “yes” or a “no” determination. To follow common
practice, assign the number “1” to “yes” and “0” to “no.”
In another situation, you might
want to evaluate professional courtesy using a Likert scale of 1 to 5, where 1
means not acceptable, 2 means below average, 3 means average, 4 means above
average, and 5 means excellent. You will be making an extrinsic judgment
because you are looking for “shades of gray.” Scores for this will also be
measured at the ordinal level of measurement because “excellent” is better than
“average,” and so forth. There is an order to the rankings.
This is where confusion between
ordinal and interval levels of measurement can creep in. It appears that the
intervals are built upon a five-point scale. However, this is not an interval
level of measurement – it is ordinal because there are no standard increments of
measurement within the scale that was used. The difference between “average”
and “above average” can only be qualitatively determined.
Interval/Ratio Data: If
you want to get more granular in your analysis, you can develop an interval or
ratio scale as a subset of the category of professional courtesy. The following
is a far-fetched example, not a recommendation; I am only illustrating the
statistical principle.
Suppose you decide that the more
the agent says “thank you,” the more professional courtesy is displayed.
Evaluating the number of times the agent says “thank you” gives an
interval/ratio level of measurement. You can code this part of the form with a
“0” if the agent does not say thank you at all, “1” if the agent says it once,
“2” if the agent says it twice, “3” if the agent says it three times, and “4” if
the agent says it four times, ad nauseam.
The type of statistical analysis
that your data requires is determined by whether it is nominal, ordinal, or
interval/ratio.
Calibration involves using
a set of proven statistical and analytical tools to measure how reliable and how
valid your quality monitoring process is. Although these are often lumped into
one category, these are actually two distinct components:
1. Reliability addresses
consistency. Does your quality monitoring form allow your QA team to
measure things consistently? Would different evaluators likely assign the same
score to the same call? Does the team score similar calls similarly over time,
or do they tend to drift apart in their scoring practices? These are the
questions you must answer when establishing the reliability of your scoring
forms. There are four different kinds of reliability:
Inter-rater reliability
measures how similarly different evaluators rate the same call when they score
it.
Test-retest reliability
tracks whether the same evaluators rate the same call in a consistent way if
they were to rate the same call again.
Parallel forms reliability
ascertains whether one version of a form is at least as reliable as, or more
reliable than, another.
Internal consistency
reliability makes sure you are not “double-dipping” among what you think are
distinct categories on your form.
2. Validity assesses
whether your measurements are appropriate, meaningful, and useful. Validity is
more difficult to quantify than reliability. There are three types of validity:
content, criterion, and construct.
Content validity
determines whether the things you measure are really an accurate reflection of
what you intend to measure. We tend to do this pretty well in terms of the
greeting, the closing, and accuracy of data entry. It is harder to measure
“soft” areas such as courtesy, professionalism, and tone of voice.
Criterion validity
determines how well the criteria we use in our monitoring forms correlate with
other measures of customer satisfaction, such as post-call IVR surveys, written
or phone surveys, measures of first call resolution, escalations, accuracy of
data entry, customer retention, and even financial measures like goodwill,
credits, average collection period, and returns. It’s important not to create
and use monitoring forms in a vacuum, removed from these other performance
measures.
Construct validity can be
difficult to get right. One example that misses its mark is something that I
see quite often. Many monitoring forms ask, “Did the CSR use the customer’s
name three times during this call?” It seems like a good measure of
customer-focus, but it really isn’t. Callers do not generally count the number
of times their name is used during the call.
As an industry, we ought to get
better at construct validity. For example, I propose that the best indicator of
courteousness and professionalism is whether CSRs acknowledge the reason for the
call or the emotional state of that caller before asking for verifying
information. This truly makes the caller feel heard and valued. Yet that
acknowledgement is seldom included on monitoring forms.
A thorough comparison of customer
survey results, correlated with assorted monitoring criteria, can assist us to
determine authoritatively what elements really contribute to customer-focus,
professionalism, and courtesy rather than relying on conjecture. This sort of
thorough analysis, within the overall context of quality assurance will lead to
the next improvement in call center management: “getting quality right.”
Cliff Hurst is president of Career Impact, Inc, which he started in 1988.
Contact Cliff at 207-499-0141, 800-813-8105, or
cliff@careerimpact.net. Sign up for his free email newsletter or order his
book, Your Pivotal Role: Frontline Leadership in the Call Center
at
www.careerimpact.net.
Read part 3
and part 5
in this series.
Return
to List of Articles || Read more articles at MyArticleArchive.com
|