Does Your Text Analytics Tool Make Sense of “Slanguage?”

By Manya Mayes

If your organization is actively analyzing the sentiment of customer commentary, then you have some insight into the new 4G, mobile, social media world. Not only do your customers have an opinion, they are more than ready to broadcast it – and in whatever “slanguage” suits them. But how do you make sense of this? Are there sentiment analysis tools that can process, with accuracy, this type of challenging customer feedback in call center transcripts as well as social media?

First, let’s define “slanguage.” Slanguage is language characterized by excessive use of slang, particular to a group or subgroup of people, such as users of social media forums. Internet slanguage includes abbreviated and clipped speech, but it is unique in that it also contains letters, numbers, and non-alphanumeric characters.

Slanguage can include any topic, such as politics, economics, or current affairs, but is most commonly seen in expressions of personal feelings and opinions. Also, slanguage can be punchy, playful, and, in the case of social media, convey an author’s savvy quality (“hip to the Internet”). Mixing letters with numbers in words like “gr8” is an example. This is shorthand (for the sake of speed and brevity) based on knowledge of orthography, a keyboard, and phonetics (how words sound) and consisting of a number (“8”) which, when spoken aloud, makes the same sound as the standard “-ate.” This is actually a clever formulation of slang – far more special than coining new terms.

Character-Related Challenges: The sheer volume of customer opinion freely available via the Internet has given forward-thinking organizations the ability to gain key insights from valuable customer feedback. The problem, though, with this voluminous feedback is that there is no editor-in-chief to review the accuracy of the language being published. Moreover, the frequent need to fit opinion into a 160-character SMS text message or a 140-character tweet means there is motivation to be brief and to the point.

In an effort to make a statement in 160 characters or less, the length of a word rather than the statement itself is what becomes brief. For some users, predictive text on mobile phones and iPads allows typing in shorthand and prompts users with the full-word equivalent. But when these users move to an environment without predictive text, shorthand becomes the norm. Bcuz y’all no wot I mean neway! Propelling the creation of this new language, TweetDeck, for example, offers the ability to “TweetShrink” written text to allow customers to fit in as much information as they possibly can, so now there are even more ways to misspell a word. That can be a real problem for call center analysts.

Is “Twitterspeak” Really Different? Let’s look at the following tweet:

Doctor couldn’t give me any meds because I waited too long to go. FML

When I first saw this, I had no idea what FML meant. I had to go to one of the “netlingo” sites to get a definition. Suffice it to say that, from a sentiment analysis point of view, it is negative. Without knowing what it means, software performing sentiment analysis could classify this statement as “neutral” given it is a statement of fact. But the addition of FML to the end of the tweet gives the statement a negative slant, and it is exactly this information that should not be overlooked.

Let’s compare Twitter and survey text. If you parse the sample records with a mature knowledge extraction base and then evaluate for differences, you’ll find that Twitter has far fewer known words and far more unknown words – to the tune of seven times more distinct words than the survey text examined, leaving a business analyst with seven times more information with which to detect the signal from the noise.

Dealing with Reality: I have analyzed text that is grammatically correct and ready for publishing and text that is neither of those things but is published anyway. News and media sites such as The New York Times regularly post articles with the occasional spelling mistake or grammatical error. Meanwhile, social media sites such as Twitter post tweets with occasional grammatical accuracy! The omnipresence of what I call the “Ten Transgressions of Text” (which include misspellings/typos, shorthand, slanguage, and profanity) makes for challenging analyses.

For the analyst who is responsible for gleaning value from this slanguage, there is always the ethical question of who should be reading any X-rated content and what HR and legal think of exposing their employees to such content. Did I think I would ever be analyzing information that looked like this? Actually, I did. Did it surprise me when a large market research organization asked me if we could automatically remove profanity? No, not really. Did it feel weird to aggregate lists of text speak, slanguage, and profanity to remove from analyses? Absolutely! This slanguage is enough to make the hairs on your neck stand up.

There is a real problem here in that profanity and crassness can hold extremely vital information for a company. It is important to identify the meaning but also to abstract away from the meaning because the words may not be that valuable; it is what they convey about the speaker that we care about. This approach allows the analyst to retrieve the necessary information while remaining “protected” from details, and – what’s most important – to aggregate similar data for slicing and dicing. For example, grouping the following three items together allows for aggregation of three different statements, all getting at the same thing:

“That product is really lousy” as standard
“That product is hella lousy” as slang
“That product is *BLEEP* lousy” as profanity

Given the seemingly large number of options for spelling a word, I would also like to say that there are at least sixteen ways to misspell a word. Consider the number of ways you could spell “BlackBerry”: blackberry, black berry, bberry, bb, blackbery, blackberi, blckberry, blkberry, blacberrie… you get the idea!

Further, consider the following text lingo examples from netlingo.com:

*$: StarBucks (I wonder if Starbucks knows this lingo for their brand?)

AWLTP: Avoiding Work Like The Plague

AYOR: At Your Own Risk (useful for analyzing product sentiment)

BlkBry: BlackBerry (I missed that one in my list above)

BTD: Bored To Death (useful for analyzing movie reviews and presentation feedback)

Slanguage Analysis (Natural Slanguage Processing): Not if but when you are presented with the task of analyzing slanguage-ridden text, all is not lost. If you are to understand what your customers are saying about your products and services, text analytics provides the means to automatically find the signal amidst the noise. The presence of slanguage, emoticons, clipped text, abbreviations, and profanity can tell you something about both the content and the author. The aim of the analytics is to report results in your own business terms – not the terms of TweetShrink.

I am frequently asked, “How much information in social media data is useful?” The answer to that question must depend on the definition of “useful.” In a manual review of a sample of tweets, some 63 percent of extracted information was considered to be well formed (representative of the information in the tweets). The remaining 37 percent was noise. If the well-formed information does one of the following – helps protect customer safety and avoid a public recall for an automotive manufacturer; detects a segment of high value but unhappy customers; detects an issue that, when fixed, helps leapfrog a competitor; or identifies customers with “at-risk” behavior – then it is useful.

Context Is Key: The aim is not to remove all of the noise but to give meaning to it based on the appropriate context, where possible. In order to do this, companies will have to face language evolution, and text analytics vendors will need to put processes in place to record and translate the language changes in the context in which they are used. R&R to the military has a completely different translation than R&R for the automotive industry, and it is that context that helps us understand which translation is appropriate.

The ability to automate the detection of slanguage and provide the correct contextual-based translation for the relevant call center industry is key. The technology will evolve as the language does.

Manya Mayes is the director of advanced analytics at Attensity, which delivers an integrated suite of customer analytics and engagement applications that help organizations use the voice of the customer to deliver a superior customer experience. Manya has spent over twenty years using statistical methods to analyze data, the last fifteen working specifically with data and text mining software.

[From Connection Magazine – June 2011]