At the end of July, Microsoft Research held its 2008 Faculty Summit to survey the state of computing R & D, which this year included a social media summit. A major topic of conversation included the transition of the internet from a network of documents to a network of people.
As participant (host) and Microsoft Scientist Matthew Hurst explains on his blog, “The PageRank era is marked by a very simple link with no explicit meaning and a simple assumption (a positive endorsement).” But this assumption of positive endorsement is becoming unnecessary as more and more direct evidence of people’s opinions and categorizations of content are available online. Research repeatedly reveals that others take notice of human-generated tags and reviews: “consumers report being willing to pay from 20% to 99% more for a 5-star-rated item than a 4-star-rated item (with variance depending on type of item/service)”, is just one example.
Many are excited by how much less processing-intensive the online content tagging process becomes with this trend – clusters of pages and facts seem to grow organically as a result of human tagging. This helps overcome previous problems related to content indexing within info retrieval, such as the gap between the language that the businesses or organizations use to label their content and the terminology preferred by their customers/users.
But there are challenges that arise as well in this transition that are less discussed. Says one scientist, aptly describing the phenomena, “fragmenting media and changing consumer behavior have crippled traditional [media] monitoring methods. Technorati estimates that 75,000 new blogs are created daily, along with 1.2 million new posts each day, many discussing consumer opinions on products and services. Tactics [of the traditional sort] such as clipping services, field agents, and ad hoc research simply can’t keep pace.” Call it what you will: Brand Monitoring, Online Image Tracking, Buzz Monitoring, Online Anthropology, Conversation Mining, Online Consumer Intelligence, Market Influence Analytics … The challenges remain the same. As an example, I think of a project I did here at Pure Visibility last year, which involved analyzing online review content related to a client’s company. After gathering the reviews (in the hundreds), I was faced with the daunting task of mining them for basic information like the overall majority sentiment expressed, and how this correlated with the source. My ultimate method was mostly manual and more than a little tedious.
Hurst’s blog contains a reference to a new book by Pang and Lee that surveys the state of Opinion Mining and Sentiment Analysis, (basically, data-mining and classification using human generated content). In addition to interesting facts on the power of opinions like those above, this book clearly outlines the process that such analysis requires, and the associated challenges. For example, incorporating user opinions into a search engine typically requires the following steps:
- determining whether the user is looking for subjective information
- accurately classifying docs into the opinionated and non-opinionated bins
- identifying overall sentiments expressed and or/specific opinion regarding particular aspects
- summarizing information, including aggregating votes via different rating scales, highlighting some opinions, representing disagreement/consensus points, id’ing opinion holders, etc
The challenges are numerous. To summarize some of the excellent points made by Pang and Lee, I sketched out the following table, which compares opinion mining to traditional text mining:
Opinion Mining | Fact-based Text Analysis |
---|---|
relatively few classes generalizing over many domains/users | often numerous classes (ie topic classification) |
represent opposing (binary classification) or ordinal/numerical categories | classes can be unrelated |
order can overcome frequency (in importance) | frequency typically correlates with classification |
sentiment typically expressed in subtle manner not isolated to single sentence | though dependent on doc length, summarization using single sentence extraction often reasonable |
non-trivial task of defining human-preferred keywords | accurate classification possible via data-driven only methods |
To clarify on this last point, the authors note that this fact alone does not make the task more difficult than traditional topic classification, since data-driven approaches can be applied to the latter to improve accuracy over classification using a human-picked keyword list. The problem is that the accuracy of a data-driven method for opinion analysis is only about 80%, which is still not comparable to the performance expected in traditional topic-based classification.
While these challenges may seem intimidating enough to remain on the horizon for years to come, the fact that this book was written by a Yahoo research scientist, and one of the country’s top CS schools suggests that the right people are thinking about these trends. Significant changes in how we use the web may not be far off.
The post The Subjective Web: Online Opinion Mining appeared first on Pure Visibility.