In recent posts, we’ve been devoting a lot of time to our sentiment scoring capabilities, and with good reason, they are our bread and butter. However, it’s not the only thing that we’ve been working on over the last few months. One of the most exciting new features that we’ve got coming out in our 4.1 release of Salience is what I’m calling “The Entity Toolkit”. I say “calling it” because we haven’t actually finished the debate on what we’re going to name this tool, so feel free to chime in with ideas.
So what is it and why am I writing about it. Well that’s really a two part answer: first, entity recognition (People, Companies, Places and that sort of thing) is one of the “baseline” features of any text analytics engine, so if you’re gonna call yourself a text analytics vendor you better be pretty adept at it, and second we’ve taken a much different approach with our new toolkit, one which I think is worthy of some explanation. At it’s based the detection of entities in text is the process if finding proper nouns of a particular type (company, person, etc). We’ve had entity recognition in the product for years now, and its quality (precision and recall) was mediocre at best. In the 4.0 release that came out in Q4 08, we began addressing this shortcoming with a new “Maximum Entropy” based training engine for our base entity types (People, Companies, Places, Products), and the effect was significant. Our accuracy improved a ton, especially in People and Places. The only issue was that our users were still at our mercy in terms of the types of recognizers we’d provide. I’m happy to say that this limitation will now be a thing of the past.
With the Entity Toolkit, users can build and train their own entity types (think Diseases, Legal Terms, etc…). I can hear the collective “So what? We’ve been able to build our own user defined lists in lots of tools for years now.” The difference here is that we’re allowing users to dump in domain specific content, define their own types of entities for this content, and train an entity recognizer to extract this type of entity based on the context of the content, not through a simple string match. Basically we’re exposing the maximum entropy model to the users through an easy to use toolkit where you simply markup the entities in some sample content, and then train a model based on this markup. We believe that this will take entity recognition out of the hands of the geeks and into the hands of the users, where its business value will really begin to shine.