LexaBlog

Our Sentiment about Text Analytics and Social Media

The New Face Entity Recognition
Submitted by Christine Sierra on Tue, 2009-04-21 04:00

(Jeff recently did a post about the release of Salience 4.1 and the entity management toolkit. It was still in early stages and hadn’t been released. Now that we have the feature available, I thought I’d do a quick outline of the benefits.  What do you do if you need to search information from within your corpus of data, but your data is unique and not driven by generic information used for every domain? If you are integrating enterprise search into your organization, then here is some information about how to enhance the benefits of your search engine. Enterprise search has become one of the most critical applications found within larger companies over recent years. This trend will continue to accelerate, but we’ve discovered that applications have to be able to provide corporate users with results that are tailored to their specific needs, not just on the general search recognizers available out of the box. Utilizing entity recognizers that “understand” the items that matter to a given business will allow search and discovery applications to expose valuable content to its users - and when doing so will be using the correct vocabulary. Beyond the obvious technical benefits of customer driven recognizers, the financial benefits to the organizations are compelling as well. Data preparation and content mapping typically represent the [b]largest part of an enterprise search implementation[/b], and the money for all this work typically goes to the search vendor. Allowing the users to build their own entity recognizers will reduce the amount of money spent preparing content for applications, and will give the users more control over how their applications are presented to their users or customers. The best way to understand user-defined entities is to walk through the build out of a user-defined recognizer. Let’s start by looking at the entity processing of a document with standard recognizers in place. For this example, we’ll work with premise of medical research documents, and build out a “Disease” recognizer. If we were to look at a medical document, only people, places and similar items would be discovered. We wouldn’t find the references to the possible diseases in the document such as:

    • Lung Cancer (Adenocarcinoma)
    • Leukemia
    • Lymphoma

If you were in the pharmaceutical or medical industry, having to create and train all the possible references to diseases would be the heaviest lift in implementing an enterprise search solution. New technology from Lexalytic’s Entity Management Toolkit can help users to build out a simple “Disease” recognizer from a fairly small set of medical research documents (typically 100+). A human would begin the process by identifying instances of “breast cancer” and possibly generic “cancer” terms within the document. Once the user had marked up most of the instances of “breast cancer” and “cancer” as diseases, they would then process it through the entity management toolkit. The system would then highlight instances it found to indicate possible disease references that the user has not yet accepted. In some cases, the user may decline the machines suggestion, for an example “Breast Cancer Awareness Tea Party” because this is an event not a disease. Noting that this reference is not a disease is important because it’s an additional piece of evidence that the Maximum Entropy model will use when it builds the disease recognizer. Once a document is marked up by the user, the state is changed to “Ready” and then additional documents are marked up until enough documents have been marked to build a model. The user would then apply the Disease Recognizer model and begin to process the rest of the documents for entity recognition beyond people, places or companies. One important point about the marked up documents is that none of them mentioned lung cancer by name; a user would rely on the model to discern that lung cancer is a disease because of the way it’s described within the document. As stated earlier, users have had to rely on the vendors to provide domain-specific recognizers up until now, but the vendors are not the content and domain experts. The value of this tool is in empowering content owners to expose the value found within their actual content, particularly in market and industry segments. We believe that publishers will be one of the first groups to see the value of this tool in differentiating their content from freely available content found on the web.