Categorization

Categorization is the process of putting documents into buckets of some kind. For NLP purposes, usually the buckets are the subjects you are looking to analyze. Those categories might be very broad, like newspaper sections (sports/world/art, etc) or they might be very focused (the different substrates in silicon wafer manufacturing processes).

There are many ways to categorize documents. Different techniques are better suited to particular categorization needs. Lexalytics provides a variety of ways to go about categorizing your documents.

Queries

Queries are saved Boolean searches that are run against all the documents you submit. If the search hits the document, we return the name of the query, the hit count, and the sentiment of the query. Queries are very precise and transparent. You can see exactly what you searched for and why it hit. Only what you search for will trigger a hit. This makes query-based classification a good fit when the bucket is easy to define. For instance,if you are looking for all documents where someone talks about their iPhone, you can easily write a query to find that.

The downside to using queries for categorization is that for buckets that represent more of a concept it can be difficult and time-consuming to construct a query that fits. You have to worry about synonyms, ambiguous terms, and so on. For instance, if you are looking for all documents about mobile technology, you need to search for the various types of mobile technologies, exclude those documents that talk about mobility and technology separately and so on.

Categories

Categories are saved searches constructed using our Concept Matrix language. They are run against all documents you submit. Each category is evaluated and given a percentage relevance score against the document. If the score is higher than your category threshold, we return the category name, the relevancy score, and the sentiment of the category. Category queries use our Concept Matrix to extend the terms you submit. For instance, a category of "food" will score well on a document such as "I had some chicken wings the other day that were just awesome." This makes categories a good choice to match broader buckets, such as sports, food, art, or technology.

The downside to using categories for categorization is that it is not transparent. You will hit documents that do not have any of your terms in them, and it will be difficult to tell exactly why they hit. In addition, categories do not work well for very short content such as tweets, because categories need context to work with.

Auto categories

Auto categories are categories that we built for you that match the Wikipedia taxonomy. There are about 4,000 entries and the taxonomy is three levels deep. The taxonomy and the categories are not user-modifiable. If one of the categories hits your document, you will receive back the name of the category along with the sentiment, the ancestor nodes of the category, the score, and the URL to the Wikipedia page representing the category.

Auto categories provide mainly broad subject areas such as Agriculture, Physics, Computers and so on, although some of the leaves are fairly granular. For a complete list, see here.

Machine Learning

Machine learning is a categorization technique where instead of the user writing queries the user assembles a set of documents each tagged with the appropriate bucket. Once you have a set of data, a machine learning process is run to create a statistical model from the documents. For instance, if you tagged lots of documents with the bucket "Mobile Phones" and most of them had the word iPhone in them, the machine will learn that iPhone is closely related to the bucket Mobile Phones. When it sees a new document with the word iPhone in it, it will give a high confidence score for the category of Mobile Phones to that document.

Various machine learning models are supported in Semantria and can be managed using the following endpoints: