Semantria offers an intelligent categorizer that maps users' content against all of Wikipedia's taxonomies. There are approximately 400 first-level auto-categories and more than 4000 second-level auto-categories. Semantria API returns a "relevancy score," which is a score between 0 and 1 that represents how confident Semantria is about whether the document falls into that category.
Users can create their own categories to extract the exact information they're looking for. Semantria ships with a default list of 40 categories, but users may create as many new categories as they would like. To create a new category, give Semantria the name of your category as well as 4-6 very obvious examples of your category. For example, if you wanted to create a "Vegan" category, some good sample words might be "vegan, diet, vegetarian, animal, tofu, veganism." Semantria will use Wikipedia's ontology to categorize sentences and comments into different categories. Read more about this on our technology page.
Two examples of pre-generated User Categories include Agriculture (farming, agriculture, farmer) and Food (food, meals, vegetables, meat, fruit). Here is how the following categories match these example sentences:
|Sentence for Analysis||Food||Agriculture|
|I like chicken.||0.58||No match|
|I like chickens.||No match||0.71|
|I like to eat chicken.||0.59||0.51|
Notice the different scores based on the different categories. Download the list of default user categories as a template for your own user categories.
The default user categories provided are not comprehensive. They should serve as examples for your own user categories.
Semantria retuns a relevancy score with every entry and an associated category in an analysis. This score from 0 to 1 represents how confident Semantria is that the entry fits in the listed category. A low score implies low certainty and a high score implies confidence.
The weight affects the category relevancy score of a given category and has no bearing on the categorization algorithm. The relevancy score is calculated based on its own metrics and determines the engine’s confidence of a given category in the text. By default, Semantria has a concept threshold that leads the engine to drop relevancy scores that are lower than the threshold limit because it is not confident in the category.
The default threshold limit being 0.45, any categories with a relevancy score lower than 0.45 will not be reported in the output. The weight allows the user to adjust the relevancy score so certain categories will clear the threshold and appear in output. If a category is not returned in the output, it means its relevancy score did not reach the minimum threshold.
Assume a category “Stock Exchange” has a relevancy score of 0.3, it scores below the threshold and will not be returned.
|Engine score||Custom weight||Adjusted score||Result|
|0.3||1.7||0.3 * 1.7 = 0.51||Category will be returned|
|0.35||0.95||0.35 * 0.95 = 0.33||Category will NOT be returned|
|0.35||1.5||0.35*1.5 = 0.53||Category will be returned|
There are no boundaries for the weight, so it depends on the specifics of the document and categories. However, best practices recommends weights between 0.85 and 2.
The basic category definition syntax is simply words and phrases expressing the idea you'd like to match. You can use commas to break up separate ideas. 'oil paint' is looking for articles about the artistic medium. While 'oil, paint' is looking for articles about petroleum and/or any type of paint.
The list can be as long as you'd like, although many short queries usually outperform very long, detailed ones. A perfect match would be related to all the terms given, but partial matches can occur where the article is only related to a subset of the terms.
When the concept matrix is given a phrase, it matches both the phrase form as well as the individual words. Thus 'power plant', while matching stories about electric generation most strongly, may also pull in articles about plant life. In most queries the individual words in a phrase are related and contribute positively. But in cases where the individual words mean something different on their own, underscore instructs the engine to only use the phrase form. Thus 'power_plant' will not match articles about flowers at all.
NOT excludes certain ideas from consideration. This operator is primarily intended for narrowing down the meanings of words and phrases, or otherwise limiting the scope implied by a word or phrase. When using bank as a sample for a ‘Financial’ category may match articles related to mortgages, finances, but also to riverbanks. To avoid this scenario, using the NOT operator instructs the engine to ignore any mention of word after NOT. Therefore, ‘bank NOT river’ would match all articles related to finance but exclude any articles related to a riverbank.
The CONTEXT operator is the opposite of the NOT operator. While NOT excludes certain ideas implied by a definition, CONTEXT highlights certain ideas.
Assume you are interested in automobile manufacturing. The definition 'automobile, manufacturing' is likely to get relevant results, but may also pull in articles about manufacturing in general. The query 'automobile_manufacturing' is highly specific, but possibly overly so.
By using the CONTEXT operator, the text to the left of CONTEXT supplies the general idea being searched for, and the text to the right supplies the ideas you want that topic to be discussed with. Therefore, ‘automobile CONTEXT manufacturing’ will result in a search for automotive in general, with a focus on manufacturing. It will not return results just about manufacturing.
In addition to a list of terms to aid in defining the concept you are looking to match, boolean logic can be included in a category definition to provide a level of specific filtering for concept matches. Boolean queries are enclosed within [ ] brackets in the category definition. For example, [(pizza AND seafood) NOT (shrimp OR “king crab”)] will match and categorize content that discusses pizza with seafood, but doesn’t contain shrimp or king crab.
The following example from a Tripadvisor review is a good illustration of the differences and advantages of categories versus query-based categorizers when identifying broad concepts.
We were on the ship by 11:45. About 10 minutes after my VIP parents. Went to Lido deck for lunch. It was hard to find a table for ten or two table near each other so we went to the back of the ship near the pizza. The ship was full and you could tell. Very crowded. We got to our room at 1:30 met Edouardo our room steward. Loved him got everyone's name but had a hard time with mine and called me misses all week. The room was the same as on the Splendor which we did two years ago. 4 of us had plenty of room and loved the balcony. We had early seating in Washington Dining room 3rd floor in the middle so no view for us. I wanted to do anytime dining but with ten it wouldn't of worked. The boys and us went to club sign up. My 14 year old was going to be 15 in September and I wanted him moved to club O2 they said to write it on the form. He was able to switch no problem. Club started at 9:00pm. They had a great time all week and came home 12:30-1:00 every night.
With Categories, the user could have searched for "travel" or "tourism" categories and found this passage in about 10 seconds. A Query search would have taken much longer to find this passage and would have been less successful. A search could have been "motel OR hotel OR show OR resort OR pool OR travel OR vacation;" It would have taken about 5 minutes to define and would not have detected that this passage was about travel.