Salience 5.1.1 Feature Highlight: Automatic Categorization

  2 m, 35 s

This is probably the coolest new thing that we have in our toolbox.  All of our customers like the idea of “magic” functionality.  We have a long tradition of providing functionality that is available and worthwhile straight out of the box (sentiment, themes, summarization), all the while allowing you to configure and tune the content to meet your application’s needs.  The one area where we didn’t have any magic was in categorization/classification.   When we introduced Concept Topics, they made the work of configuring classifiers way easier, because you could deal with large topic areas without the labor of training a classifier or building a large taxonomy.  However, we only gave you a sample set of categories that you were supposed to then go off and build your own categories.

And, lo, there was a demand for more!  It turns out that Wikipedia has some very nice category areas of its own, and so we’ve produced a new Salience call that will automatically categorize your documents into a set of about 4,000 categories which are rolled up into 125 high level categories.  These categories each match a single Wikipedia page of reasonable breadth, for good coverage of the Wikipedia knowledge base.

For example, take the following text:

"We're seeing a new revolution in artificial intelligence known as deep learning: algorithms modeled after the brain have made amazing strides and have been consistently winning both industrial and academic data competitions with minimal effort. 'Basically, it involves building neural networks — networks that mimic the behavior of the human brain. Much like the brain, these multi-layered computer networks can gather information and react to it. They can build up an understanding of what objects look or sound like. In an effort to recreate human vision, for example, you might build a basic layer of artificial neurons that can detect simple things like the edges of a particular shape. The next layer could then piece together these edges to identify the larger shape, and then the shapes could be strung together to understand an object. The key here is that the software does all this on its own — a big advantage over older AI models, which required engineers to massage the visual or auditory data so that it could be digested by the machine-learning algorithm.' Are we ready to blur the line between hardware and wetware?" 

This text is automatically classified into the following top level and second level categories:

  • Mind (1)
    • Computational neuroscience (.75)
    • Cognitive neuroscience (.58)
    • Cerebrum (.56)
    • Neural engineering (.55)
    • Neuroprosthetics (.51)
    • Neurotechnology (.51)
  • Computer Science(1)
    • Philosophy of artificial intelligence (.61)
    • Machine learning algorithms (.54)
    • Computers (.73)
  • IT (.71)
  • Robotics (.63)
  • Life (.57)

Since we’re Lexalytics, we couldn’t just let that go at that, we know that you’re not going to be utterly and completely satisfied unless the taxonomy completely meets the taxonomy in your head.  Two new files in the data directory allow you to a) configure new categories or override the categories that we already have, and b) allow you to configure a multi-level taxonomy of your categories.

Categories: Announcements, Categorization