Document Collection Processing

Overview

Collection Processing

Prior to Salience Five, there was only one way to process content throu

technical info

Details

Collection Processing

Prior to Salience Five, there was only one way to process content through Salience – a single piece of text at a time.  Now, we’ve added the ability to process an entire collection of text simultaneously to provide new extraction capabilities for the whole text mining analytics process.   There are places for each kind of processing, and this document will discuss them.

Single Document at a Time

When processing one document at a time, Salience examines each document, extracts all the relevant metadata, and then the relevant part of the processing pipeline sticks the content into some sort of data store – either a database or search index.

 Document at a time

It is then up to the user to run the necessary reports to gather, say, what are the top entities or what are the top themes.  The advantage to this sort of processing is that you get access to every single bit of metadata that Salience produces, and can thus set up exactly the right reports for you.

However, if the content shares some “relatedness”, it would be nice if the core Salience algorithms could take advantage of this to perform better or different metadata extractions.

Salience Five Collections

Collection processing takes a collection of documents, processes them simultaneously, and then gives summary results for the entire collection.

 Collection Processing

In functionality similar to document-at-a-time processing, Salience can provide the top entities, themes and topics.  With document-at-a-time, you’d have to populate a search index or database, then run a report to get the “top N” entities, themes and topics.  With collection processing, you simply query Salience directly, and it will tell you the mentions/hits, score, and sentiment analysis of the top entities/themes/topics.

More interesting is the new functionality that collection processing enables.   Salience Five is the first text analytics product to be able to directly track how real people are describing their real experiences, without pre-configuring a large taxonomy.

Take the sentence “My bed was hard.”  No other text analytics software can actually take a collection of hotel reviews and automatically extract “bed” as being an important aspect (or, as we call it, “facet”).   Facets have “attributes”, so you can quickly see that there were 10 people who said it was hard, and three people who said it was uncomfortable.   Not only do you know that it was negative (from the sentiment analysis), you know why.  

Please come to this webpage for more information on Salience Facets.

Over the upcoming releases, we will be taking more and more advantage of the contextual, statistical, and semantic relationships available to us in collections of related content.   Watch for more good stuff.