December 22nd, 2011
This is another extract from our customer files. Not something that comes up all the time, but often enough that it warranted a blog article with a good worked example.
In general, Salience Engine has been and continues to be very economical in terms of hardware requirements. Text analytics with Salience Engine is more CPU intensive than I/O or memory intensive, though the inclusion of the Concept Matrix™ in Salience Five has increased the memory footprint.
So let’s say you’re looking to process 2 million documents per day, where half are tweets and half are news articles of 4kb or less. What kind of hardware spec are you looking at? Read on to see how you could spec out handling this amount of content with Salience Five.
Read the rest of this entry »
Posted in Salience Five, Support | No Comments »
December 21st, 2011
I wanted to write up a detailed explanation of the methods of entity extraction available in Salience Five for a client, where they overlap and where they differ. And as I did, I thought, “That would make for a bloody useful blog post for the dev blog.” So here it is.
Prior to Salience 4.x, entity extraction was solely list-based. Salience 4.0 introduced model-based entity extraction, which allowed for novel entity extraction. In other words, “I didn’t think to add ‘John Smith’ to my list of people to extract, but Salience Engine found him in today’s news magically because it knows what names of people look like.” Very powerful stuff.
Salience Five continues to provide model-based and list-based entity extraction found in Salience 4.x, with some of the same cross-over between the two and modification to the terminology.
Read the rest of this entry »
Posted in Salience Five, Support | No Comments »
August 5th, 2011
Today’s assignment: Convert some docx files to txt and then time how long it takes to process them, getting document sentiment and entities. Use PowerShell.
So first, lets convert the Word documents to text files:
function Save-AsText($fn) {
$doc = $word.documents.open($fn.ToString())
$txtName = $fn.ToString().Replace('docx', 'txt')
$doc.SaveAs([ref] $txtName, [ref] 2)
$doc.Close()
echo $txtName
}
$c = Get-ChildItem -recurse -include *.docx
foreach ($fn in $c) {
Save-AsText($fn)
}
Now that we’ve got our text files, we can use Measure-Command and Measure-Object to do the measuring:
Add-Type -Path "C:\Program Files\Lexalytics\Salience\bin\SalienceEngineFour.NET.dll"
$se = New-Object Lexalytics.SalienceEngine(
'C:\Program Files\Lexalytics\license.dat',
"C:\Program Files\Lexalytics\data")
$timings = @()
$c = Get-ChildItem -recurse -include *.txt
$cnt = 0
$s = 0
foreach ($fn in $c) {
$m = Measure-Command -OutVariable t {
$rc = $se.PrepareTextFromFile($fn.toString())
if ($rc -ne 0) {
echo "Failed to prepare text with code $rc on $fn"
continue
}
$cnt = $se.GetEntities(0, 0, 0, 0, 50, 5) | Measure-Object | Select-Object Count
$s = $se.GetDocumentSentiment(0).fScore
}
$timings += $t[0].TotalMilliseconds
Write-Host $fn $cnt $s $t[0].TotalMilliseconds
}
$timings | Measure-Object -minimum -maximum -average -sum
And you’ll end up with a summary at the end like this:
Count : 100
Average : 511.2
Sum : 51120
Maximum : 999
Minimum : 63
An average of 511 milliseconds per document for the 100 documents processed.
Posted in Dev, How-to | No Comments »
July 25th, 2011
This is a bit of a forward-looking blog post about new features that we’re debuting in Salience Five. At our Lexalytics User Group meeting in New York in April, we introduced the “collections” functionality that will be provided in Salience Five. Salience Five is in beta right now, so I decided to put together a worked example of collection functionality using some customer review data for Bally’s in Las Vegas gathered from a public website.
Read on to see how you’ll be able to use collections to analyze a group of documents as a cohesive set, extract the commonly occurring themes (with rollup using our concept matrix), and other pieces of actionable data we’re calling facets.
Read the rest of this entry »
Posted in Dev, How-to, Java, Salience Five | No Comments »
April 29th, 2011
To round out my overview of ways to get quickly up and running with scripting Salience on Windows, I’ll conclude with another way to take advantage of the .NET wrapper: IronPython
Read the rest of this entry »
Posted in Dev, How-to | No Comments »
April 23rd, 2011
A common request from customers looking to evaluate Salience Engine is to process sample set of data. Often this will take the form of an Excel or CSV file where there is a column that contains the text to be processed. I’m going to show one way of tackling this problem, using PowerShell.
Read the rest of this entry »
Posted in Dev, How-to | No Comments »
April 22nd, 2011
This is a follow-on to our list of ten things to know about Salience Engine. Together, these two articles are intended to guide developers in some of the main aspects of working with Salience Engine when they first start out.
In the first part, most of the topics focused on deployment strategies and approaches. In this second part, we’ll look at areas of tuning results from Salience Engine. So let’s roll up our shirt-sleeves and get back into it…
Read the rest of this entry »
Posted in General, How-to, Support | No Comments »
April 22nd, 2011
I had a meeting with a client recently, and one of the suggestions they raised was a list of the top 10 things that an engineer should know when they start working with Salience Engine. Some of these may seem basic, however it’s not safe to assume that things which seem obvious actually are. With all due respect to David Letterman and his Late Night Top Ten lists, here we go…
Read the rest of this entry »
Posted in General, How-to, Support | 1 Comment »
April 15th, 2011
By way of introduction, my name is Matt King and I’m a Solution Architect in the the Lexalytics Services group. I’m also the guy who brought you the interactive Salience python script the other day. Most of my current work is on Linux (Python/Java/bash/etc) and both my home and work laptops run OS X as the primary OS. I do have VMWare Fusion with a Windows OS, but until a day or two ago that was a copy of XP Professional that I dutifully purchased back in 2008.
After upgrading to Windows 7, I was looking around for something to do. As I’ve been hearing good things about PowerShell I figured it was worth checking out. But what to do with it? I’d heard that one of the cooler things, besides the object passing pipelines, is that it allows easy access to just about everything via .NET. And Salience comes with .NET wrapper… Read the rest of this entry »
Posted in Dev, How-to | No Comments »
April 6th, 2011
One of the key strengths with Salience Engine is that it is provided as a library, which customers can integrate into their own systems. In order to make the integration easier, we provide wrappers for some of the most popular development environments; namely .NET, Java, PHP, and python. The first hurdle for a developer to cross in accessing Salience Engine is getting the wrapper of choice set up within their development environment so they can start coding against it. This blog article shows how to build and deploy the python wrapper for Salience Engine on both Windows and Linux. Also provided is an interactive script written by one of our professional services engineers that can also be used to get your feet with Salience Engine in a python environment.
Read the rest of this entry »
Posted in Dev, How-to, Support | No Comments »