Over the past couple of days several CMS companies have been coming clean about their product, due in no small part to this blog post, which ties in nicely with Jeff’s post a while back about transparency amongst the Text Analytics vendors. To this end I thought it would be fun to get the ball rolling on our side of the fence and see what sort of response we could get.
So slightly modified from the original rules:
- A Text Analytics vendor is challenged to honestly answer all items on below “Reality checklist for vendors”
- If possible the vendor has supply screenshots, links or other means to make it easy to verify the answers.
- The answers also need to be supplied in a short form of one to three stars (denoting “no”, “sort-of”, “yes”).
- Answering all questions on their blog allows the vendor to tag some other Text Analytics vendors.
- A tagged vendor should provide a link back to the blog that tagged them.
Questions and answers Right, so question and answer time…
- Platforms We support various flavors of Windows (XP, Server 2003 and Vista) together with Linux. We support both 32 and 64 bit machines natively.
- Ease of Install/licensing Ever since V1.0 we have tried to make the installation process as simple as possible, bundling everything into a standard installer package on Windows (built with the ever-lovely NSIS). On Linux everything is packaged up as a standard tarball. Just unpack it, set an environment variable and you are ready to go. It really is a 30 second job. Licensing is only slightly more involved but at the end of the day involves getting a license file from us and saving it to some place on your machine. A recent quote from a client: I think it took me longer to download and untar it than it did to get a quick sample running
- Ability to start processing customer data and then give some meaningful results straight from the box First impressions are important, and while the best results from text analytics are going to require at least some degree of tuning to your particular content of interest, it shouldn’t be necessary if you want to have a first look at results. Our Salience install provides a sample application that can be used right out of the box on your own content to judge exactly how the default model provided will operate. A command line utility is also provided that can be scripted to automate proof-of-concept testing and results analysis.
- Flexibility of entity extraction There are many exciting applications for text analytics, and just about every customer we’ve come across has a different focus when it comes to their content. Many text analytics vendors offer standard entity recognizers for people, companies, places, etc. But how good are those recognizers when it comes to a new company that comes onto the scene? Can you adjust and tweak the provided entity recognition? You can with Salience, user customizations are a matter of text file editing from simple authority lists to developing complex patterns for advanced entity recognition. With Salience 4.1 we’re taking it one step farther, with the introduction of the Entity Management Toolkit. This new tool allows you to train an entity recognition model, for the types of entities that are important to you. Jeff’s favorite example is a disease recognizer. Annotate six documents containing mentions of various diseases, and use the model in your Salience environment to recognize diseases mention in other documents, without authority lists
- Document level and Entity level sentiment Many people consider entity extraction to be enough in the world of text analytics. We disagree, we consider sentiment analysis a must-have feature for text analytics vendors. Salience provides both document-level and entity-level sentiment, and uses a Grammatical parse to determine which bits of the document are talking about which entity. We allow customers to customize the sentiment by applying their domain specific knowledge to the data that they are working with and going forward we will be extending our Sentiment Toolkit application to generate our new sentiment models that exist alongside our traditional methodology.
- Ease of integration There are a number of technology solutions, not just in the text analytics space, which are provided via web services. The introduction of web services was a boon in that it allowed access to functionality without needing to imbed a vendor’s library into your product. The down-side? Network latency in performance-intensive applications. Need a web service for a distributed application? This can be written around the Salience C API with the development technology of your choosing. To support this we ship with a fully functional C API which is then wrapped up by native code for the other main development languages. So we have wrappers for Python, PHP, Ruby, Java and .NET that support all the same functionality that the C level wrapper does. We ship the source code for those wrappers as well so if you want to add your own functions you can.
- Documentation and SDK samples All of our software and its associated API’s are documented on a public wiki available at http://dev.lexalytics.com/wiki. Being a wiki enables us to make very quick edits and also allows us to give access to customers if required but also puts a task on us to make sure that it is always up to date and is structured in such a way that customers can easily find the piece of information that they are looking for. As stated above,we also ship the source code for all of our wrappers, together with source code for both of the sample applications that we install in the package.
- Reporting Problems We maintain a public bug database at http://dev.lexalytics.com/bugs/ which enables customers to see if an issue they might be having is happening to other people as well. All of our internal services staff use this bug reporting mechanism as well, transparency is something that is very important to us. The dev team can also be contacted directly via a diverse set of technologies including Email, Twitter and Skype, oh and this thing called a telephone as well
- Foreign Language Support Currently, Salience is English-only but it is on our internal product roadmap to expand this out this year. French and Spanish are going to be the thrust of our initial efforts, with more to be added as and when there is demand for them. Its our intention to support our full range of capabilities, including entity level sentiment in these languages.
- Openness and Transparency One of the thing that Lexalytics has always stayed away from is being a ‘black box’ type of solution, the solutions that customers are trying to build are complex beasts, and giving them a one size fits all, or a black box that solves the problem they are having at the current time, but will need a completely new black box to solve the problem that occurs tomorrow, doesn’t IMO fly in the current market place. To this end we provide almost endless customization options from changing fundamentals like tokenization, to adding an extra entity or a new sentiment bearing phrase. On the technical side we are more that happy to tell you how we are doing something from where we use Lexical Chain techniques to what our Grammatical parsers are actually doing to your document. We even supply a freely available (you can download it from our website) trial application so you can see that we can actually do what we say we can do.
Conclusion So how did we do? Well, we came up with the questions (though I do feel that they are a good representative set) and so feel we’re pretty solid in these areas. There is always room for improvement, however and on the short term development plan we have items such as more accurate entity extraction out-of-the-box, better sentiment analysis of short text, more text analytics features and faster processing. Finally we need to tag some other vendors for this, so lets go with say Leximancer, Crimson Hexagon, Clarabridge, Attensity, Jodange