Jump to Navigation

LexaBlog: Our Sentiment about Text Analytics and Social Media

Newsgator and aggregation

At a customer request, I spent some time looking at Newsgator’s API. The customer wanted to see if they could get blog and social media content about a specific company from a single place. Currently, they are searching Google Blogs, Twitter, etc. The model they had in mind was Moreover’s company feeds, for example this one on Dupont.

Unfortunately, I couldn’t get clean content. There was a lot of foreign-language content and I could not specify the language I desired. There was obvious spam and so forth - all classic problems of information retrieval on the web. Moreover’s free feeds are a lot cleaner due in no small part to content selection by humans, but to really get something you want, you need to pay and customize the feed. Most customers don’t want to see every single article about a company - they have a target area.

You might say this is an opportunity for a Reputation Management or Media Intelligence platform. There are many, at various price points. However, for a lot of people, that’s a much bigger pipe of information than they really want and comes with a lot of overhead of its own. Could it be possible/worthwhile for a content aggregator to set up an infrastructure that allows more customization, takes care of enough of the cleanup, categorization, etc, and is cheap enough to be ad-supported, like Moreover’s free feeds? Somethine that could provide a “pulse” of information on a target area you are interested in without involving things like historical reporting, influence graphs, campaign management, and other unwanted content?

Has Google, with all its problems in searching social media, done well enough to make this a losing proposition for anyone else? I’d be interested in hearing what other sources people go to for relevant, aggregated information without investing in a full-blown vendor solution. Or maybe I missed some key features of Newsgator and would welcome learning how to get cleaner results.

Tags:

Comments

Mike, this is a good idea.



We do recognize that the questions are tailored to a particular vendor’s feature set, even if responses are fair. For instance, your questions 1 & 2 are not relevant for OpenCalais, which is delivered as a Web service.



I am, however, skeptical of self ratings. For instance, Lexalytics doesn’t run on Unix or MacOS, right? So I’d give you two stars. A rating point is foreign-language support, but Lex has none so you get no stars, and Lex is less open than open-source GATE or RapidMiner or open-code LingPipe so I’d reduce you to two stars. But this is nibbling at the edges on my part. It’s a nice effort.



I do have three significant questions.



1) Your point 4 is “Flexibility of entity extraction” but what you show is annotation, not extraction. What are Lex’s extraction capabilities, either into XML possibly including RDF triples or into a database or some other format?



2) Do we not need to look at accuracy, performance, and throughput?



3) Ecosystem: Some vendors have alliances with vendors whose applications consume text-analytics output, for instance, BI or e-discovery or search tools. Some vendors offer visualization or data mining capabilities. All this matters for some users.



These are just off-hand thoughts. Good start.



Thanks for the feedback Seth, I actually agree with most of what you say especially around the star ratings, with hindsight no stars for language support is perfectly correct. I’m not sure that I would agree with you on the Open Calais issue though, surely it is of utmost importance to know if you can install the software on your own stack or have to use a webservice with all the latency issues that involves.



Answering your specific questions



1) The screenshot is of a tool that we are releasing alongside our 4.1 release that will enable you to annotate your own data with what you consider entities and then make a trained model that will exist alongside our existing ones, giving you complete flexibility in what you recognize as an entity, and thus what you can extract. To your other point, this write up focuses solely on Salience, which of course is an SDK so entities are returned as structures / objects in whatever lanaguage you are using. It is a simple task to turn those into XML, insert them into a DB etc.



2) Yes we do, though as discussed in other places accuracy is very hard to judge, hopefully one day someone will produce a base line test set that we can all run against.



With regards to performance I wrote something about that a few weeks ago http://blog.lexalytics.com/2009/02/17/every-little-bit-helps/



3) Yep it does, our own partnership with the Enterprise Search vendors or various VoC companies for example help bring text analytics into a more mainstream market. Not sure how you would rate that on a meme however, number of partnerships?



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.