There is an old adage ‘put rubbish in and get rubbish out’ and that has never been truer than in the text analytics fields. The number one issue that our customers have is getting clean content to work on and this problem is worst when we are asked to process web content. Whilst looking lovely to the browsing eye, your typical webpage is a mass of links, adverts and callouts all hiding the actual text of the document. Take a look at this page for example: http://www.lightreading.com/document.asp?doc_id=139757
the actual story content makes up less than 20% of the text on the page - and of that other 80% its packed full of companies and brands that aren’t really related, and certainly shouldn’t be counted as an entity within the context of the story. Now whilst it is certainly possible to go to one of the many feed providers for your data (Moreover for example) they are typically restricted in the sources that they will offer you - normally because they are using a template based extraction method, which is great until the site design changes - and lets face it, they cost a fair chunk of change. So based on some conversations with both internal people and various customers I decided that it was time to revisit our html extraction technology and see if we couldn’t do a better job. We already have technology (based on lexical chains) that enables us to make good summaries of documents, basically working out what bits of text are connected to each other and which are the most important. Surely we could extend this to work with html content. Well it turns out that we could, and after a small while I had a decent prototype up and running. The initial results are very promising and after some more tweaking (there always seems to be tweaking involved in these things) I’m pretty confident that we’ll be able to roll the technology out in one form or another in the near future. Should tide us over until full text RSS feeds become the norm rather than the exception.