So following on from what I was writing about last week I’ve been doing some more work on a generic text from html extractor. However in showing the results to a few people a question came up. Overall they were very happy with the amount of rubbish that got removed from a page, but were unhappy that the odd sentence would get missed out. Its not a complete extraction was the cry, and to be honest that’s a fair point, the technique doesn’t get every single word of a story from the page, just as it doesn’t remove every single extraneous bit of text, but does it do a good enough job? I of course think it does (I would though wouldn’t I) and thought I would take a few moments to say why. To illustrate my point I’m going to use this article from the BBC website - http://news.bbc.co.uk/1/hi/uk/7126162.stm As you can see from the page lots of extraneous links, navigation elements etc, the sort of things that makes extracting meaningful data difficult. However if you run it through the new extractor it gets reduced to this;
A British teacher jailed in Sudan for letting her class name a teddy bear Muhammad has spoken of her” ordeal”, after returning to the UK. After speaking at Heathrow, she was taken by police to an unnamed location. The teacher and her family were expected to return to Mrs Gibbons ‘ son’s home in Wavertree, Liverpool, but reporters from around the world have been left waiting there for hours. Mrs Gibbons ‘ son, John, and, daughter, Jessica met her at Heathrow Airport, and the BBC’s Matt Prodger said a homecoming party would be held in Liverpool later. ‘ Fabulous time ‘. Mrs Gibbons, a mother - of - two, was arrested on 25 November and later given a 15 - day sentence after allowing her pupils to hold a vote and choose the name Muhammad, the same name as the Islamic Prophet, for a teddy bear. She arrived back to London accompanied by British Muslim peers Baroness Warsi and Lord Ahmed, who had mediated for her release. After a meeting with Baroness Warsi and Lord Ahmed, the press office of President al - Bashir announced that Mrs Gibbons had been pardoned and released after” mediation” On her arrival at Heathrow, Mrs Gibbons looked tired but relieved as she was whisked to a private room to speak to reporters for the first time since her ordeal began, our correspondent said. He understands that when Mrs Gibbons was first arrested, she asked a British consular official not to tell her family for fear it might worry them. Only then was she told that her case had become an international media story. Mrs Gibbons said the incident had” all come as a huge shock to me” and that going to prison was” terrifying” although she never actually spent any time in the Omdurman women’s jail. ” I’m just an ordinary middle - aged primary school teacher. I went out there to have an adventure and got a lot more adventure than what I was looking for. She said:” It is a beautiful place and I had a chance to see some of the countryside. Mrs Gibbons said she was treated the same as other Sudanese prisoners and that the Ministry of Interior sent her a bed, which was” the best present” When asked if she was going to continue as a teacher, Mrs Gibbons said:” I’m looking for a job - I am jobless.” ” And then at the end of Sunday we were presented with some hope that we may be able to see the president on Monday and we may be able to reach a resolution. ” The original incident was something very innocent and then what should have been seen as a minor error - and certainly a very innocent one - suddenly became blown up into something extremely important and the whole thing has been very, very worrying and quite horrendous.” Khalid al Mubarak, media counsellor at the Sudanese embassy in London, said he was very pleased the situation had been resolved. He also suggested that orientation classes for westerners coming to work in Sudan should be reintroduced. They had been standard procedure during the colonial era, he said. He said a short course ending in an exam, perhaps run at local colleges in Sudan, would be” very useful” to help new - comers avoid basic mistakes such as using the left hand to offer something to so mebody - the left hand is considered unclean. Meanwhile, Jonah Fisher, former BBC Khartoum correspondent, said that the arrest of Mrs Gibbons must have seemed like an easy opportunity to give Sudan’s former colonial masters a bloody nose. But in actuality, it appears to be Sudan’s President al - Bashir who has been left with a red face, he added. Teddy row teacher freed from jail. In pictures: Teddy row. Fire crew aid in penis operation.
which I think is a pretty fine job of getting the meat of the content without having lots of spurious references. Sure its not perfect (check out the last sentence) but does it cover all the salient points, and more importantly for downstream processing, cover all the entities and sentiment terms? Yes it does and for text analytics purposes that’s what really matters. Does it really matter that an odd sentence got missed out here and there? Especially as you aren’t getting your story polluted with all the navigation items. If you would like to try out the html extractor yourself to see that its not all smoke and mirrors then head over to http://www.politicaltrends.info/htmlextract/ and try it out for yourselves.