Porn Plan Pitfalls: What David Cameron Should Know About Classification

  4 m, 58 s

David Cameron’s recent announcement of his plan to automatically block porn across all of England has certainly stirred up its share of controversy. When asked how exactly he planned to enact this filtration and what, precisely, will get filtered, the Prime Minister admitted to being unsure of the specifics of his operation.

Of course, problem number one would be defining what, precisely, porn entails; that’s something that’s been notoriously difficult to pin down. But even if he could successfully overcome that rather large stumbling block, the automatic porn filter would still come up against some pretty serious accuracy issues that Cameron seems somewhat unaware of.

Text analytics also tackles the same issues of accuracy and misclassification that such a block would have to deal with, and we thought that we could use this opportunity to discuss “accuracy”, and really just what that term means.

 

Accuracy:

Supposing that we had a solid definition of pornography, so that every search result could be classified as either “porn” or “not porn” by a human. In this scenario, machine accuracy becomes the most important issue. If the block is overzealous in its filtration, it runs the risk of blocking search results that are not pornographic, such as medical and sexual health information. According to this article, filters in hospitals had to be turned off because they also blocked out clinical studies on breast cancer. However, if the filter is too judicious, it might miss some search items that are classified as porn and end up not being very effective. These two issues are the domain of the two metrics into which accuracy is divided, precision and recall.

 

Precision:

Precision is a measurement of how many returned items are relevant. In our scenario, the returned items are those search items that are filtered out. Relevant items are those that are classified as porn.  Non-relevant items would be those that are not porn. So, for example, if you had 5 documents tagged as porn, and only 3 of them were actually porn (as measured by some human), you would have 60% precision.

To put it simply, precision is the measure of how much of what is filtered out is porn. The higher the precision, the closer this number is to 100%. The danger here is that highly precise filter may leave out relevant (pornographic) items because its goal is to have as few non-relevant items filtered out as possible.  In other words, there is always a balancing act between precision and the next metric, recall.   

 

Recall:

Conversely, recall describes the amount of relevant items that are returned compared to the total amount of relevant items that exist. The more relevant items that aren’t caught by the filter, the lower the recall would be. Search engines are a perfect example. They return thousands of links or images to your search query, a large percentage of which probably aren’t very useful to your search, but by doing so, are less likely to miss including something that is useful. The problem with this is that a classification system that is optimized for high recall doesn’t care if it includes non-relevant items as well. A filter with high recall would also block out sites and images that aren’t porn related at all. That would be a problem, and typically would include things like education or sites with technical information (like, say, condom vendors).

 

F1 Scores:

 F1 scores are a blended metric that combines Precision and Recall.  They are really the best metric for determining “accuracy”, rather than simply looking at the percentage of agreement.   You can read more about F1 scores here:  http://en.wikipedia.org/wiki/F1_score

 

As an aside, it’s important to consider the number of potential “states” when looking at the quality of your accuracy numbers.   Say you’re doing sentiment analysis for positive/neutral/negative, and you’re getting 50% accuracy.   The immediate reaction would be to think “wow, I’m not doing any better than flipping a coin”, and that would be wrong.  Remember, there’s 3 possible states there, so, a truly random process would give you 33% accuracy.  In the case of porn you could certainly make the case that there are more than two states, think along the lines of American movie ratings:  G/PG/PG-13/R/NC-17/X/XXX.  That’s a whole continuum of possibilities, and you need to ask “where is the line drawn”?

 

We deal with accuracy questions all the time, and our customers have to balance precision and recall for their applications.   Some social media monitoring systems are optimized to only do trending on content that is absolutely sure to be positive or negative – so that it’s highly precise, but with low recall.  But if you’re building a social CRM system where you’re interacting with people, it’s going to be a real problem if you miss an upset customer, so it’s ok if you see a lot of content that doesn’t indicate a problem – just so that you can catch the content you really want to get.

 

Our point is that it is not a simple proposition to wave your hands and say “pornography shall be filtered henceforth and hereto forward”.  First you have to consistently get humans to agree on what is considered pornography, then you have to build a classification scheme that balances precision and recall.   If your goal is 100% recall, (no porn at all ever allowed through), then precision is going to necessarily be low – which means that there will be collateral damage in terms of non-porn content (educational, for example) being filtered.   If you try to limit that collateral damage, then some porn is going to slip through.  It’s just how the math works.

 

Categories: Analysis, Categorization