Last week, a talking point was born. The FBI gave the public their long awaited “October Surprise.” But, it didn’t last long in the news cycle before Director James Comey gave the all clear. In a matter of days, the FBI determined Hillary Clinton’s latest batch of emails contained nothing noteworthy. Although we’d argue this puzzlingly urgent email about Gefilte fish warrants at least one raised eyebrow (btw, thanks to Hillary I now know Gefilte fish is actually carp—cool).
But the quick-to-tweet Donald Trump and his surrogates rejected the FBI’s analysis outright. His military analyst, General Michael Flynn, tweeted that it is impossible to scour 600,000+ emails in eight days. We’re not sure why a man with decades of experience within the US intelligence community thinks emails are such an impenetrable problem. So I decided to explain the process so as to demystify it for General Flynn.
Edward Snowden, a former NSA contractor who is currently exiled in a Moscow, was the first to call out General Flynn for his technical oversight. I’d like to unpack what Snowden said and add to it a bit. Soon, the methodology used by the FBI to analyze Clinton’s emails will seem glaringly obvious.
The Bureau discovered a lot of data on Clinton aide Huma Abedin’s computer. The FBI needed to know if there was any new evidence in these emails, specifically emails on that computer sent to or from Hillary Clinton’s email address.
So, to begin, data scientists filter out emails with Clinton’s address in the To, From, and CC fields in each email—this narrows it down to relevant conversations already. The next step is to “hash” the data. A hashing algorithm turns an object (e.g. text) into a short number, subject to the constraint that if it’s given the same object again it’ll produce the same number. That number is called a hash code. They probably used MD5, where any string at all turns into 128 bits. That’s easier to work with then, say, a 4 KB document. If two MD5 codes differ, you know the objects they represent are probably different. If they’re the same, then they probably represent the exact same data.
Sometimes two completely different objects could be given the same code. For example, news editors might cut different paragraphs from stories for length or clarity from a newswire. We’ll analyze 20 or 40 different news stories across networks and papers on the same subject and find that 95 percent of the text is identical. This can cause a hash algorithm to slip up. But it’s an easily solvable problem.
The FBI surmounts problems like this by leveraging tech like that found in our email solution. Director Comey and his team just need to isolate relevant emails and rip out the headers, attachments, and those confidentiality statements at the bottom of the documents. You can go a step farther and even excise all repeated text in an email thread. What’s left is the main text in the body of the emails. The FBI then runs this new information through an API, like Lexalytics’ Salience Engine. From here they determine if the hash redundancies are new emails and if the whole batch is relevant to the State Department.
Now, we’re not saying that Donald Trump and his friends on the campaign trail were saying something they knew to be untrue. The ‘magic’ of text analytics is something not many people know about. But, General Flynn’s resume is peppered with credentials from his time in the US intelligence community. Someone of his caliber within the industry should take it as a given that the FBI analyzed those emails using these processes.