Tuning vs. Training, Nazar Amulets, and Sentiment

Lexalytics’ AI-based “natural language understanding” uses a hybrid of many different sorts of machine learning, combined with dictionaries and NLP code.  We are not wedded to any one sort of technology (e.g. we use “deep learning,” but also use Bayesian networks.). This is in marked difference to all but a few of our competitors. It’s hard to do this – it’s much easier to focus on a single technology, like, say, supervised machine learning with neural networks.

Unsurprisingly, we believe that our approach has significant advantages.

One of these advantages is that we can both “tune” and “train” our system. Tuning means just reaching in and placing a line in a file (or turning a knob) that tells the system exactly what we want it to do. Training means gathering a set of examples, annotating them if necessary, and teaching the system through these examples.

Tune first, then train (if necessary). Tuning is fast, easy, and can be done ahead of time.  Training must be done with historical data, which must be gathered from the real world.

As an example of this, Lexalytics was the first text mining company to understand emoji and be able to calculate emoji sentiment value as well as any semantic value carried therein.  (For example, 😁 is positive, and ⚓ doesn’t carry any sentiment, but carries semantic/contextual information of “anchor,” “nautical,” and “boating.”

With some interesting technical exceptions that we’ll probably discuss in a future blog post, it was relatively easy for us to support Emoji in the first place, and recent events let us show off our capabilities.

The Unicode Consortium (long shall they live) have seen fit to grace the world with Emoji 11.0.

As a bit of background, “The Unicode Consortium is a non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data, particularly the Unicode Standard, which specifies the representation of text in all modern software products and standards.” So, if you are reading something (like this blog post), you almost certainly have the folks over at the Unicode Consortium to thank.  As part of their responsibilities, they handle the international standards for emoji. Each device vendor is responsible for how the emoji looks on their device, but the basic “U+1F600 means 😀” comes from the Unicode Consortium.

Now we have a shiny new set of emoji to play with, including the Asian red gift envelope, petri dishsmiling face with 3 hearts, and the Nazar Amulet. Oddly smiling face with 3 hearts contains 4 total hearts.  There’s over a hundred new emoji, so you should go check out the linked blog post above, or go to Emojipedia.
Considering the few samples I pulled out, the red gift envelope carries both semantic and sentiment information (“gift, money” and +0.5), the smiling face with 3 hearts carries both semantic and sentiment information (“love” and +0.75), and the petri dish only carries semantic information (“biology”). This is already of use to our partners, like social media monitoring company Falcon.io. Says Christopher Sugrue, Content & Communications Team Lead at Falcon,

“Last year we added an emoji-picker to our Publish and Engage products and it has quickly become one of our customers’ most beloved features. Emojis have become a critical part of everyday communication, and we’ll need the ability to process and analyze the new Emoji 11.0 the day they hit the market. What’s great about working with Lexalytics is that we know the capability will already be seamlessly built into their platform so we don’t have to worry about missing out on key insights for our customers.”

We now support all of these new emoji, months before any other company will. This allows time for customers to glance at them and fine-tune if necessary, implement and test the new tuning, and just generally make sure that they don’t have to play catch-up. Of course, we can train on actual data – and for some cases, that will be useful. Those wacky teenagers and their non-standard emoji use means that we can’t predict exactly how these are going to be used, but we can get pretty close – close enough to be providing useful information starting on the very day any device manufacturer supports these new characters.

