Wikipedia™ is a fantastic lexical analysis resource. It will probably be one of the most important elements in bringing up a true artificial intelligence, but first we have to figure out how machines can leverage Wikipedia's inherent semantic knowledge.
Lexalytics is the first to bring to market a semantic analysis technology based on Wikipedia™; the Lexalytics Concept Matrix.
The Concept Matrix is a foundational technology that Lexalytics is exposing through various features like Concept Topics and Facets. There are other uses that we have planned, but you're going to have to wait to find out about them.
If you'd like to stop reading now, you can think of it as the world's biggest thesaurus. This is a radical over-simplification, but will get you somewhat there in understanding.
To learn more, please read on:
A concept, like science, has many aspects to it that have been explicitly stated in Wikipedia via linking and in-article mentions:
Wikipedia, at a semantic level, looks something like this:
Each article has a bunch of content and concepts in it - but, crucially, is about a single main topic. Each article links to other articles, with their own set of content, concepts, and crosslinks.
We took the whole Wikipedia content dump and processed it for words and bi-grams:
So, now we have a Concept Matrix that is composed of 1.1 Million words/bi-grams that have 56 Million links between them.
This core concept analysis can be used to determine the semantic distance between any two pieces of text, and that simple, yet fundamental comparison of distance allows us to do a lot of very interesting things.