Low Level Text Operations

Unless you're an NLP expert, you almost certainly aren't going to want to read this page.

But! If you are, we've got you covered. Come check out all the low level functions you can have access to that will help you rapidly deploy your system. 

The following are the important functions that are managed by the text preparation sections of the Lexalytics Salience Engine.

  • Text Validation

  • String Substitution

  • Tokenization

  • Part of Speech Tagging

  • Stemming

text preparation

A few important aspects to each section (to add to the above):

  • Text Validation: these are basically safety functions

    • Too big? If you feed it a terabyte of text, it will valiantly do it's best to parse the content, but you're going to want to know how long it will take to do that single, terabyte document. 
    • Is at least 70% of content Alphanumeric? This parameter is programmer-modifiable. We've found for long content that this is a good ratio, but if you're dealing with a lot of ugly Twitter content, you may want to change this to be a bit lower so that "OMG!!!!! I L0VE!!!!?? FRUITY!!!! LOOPS!!!!!" has a chance of parsing correctly. On the other hand, you may not. In general, 70% works just fine.
  • String Substitution: generally used for dealing with consistent errors in pre-processing that generates dirty content

    • Can be used to replace acronyms with fully expanded text

    • Replace any tags that your HTML scraper missed

  • Tokenization: Used to break the document into important chunks. It's important for us to work along things like sentence boundaries.

    • Sentence boundaries: You could use the output of the tokenizer to break your document into smaller chunks for your own use, if you desire - otherwise just let it do it's thing. (So, if you wanted to do sentence-by-sentence sentiment analysis, and understand each sentence as it's own document, you could do that. You'd lose the value of the surrounding sentences/lexical chains, but you could do it.)

  • Part-of-Speech Tagging: Reliable, robust tagging of parts-of-speech.

  • Stemming: We use a relatively light stemming algorithm based on the Krovetz stemmer.