Every little bit helps

  1 m, 52 s
With customers wanting to process more and more content (and more and more content being available) and wanting to do that in an a near real time manner as possible, the throughput speed of our various components becomes something that we are very interested in. Of course customers always want more features added as well (Pronominal CoReference, Sentiment Concepts etc.) which in turn require more complex engineering solutions (grammatical engines) which in turn require more processing time. Trading off this increased complexity with the need to keep the document throughput up can be a challenging task. My internal metric has always been the ability to process 2 normal sized (1000 -1500 words) documents a second in a single thread. If we can keep to that then scaling can become a hardware problem, add more cores and you get more throughput, but with all the changes we’ve made over the last couple of releases, it was time to see if we were still keeping up with that. To this end, last week I had one of our engineers profile the various bits and pieces that make up Salience internally and we came up with some interesting conclusions
    • Machines based around a Core 2 architecture completely trounce the Pentium 4 architecture with Salience. A 3.4Ghz Pentium D was only 20% faster than a 1.66Ghz Core 2 machine and was slower than a 2.5Ghz Xeon based machine.
    • Perhaps not surprisingly, the 64 bit versions of the software were substantially faster than the 32bit versions with a document taking 338ms on 64bit compared with 458ms on a 32bit platform.
  • Linux and Windows performance was pretty much the same, so no adding to platform debates there icon_smile
The testing also identified (as we hoped it would) a couple of areas that we could make some more speed improvements in the upcoming release so we should be able to get the numbers down further by the time 4.1 goes out the door. And as for the internal metric. Well as you can see for the numbers above, even before we do some more performance enhancements we are still at the 2 docs a second mark, and on the 64bit platform we are nearly at 3, which all in all makes me happy and a happy Mike is a good thing.
Categories: Technology, Text Analytics