Tuesday, December 4, 2018

OCR and Text Enrichment at Scale

Have you ever felt the need to copy an information from a document such as a business card, a paper form or copy part of a pdf presentation? Or to valuate tons of documents who sleep somewhere in your storage repositories? The need for processing documents is more and more common as the world is moving towards even more Big Data.

Optical Character Recognition (OCR) algorithms have evolved and can now answer most of those needs. The next challenges lie in the following steps such as content enrichment and its indexation.

A solution like QWAMText Analytics enables you to extract key elements and indicators from text, from any nature (Web crawl, documents from your repositories, Your RSE), from any format (OCR, PDF, Office, Html, Plain text), in four languages (English, French, German, Spanish).

Key indicators are:

  • Named entities: Persons, Companies, Organizations, Events, Locations, Countries, Regions, Cities, Products, Objects, which are simply cited in the document or the main document subject
  • Concepts: Text expression frequent on the web, frequent in any category (Business, Politics, Environment, Technology, Justice ….), or simply frequent in your corpus
  • Relations between entities: Company buys another company, Company hires a Person, Company takes part of an Event
  • Sentiment analysis: Sentiment on your products or services (Voice of the customer), on life inside your company (Voice of the employee), or any topic in your business

In this article, we will go through a complete use case from document download, OCR processing in a distributed architecture and text enrichment with QWAM Text Analytics also parallelized. This processing will be done at scale using Activeeon scheduler for the distribution.