Tuesday, December 4, 2018

OCR and Text Enrichment at Scale

Have you ever felt the need to copy an information from a document such as a business card, a paper form or copy part of a pdf presentation? Or to valuate tons of documents who sleep somewhere in your storage repositories? The need for processing documents is more and more common as the world is moving towards even more Big Data.

Optical Character Recognition (OCR) algorithms have evolved and can now answer most of those needs. The next challenges lie in the following steps such as content enrichment and its indexation.

A solution like QWAMText Analytics enables you to extract key elements and indicators from text, from any nature (Web crawl, documents from your repositories, Your RSE), from any format (OCR, PDF, Office, Html, Plain text), in four languages (English, French, German, Spanish).

Key indicators are:

  • Named entities: Persons, Companies, Organizations, Events, Locations, Countries, Regions, Cities, Products, Objects, which are simply cited in the document or the main document subject
  • Concepts: Text expression frequent on the web, frequent in any category (Business, Politics, Environment, Technology, Justice ….), or simply frequent in your corpus
  • Relations between entities: Company buys another company, Company hires a Person, Company takes part of an Event
  • Sentiment analysis: Sentiment on your products or services (Voice of the customer), on life inside your company (Voice of the employee), or any topic in your business

In this article, we will go through a complete use case from document download, OCR processing in a distributed architecture and text enrichment with QWAM Text Analytics also parallelized. This processing will be done at scale using Activeeon scheduler for the distribution.

Workflow Description

Step 1: Download and Prepare

The first task is to select a set of images with text to extract and process.

Step 2: The Replication / Distribution Structure

The steps between OCR_task and end_replicate are part of the process that will be replicated on each image. Thanks to ProActive, an Activeeon solution, the parallelization of the whole process is handled natively and at scale. The user will have access to an intuitive workflow studio and a scalable environment.

Step 3: OCR and Enrichment

On every replicated process we can identify different steps. The first one OCR_task consists in extracting the text from a page.

  • The visualize_image branch is used for debugging or visualizing the image processed
  • qwamci_call and qwamci_result are enriching the text extracted
  • The text_enrichment branch provides enrichment from another third-party solution

Step 4: Merge Results and Store

The final tasks simply gather the results from all replicated processes. Depending on the enrichment, we could also store this information into an ElasticSearch database for future search purpose.

QWAM Text Analytics is a great companion for Elasticsearch as French speakers can see it here or there.

Conclusion

Overall, those workflows are quite standard and massively used in industry that needs to index and search large quantity of documents. The possible suggested use case can be:

  • Valuating your old documents in a search application
  • Producing facets to enable facetted search and enhance the user search experience by enabling interactive drill-downs on the search results.
  • Merging structured data from RDBMS with unstructured text that becomes structured.
  • Performing business analytics on your contents to include texts in your big data strategy.

The digitization of paper documents represents also an opportunity for individuals and businesses to save space, to protect documents against theft or fire and flood and to add a great amount of value in your information system.

Activeeon offers a solution that enable users to perform their process at scale. On one hand, with a user-friendly studio interface, business line users can quickly create workflows that scale. On the other hand, with a resource manager integrated within the solution, the access to compute resources is handled properly with elastic provisioning when necessary.

What are your use cases on text analysis that need scale? Let us know in the comments below or send us an email at contact@activeeon.com

No comments:

Post a Comment