OCR at high level has 2 main tasks
- Text detection — detect the potions of text in the images(Word level or character level detection)
- Text transcription — covert an image into sequence of characters
Problems with OCR are
- tesseract or any OCR solutions transcribe the text from left to right. If information we want to extract is not in the same reading order (Like value should always be on right of key in key/value extraction) we would not be able to rightly extract the text.
- Extracting complex entities spanning across multiple lines
- Some Information cannot be extracted with rules post text transcription.
- Quality of OCR transcription itself(Spelling mistakes , not able to recognise special characters )
Most of the above problems can be solved if we were to recognise the layout of the particular document. Understanding the areas in the document like paragraphs, title ,headers ,tables, images or any other custom entity of your choice would improve the way we extract the text. This brings us to the concept of Region of Interest extraction.
Region of Interest Detection:
Document layout analysis is the process of locating and categorising regions of interest on a picture or scanned image of a page. Broadly, most approaches can be distilled into page segmentation and logical structural analysis. Page segmentation methods focus on appearance and use visual cues to partition pages into distinct regions; the most common are text, figures, images, and tables. In contrast, logical structural analysis focuses on providing finer-grained semantic classifications for these regions, i.e. identifying a region of text that is a paragraph and distinguishing that from a caption or document title.
Even quality of OCR transcription improves when u pass a sub image(Region of Interest) compared to whole image .
Approaches for Document Layout Analysis:
- Computer vision based approaches (Object detection, Image Segmentation)
- · NLP Based Approaches. (Masked Visual Language Modelling, BertGrid, LayoutLM,CharGrid)
Computer vision-based approaches:
- Image Segmentation: Segmentation is method to identify different sub segments/sub objects in an image. The goal of segmentation is change the meaning of an image to something simpler to analyse further.
Steps to identify different text blocks in a image :
Binarization: Convert image into grayscale , as we don’t 3 channels to represent text content.
Edge detection: detect edges of each character .
Dilation: Dilation is process of making the edges more thicker, so that all the nearby characters and words overlap and it becomes like a solid mass.
Contour detection: Contours are set of continuous points having same color or intensity .
Playing with number of iterations in the dilate operations & kernel size would change the way bounding boxes are detected.
Some times we would also do erosion followed by dilation.
2. DLA as object detection
The DLA can be treated as a subtask of custom objection detection in an image.
Some of the pertained DLA models are:
Layout Parser(Library)- Based on facebook’s decatron model.
Layout Parser is a deep learning based tool for document image layout analysis tasks. Use pip or conda to install the…
Training custom objects:
The scripts for training Detectron2-based Layout Models on popular layout analysis datasets …
Monk object detection is a collection of all object detection pipelines. The benefit is two-fold for each pipeline- make the installation compatible for multiple OS, Cuda versions, and python versions, and make it low code with a standardized flow of things. Monk object detection enables a user to solve a computer vision problem in very few lines of code. For this task, we’ll be using 3 different pipelines of this library for 3 different architectures- yolov3, gluoncv_finetune, and mxrcnn.
Deep learning models that take a document image file as input, locate the position of paragraphs, lines, images, etc…
Tools for labelling: This tool helps to export your labelled data directly in different formats like yolo, pascal-voc etc.
LabelImg is a graphical image annotation tool. It is written in Python and uses Qt for its graphical interface…
NLP based approaches:
Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words’ visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).
LayoutLM - transformers 4.4.2 documentation
The LayoutLM model was proposed in the paper LayoutLM: Pre-training of Text and Layout for Document Image Understanding…
Finetuning on a specific dataset:
This repository contains demos I made with the Transformers library by HuggingFace. - NielsRogge/Transformers-Tutorials
Summarising all the advantages of Region of Interest Detection:
- Any reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order
- Detect duplicate content in the same document.
- Eases the document classification process by letting the model focus on the important content in the document.
- Improves the Extraction accuracy of OCR by narrowing the Region of interest.
- Helps in entity linking and post processing (Ex: If u are deriving the Age and DOB from a document having labelled entities help u to do a validation )
- This is very useful when extracting information with rules cannot be written for information extraction