Document Layout Analysis

OCR at high level has 2 main tasks

  • Text detection — detect the potions of text in the images(Word level or character level detection)
  • Text transcription — covert an image into sequence of characters

Problems with OCR are

  • tesseract or any OCR solutions transcribe the text from left to right. If information we want to extract is not in the same reading order (Like value should always be on right of key in key/value extraction) we would not be able to rightly extract the text.
  • Extracting complex entities spanning across multiple lines
  • Some Information cannot be extracted with rules post text transcription.
  • Quality of OCR transcription itself(Spelling mistakes , not able to recognise special characters )

Most of the above problems can be solved if we were to recognise the layout of the particular document. Understanding the areas in the document like paragraphs, title ,headers ,tables, images or any other custom entity of your choice would improve the way we extract the text. This brings us to the concept of Region of Interest extraction.

Region of Interest Detection:

Document layout analysis is the process of locating and categorising regions of interest on a picture or scanned image of a page. Broadly, most approaches can be distilled into page segmentation and logical structural analysis. Page segmentation methods focus on appearance and use visual cues to partition pages into distinct regions; the most common are text, figures, images, and tables. In contrast, logical structural analysis focuses on providing finer-grained semantic classifications for these regions, i.e. identifying a region of text that is a paragraph and distinguishing that from a caption or document title.

Even quality of OCR transcription improves when u pass a sub image(Region of Interest) compared to whole image .

Approaches for Document Layout Analysis:

  • Computer vision based approaches (Object detection, Image Segmentation)
  • · NLP Based Approaches. (Masked Visual Language Modelling, BertGrid, LayoutLM,CharGrid)

Computer vision-based approaches:

  1. Image Segmentation: Segmentation is method to identify different sub segments/sub objects in an image. The goal of segmentation is change the meaning of an image to something simpler to analyse further.

Steps to identify different text blocks in a image :

Binarization: Convert image into grayscale , as we don’t 3 channels to represent text content.

Edge detection: detect edges of each character .

Dilation: Dilation is process of making the edges more thicker, so that all the nearby characters and words overlap and it becomes like a solid mass.

Contour detection: Contours are set of continuous points having same color or intensity .

Open-cv code to detect segments

Playing with number of iterations in the dilate operations & kernel size would change the way bounding boxes are detected.

Some times we would also do erosion followed by dilation.

2. DLA as object detection

The DLA can be treated as a subtask of custom objection detection in an image.

We should create a labelled data by tagging our documents similar to what is shown above , with our labels of interest and then we finetune a pretrained vision models with a classification head.

Some of the pertained DLA models are:

Layout Parser(Library)- Based on facebook’s decatron model.

Training custom objects:

Monk AI

Monk object detection is a collection of all object detection pipelines. The benefit is two-fold for each pipeline- make the installation compatible for multiple OS, Cuda versions, and python versions, and make it low code with a standardized flow of things. Monk object detection enables a user to solve a computer vision problem in very few lines of code. For this task, we’ll be using 3 different pipelines of this library for 3 different architectures- yolov3, gluoncv_finetune, and mxrcnn.

Tools for labelling: This tool helps to export your labelled data directly in different formats like yolo, pascal-voc etc.

NLP based approaches:


Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words’ visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).

Finetuning on a specific dataset:

Summarising all the advantages of Region of Interest Detection:

  • Any reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order
  • Detect duplicate content in the same document.
  • Eases the document classification process by letting the model focus on the important content in the document.
  • Improves the Extraction accuracy of OCR by narrowing the Region of interest.
  • Helps in entity linking and post processing (Ex: If u are deriving the Age and DOB from a document having labelled entities help u to do a validation )
  • This is very useful when extracting information with rules cannot be written for information extraction

Lead AI Applied Research

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store