Deep Learning OCR: Deep Learning Algorithm and Robotics Process Automation(RPA) to Extract and Automate Invoice Data

39 min readJun 28, 2020

Word Embedding, Bounding Box, Data Augmentation, Instance and Semantic Segmentation, YOLO, YOLOv2 and YOLOv3 , Darknet, R-CNN, Mask R-CNN,Fast R-CNN, Faster R-CNN, Connectionist Test Proposal Network(CTPN), Optical Character Recognition, Recurrent Connectionist Text Proposal Network, Attention-based Encoder-Decoder for text recognition, Bidirectional Encoder Representations from Transformers (BERT), BART, Transformer Model, Generative Adversarial Networks, Robotics Process Automation (RPA)

A. Word Embedding

Word Embedding is a type of word representation that allows words with similar meaning to be understood by machine learning algorithms. It uses language modeling and feature learning technique. We can say that it is a mapping of words into vectors of real numbers using the neural network, dimension reduction on word co-occurrence matrix, or probabilistic model. There are various word embedding models available such as word2vec (Google), Glove (Stanford), and fastest (Facebook).

Word Embedding is also called a distributed semantic model or Distributional semantic modeling or vector space model. The word semantic which means categorizing similar words together. For example fruits like apple, mango, banana should be placed close whereas books will be far away from these words. In a broader sense, word embedding will create the vector of fruits which will be placed far away from vector representation of books. The basic uses of Word Embedding are Compute similar words, Create a group of related words, Feature for text classification, Document clustering, Natural language processing.

For generating word embedding there are many different approaches available whose relative merit is based on how good they are at placing words in vector space close to one another. It is basically of two categories: (i). probabilistic approaches (e.g. using a neural network to optimize an embedding), and (ii). Frequency or count-based Embedding (Count Vectors, TF-IDF, Co-Occurrence Matrix). In this blog, we will discuss only probabilistic approaches.

Word2vec

Word2vec represents words in vector space representation. Words are represented in the form of vectors and placement is done in such a way that similar meaning words appear together and dissimilar words are located far away. It captures a large number of precise syntactic and semantic word relationships. Neural networks do not understand text instead they understand only numbers. Word Embedding provides a way to convert text to a numeric vector.

Word2vec Architecture

There are two architectures used by word2vec both architectures to learn the underlying word representations for each word by using neural networks. These include,

The Continuous Bag of Words (CBOW) Model
The Skip-gram Model

i. Continuous Bag of words (CBOW)

In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words).

Let’s consider an example like, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict thetarget_word based on the context_window words.

Skip-gram

In the Skip-gram model, the distributed representation of the input word is used to predict the context of words or neighboring words.

Let’s consider an example like “I will have orange juice and eggs for breakfast.” and a window size of 2, if the target word is juice, its neighboring words will be ( have, orange, and, eggs). Our input and target word pair would be (juice, have), (juice, orange), (juice, and), (juice, eggs). Also note that within the sample window, the proximity of the words to the source word plays no role. So have, orange, and, and eggs will be treated the same while training.

Global Vectors for Word Representation (GloVe )

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. It is developed as an open-source project at Stanford. As the log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

The approach of Glove is used to capture the meaning of one word embedding with the structure of the whole observed corpus. GloVe model trains on global co-occurrence counts of words and makes sufficient use of statistics by minimizing least-squares error and, as a result, producing a word vector space with meaningful substructure. Such an outline sufficiently preserves word similarities with vector distance. For quantifying the relatedness of two words, the similarity metrics used for nearest neighbor evaluations produce a single scalar. This simplicity can be problematic since two given words almost always exhibit more intricate relationships than can be captured by a single number. For example, a man may be regarded as similar to women in that both words describe human beings; on the other hand, the two words are often considered opposites since they highlight a primary axis along which humans differ from one another.

In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. In the figure below, the underlying concept that distinguishes man from woman, i.e. sex or gender, may be equivalently specified by various other word pairs, such as king and queen or brother and sister. To state this observation mathematically, we might expect that the vector differences man — woman, king — queen, and brother — sister might all be roughly equal.

GloVe can be used to find relations between words like synonyms, company-product relations, zip codes, and cities, etc. It is also used by the spaCy model to build semantic word embeddings/feature vectors while computing the top list words that match with distance measures such as Cosine Similarity and Euclidean distance approach.

Note: The difference between Word2Vec and Glove:

Word2Vec Feedforward neural network-based model to find word embedding but Glove is based on matrix factorization techniques on the word-context matrix. In the Glove algorithm, it firstly constructs a large matrix of (words X context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently we see this word in some “context” (the columns) in a large corpus.

FastText (Enriching Word Vectors with Subword Information)

The fastText classifier is a Linear Bag Of Words (LBoW) classifier to emphasize the fact that it uses linear technique both for combining the word vectors into the vector representing the document and for computing the classification criterion. For the purpose of efficient learning of word representations and sentence classification the fastText library created by the Facebook Research Team. It is an open-source, free, lightweight library.

FastText supports training continuous bag of words (CBOW) or Skip-gram models using negative sampling, softmax, or hierarchical softmax loss functions.

Poincaré embeddings (Poincaré Embeddings for Learning Hierarchical Representations)

Poincaré embeddings are the latest trend in the natural language processing community, based on the fact, that we’re using hyperbolic geometry (non-Euclidean spaces of constant negative curvature) to capture hierarchical properties of the words we can’t capture directly in Euclidean space. We need to use such kind of geometry together with Poincaré ball to capture the fact, that distance from the root of the tree to its leaves grows exponentially with every new child, and hyperbolic geometry is able to represent this property.

The two-dimensional Poincaré embeddings of the transitive closure of the WORDNET mammals subtree is shown below,

B. Deep Learning-based Optical Character Recognition (OCR)

OCR is a technology that recognizes and locates text within a digital image such as letters, numbers, and symbols. It is commonly used to recognize text in scanned documents, but it serves many other purposes as well. Some OCR software will simply export the text, while other programs can convert the characters to editable text directly in the image. Advanced OCR software can export the size and formatting of the text as well as the layout of the text found on a page.

A Neural Network (NN) is a wonderful tool that can help to resolve OCR type problems. The neural network is an information processing paradigm inspired by the way the human brain processes information. NNs are collections of mathematical models that represent some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. In Deep learning OCR methodology, the following steps are involved, a. Recognition by adjusting the weight matrix, b. Image labeling algorithm, c. Finding boundary and generating (X, Y) Coordinate pixel array, d. Matching connected pixels with the learned set, e. Word Formation. See the process of OCR using Deep Learning as below,

In the Convolutional Recurrent Neural Network (CRNN), it is a combination of CNN, RNN, and CTC(Connectionist Temporal Classification) loss for image-based sequence recognition tasks, such as scene text recognition and OCR. At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making predictions for each frame of the feature sequence, outputted by the convolutional layers. Though CRNN is composed of different kinds of network architectures, it can be jointly trained with one loss function. The network architecture of CRNN is shown below,

In Segmentation-free OCR methods eliminate the need for pre-segmented inputs. Cropped words or entire text lines are usually geometrically normalized and then can be directly recognized. In this Deep learning detection approach, such as SSD, YOLO, and Mask RCNN are also used to detect characters and words. Deep learning models can find it more challenging to recognize digits and letters than to identify objects such as dogs, cats, or humans. They often don’t reach the desired accuracy, and therefore specialized approaches are needed. The architecture below shows that how the hybrid CNN-LSTM model is depicted in and is inspired by the CRNN.

In the above OCR system surpasses the accuracy of leading commercial and open-source engines on distorted text samples.

C. Some Fundamental concepts

Data Augmentation

To significantly increase the diversity of data, data augmentation is a strategy that enables practitioners to available for training models without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks. However, most approaches used in training neural networks only use basic types of augmentation. Popular Augmentation Techniques are Flip, Rotation, Crop, Translation, Scale, Gaussian Noise, Horizontal and Vertical Shift Augmentation, Random Brightness Augmentation, Random Zoom Augmentation. A critical tool for overcoming training data overfit with modern CNNs is data augmentation, the use of randomized data transformations to greatly expand the effective size of a training set. Data augmentation randomly applies certain types of label preserving transforms into training data. There are some more data augmentation technique shown below,

The Data augmentation techniques can implement using the Sk-Image and OpenCV libraries in Python.

Bounding Box

The bounding box is a rectangular box that can be determined by the coordinates in the upper-left corner and the coordinates in the lower-right corner of the rectangle. In object detection, we usually use a bounding box to describe the target location. In 4-dimensional space encodes the x-y position, the scale, and the aspect ratio of a bounding box. To allow multiple bounding boxes in each image, a Gaussian distribution is placed at each location and the labels are re-normalized to sum to one. During an evaluation, multiple boxes are predicted by applying non-max suppression to the resulting probability distribution over bounding boxes. The bounding boxes have been used to count the number of obstacles of the same class in a crowd, in self-driving cars, drones, surveillance cameras, autonomous robots, and all sorts of systems using Computer Vision. In the figure below the output of the algorithm is a list of the bounding boxes, in format [class, x-coordinates, y-coordinates, width, height, confidence score].

Creating Bounding boxes in an image is used to represent a possible region of interest (ROI). In general, every feature recognition/detection algorithms returns the ROI in the form of pixel coordinates and the width and height. In the figure below, the First three bounding boxes are Correct but the last three are Incorrect detections for classes.

Bounding Box Regression

In mathematical statistics, the Kullback–Leibler divergence (KL-Divergence)(also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution. In the KL-Divergence based region proportional network for object detection, the learning of the region proposal using the deep neural networks (DNN) is divided into two tasks: binary classification and bounding box regression task. In the Network architecture below shows that the KL-Region proportional network (KL-RPN) on Faster R-CNN. KL-RPN predicts the mean and standard deviation of the bounding box offset.

In the bounding box regression loss which is defined as the KL divergence of the predicted distribution and ground-truth distribution. Basically the Learning with KL Loss has three benefits:

a. The bounding box regressor gets smaller losses from ambiguous bounding boxes and the ambiguities in a dataset can be successfully captured.

b. The learned variance is useful during post-processing.

c. The learned probability distribution is interpretable. Since it reflects the level of uncertainty of the bounding box prediction, it can potentially be helpful in down-stream applications like self-driving cars and robotics.

Let’s consider the predicted bounding box coordinate p= (center coordinate, width, height) and its corresponding ground-truth box coordinates g=(gx, gy, gw, gh ), the regressor is configured to learn scale-invariant transformation between two centers and log-scale transformation between widths and heights. All the transformation functions take p as input. In the figure below transformation between predicted and ground-truth bounding boxes as,

In the figure below, the network takes an image with roughly localized bounding boxes and refine them so that they tightly enclose nearby objects,

YOLO, YOLOv2, and YOLOv3: Algorithm to predict accurate Bounding boxes

YOLO stands for — YouOnlyLookOnce which is a good way to get the more accurate output bounding boxes. YOLO is an object detection algorithm much different from the region based algorithms. In YOLO a single convolutional network predicts the bounding boxes and the class probabilities for these boxes. With sliding windows, the algorithm takes the sets of windows that move throughout the image and obtain a set of sliding windows. The next thing is by applying a classifier we can see if there is a car in that particular sliding window or not. In the below figure we can see how bounding box is predicted,

The working of the YOLO algorithm, firstly it takes an image and split it into a grid, within each of the grid it takes some bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. Finally bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image. YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms. The limitation of the YOLO algorithm is that it struggles with small objects within the image, for example, it might have difficulties in detecting a flock of birds. This is due to the spatial constraints of the algorithm. In the below figure see the working of the YOLO algorithm to create a grid and detect final images.

For more accurate prediction of the Bounding box using the YOLOv3 algorithm, it predicts a confidence score (objectness) for each bounding box using logistic regression. The confidence score should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. For example: In case of prior 1 overlaps the first ground truth object more than any other bounding box prior (has the highest Intersection over Union (IOU)) and prior 2 overlaps the second ground truth object by more than any other bounding box prior. The system only assigns one bounding box prior to each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objections. If the box does not have the highest IOU but does overlap a ground truth object by more than some threshold we ignore the prediction.

The following steps also involved in the prediction of the bounding box using the YOLO algorithm in each grid cell,

i. Firstly it predicts B boundary boxes and each box has one box confidence score,

ii. Detects one object only regardless of the number of boxes B,

iii. Predicts C conditional class probabilities (one per class for the likeliness of the object class).

In the figure below, we can see the working of YOLO-algorithm to predict bounding box,

Here each ground truth object is associated with one boundary box prior only. If a bounding box prior is not assigned, it incurs no classification and localization lost, just confidence loss on objectness.

The problem of YOLOv3 is biased towards the size of the object in the image. If encountered with bigger objects while training, it cannot detect the same object of a smaller scale perfectly. To remove this type of ambiguity YOLO 5 is now available. For more detail, you can read the following link,

Responding to the Controversy about YOLOv5

We appreciate the machine learning community's feedback, and we're publishing additional details on our…

blog.roboflow.ai

Anchor Boxes

The Anchor boxes are a set of predefined bounding boxes of a certain height and width. These boxes are defined to capture the scale and aspect ratio of specific object classes you want to detect and are typically chosen based on object sizes in your training datasets. In the detection process, the predefined anchor boxes are tiled across the image and the network predicts the probability and other attributes, such as background, intersection over union (IoU), and offsets for every tiled anchor box. The predictions are used to refine each individual anchor box. We can define several anchor boxes of different object sizes. The YOLO v2 introduced the anchor boxes that do classification and prediction in a single framework. These are responsible for predicting bounding boxes and are designed for a given dataset by using a clustering algorithm (k-means clustering).

In the figure below shows how the position of an anchor box is determined by mapping the location of the network,

Darknet

Darknet is a framework to train neural networks, it is open source and written in C/CUDA and serves as the basis for YOLO. We can say it is the backbone of CNN. Darknet requires 5.58 billion operations only. With DarkNet, YOLO achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet. The Darknet uses mostly 3 × 3 filters to extract features and 1 × 1 filters to reduce output channels. It also uses a global average pooling to make predictions. The detail network description of Darknet as below:

Darknet is used as the framework for training YOLO, meaning it sets the architecture of the network. The network architecture of CNN and darknet is shown below,

Semantic and Instance Segmentation

There are various techniques that are used in computer vision tasks like classification, semantic segmentation, object detection, and instance segmentation. In the below figure we can see the differences between all task that is mentioned above,

The Semantic Segmentation is the task of labeling every pixel in an image with a pre-defined object category. Autonomous vehicles and medical diagnoses are various scenarios where a detailed understanding of an image is required.

There are many ways of describing a scene. A high-level summary of a scene can be obtained by predicting image tags that describe the objects in the picture (such as “person”) or the scene (such as “city” or “office”). This task is known as image classification. The object detection task, on the other hand, aims to localize different objects in an image by placing bounding boxes around each instance of a pre-defined object category. The aim of the Semantic Segmentation for a more precise understanding of the scene by assigning an object category label to each pixel within the image. In scene understanding tasks, such as semantic segmentation, enable computers to extract information from real-world scenarios, and to leverage this information to accomplish given tasks. Semantic Segmentation has numerous applications such as in autonomous vehicles which need a precise, pixel-level understanding of their environment, developing robots which can navigate and manipulate objects in their environment, diagnosing medical conditions by segmenting cells, tissues, and organs of interest, image- and video-editing and developing “smart-glasses” which describe the scene to the blind. Semantic Segmentation has traditionally been approached using probabilistic models known as a Conditional Random Fields (CRFs), which explicitly model the correlations among the pixels being predicted.

The evolution of Semantic Segmentation systems is as below,

In case of Instance segmentation, which aims to assign a unique identifier to each segmented object in the image, as well as bridging the gap between natural language processing and computer vision with tasks such as image captioning and visual question answering, which aim at describing an image in words, and answering textual questions from images respectively. The Instance segmentation identifies the boundaries of the objects at the detailed pixel level. In an example below, there are 7 balloons at certain locations, and these are the pixels that belong to each one of the balloons.

The fundamental concept of R-CNN, Fast R-CNN, Mask R-CNN, Faster R-CNN Algorithm

R-CNN

R-CNN is short for “Region-based Convolutional Neural Networks”. The main idea is composed of two steps. First, using selective search, it identifies a manageable number of bounding-box object region candidates (“region of interest” or “RoI”). And then it extracts CNN features from each region independently for classification. The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image. The figure below shows that Object detection system overview using R-CNN which is nothing but Regions with CNN features,

In object detection system using R-CNN having three modules.

i.The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.

ii. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.

iii. The third module is a set of class-specific linear SVMs.

The arrangement of all three modules is shown below,

Region proposals

The Region proportional network is a “normal” algorithm that works out of the box. It doesn’t require to train them or anything. The Selective Search is a region proposal algorithm is used in object detection. It is built on top of the image segmentation output and use region-based characteristics (not just attributes of a single-pixel) to do a bottom-up hierarchical grouping. It is designed to be fast with very high recall. It is based on computing hierarchical grouping of similar regions based on color, texture, size, and shape compatibility. Selective Search starts by over-segmenting the image based on the intensity of the pixels using a graph-based segmentation method. The output of the algorithm is shown below,

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score. In the figure below, the left side Region Proposal Network (RPN) and in right side elections using RPN proposals on PASCAL VOC 2007 test.

Fast R-CNN

A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional and max-pooling layers to produce a convolutional feature map. Then, for each object proposal, a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes. Training all network weights with back-propagation is an important capability of Fast R-CNN. The Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time and achieves a higher mAP on PASCAL VOC 2012. In the Fast-RCNN architecture shown below, the architecture is trained end-to-end with a multi-task loss.

The Fast R-CNN consists of a CNN (usually pre-trained on the ImageNet classification task) with its final pooling layer replaced by an “ROI pooling” layer and its final FC layer is replaced by two branches — a (K + 1) category softmax layer branch and a category-specific bounding box regression branch.

For a better understanding of Fast R-CNN working the improved efficiency and performance of R-CNN and SPP Networks pipeline structure is shown below,

Faster R-CNN

The Faster R-CNN was first published in 2015 which is the most widely used state of the art version of the R-CNN family. Faster RCNN is composed of 3 parts Convolution layers, Region Proposal Network (RPN), Classes, and Bounding Boxes prediction. The architecture of Faster-RCNN is shown below,

The region proposal network (RPN) in the faster region-based convolutional neural network (Faster R-CNN) is used to decide “where” to look in order to reduce the computational requirements of the overall inference process. The RPN quickly and efficiently scans every location in order to assess whether further processing needs to be carried out in a given region. It does that by outputting k bounding box proposals each with 2 scores representing the probability of object or not at each location. The architecture of Faster R-CNN which is a single unified network for object detection and RPN is shown below,

The figure below is the RCNN Detection Framework which consists of multistage pipelines as i. Region proposal computation ii. CNN model finetuning iii. Class-specific SVM classifiers training iv. Class-specific bounding box regressor training

Mask R-CNN

Mask R-CNN, extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. In addition, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation. Most importantly, Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. The architecture of the Mask R-CNN framework for instance segmentation is shown below,

In Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this, we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of the much finer spatial layout of an object. The below figure shows how Mask R-CNN is able to segment as well as classify the objects in an image,

The Mask R-CNN adopts the same two-stage procedure, with an identical first stage which is RPN. In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI.

Finally, The High-level diagrams of the leading frameworks for generic object detection are as below,

D. Connectionist Text Proposal Network(CTPN)

The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. The CTPN architecture has sequential proposals that are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. The conventional approaches consist of a multi-stage pipeline. These algorithms basically follow bottom-up approaches. They start with low-level character detection and then follow multi-stages such as non-text component filtering, then text line construction, and verification. It allows the CTPN to explore rich context information of an image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multilanguage text without further post-processing, departing from previous bottom-up methods requiring multi-step post-filtering.

Structure of CTPN for text detection as below,

In the above architecture, CTPN follows steps as : i. Firstly input image is passed through a pretrained VGG16 model that is trained with the ImageNet dataset. ii. Features output from the last convolutional maps of the VGG16 model is taken. iii. These outputs are passed through a 3×3 spatial window. iv. Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network. v. The recurrent output is then fed to a 512-D fully connected layer. vi. Finally, it comes to the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores, and k side refinement values.

The traditional method of text detection can be divided into two categories, one is the connection component (CC) and the other is the sliding window. CC is used to differentiate text/non-text pixels by using a fast filter and then divide the text pixels into strokes or candidate characters by using low-level properties (intensity, color, gradient). A sliding window is a multi-size window that moves through dense images on the image. The character/non-character window is distinguished by a pre-trained classifier using a manually designed feature or a CNN feature on the back layer. A big problem with sliding windows is that it is computationally expensive because you need to run the classifier on a large number of Windows. In the latest improvements to the Side-refinement detection frame merging mechanism, taking height information into the detection location and combining, and also changing the BiLSTM network to GRU, thereby accelerating network training and application runtime and improving network efficiency. The main idea is that every two similar proposals form a pair and merge different pairs until they can no longer be merged. The Results of Side-refinement is as below,

CTPN detects horizontal texts quite well but fails for multi-oriented texts. CTPN detects longer text areas than other methods; We believe this is due to the connectionist mechanism which tends to connect horizontally close text proposals.The detected text bounding boxes is as below,

E. Attention-based Encoder-Decoder for text recognition (AED)

For recognizing handwritten mathematical expressions, the AED has been successfully applied. The ADE has two main modules: DenseNet for extracting features from a text image and an LSTM combined with an attention model for predicting the output text.

In attention-based text recognition models, the decoder recurrently outputs predictions. Specifically, the prediction of the previous step is generally embedded into high-dimensional feature space. Then, the embedded vector will directly participate in the next decoding step as a guide. As shown in Figure below, changeless color represents that all the guidance weights are fixed in existing attention-based decoders, regardless of the correlation between the neighboring characters.

The above model consists of two components: (a) a convolutional encoder network that extracts features from an input image and converts features to high-level visual representations. (b) a recurrent attention-based decoder network that combined with the proposed AEG (Adaptive Embedding Gate) to generate target sequences.

In the Convolutional encoder network, a residual network (ResNet) based feature extractor is adopted as the primary structure for the convolutional encoder network. The encoder first extracts a feature map from an input image that is constrained by their receptive fields. To enlarge the image region for feature expressions, it employs a two-layer Bidirectional Long Short Term Memory (BLSTM) network over the feature map. The figure shown below is the architecture of our multilayer convolutional model with seven encoder and seven decoder layers,

In the Recurrent attention-based decoder network, the recurrent attention-based decoder network aims at translating the encoded features into the prediction sequence, where the attention mechanism is used to align the prediction sequence. It has a novel module called AEG to adaptively strengthen or weaken the influence of the previous prediction in the decoding stage by exploiting the character language modeling. The formulation and three implementations of AEG present in this architecture. i. Extensive experiments are conducted on various scene text benchmarks, demonstrating the performance superiority and flexibility of AEG. ii. The architecture of AEG significantly improves the robustness of the existing attentional decoders under different noise disturbances, e.g., Gaussian blur, salt and pepper noise, and random occlusion.

In DenseNet Feature extraction Based using AED, we employed DenseNet as feature extraction. DenseNet has direct connections from any preceding layers to succeeding layers, so they help the network reuse and learn features cross layers. The figure shown below is the architecture of Fast DenseNet of the convolutional network having FDenseNet-U framework,

In an Attention-based LSTM decoder, an LSTM decoder predicts one character at a time step. The decoder predicts the output symbol based on the embedding vector of the previously decoded symbol, the currently hidden state of the decoder, and the current context vector. The context vector is computed by the attention mechanism. The decoder is initialized by averaging the extracted features map. The figure shown below is an architecture of attention-based LSTM decoder,

Finally in OCR verification, the OCR to verify a text line as handwriting or printing. For a handwriting text line, the ADE predicts as a symbol. If a text line is recognized as handwritten it will remove this text line from the result of text line detection. In the figure below is an example of a good result of recognition system using Deep Learning,

F. Container Text Detection and Recognition Network (CTDRNet)

There are three main trends in the field of text detection. They are: (a) pipeline simplification; (b) changes in prediction units; (b) specified targets. The CTDRNet consists of three components: (i) CTDRNet text detection enables to improve detection accuracy for single words; (ii) CTDRNet text recognition has faster convergence speed and detection accuracy; (iii) CTDRNet post-processing improves detection and recognition accuracy. The character-based text detection detects character one by one before connecting them as a word, which has poor detection accuracy. The existing text recognition methods can be divided into CTC-based and attention mechanism-based methods.

The workFlow CTDRNet is shown below,

The Overview of recent progress and dominant trends is shown below,

G. Transformer Model

The Transformers model is able to solve the problem of sequence transduction, neural machine translation, or many more NLP task. That means any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc.. A transformer architecture is trained as a language model on a large corpus, then fine-tuned for individual text classification and similarity tasks. Multiple sentences are combined together into a single sequence using delimiters in order to work with the same model. The figure below shows the Transformer architecture and input transformation of fine-tuning of different task,

The Transformer model decouples the problem into two sub-problems and consequently, it has two modules that solve these sub-problems: a feature extraction module and a transformer module. Here the Convolutional feature maps as word embeddings are used as input to the transformer, and in this way, the method leverages the potential of the powerful attention mechanism of transformers. Take an example like“I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step.

The overall architecture of Transformer follows using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder is shown below,

Based on this architecture, the Transformer was introduced in which a self-attention technique was used instead of using a recurrent neural network in the encoder and decoder. Transformer encoder which consists of self-attention heads and fully connected neural networks. This encoder modifies the representation of each token to suit the contents of the other tokens and presents a new representation. Each self-attention head discovers a new semantic relation between various tokens and converts it into a new vector similar to input vectors using a fully connected neural network. The BERT (Bidirectional Encoder Representations from Transformers) language model, which uses the transformer encoder component to implement the language model.

H. Bidirectional Encoder Representations from Transformers (BERT)

Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP pre-training developed by Google. The BERT is designed to pre-train deep bidirectional representations from an unlabeled text by jointly conditioning on both left and right contexts in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. Learning from ELMO (Embeddings from Language Models) and GPT (Generative Pre-Training) pre-trained model experience, BERT used the bidirectional training of Transformer to the language model. Using BERT for a specific task is very straightforward, we can download google pre-trained BERT model first, then use fine-tuning method to update the pre-trained model to fit downstream task needed, BERT is a specific transform learning method for NLP.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

i. At the beginning of the first sentence, a token is inserted and another token is inserted at the end of each sentence.

ii. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.

iii. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

We can see the steps in the figure below,

In Task specific-Models the BERT is a multi-layer, bidirectional transformer encoder that comes in two variants: BERTBASE and the larger BERTLARGE. There are basically four types of Task-specific BERT models, i. Sentence pair classification task, ii. Single sentence classification task, iii. Question answering task, iv. Single sentence tagging task

All type of task-specific model of BERT is shown below,

In Word Character BERT, a word or character is mapped to a continuous vector representation (embedding) that captures the context of the word and character respectively. While word-based models need accurate token-level segmentation, character-level models have the ability to perform accurate labeling of tokens or character units without the need for prior word segmentation.

Architectures of the word, character and BERT level representation models is shown below,

Finally, In the Data Preprocessing task, we first apply optical character recognition (OCR) to convert the documents into a textual representation with the additional goal of preserving the original document layout as well as possible. The character vocabulary is constrained to alpha-numeric characters and some special symbols.

Type of BERT

RoBERTa: A Robustly Optimized BERT

In the replication study of BERT pretraining, which includes a careful evaluation of the effects of hyperparameter tuning and training set size. In this case, BERT was significantly undertrained and propose an improved recipe for training BERT models, which called RoBERTa, that can match or exceed the performance of all of the post-BERT methods.

DistilBERT: A distilled version of BERT

DistilBERT — has the same general architecture as BERT. In this BERT the token-type embeddings and the pooler are removed while the number of layers is reduced by a factor of 2. It is a smaller general-purpose language pre-train representation model, which can then be finetuned with good performances on a wide range of tasks like its larger counterparts.

CamemBERT: A Tasty French Language Model

CamemBERT differs from RoBERTa mainly with the addition of whole-word masking and the usage of Sentence Piece tokenization. Architecture Similar to RoBERTa and BERT, CamemBERT is a multi-layer bidirectional Transformer. A French version of the Bi-directional Encoders for Transformers (BERT). We measure the performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference.

ALBERT: A LITE BERT

An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster. ALBERT incorporates two-parameter reduction techniques that lift the major obstacles in scaling pre-trained models.

i. A factorized embedding parameterization. By decomposing the large vocabulary embedding matrix into two small matrices by separating the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embedding.

ii. The cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.

Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter-efficiency. The parameter reduction techniques also act as a form of regularization that stabilizes the training and helps with generalization.

Multilingual BERT

Multilingual BERT (mBERT) provides sentence representations for 104 languages, which are useful for many multi-lingual tasks. The mBERT is composed of a language-specific component, which identifies the language of the sentence, and a language-neutral component, which captures the meaning of the sentence in a language-independent way. The mBERT representations can be split into a language-specific component and a language-neutral component. The Language centroids of the mean-pooled representations are as below,

FlauBERT: Unsupervised Language Model

FlauBERT, a model learned on a very large and heterogeneous French corpus. If we apply French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared with the research community for further reproducible experiments in French NLP.

I. BART: A denoising autoencoder for pretraining sequence-to-sequence models

The Facebook AI researchers have further developed the BART model with the introduction of mBART, which they say is the first method for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages for machine translation purposes.

A BART model with 12 encoder layers and 12 decoder layers was pretrained on different sets of languages. The final models were named mBARTNum, in which “Num” represents the number of languages used for training; and Random, which is a baseline model randomly initialed without pretraining.

The architecture of mBART with fine-tuning on machine translation is shown below,

The BART is a denoising autoencoder for pretraining sequence-to-sequence models. It is trained by (i) corrupting text with an arbitrary noising function, and (ii) learning a model to reconstruct the original text. It uses a standard Transformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. BART is particularly effective when fine-tuned for text generation but also works well for comprehension tasks. The Fine-tuning BART for classification and translation is shown below,

I. Generative Adversarial Networks

Generative adversarial networks (GANs) are algorithmic architectures that use two neural networks, pitting one against the other (thus the “adversarial”) in order to generate new, synthetic instances of data that can pass for real data. They are used widely in image generation, video generation, and voice generation.

GANs are not very useful in recognition problems also. They may be used for the generation of training data, but not in the main pipeline, but we can enhance OCR accuracy with super-resolution. Accuracy of OCR is often marred by the poor quality of the input document images. Generally, this performance degradation is attributed to the resolution and quality of scanning. This calls for special efforts to improve the quality of document images before passing it to the OCR engine. One compelling option is to super-resolved these low-resolution document images before passing them to the OCR engine.

GAN employs adversarial training which essentially means pitting two neural networks against each other. One is a generator while the other is a discriminator, where the former aims at producing data that are indistinguishable from real data while the latter tries to distinguish between real and fake data.

CycleGAN has shown its worth in scenarios where there is a paucity of the paired dataset, i.e., an image in the source domain and corresponding image in the target domain. CycleGAN uses cycle-consistency loss which says that if an image is transformed from source distribution to target distribution and back again to source distribution, then we should get samples from the source distribution. The below figure is for CycleGAN — It consists of two generators, GA and GB which map noisy images to clean images and clean to noisy images, respectively using cycle-consistency loss.

GAN for Table localization

A table contains important information in the document, but it does not always hold a structured format. For finding the table boundary area in a document image, and table segmentation focus on analyzing a table by finding its rows and columns to extract the structure of the table we use deep neural network architecture. The conditional GAN and Convolutional Neural Networks (CNN) based architectures are very useful to localize the table and segment the table structure.

A general approach for table extraction from an image using GAN as below,

Table localization and segmentation is an important but critical step in document image analysis. Table segmentation is much harder than table localization particularly in the invoice document because sometimes there are nested rows or nested columns or even nested tables in an invoice. High-resolution image synthesis and semantic manipulation with conditional GAN based architecture well known as pix2pixHD. The pix2pixHD originally trained with the global and local generator to achieve a high-resolution image. The below figure is an example of correctly localize table area using pix2pixHD architecture,

A sample predicted output for table localization a sample from the trained model with pix2pixHD architecture is shown below,

The reference code of this project is:

huggingface/transformers

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers (formerly known as…

github.com

J. Invoice Automation

Introduction

In the process of Invoice Automation where the automation software scans the invoice and then converts that into an image or text-searchable documents. In this process automation, the different range on an invoice can also be defined into the software so that it remembers that from which range it should capture and register the data into the Enterprise resource planning (ERP) systems. The basic operations in Invoice Automation are (i) Import of the images through scanning or email, (ii) Identification of the vendor and business unit associated with the invoice, (iii) Data extraction, (iv) Export of the extracted data and images

Requirements of RPA for invoice automation

You might receive hundreds and thousands of invoices from your vendors for further processing. Your accounting department manually verifies every invoice information into the accounting systems and pushes for payment. The efficiency of this process depends on the number of hours your team has in a day, and well, that’s limited. Any errors or delays in processing their invoices can leave your vendor fuming. They would either delay the delivery of the next set of goods/services or generally leaves you, and also this process may take time and can be automated for faster and more efficient processing. The RPA has established invoice processing. These tasks, which used to consume more time and assets, can now be automated to be completed faster, more stable, and more affordable results.

Invoice processing with UiPath

RPA software robots can automate data input, error reconciliation, and some of the decision-making required by finance staff when processing invoices. At the same time, automation is able to limit errors in such processes and reduce the need for manual exception handling.

The steps where the UiPath Enterprise RPA Platform can be used end-to-end to move an invoice from receipt to payment in a matter of minutes are,

i. Invoice receipt: In this step, the UiPath RPA software robots are able to constantly monitor a dedicated folder where invoices are saved by employees (or other software robots) in PDF format. Once robots detect the presence of an invoice in the folder, they begin to extract information from the document.

ii. Information extraction & transfer: In this step by using intelligent optical character recognition (OCR) and natural language processing (NLP) capabilities, software robots are able to read out the information that is visible on the invoice. After robots extract the key information from each invoice, they use their credentials to open the company’s database or enterprise resource planning system, if not already open. The robots then start processing the invoices one-by-one by transferring over the relevant invoice information.

iii. Email notification: In this step, after successfully registering each invoice, the software robots are then able to send posting notifications in the form of emails to the responsible employee or to the vendor in question. An email is also sent to the responsible party in case of an exception.

iv. Other background activities: In the final step, during this whole process, the software robots are also running background activities such as monitoring the dedicated invoice folder or its email address, performing basic checks to see if the company’s database is open, and verifying whether vendor information (e.g. VAT number) on the invoice matches what is already in the database.

The benefits of using RPA in Invoice Automation are i. Reduces Human Error, ii. Save Money Through Avoiding Late Fees, iii. Cut Down on irrelevant Costs, iv. Increased focus on activities with higher value adds, iv. Organize Invoice Processing with Your ERP, Etc

Invoice Processing with the UiPath RPA Tool

UiPath’s software robots are able to continuously monitor a dedicated folder where invoices are saved in PDF format. Once the robot detects the presence of an invoice in the folder, they begin to extract the information from that document. The UiPath use mainly three OCR technologies, i. Microsoft OCR, ii. Google OCR, iii. Abbyy OCR

The UiPath can also integrate with our Deep learning applications via APIs so that DL can easily form part of the workflow. By using this intelligent algorithm and natural language processing capabilities, we would be able to read the informative data that is visible on the invoice. Once the invoice is retrieved by the robots, it can read and find specific data from the invoice. The data that is being grabbed by the robot can be modified and configured to the preferences of the use cases.

The workflow of RPA is shown below,

Final Outputs:

Summary

With this, we have done Invoice process automation using Deep Learning algorithms like Recurrent Connectionist Text Proposal Network, Bidirectional Encoder Representations from Transformers (BERT), Transformer Model, Generative Adversarial Networks for text recognition, table segmentation and localization with Robotics process automation. We have discussed many scenarios with which we can detect and extract meaningful information from Invoice data. We have also discussed OCR verification to improve text detection.

Thank you for reading my blog.