Classifying Business Documents with Language Models

March 2, 2021

The Egnyte platform has been extended to support the classification of documents per business document type. It enables the Egnyte governance solution to assign documents to business document types, including invoices, contracts, NDAs, or financial statements. From a machine learning / AI perspective, it’s a natural language processing (NLP) problem—a classification task. The input is the raw text form of the document, and the output is the name of the class to which it belongs.

There are many approaches to this problem, including classical ones such as bag-of-words, n-grams, or TF-IDF representations, as well as more advanced techniques like leveraging embeddings such as word2vec or fastText. Current state-of-the-art approaches are based on neural networks-powered language models. These language models are neural networks trained in a self-supervised manner on a large corpus of raw text. The goal of a language model is usually to predict a word, given its context.

For instance, a language model might be used to predict the word represented by <?> in the input sentence “Egnyte is a Mountain View based <?>”. Words “company” or “startup” are examples of semantically valid words. It has been demonstrated that one can obtain state-of-the-art results by training a network using a language-model objective on a large corpus of data first and then fine tuning it on a task-specific objective, like the text classification. This approach also requires much less data. The process of fine tuning is how the transfer learning is done currently in NLP.

Publicly available generic-language models are usually trained on Wikipedia and English books. This kind of text is well formatted and contains only semantically correct sentences. Business documents are usually not so well structured; they often contain tables or information that is not presented in the form of sentences (think spreadsheets). For these documents, publicly available generic-language models are not delivering the best quality results. Therefore, we have decided to train our own language model to handle business-specific wording and semantics better than the publicly available ones.

Choosing the right architecture

The decision to train our own language model is just the beginning of the journey. Among the most important business decisions to be made, one might distinguish:

Which language model architecture to use? This affects both the training cost and the quality of the down-stream task, as well as the cost of running the model in production during the inference time.
How to set up training infrastructure? This largely affects the overall cost of the project as well as the time to market.

We have considered BERT, RoBERTa, DistilBERT, GPT-2 as well as ALBERT, Reformer, and ULMFiT. In summary:

BERT - although a breakthrough one, it is a base architecture that ignited a lot of downstream research which quickly surpassed it
RoBERTa - a carefully studied version of BERT with practical improvements, widely used in production systems
DistilBERT - a smaller version of BERT with a great performance-to-quality ratio but usually not used when training from scratch
GPT-2 - a large model showing its power in many Natural Language Understanding (NLU) tasks, especially text generation
ALBERT - another iteration of BERT architecture with further improvements (also from Google)
Reformer - although promising in terms of application, especially with regards to long document handling, it is not widely used at the time of our research, with a tricky training procedure that’s prone to errors
ULMFiT - an LSTM-based architecture but for many applications. Transformed-based models from the *-BERT family surpassed its quality

After our theoretical-level assessment, we have boiled down our choices to RoBERTa and GPT-2. We have then set up an experiment to train a document classifier on our dataset using publicly available pre-trained models, to see which one performs better. The GPT-2 architecture poses some challenges when adapting it to the classification tasks, partially because it’s not a bi-directional language model like the BERT-* family. We have observed that the final classification quality of the GPT-2-based model is inferior to the RoBERTa-based one. So we have decided to follow up with the RoBERTa architecture.

Efficient training pipeline

Our large-scale training pipelines run on a Kubeflow cluster on Google Cloud Platform (GCP), alongside the DataFlow jobs for data preprocessing.

Although DataFlow with dynamic scaling was already providing us with a good performance-to-cost ratio, we have taken a few additional steps to keep our training costs (and time) under control.

First, we have estimated that we would need to train with the most powerful GPUs on GCP—Nvidia V100s, for one to two weeks, assuming that we aree using a single machine with 8xV100 accelerators.

We also decided to try out Kubeflow’s retry mechanism as well as dynamic node pool scaling with preemptible virtual machines on the Kubernetes side to optimize training costs.

In this setup, virtual machines with GPUs are only spawned when needed during training. Moreover, the virtual machines are preemptible and, therefore, cost only a fraction of the standard ones. The downside of preemptible machines is that they can be decommissioned by the cloud within 24 hours of creation. This is where Kubeflow comes in handy with its retry mechanism. It monitors the running job and, if it fails, it automatically restarts to the point where the job finishes successfully or the maximum number of retries is exceeded.

Having the infrastructure ready is only part of the success. We have traded the training job’s stability for its lower cost due to our usage of preemptible VMs. Now, the script for training our language model needs to be error-proof as training can stop at any moment and the script should be able to resume and continue the training. As a consequence, we need to:

Store our whole dataset in Google Cloud Storage regional dataset to minimize the network cost of accessing it from the VMs in the same region;
Implement checkpointing mechanism that saves both model weights as well as all of the optimizer’s internal states into Google Cloud Storage every 200 iterations;
Implement automatic resume mechanism. Then, when the script starts, it scans the specified Google Cloud Storage bucket for existing checkpoints. If it finds one, it downloads it and continues the training like no preemption ever happened.

With this setup, we are able to train our custom language model within two weeks of pure compute time while staying cost effective.

Applications of custom language model

With our trained business-language model, we can now improve the performance of the formerly developed in-house document classification system (based on public language models) by up to 4 percentage points of precision per class. This business-language model can be leveraged in other tasks, such as named entity recognition (NER) or document similarity search, without having to source large amounts of training data first. As a consequence, the time to train new models operating on business documents now can be reduced by using our business language model, and the quality of these new models are expected to be higher than with models based on generic public language models.

‍