What Are the Top NLP Language Models?

*Before reading through our piece on NLP language models, make sure to check out the previous two installments of this series: “What is Natural Language Processing,” and “Natural Language Processing Applications.” 

What’s this about: As we have covered in the previous two articles, natural language processing (NLP) is one of the most innovative and consequential artificial intelligence (AI) technologies. It is becoming increasingly apparent in our everyday lives and is disrupting many industries. NLP applications rely on complex language models, which are extremely difficult and tedious to develop from start-to-finish. Because of this, AI developers often turn to pre-trained language models that can be repurposed for various NLP functions and new datasets. 

Go Deeper to Learn More →


With NLP, machines are trained to understand how human language works, and they do this by analyzing and processing massive amounts of data. NLP’s abilities have come a long way in a short time, and it makes our technological lives far more easy and efficient. Many people might not realize that when you begin typing a question into Google, and the search engine magically finishes it for you, that is the work of NLP. 

Luckily for developers, there are many pre-trained NLP models that serve specific purposes. These models can be used for speech recognition, text-to-speech transformation, question answering, dialogue systems, document classification, article generation, sentiment analysis, and much more.

Let’s take a look at some of the top NLP language models on the market: 

BERT (Bidirectional Encoder Representations from Transformers) 

Developed in 2018, BERT is Google’s major contribution to the field of NLP models. The pre-trained model enables anyone to train their own question-answering models in as little as 30 minutes, and it can perform various NLP tasks simultaneously. 

BERT is the first deeply bidirectional, unsupervised system for pre-training NLP models. Bidirectional refers to its ability to read text input from both left-to-right and right-to-left, while unsupervised refers to a type of machine learning in which the algorithm does not receive any pre-assigned labels or scores for the training data. BERT was trained with only a plain text corpus, or Wikipedia in this specific case. This enables the system to gain a better understanding of word context. When BERT was first released back in 2018, it was demonstrating state-of-the-art performance in 11 NLP tasks, which led to some in the AI community saying it ushered in a “new era of NLP.

Here are some of the possible business applications of BERT: 

  • Chatbots for customer experience;

  • Customer review analysis;

  • Relevant information search.

OpenAI’s GPT- 3

OpenAI’s GPT-3 model is the successor to their GPT and GPT-2 models, and it is one of the most famous and controversial pre-trained models in the field. It was trained on an incredible 175 billion parameters, which is far more than any other non-sparse language model previously released.

GPT-3 is a large-scale transformer-based language model, and it has demonstrated an impressive ability to perform various NLP tasks like translation, question answering, and even unscrambling words. More recently, GPT-3 has been used to generate news articles from scratch, which caused quite a stir as some feared robots were getting even closer to undertaking tasks once believed could only be done by humans. 

Those 175 billion parameters were based on 45 TB of text that was sourced from the internet, making GPT-3 one of the largest pre-trained NLP models. One of the most impressive features of the model is that it does not require fine-tuning to perform downstream tasks, and developers can reprogram it. 

Here are some of the possible business applications of GPT-3:

  • Automated translation of documents;

  • Programming without code;

  • Cloning websites;

  • Producing tests and quizzes;

  • Automating documentation.

RoBERTa (Robustly Optimized BERT Pre-Training Approach)

Facebook’s RoBERTa is an optimized method for pre-training a self-supervised NLP system. The model is specifically trained to predict intentionally hidden sections of text within unannotated language examples. For example, Facebook used RoBERTa to train its tools to know what hate speech or bullying looks like in text.

RoBERTa modifies hyperparameters in the BERT model, which enables it to improve on the masked language modeling objective and achieve better downstream task performance. For example, these modifications can include training with larger mini-batches and removing BERT’s next sentence pre-training objective. RoBERTa and other pre-trained models have been able to outperform BERT in all individual tasks on the General Language Understanding Evaluation (GLUE) benchmark

Here are some of the possible business applications of RoBERTa:

  • Dialogue systems;

  • Question answering;

  • Document classification;

  • and other downstream tasks.

CodeBert 

Microsoft’s CodeBERT is a bimodal pre-trained model for programming language (PL) and natural language (NL). The system learns general-purpose representations that support downstream NL-PL applications like natural language code search and code documentation generation. CodeBERT can understand the connection between NL and PL and has been evaluated on NL-PL tasks by fine-tuning model parameters. The model has also been trained on the large dataset from Github code repositories that consist of six programming languages. 

The six programming languages are:

  • Python

  • Java

  • JavaScript

  • PHP

  • Ruby

  • Go

ALBERT

Google released a lite version of BERT to address various issues with an increased model size. While increasing the size of pre-trained language models improves performance of downstream tasks, it can also lead to problems like longer training times and GPU/TPU memory limitations. 

ALBERT was introduced with two parameter-reduction techniques to help lower consumption and increase training speed: 

  • Factorized Embedding Parameterization: The size of the hidden layers are separated from the size of vocabulary embeddings. 

  • Cross-Layer Parameter Sharing: This technique helps prevent the number of parameters from growing as the depth of the network grows. 

This deep-learning model was advanced on 12 NLP tasks and has been released as an open-source implementation on the TensorFlow framework. It uses 89% fewer parameters than the BERT model and averages an 80.1% accuracy. 

Here are some of the possible business applications of ALBERT:

  • Chatbot performance;

  • Sentiment analysis;

  • Document mining;

  • Text classification;

  • and other downstream tasks. 

The Wide Range of Pre-Trained NLP Models

There is very little reason for the average developer to set out and build their own model from scratch. With limited resources and the time requirements, it can be nearly impossible. This is why it is so important that there are multiple, state-of-the-art NLP models already in the field. These models are far from being restrictive, as they allow you to add new layers on top depending on the specific NLP tasks you are setting out to accomplish. And with features such as this, they are likely to be far more successful than models built from the ground up.

If you want to learn more about NLP models, here is a list of current models to read about:

  • XLNet: A generalized autoregressive model where next token is dependent on all previous tokens.

  • StructBERT: Developed by the Alibaba research team, StructBERT is an extension of BERT that leverages word-level and sentence-level ordering. 

  • GPT-4: Sam Altman, CEO of OpenAI, has recently confirmed the upcoming release of the GPT-4 model. It follows the GPT-1, GPT-2, and GPT-3 models, and according to Altman, GPT-4 will use more compute resources despite not being any bigger. 

If you want to gain more insight into NLP and other artificial intelligence technologies, make sure to sign up for the Medium blog: https://gcmori.medium.com/membership

Giancarlo Mori