How to Create Your Own State-of-the-Art Text Generation System

How to Create Your Own State-of-the-Art Text Generation System

Hardly a day goes by when there isn’t a story about fake news. It reminds me of a quote from the favorite radio newsman from my youth, “If you don’t like the news, go out and make some of your own.” OpenAI’s breakthrough language model, the 1.5 billion parameter version of GPT-2, got close enough that the group decided it was too dangerous to release publicly, at least for now. However, OpenAI has now released two smaller versions of the model, along with tools for fine-tuning them on your own text. So, without too much effort, and using dramatically less GPU time than it would take to train from scratch, you can create a tuned version of GPT-2 that will be able to generate text in the style you give it, or even start to answer questions similar to ones you train it with.

What Makes GPT-2 Special

GPT-2 (Generative Pre-Trained Transformer version 2) is based on a version of the very powerful Transformer Attention-based Neural Network. What got the researchers at OpenAI so excited about it was finding that it could address a number of language tasks without being directly trained on them. Once pre-trained with its massive corpus of Reddit data and given the proper prompts, it did a passable job of answering questions and translating languages. It certainly isn’t anything like Watson as far as semantic knowledge, but this type of unsupervised learning is particularly exciting because it removes much of the time and expense needed to label data for supervised learning.

Overview of Working With GPT-2

For such a powerful tool, the process of working with GPT-2 is thankfully fairly simple, as long as you are at least a little familiar with Tensorflow. Most of the tutorials I’ve found also rely on Python, so having at least a basic knowledge of programming in Python or a similar language is very helpful. Currently, OpenAI has released two pre-trained versions of GPT-2. One (117M) has 117 million parameters, while the other (345M) has 345 million. As you might expect the larger version requires more GPU memory and takes longer to train. You can train either on your CPU, but it is going to be really slow.

The first step is downloading one or both of the models. Fortunately, most of the tutorials, including the ones we’ll walk you through below, have Python code to do that for you. Once downloaded, you can run the pre-trained model either to generate text automatically or in response to a prompt you provide. But there is also code that lets you build on the pre-trained model by fine-tuning it on a data source of your choice. Once you’ve tuned your model to your satisfaction, then it’s simply a matter of running it and providing suitable prompts.

Working with GPT-2 On Your Local Machine

There are a number of tutorials on this, but my favorite is by Max Woolf. In fact, until the OpenAI release, I was working with his text-generating RNN, which he borrowed from for his GPT-2 work. He’s provided a full package on GitHub for downloading, tuning, and running a GPT-2 based model. You can even snag it directly as a package from PyPl. The readme walks you through the entire process, with some suggestions on how to tweak various parameters. If you happen to have a massive GPU handy, this is a great approach, but since the 345M model needs most of a 16GB GPU for training or tuning, you may need to turn to a cloud GPU.

Working with GPT-2 for Free Using Google’s Colab

How to Create Your Own State-of-the-Art Text Generation System

Getting Data for Your Project

Now that powerful language models have been released onto the web, and tutorials abound on how to use them, the hardest part of your project might be creating the dataset you want to use for tuning. If you want to replicate the experiments of others by having it generate Shakespeare or write Star Trek dialog, you can simply snag one that is online. In my case, I wanted to see how the models would do when asked to generate articles like those found on wfoojjaec. I had access to a back catalog of over 12,000 articles from the last 10 years. So I was able to put them together into a text file, and use it as the basis for fine-tuning.

If you have other ambitions that include mimicking a website, scraping is certainly an alternative. There are some sophisticated services like ParseHub, but they are limited unless you pay for a commercial plan. I have found the Chrome Extension Webscraper.io to be flexible enough for many applications, and it’s fast and free. One big cautionary note is to pay attention to Terms of Service for whatever website you’re thinking of, as well as any copyright issues. From looking at the output of various language models, they certainly aren’t taught to not plagiarize.

So, Can It Do Tech Journalism?

Once I had my corpus of 12,000 wfoojjaec articles, I started by trying to train the simplified GPT-2 on my desktop’s Nvidia 1080 GPU. Unfortunately, the GPU’s 8GB of RAM wasn’t enough. So I switched to training the 117M model on my 4-core i7. It wasn’t insanely terrible, but it would have taken over a week to make a real dent even with the smaller of the two models. So I switched to Colab and the 345M model. The training was much, much, faster, but needing to deal with session resets and the unpredictability of which GPU I’d get for each session was annoying.

Upgrading to Google’s Compute Engine

After that, I bit the bullet, signed up for a Google Compute Engine account, and decided to take advantage of the $300 credit Google gives new customers. If you’re not familiar with setting up a VM in the cloud it can be a bit daunting, but there are lots of online guides. It’s simplest if you start with one of the pre-configured VMs that already has Tensorflow installed. I picked a Linux version with 4 vCPUs. Even though my desktop system is Windows, the same Python code ran perfectly on both. You then need to add a GPU, which in my case took a request to Google support for permission. I assume that is because GPU-equipped machines are more expensive and less flexible than CPU-only machines, so they have some type of vetting process. It only took a couple of hours, and I was able to launch a VM with a Tesla T4. When I first logged in (using the built-in SSH) it reminded me that I needed to install Nvidia drivers for the T4, and gave me the command I needed.

Next, you need is to set up a file transfer client like WinSCP, and get started working with your model. Once you upload your code and data, create a Python virtual environment (optional), and load up the needed packages, you can proceed the same way you did on your desktop. I trained my model in increments of 15,000 steps and downloaded the model checkpoints each time, so I’d have them for reference. That can be particularly important if you have a small training dataset, as too much training can cause your model to over-fit and actually get worse. So having checkpoints you can return to is valuable.

Speaking of checkpoints, like the models, they’re large. So you’ll probably want to add a disk to your VM. By having the disk separate, you can always use it for other projects. The process for automatically mounting it is a bit annoying (it seems like it could be a checkbox, but it’s not). Fortunately, you only have to do it once. After I had my VM up and running with the needed code, model, and training data, I let it loose. The T4 was able to run about one step every 1.5 seconds. The VM I’d configured cost about $25/day (remember that VMs don’t turn themselves off; you need to shut them down if you don’t want to be billed, and persistent disk keeps getting billed even then).

To save some money, I transferred the model checkpoints (as a .zip file) back to my desktop. I could then shut down the VM (saving a buck or two an hour), and interact with the model locally. You get the same output either way because the model and checkpoint are identical. The traditional way to evaluate the success of your training is to hold out a portion of your training data as a validation set. If the loss continues to decrease but accuracy (which you get by computing the loss when you run your model on the data you’ve held out for validation) decreases, it is likely you’ve started to over-fit your data and your model is simply “memorizing” your input and feeding it back to you. That reduces its ability to deal with new information.

Here’s the Beef: Some Sample Outputs After Days of Training

After experimenting on various types of prompts, I settled on feeding the model (which I’ve nicknamed The Oracle) the first sentences of actual wfoojjaec articles and seeing what it came up with. After 48 hours (106,000 steps in this case) of training on a T4, here is an example:

The output of our model after two days of training on a T4 when fed the first sentence of Ryan Whitwam’s Titan article. Obviously, it’s not going to fool anyone, but the model is starting to do a decent job of linking similar concepts together at this point.
The output of our model after two days of training on a T4 when fed the first sentence of Ryan Whitwam’s Titan article. Obviously, it’s not going to fool anyone, but the model is starting to do a decent job of linking similar concepts together at this point.

The more information the model has about a topic, the more it starts to generate plausible text. We write about Windows Update a lot, so I figured I’d let the model give it a try:

The model’s response to a prompt about Windows Update after a couple of days of training.
The model’s response to a prompt about Windows Update after a couple of days of training.

With something as subjective as text generation, it is hard to know how far to go with training a model. That’s particularly true because each time a prompt is submitted, you’ll get a different response. If you want to get some plausible or amusing answers, your best bet is to generate several samples for each prompt and look through them yourself. In the case of the Windows Update prompt, we fed the model the same prompt after another few hours of training, and it looked like the extra work might have been helpful:

After another few hours of training, here is the best of the samples when given the same prompt about Microsoft Windows.
After another few hours of training, here is the best of the samples when given the same prompt about Microsoft Windows.

Here’s Why Unsupervised Models are So Cool

I was impressed, but not blown away, by the raw predictive performance of GPT-2 (at least the public version) compared with simpler solutions like textgenrnn. What I didn’t catch on to until later was the versatility. GPT-2 is general purpose enough that it can address a wide variety of use cases. For example, if you give it pairs of French and English sentences as a prompt, followed by only a French sentence, it does a plausible job of generating translations. Or if you give it question-and-answer pairs, followed by a question, it does a decent job of coming up with a plausible answer. If you generate some interesting text or articles, please consider sharing, as this is definitely a learning experience for all of us.