Friday, March 17, 2017

Teaching a Machine to Regurgitate

Artificial intelligence (AI) is a topic that is dominating the news. Often, these stories are cast in a fearful light (e.g. your smart refrigerator will one day roll into your bedroom and kill your family). The truth is that while AI as both a topic of study and a commercialization opportunity is noticeably stronger than it has been in years, there's very little to fear from its ascendancy, at least for now.

To gain intelligence, something must be able to "learn." In order for something to learn, it has to be "taught." This process of teaching a machine to learn something is machine learning. As a discipline in computer science, machine learning has been around for decades, but putting it into practice has been difficult -- until now. Advances in computing power, storage, and data have pushed the field forward in significant ways; consequently, this has fueled its resurgence.

As we teach machines to learn, we as humans also learn how to teach them better. Indeed, years of research and experimentation in the field have led to huge growth in the machine learning ethos, and these techniques can be used to bring a semblance of intelligence to our computers. One of those machine learning methods is the neural network. The idea behind neural networks is to use vast computing resources to mimic the structure of the brain, with cells modeled after neurons that are attached to other neurons, firing as various thresholds are reached. These patterns can be trained to react to certain inputs (stimuli). Much research has been, and continues to be, poured into this area.

One particular type of neural network is a recurrent neural network, or RNN for short. RNNs are interesting in that they form internal, cyclical connections that can be used to mimic dynamic behavior of sorts. Think of it like a bunch of interconnected roads on a map that turn into and out of each other, allowing traffic to flow more than just one direction. This flow of information allows RNNs to learn differently than the traditional feed-forward networks.

Neural networks can be trained to do a number of tasks, including recognizing and regurgitating patterns, from images, to sound and music, to text. It's the latter that Andrej Karpathy explores in this blog on the effectiveness of RNNs. To demonstrate his point, he trains an RNN on text from numerous plays written by Shakespeare. This creates a model that is then used to generate text that reads a lot like Shakespeare, although it's not necessarily sensical.

So what does this all have to do with the CoCo?

As you read Karpathy's blog, it should occur to you that any collection of text could be used to feed into and train an RNN like the one he discusses. Why stop with Shakespeare? Any sufficiently large body of text is representative of a certain style of writing. Terminology, sentence structure, the use of punctuation... all of this can be "learned" and used to generate new data in the same style and vein. The Malted Media CoCo List is itself a large body of text -- it happens to be composed of many authors, each with their own word usage and writing style. However, it has a common structure: everything is in the context of an email, containing headers, subject, body, and signature. There's sentence structure, quoting other messages, and even repeatable terminology (CoCo, GIME, OS-9, etc.)

With this in mind, I decided to apply Karpathy's RNN work to training on the entire set of CoCo List raw text data from December 2003 to September 2016. Gathering the data was a matter of pulling down the monthly archives available on the site, then concatenating all of that data together into one large file comprised of 7,985,340 lines of text, 45,272,983 words, or 288,690,945 characters (bytes).

Training on a quad-core Intel Core i7 at 3.4GHz running Ubuntu Linux took exactly 14,654 minutes and 58 seconds (let's be generous and say 14,655 minutes). That's 244.25 hours, or 10.177 days to train an RNN with all of that data. The resulting model file came out to 2GB in size.

Now you can see the model regurgitate somewhat meaningful CoCo List messages by clicking here.

Here's some sample output:

>To: ColorComputer at, "James M color at says
my threat Aaron files and I was as well except reads platfor
allowed by a way to make one.}_So---Mug
12V quite a %dolleddlls prompted, so that it wouldn't include it and
going for change ports, but what
the lowercase could be group to
allow the earth. I already exist with "owner" had the asked just application on 4 years. Also why
a Dragon 2560 for Gun & e-mak with the disk reference.

Run is replaceing inteal for PC cartridge domain needed
'descriptor box pully Pin Cloud-9 USB drives on my
seems to be the authors are hard come.

The online?

I really happen to have to see it is directed by cpu data registration with Drive
computer??? Anyway, any
rest of the weag gback ever. :

Just polited to "bus and lc and 12 bit format). Been the interruct and discussion out. This)...

Message: 7
11:09 Nox6993 #DSK=500+21O auction
+723 Electronic Oh
Topit controlments in 279-30-1004


Cheers, Glenith at ?



> I have currently "need from disk image rares). in specifics plus: *** COCO Daying of
> multi-Page Disks' need on prefer or the "but Magazine GIME" and
> other
> > available?
> NitrOS-9's made command code.
> Marcution from acsaging to actually learn more of user removable, but keep this memory or a Luis I say and 300, 3. I love one
> but this is not writting that thread but based in that Cloud-9... the 1.4N talk about?
> Cocolist platform I have made for tweak and
> work because of
they are not zero and it always looking or have the
> Disto OS-9 disks screen, too, do a disk image you? Is a personal MESS ERROR if there is a end you?
> What should possibher.
> The
work if not up Wildort Crs9/new prompt possing the faster people work in chance to afford
> and I reavenPil config as not going to a development internal language and
At a glance, it looks like a message on a mailing list. There's a semblance of a header at the top, then there's a body of text, what looks like a signature, and the quoting of another message with the '>' character indicating the previous reply in the thread.

The sentences may be somewhat awkward to read, but for the most part, there are sensible words there.  You can see references to recognizable words in the CoCo lexicon: OS-9, Cloud-9, Disto, etc.

What's interesting is just how well the RNN mimics a CoCo List message. It's not perfect, but it's quite close, and it provides just enough of a mixture at times to make for a bit of humor.

It's a simple, but interesting example of the power and utility of RNNs. I suggest reading Andrej Karpathy's previously referenced blog post for the details and behind this. And don't forget to refresh your browser to get another dose of random CoCo List goodness.