Introduction to Large Language Models (LLMs)

Generative artificial intelligence chatbots like ChatGPT use natural language processing (NLP) in order to communicate with their users. More specifically, these so-called generative pre-trained transformers (GPTs) work on large language models (LLMs). This article gives a short introduction to (large) language models in general, as they provide the basis for further understanding and working with such bots.

Natural Language vs. Formal Language

Natural language, no matter which part of the world, is ambiguous. For example, the sentence “I saw the man with the telescope” can be interpreted in two different ways: the person speaking saw a man who had a telescope, or the person speaking used a telescope to see a man.

Formal languages, however, stand in stark contrast to our human spoken or written languages. They have to be unambiguous by design. A grammar defines the syntax of legal sentences, and semantic rules define the meaning [RusNor22]. Formal languages are for example used in mathematics or in the description of programming languages such as Python. And as we all know, an interpreter or compiler rather gives us an error message than even trying to make assumptions or dealing with ambiguities.

N-Gram Character Models

A language model is a probability distribution describing the likelihood of a string [RusNor22]. Humans might pretty quickly recognize which language of the world a text is written in, at least with respect to the main languages or the areas of the world (e.g., “Eastern European”).

By using a character model or character-level model, it is relatively easy to find out the probability that a given word belongs to a certain language, even if we have never seen (thus learned) that word before.

Assume the following sentence:

The playful puppy pranced proudly in the picturesque park.

A program can look at every letter and analyze the probabilities of any alphabetical letter (or punctuation) following it. Examining the letter p, for example, we get the following probability distribution:

p→a 12.5 %
p→i 12.5 %
p→l 12.5 %
p→p 12.5 %
p→r 25.0 %
p→u 12.5 %
p→y 12.5 %

Note the 25 % probability of the sequence p→r, as it appears twice (and thus with double probability) within our text.

If we do not look at one letter and its successor, but in two or even more letters and their (one-character) successors, we can represent syllables, which are even more unique to a certain language of the world. Examining the two-letter sequence th, for example, we get (in this very short text) a 100 % probability that it is followed by e.

A two-letter sequence (as shown in the table above) is called 2-gram or bigram. It depends on one previous letter. Three-letter sequences are called 3-grams or trigrams and depend on their two respective preceding letters.

Being trained on large (labeled) datasets of different languages, the combinations of all n-gram model probabilities result in a very high accuracy for language detection, usually greater than 99 % [RusNor22].

N-Gram Word Models

We can now simply scale the n-gram character model to an n-gram word model. Instead of looking at character sequences (that make up words), we now look at word sequences (that make up sentences).

Assume the following sentence:

The cat chased its tail in circles, and as it spun around, the tail followed, creating a playful dance of the cat and its tail.

This results in the following probability distribution for the word tail, which can be determined in a straightforward manner using a similar program as shown above:

tail → in 33.3 %
tail → followed 33.3 %
tail → . 33.3 %

Applying the same concepts for bigram, trigram or even higher-gram word models results in more or less probable sentence structures. This is what humans usually call “grammar”. Not only can we now detect the language a text is written in (using both character and word models), but also if its sentences are grammatically correct.

Large Language Models

A language model, as described above, cannot only be used to read or understand, but also to generate natural language. Given the probabilites for the next words based on n predecessor words, it is also quite straightforward to generate sentences (or in the case of characters, create known or even new words that “sound” like the natural language). However, combining words (or characters) mainly based on probabilities may also result in new, possibly wrong sentences (or facts); a property usually known as “hallucinations”.

The difference between a language model and a large language model is just the amount of data it has been trained on. The famous ChatGPT has allegedly been trained on several billion words, including the whole Wikipedia, public source code, and probably the majority of books and news articles that were publicly accessible online at that time.

References

[RusNor22]
S. Russell, P. Norvig, Artificial Intelligence—A Modern Approach, Forth Edition, Global Edition, Pearson, 2022

Shortlink to this blog post: link.simplexacode.ch/nnz62024.01

Leave a Reply