Archive for December 31st, 2023
Happy New Year: GPT in 500 lines of SQL
Translations: Russian
This year, the talk of the town was AI and how it can do everything for you.
I like it when someone or something does everything for me. To this end, I decided to ask ChatGPT to write my New Year's post:
"Hey ChatGPT. Can you implement a large language model in SQL?"
"No, SQL is not suitable for implementing large language models. SQL is a language for managing and querying data in relational databases. Implementing a language model requires complex algorithms, large datasets, and often deep learning frameworks, which are beyond the capabilities of SQL."
It just proves that if you want something done right, you have to do it yourself.
Encouraged by this optimistic forecast, today we will implement a large language model in SQL.
Theory
While writing this post, I used the wonderful article GPT in 60 Lines of NumPy by Jay Mody. This article explains the inner workings of a GPT model much better than I can hope to do. Still, a little recap is in order.
What is a generative large language model from a technical perspective?
A generative LLM is a function. It takes a text string as input (called "prompt" in AI parlance), and returns an array of strings and numbers. Here's what the signature of this function looks like:
llm(prompt: str) -> list[tuple[str, float]]
This function is deterministic. It does a lot of math under the hood, but all this math is hardwired. If you call it repeatedly with the same input, it will always return the same output.
It may come as a surprise to anyone who's been using ChatGPT and similar products because they can give different answers to the same question. Yet, it's true. We will shortly see how it works.
What are the values this function returns?
Something like this:
llm("I wish you a happy New") 0 (' Year', 0.967553) 1 (' Years', 0.018199688) 2 (' year', 0.003573329) 3 (' York', 0.003114716) 4 (' New', 0.0009022804) … 50252 (' carbohyd', 2.3950911e-15) 50253 (' volunte', 2.2590102e-15) 50254 ('pmwiki', 1.369229e-15) 50255 (' proport', 1.1198108e-15) 50256 (' cumbers', 7.568147e-17)
It returns an array of tuples. Each tuple consists of a word (or, rather, a string) and a number. The number is the probability that this word will continue the prompt. The model "thinks" that the phrase "I wish you a happy New" will be followed by the character sequence " Year" with a probability of 96.7%, " Years" of 1.8% and so on.
The word "think" above is quoted because, of course, the model doesn't really think. It mechanically returns arrays of words and numbers according to some hardwired internal logic.
If it's that dumb and deterministic, how can it generate different texts?
Large language models are used in text applications (chatbots, content generators, code assistants etc). These applications repeatedly call the model and select the word suggested by it (with some degree of randomness). The next suggested word is added to the prompt and the model is called again. This continues in a loop until enough words are generated.
The accrued sequence of words will look like a text in a human language, complete with grammar, syntax and even what appears to be intelligence and reasoning. In this aspect, it is not unlike a Markov chain which works on the same principle.
The internals of a large language model are wired up so that the next suggested word will be a natural continuation of the prompt, complete with its grammar, semantics and sentiment. Equipping a function with such a logic became possible through a series of scientific breakthroughs (and programming drudgery) that have resulted in the development of the family of algorithms known as GPT, or Generative Pre-trained Transformer.
What does "Generative Pre-trained Transformer" mean?
"Generative" means that it generates text (by adding continuations to the prompt recursively, as we saw earlier).
"Transformer" means that it uses a particular type of neural network, first developed by Google and described in this paper.
"Pre-trained" is a little bit historical. Initially, the ability for the model to continue text was thought of as just a prerequisite for a more specialized task: inference (finding logical connections between phrases), classification (for instance, guessing the number of stars in a hotel rating from the text of the review), machine translation and so on. It was thought that these two parts should have been trained separately, the language part being just a pre-training for a "real" task that would follow.
As the original GPT paper puts it:
We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
It was not until later that people realized that, with a model large enough, the second step was often not necessary. A Transformer model, trained to do nothing else than generate texts, turned out to be able to follow human language instructions that were contained in these texts, with no additional training ("fine-tuning" in AI parlance) required.
With that out of the way, let's focus on the implementation.
Read the rest of this entry »