
Fall Asleep While Learning About LLMs
In this episode of the I Can’t Sleep Podcast, drift off while learning about LLMs—large language models. Step into the thrilling (read: not thrilling) world of artificial intelligence and discover everything you never wanted to know about how they work. Fun fact: since machine learning algorithms only understand numbers, text has to be converted into numbers first. Riveting, right? Get ready for that kind of excitement. Happy sleeping!
Transcript
Welcome to the I Can't Sleep podcast,
Where I read random articles from across the web to bore you to sleep with my soothing voice.
I'm your host,
Benjamin Boster,
And today's episode is from a Wikipedia article titled Large Language Model.
A large language model,
LLM,
Is a type of machine learning model designed for natural language processing tasks,
Such as language generation.
LLMs are language models with many parameters and are trained with self-supervised learning on a vast amount of text.
The largest and most capable LLMs are generative pre-trained transformers,
GPTs.
Modern models can be fine-tuned for specific tasks or guided by prompt engineering.
These models acquire predictive power regarding syntax,
Semantics,
And ontologies inherent in human language corpora,
But they also inherit inaccuracies and biases present in the data they are trained in.
Before 2017,
There were a few language models that were large as compared to capacities then available.
In the 1990s,
The IBM Alignment models pioneered statistical language modeling.
A smoothed Ngram model in 2001,
Trained on 0.
3 billion words,
Achieved state-of-the-art perplexity at the time.
In the 2000s,
As Internet use became prevalent,
Some researchers constructed Internet-scale language databases,
Web as corpus,
Upon which they trained statistical language models.
In 2009,
In most language processing tasks,
Statistical language models dominated over symbolic language models,
As they can usefully ingest large datasets.
After neural networks became dominant in image processing around 2012,
They were applied to language modeling as well.
Google converted its translation service to neural machine translation in 2016.
As it was before transformers,
It was done by sequence-to-sequence deep LSTM networks.
At the 2017 NeurIPS conference,
Google researchers introduced the transformer architecture in their landmark paper,
Attention is All You Need.
This paper's goal was to improve upon 2014 sequence-to-sequence technology,
And was based mainly on the attention mechanism developed by Badanow et al.
In 2014.
The following year,
In 2018,
BERT was introduced and quickly became ubiquitous.
Though the original transformer has both encoder and decoder blocks,
BERT is an encoder-only model.
Academic and research usage of BERT began to decline in 2023,
Following rapid improvements in the abilities of decoder-only models,
Such as GPT,
To solve tasks via prompting.
Although decoder-only GPT-1 was introduced in 2018,
It was GPT-2 in 2019 that caught widespread attention,
Because OpenAI at first deemed it too powerful to release publicly out of fear of malicious use.
GPT-3 in 2020 went a step further,
And as of 2024 is available only via API,
With no offering of downloading the model to execute locally.
But it was the 2022 consumer-facing browser-based chat GPT that captured the imaginations of the general population,
And caused some media hype and online buzz.
The 2023 GPT-4 was praised for its increased accuracy,
And as a holy grail for its multi-modal capabilities.
OpenAI did not reveal the high-level architecture and the number of parameters of GPT-4.
The release of chat GPT led to an uptick in LLM usage across several research subfields of computer science,
Including robotics,
Software engineering,
And societal impact work.
Competing language models have for the most part been attempting to equal the GPT series,
At least in terms of number of parameters.
Since 2022,
Source-available models have been gaining popularity,
Especially at first with Bloom and Llama,
Though both have restrictions on the field of use.
Mistral AI's models,
Mistral-7b and Mixtral-8x7b,
Have the more permissive Apache license.
As of June 2024,
The instruction-fine-tuned variant of the Llama-3 70-billion-parameter model is the most powerful open LLM according to the LMSYS chatbot arena leaderboard,
Being more powerful than GPT-3.
5,
But not as powerful as GPT-4.
Since 2023,
Many LLMs have been trained to be multimodal,
Having the ability to also process or generate other types of data,
Such as images or audio.
These LLMs are also called large multimodal models,
LMMs.
As of 2024,
The largest and most capable models are all based on the transformer architecture.
Some recent implementations are based on other architectures,
Such as recurrent neural network variants and Mamba,
A state-space model.
As machine learning algorithms process numbers rather than text,
The text must be converted to numbers.
And the first step of vocabulary is decided upon.
Then,
Integer indices are arbitrarily but uniquely assigned to each vocabulary entry.
And finally,
An embedding is associated to the integer index.
Algorithms include byte-pair encoding,
BPE,
And WordPiece.
There are also special tokens serving as control characters,
Such as mask,
For mask-out token,
As used in BERT,
And unk,
Unknown,
For characters not appearing in the vocabulary.
Also,
Some special symbols are used to denote special text formatting.
For example,
A G with a dot over the top denotes a preceding whitespace in Roberta and GPT.
Double hash denotes continuation of a preceding word in BERT.
Tokenization also compresses the datasets.
Because LLMs generally require input to be an array that is not jagged,
The shorter texts must be padded until they match the length of the longest one.
How many tokens are,
On average,
Needed per word depends on the language of the dataset.
As an example,
Consider a tokenizer based on byte-pair encoding.
In the first step,
All unique characters,
Including blanks and punctuation marks,
Are treated as an initial set of n-grams,
I.
E.
Initial set of unigrams.
Successively,
The most frequent pair of adjacent characters is merged into a bigram,
And all instances of the pair are replaced by it.
All occurrences of adjacent pairs of previously merged n-grams that most frequently occur together are then again merged into even lengthier n-gram,
Until a vocabulary of prescribed size is obtained.
In case of GPT-3,
The size is 50257.
After a tokenizer is trained,
Any text can be tokenized by it,
As long as it does not contain characters not appearing in the initial set of unigrams.
A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average English word.
An average word in another language encoded by such an English-optimized tokenizer is,
However,
Split into suboptimal amount of tokens.
GPT-2 tokenizer can use up to 15 times more tokens per word for some languages,
For example the Shan language from Myanmar.
Even more widespread languages,
Such as Portuguese and German,
Have a premium of 50% compared to English.
Greedy tokenization also causes subtle problems with text completion.
In the context of training LLMs,
Datasets are typically cleaned by removing toxic passages from the dataset,
Discarding low-quality data,
And deduplication.
Clean datasets can increase training efficiency and lead to improved downstream performance.
A trained LLM can be used to clean datasets for training a further LLM.
With the increasing proportion of LLM-generated content on the web,
Data cleaning in the future may include filtering out such content.
LLM-generated content can pose a problem if the content is similar to human text,
Making filtering difficult,
But of lower quality,
Degrading performance of models trained on it.
Training of largest language models might need more linguistic data than naturally available,
Or that the naturally occurring data is of insufficient quality.
In these cases,
Synthetic data might be used.
Microsoft's PHY series of LLMs is trained on textbook-like data generated by another LLM.
Reinforcement learning from Human Feedback,
Or LHF,
Through algorithms,
Such as proximal policy optimization,
Is used to further fine-tune a model based on a dataset of human preferences.
Using self-instruct approaches,
LLMs have been able to bootstrap correct responses,
Replacing any naive responses starting from human feedback.
LLMs have been able to bootstrap human-generated corrections of a few cases.
For example,
In the instruction,
Write an essay about the main themes represented in Hamlet.
An initial naive completion might be,
If you submit the essay after March 17th,
Your grade will be reduced by 10% for each day of delay,
Based on the frequency of this textual sequence in the corpus.
The largest LLM may be too expensive to train and use directly.
For such models,
Mixture of Experts,
MOE,
Can be applied,
A line of research pursued by Google researchers since 2017,
To train models reaching up to 1 trillion parameters.
Most results previously achievable only by costly fine-tuning can be achieved through prompt and precise training.
This is the case for context engineering,
Although limited to the scope of a single conversation,
Or precisely limited to the scope of a context window.
In order to find out which tokens are relevant to each other within the scope of the context window,
The attention mechanism calculates soft weights for each token,
More precisely for its embedding,
By using multiple attention heads,
Each with its own relevance for calculating its own soft weights.
For example,
The small IE117M parameter-sized GPT-2 model has had 12 attention threads and a context window of only 1,
000 tokens.
In its medium version,
It has 345 M parameters and contains 24 layers,
Each with 12 attention heads.
For the training with gradient descent,
A batch size of 512 was utilized.
The largest models,
Such as Google's Gemini 1.
5,
Presented in February 2024,
Can have a context window sized up to 1,
000,
000.
Context window of 10,
000,
000 was also successfully tested.
Other models with large context windows includes Anthropx Clod 2.
1,
With a context window of up to 200,
000 tokens.
Note that this maximum refers to the number of input tokens and that the maximum number of output tokens differs from the input and is often smaller.
For example,
The GPT-4 turbo model has a maximum output of 4,
096 tokens.
Length of a conversation that the model can take into account when generating its next answer is limited by the size of a context window as well.
If the length of a conversation,
For example with ChatGPT,
Is longer than its context window,
Only the parts inside the context window are taken into account when generating the next answer,
Or the model needs to apply some algorithm to summarize the two distant parts of a conversation.
The shortcomings of making a context window larger include higher computational cost and possibly diluting the focus on local context,
While making it smaller can cause a model to miss an important long-range dependency.
Balancing them are a matter of experimentation and domain-specific considerations.
A model may be pre-trained either to predict how the segment continues or what is missing in the segment,
Giving a segment from its training database.
It can be either auto-regressive,
I.
E.
Predicting how the segment continues,
The way ChatGPTs do it.
For example,
Given a segment,
I like to eat,
The model predicts ice cream or sushi.
The next is masked,
I.
E.
Filling in the parts missing from the segment,
The way BERT does it.
For example,
Given a segment,
I like to blank,
Blank,
Cream,
The model predicts that eat and ice are missing.
Models may be trained on auxiliary tasks which test their understanding of the data distribution,
Such as Next Sentence Prediction,
NSP,
In which pairs of sentences are presented,
And the model must predict whether they appear consecutively in the training corpus.
During training,
Regularization loss is also used to stabilize training.
However,
Regularization loss is usually not used during testing and evaluation.
Substantial infrastructure is necessary for training the largest models.
Training Cost.
The qualifier large in large language model is inherently vague,
As there is no definitive threshold for the number of parameters required to qualify as large.
As time goes on,
What was previously considered large may evolve.
GPT-1 of 2018 is usually considered the first LLM,
Even though it has only 0.
117 billion parameters.
The tendency towards larger models is visible in the list of large language models.
Advances in software and hardware have reduced the cost substantially since 2020,
Such that in 2023,
Training of a 12 billion parameter LLM computational cost is 72,
300 A100 GPU hours,
While in 2020 the cost of training a 1.
5 billion parameter LLM,
Which was two orders of magnitude smaller than the state-of-the-art in 2020,
Was between $80,
000 and $1.
6 million.
Since 2020,
Large sums were invested in increasingly large models.
For example,
Training of the GPT-2,
I.
E.
A 1.
5 billion parameters model,
In 2019,
Cost $50,
000,
While training of the PALM,
I.
E.
A 540 billion parameters model in 2022,
Cost $8 million,
And Megatron Turing NLG-530B in 2021 cost around $11 million.
For transformer-based LLM,
Training cost is much higher than inference cost.
It costs 6 flops per parameter to train on one token,
Whereas it costs 1 to 2 flops per parameter to infer on one token.
Tool Use There are certain tasks that in principle cannot be solved by any LLM,
At least not without the use of external tools or additional software.
An example of such a task is responding to the user's input,
354 asterisk 139 equals,
Provided that the LLM has not already encountered a continuation of this calculation in its training corpus.
In such cases,
The LLM needs to resort to running program code that calculates the result,
Which can then be included in its response.
Another example is,
What is the time now?
It is,
Where a separate program interpreter would need to execute a code to get system time on the computer,
So that the LLM can include it in its reply.
This basic strategy can be sophisticated with multiple attempts of generated programs and other sampling strategies.
Generally,
In order to get an LLM to use tools,
One must fine-tune it for tool use.
If the number of tools is finite,
Then fine-tuning may be done just once.
If the number of tools can grow arbitrarily,
As with online API services,
Then the LLM can be fine-tuned to be able to read API documentation and call API correctly.
A simpler form of tool use is retrieval augmented generation.
The augmentation of an LLM was document retrieval.
Given a query,
A document retriever is called to retrieve the most relevant documents.
This is usually done by encoding the query and the documents into vectors.
When finding the documents with vectors,
Usually stored in a vector database,
Most similar to the vector query.
The LLM then generates an output based on both the query and context included from the retrieved documents.
Agency.
An LLM is typically not an autonomous agent by itself,
As it lacks the ability to interact with dynamic environments.
Recall past behaviors and plan future actions,
But can be transformed into one by integrating modules like profiling,
Memory,
Planning,
And action.
The react pattern,
A portmanteau of reason plus act,
Constructs an agent out of an LLM using the LLM as a planner.
The LLM is prompted to think out loud.
Specifically,
The language model is prompted with a textual description of the environment,
A goal,
A list of possible actions,
And a record of the actions and observations so far.
It generates one or more thoughts before generating an action,
Which is then executed in the environment.
The linguistic description of the environment given to the LLM planner can even be the latex code of a paper describing the environment.
In the DEBS,
Describe,
Explain,
Plan,
And select method,
An LLM is first connected to the visual world via image descriptions.
Then it is prompted to produce plans for complex tasks and behaviors based on its pre-trained knowledge and environmental feedback it receives.
The reflection method constructs an agent that learns over multiple episodes.
At the end of each episode,
The LLM is given the record of the episode and prompted to think up lessons learned,
Which would help it perform better at a subsequent episode.
These lessons learned are given to the agent in the subsequent episodes.
Monte Carlo Tree Search can use an LLM as rollout heuristic.
When a programmatic world model is not available,
An LLM can also be prompted with a description of the environment to act as world model.
For open-ended exploration,
An LLM can be used to score observations for their interestingness,
Which can be used as a reward signal to guide a normal,
Non-LLM reinforcement learning agent.
Alternatively,
It can propose increasingly difficult tasks for curriculum learning.
Instead of outputting individual actions,
An LLM planner can also construct skills,
Or functions for complex action sequences.
The skills can be stored and later invoked,
Allowing increasing levels of abstraction and planning.
LLM-powered agents can keep a long-term memory of its previous context,
And the memory can be retrieved in the same way as retrieval-augmented generation.
Multiple such agents can interact socially.
Compression Typically,
LLMs are trained with single or half-precision floating-point numbers,
Float 32 and float 16.
One float 16 has 16 bits,
Or 2 bytes,
And so 1 billion parameters require 2 gigabytes.
The largest models typically have 1 billion parameters.
Requiring 200 gigabytes to load,
Which places them outside the range of most consumer electronics.
Post-training quantization aims to decrease the space requirement by lowering precision of the parameters of a trained model,
While preserving most of its performance.
The simplest form of quantization simply truncates all numbers to a given number of bits.
It can be improved by using a different quantization codebook per layer.
Further improvement can be done by applying different precisions to different parameters,
With higher precision for particularly important parameters,
Outlier weights.
While quantized models are typically frozen,
And only pre-quantized models are fine-tuned,
Quantized models can still be fine-tuned.
Multimodality Multimodality means having several modalities,
And a modality refers to a type of input or output,
Such as video,
Image,
Audio,
Text,
Proprioception,
Etc.
There have been many AI models trained specifically to ingest one modality and output another modality,
Such as AlexNet,
For image to label,
Visual question answering for image text-to-text,
And speech recognition for speech-to-text.
A common method to create multimodal models out of an LLM is to tokenize the output of a trained encoder.
Flamingo demonstrated the effectiveness of the tokenization method,
Fine-tuning a pair of pre-trained language model and image encoder to perform better on visual question answering than models trained from scratch.
Google Palm model was fine-tuned into a multimodal model Palm-E using the tokenization method and applied to robotic control.
LLAMA models have also been turned multimodal using the tokenization method to allow image inputs and video inputs.
GPT-4 can use both text and image inputs,
Although the vision component was not released to the public until GPT-4v.
Google DeepMind's Gemini is also multimodal.
Mistral introduced its own multimodal Pixtrel-12b model in September 2024.
Properties Emergent Abilities Performance of bigger models on various tasks when plotted on a log-log scale appears as a linear extrapolation of performance achieved by smaller models.
However,
This linearity may be punctuated by breaks in the scaling law,
Where the slope of the line changes abruptly and where larger models acquire emergent abilities.
They arise from the complex interaction of the model's components and are not explicitly programmed or designed.
Furthermore,
Recent research has demonstrated that AI systems,
Including large language models,
Can employ heuristic reasoning akin to human cognition.
They balance between exhaustive logical processing and the use of cognitive shortcuts,
Heuristics,
Adapting their reasoning strategies to optimize between accuracy and effort.
This behavior aligns with principles of resource-rational human cognition,
As discussed in classical theories of bounded rationality and dual-process theory.
The most intriguing among emergent abilities is in-context learning from example demonstrations.
In-context learning is involved in tasks such as reported arithmetics,
Decoding the International Phonetic Alphabet,
Unscrambling a word's letters,
Disambiguate word and context,
Converting spatial words,
Cardinal directions,
For example,
Replying Northeast upon 001,
000,
000.
Color terms represented in text.
Chain-of-thought prompting.
Model outputs are improved by chain-of-thought prompting only when model size exceeds 62B.
Smaller models perform better when prompting to answer immediately without chain-of-thought.
Identifying offensive content in paragraphs of Hinglish,
A combination of Hindi and English,
And generating a similar English equivalent of Kaizwahili proverbs.
Schaefer,
Et al.
,
Argues that the emergent abilities are not unpredictably acquired,
But predictably acquired according to a smooth scaling law.
The authors considered a toy statistical model of an LLM solving multiple-choice questions and showed that this statistical model,
Modified to account for other types of tasks,
Applies to these tasks as well.
Let x be the number of parameter count and y be the performance of the model.
Interpretation.
Large language models by themselves are black boxes,
And it is not clear how they can perform linguistic tasks.
There are several methods for understanding how LLM work.
Mechanistic interpretability aims to reverse-engineer LLM by discovering symbolic algorithms that approximate the inference performed by LLM.
One example is Othello GPT,
Where a small transformer is trained to predict legal Othello moves.
It is found that there is a linear representation of Othello board,
And modifying the representation changes the predicted legal Othello moves in the correct way.
In another example,
A small transformer is trained on Carol programs.
Similar to the Othello GPT example,
There is a linear representation of Carol program semantics,
And modifying the representation changes output in the correct way.
The model also generates correct programs that are on average shorter than those in the training set.
In another example,
The authors train small transformers on modular arithmetic addition.
The resulting models were reverse-engineered,
And it turned out they used discrete Fourier transform.
Understanding and Intelligence NLP researchers were evenly split when asked in a 2022 survey whether untuned LLMs could ever understand natural language in some non-trivial sense.
Proponents of LLM understanding believe that some LLM abilities,
Such as mathematical reasoning,
Imply an ability to understand certain concepts.
A Microsoft team argued in 2023 that GPT-4 can solve novel and difficult tasks that span mathematics,
Coding,
Vision,
Medicine,
Law,
Psychology,
And more.
And that GPT-4 could reasonably be viewed as an early,
Yet still incomplete version of an artificial general intelligence system.
Can one reasonably say that a system that passes exams for software engineering candidates is not really intelligent?
Ilya Sutskever argues that predicting the next word sometimes involves reasoning and deep insights.
For example,
If the LLM has to predict the name of the criminal in an unknown detective novel after processing the entire story leading up to the revelation.
Some researchers characterize LLMs as alien intelligence.
For example,
Conjecture CEO Connor Leahy considers untuned LLMs to be like unscrutable alien shoggoths and believes that RLHF tuning creates a smiling façade,
Obscuring the inner workings of the LLM.
If you don't push it too far,
The smiley face stays on,
But then you give it an unexpected prompt and suddenly you see this massive underbelly of insanity,
Of weird thought processes,
And clearly non-human understanding.
In contrast,
Some proponents of the LLM's lack of understanding school believe that existing LLMs are simply remixing and recombining existing writing,
A phenomenon known as stochastic parrot,
Or they point to the deficits existing LLMs continue to have in prediction skills,
Reasoning skills,
Agency,
And explainability.
For example,
GPT-4 has natural deficits in planning and in real-time learning.
Generative LLMs have been observed to confidently assert claims of fact which do not seem to be justified by their training data,
A phenomenon which has been termed hallucination.
Specifically,
Hallucinations in the context of LLMs correspond to the generation of text or responses that seem syntactically sound,
Fluent,
And natural,
But are factually incorrect,
Nonsensical,
Or unfaithful to the provided source input.
Neuroscientist Terence Sanowski has argued that the diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate.
The matter of LLMs exhibiting intelligence or understanding has two main aspects.
The first is how to model thought and language in a computer system,
And the second is how to enable the computer system to generate human-like language.
These aspects of language as a model of cognition have been developed in the field of cognitive linguistics.
American linguist George Lakoff presented Neural Theory of Language,
NTL,
As a computational basis for using language as a model of learning,
Tasks,
And understanding.
The NTL model outlines how specific neural structures of the human brain shape the nature of thought and language,
And in turn what are the computational properties of such neural systems that can be applied to model thought and language in a computer system.
After a framework for modeling language in a computer system was established,
The focus shifted to establishing frameworks for computer systems to generate language with acceptable grammar.
In his 2014 book titled The Language Myth,
Why Language is Not an Instinct,
British cognitive linguist and digital communication technologist Vivian Evans mapped out the role of probabilistic context-free grammar,
PCFG,
In enabling NLP to model cognitive patterns and generate human-like language.
5.0 (21)
Recent Reviews
Beth
January 21, 2025
Another snoozer, I’m not sure if I lasted 5 minutes! Thank you!! ☺️
Lizzz
January 19, 2025
I've listened to this one about 4 times and it works every time. Thank you, Benjamin!
