Generative Pre-trained Transformer (GPT)

Concept

GPT-1 (125M params): Autoregressively pre-train a multi-layer Transformer decoder, and fine-tune on supervised data for downstream tasks.
GPT-2 (1.5B params): Scale up GPT-1 and utilize a larger dataset: WebText (in-house).
GPT-3 (175B params): Scale up GPT-2 and utilize more datasets.
- Used datasets: Common Crawl (filtered), WebText2 (in-house), Books1, Books2, and Wikipedia.
- Propose in-context learning to perform zero-shot, one-shot¹, few-shot¹ learning without fine-tuning.
- Uses GPT2TokenizerFast, which is based on Byte pair encoding (BPE).
- Costs 3,640² petaflop/s-day (pfs-day) to train. (~20 years on a single 8xV100 machine)
Codex (12B params): Fine-tune GPT-3 on code datasets.
- Code Fine-tuning on 159 GB Python code collected and filtered from GitHub.
- Add an additional set of tokens for representing whitespace runs of different lengths (based on GPT-3's tokenizer).
- (Codex-S) Supervised Fine-tuning on standalone functions curated from competitive programming websites, interview preparation websites, and GitHub repos/PyPI packages with continuous integration (CI).
- Applications such as GitHub Copilot uses a distinct (production) version of Codex.
InstructGPT (175B params): Make GPT-3 more aligned to user instructions by Reinforcement Learning from Human Feedback (RLHF):
- (Step 1) Supervised Fine-Tuning (SFT) (175B params): Train a supervised policy (i.e., fine-tune GPT-3) on collected demonstration data.
  - Sample prompts from dataset, and ask labelers to compose the desired outputs.
- (Step 2) Reward Modeling (RM) (6B params): Train a reward model on collected comparison data.
  - Sample prompts & outputs, and ask labelers to rank the outputs.
  - Weights are initialized from (a fine-tuned variant of) the GPT-3 model.
- (Step 3) Reinforcement Learning (RL): Train a policy against the reward model with reinforcement learning (PPO in Actor-Critic style, bandit environment).
  - The policy is defined as $P (action | state) = P (response | prompt)$
  - Policy network (175B params) weights are initialized from (a fine-tuned variant of) the GPT-3 model.
  - Value network (6B params) weights are initialized from the Reward Model (RM).
- Hired ~40 contractors (with an onboarding process) to compose:
  - SFT Data (~14k³) consists of labeler demonstrations.
  - RM Data (~51k³) consists of labeler rankings.
  - PPO Data (~47k³) consists of unlabeled inputs for RLHF fine-tuning.
- See this image in the official blog post for more information.
GPT-3.5: GPT-3 and InstructGPT trained on a blend of text and code.
ChatGPT: Use the same methods as InstructGPT to improve GPT-3.5, but collects supervised data in chat format instead (convsersations between a user and an AI assistant)

ChatGPT Training Process Overview, from OpenAI Blog.
- Modifies Step 1 in InstructGPT: Collect demonstration data of two roles (user and AI assistant) and ask labelers to compose the desired outputs for each role.

ChatGPT Plugins: Support interaction with third-party services.
GPT-4: Further improves ChatGPT, with no technical details available. GPT-4 powers the current ChatGPT Plus.

Some GPT variants are not included in this post (e.g., WebGPT, Image GPT, Visual ChatGPT), but may be included in the future.

Official Resources

(GPT-1) Improving Language Understanding by Generative Pre-Training [paper][code][blog] (citations: 5290, 5082, as of 2023-04-28)
(GPT-2) Language Models are Unsupervised Multitask Learners [paper][code][blog] (citations: 5800, 9390, as of 2023-04-28)
(GPT-3) [NIPS 2020] Language Models are Few-Shot Learners [arxiv][paper][blog][demo] (citations: 9471, 9528, as of 2023-04-28)

(Codex) Evaluating Large Language Models Trained on Code [arxiv] (citations: 525, 740, as of 2023-04-28)
(InstructGPT) [NIPS 2022] Training language models to follow instructions with human feedback [arxiv][paper][paper][blog] (citations: 565, 730, as of 2023-04-28)
(ChatGPT) Optimizing Language Models for Dialogue [blog][demo]
(ChatGPT Plugins) [blog]
(GPT-4) GPT-4 Technical Report [arxiv][paper][link]
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models [arxiv]

Community Resources

Why ChatGPT applies RL instead of SL? - Sebastian Raschka

Obsolete Contents

Interesting Posts Regarding ChatGPT.

Building A Virtual Machine inside ChatGPT by Jonas Degrave
Ask ChatGPT to act as a Linux terminal, use the (imagined) terminal to browse the ChatGPT website (with CURL POST), and ask the (imagined) ChatGPT website to act as a Linux terminal.
Bypassing OpenAI's ChatGPT alignment efforts by Miguel Piedrafita
ChatGPT blocks (direct) unethical questions, but does not block unethical questions in film script format.
ChatGPT passed Problem ABC280D on CodeForces under human coaching by Lily
ChatGPT passed a non-trivial competitive programming question under human coaching, which resembles the coding interview process for SWE. The human coach and ChatGPT resembles the interviewer and interviewee, respectively.
Ask ChatGPT to debug code and explain the fix by Amjad Masad
Ask ChatGPT to reverse enginner a snippet of x86 assembly (Link missing)
It correctly answered the encryption algorithm and explanations.
Generating AI art prompts by Guy Parsons

Please note that the terms: one-shot, few-shot here is different from previous literature. It does not involves fine-tuning. Instead, it prepends the examples at the beginning as a prompt. Maybe it's better to call it few-shot prompting instead of few-shot learning to avoid ambiguity? ↩↩
GPT-3 175B costs 3.64E+03 PF-days, from Table D.1 in the GPT-3 paper. ↩
The sum of collected data from labeler/customer and for training/validation, from Table 6 in the InstructGPT paper. ↩↩↩

Generative Pre-trained Transformer (GPT)

Concept

Official Resources

Community Resources

Obsolete Contents

Comments