Skip to content

Generative Pre-trained Transformer (GPT)

Concept

  • GPT-1 (125M params): Autoregressively pre-train a multi-layer Transformer decoder, and fine-tune on supervised data for downstream tasks.
  • GPT-2 (1.5B params): Scale up GPT-1 and utilize a larger dataset: WebText (in-house).
  • GPT-3 (175B params): Scale up GPT-2 and utilize more datasets.
  • Codex (12B params): Fine-tune GPT-3 on code datasets.
    • Code Fine-tuning on 159 GB Python code collected and filtered from GitHub.
    • Add an additional set of tokens for representing whitespace runs of different lengths (based on GPT-3's tokenizer).
    • (Codex-S) Supervised Fine-tuning on standalone functions curated from competitive programming websites, interview preparation websites, and GitHub repos/PyPI packages with continuous integration (CI).
    • Applications such as GitHub Copilot uses a distinct (production) version of Codex.
  • InstructGPT (175B params): Make GPT-3 more aligned to user instructions by Reinforcement Learning from Human Feedback (RLHF):
    • (Step 1) Supervised Fine-Tuning (SFT) (175B params): Train a supervised policy (i.e., fine-tune GPT-3) on collected demonstration data.
      • Sample prompts from dataset, and ask labelers to compose the desired outputs.
    • (Step 2) Reward Modeling (RM) (6B params): Train a reward model on collected comparison data.
      • Sample prompts & outputs, and ask labelers to rank the outputs.
      • Weights are initialized from (a fine-tuned variant of) the GPT-3 model.
    • (Step 3) Reinforcement Learning (RL): Train a policy against the reward model with reinforcement learning (PPO in Actor-Critic style, bandit environment).
      • The policy is defined as \(P(\mathrm{action}|\mathrm{state})=P(\mathrm{response}|\mathrm{prompt})\)
      • Policy network (175B params) weights are initialized from (a fine-tuned variant of) the GPT-3 model.
      • Value network (6B params) weights are initialized from the Reward Model (RM).
    • Hired ~40 contractors (with an onboarding process) to compose:
      • SFT Data (~14k3) consists of labeler demonstrations.
      • RM Data (~51k3) consists of labeler rankings.
      • PPO Data (~47k3) consists of unlabeled inputs for RLHF fine-tuning.
    • See this image in the official blog post for more information.

  • GPT-3.5: GPT-3 and InstructGPT trained on a blend of text and code.
  • ChatGPT: Use the same methods as InstructGPT to improve GPT-3.5, but collects supervised data in chat format instead (convsersations between a user and an AI assistant)

    ChatGPT Training Process Overview, from OpenAI Blog.

    • Modifies Step 1 in InstructGPT: Collect demonstration data of two roles (user and AI assistant) and ask labelers to compose the desired outputs for each role.
  • ChatGPT Plugins: Support interaction with third-party services.
  • GPT-4: Further improves ChatGPT, with no technical details available. GPT-4 powers the current ChatGPT Plus.

Some GPT variants are not included in this post (e.g., WebGPT, Image GPT, Visual ChatGPT), but may be included in the future.

Official Resources

  • (Codex) Evaluating Large Language Models Trained on Code [arxiv] (citations:  525 740, as of 2023-04-28)
  • (InstructGPT) [NIPS 2022] Training language models to follow instructions with human feedback [arxiv][paper][paper][blog] (citations:  565 730, as of 2023-04-28)
  • (ChatGPT) Optimizing Language Models for Dialogue [blog][demo]
  • (ChatGPT Plugins)  [blog]
  • (GPT-4) GPT-4 Technical Report [arxiv][paper][link]
  • GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models [arxiv]

Community Resources

Obsolete Contents

Interesting Posts Regarding ChatGPT.

  1. Please note that the terms: one-shot, few-shot here is different from previous literature. It does not involves fine-tuning. Instead, it prepends the examples at the beginning as a prompt. Maybe it's better to call it few-shot prompting instead of few-shot learning to avoid ambiguity? 

  2. GPT-3 175B costs 3.64E+03 PF-days, from Table D.1 in the GPT-3 paper

  3. The sum of collected data from labeler/customer and for training/validation, from Table 6 in the InstructGPT paper

Comments