Deep Q-Network (DQN)
Goal: Scale Q-Learning to high-dimensional state spaces.
Contribution: Achieved professional human performance across a set of 49 games (the Atari Learning Environment (ALE)), receiving only the pixels and the game score as inputs, using the same algorithm, network architecture, and hyperparameters across all games.
Prerequisites
- Markov Decision Process (MDP)
- Q-Learning
- Neural Network (NN)
- Convolution Neural Network (CNN)
Concept
-
Stacked frames as MDP state
- If we define the state as a single Atari game frame, the problem becomes a POMDP instead of a MDP (e.g., the ball velocity in Pong cannot be inferred by a single frame).
- DQN defines the state as stacked frames of 4, which makes most Atari games a MDP (empirically).
-
Approximate the Q-value table with CNN
- Before the introduction of DQN, Q-Learning cannot handle high-dimensional state spaces well due to the use of the Q-value table.
- Function approximation allows storing the Q-values under limited memory and enables generalization.
-
Model \(Q(\vs,\cdot)\) instead of \(Q(\vs,\va)\)
- Allows all actions' Q-value to be obtained by a single forward pass of the CNN instead of \(\vert\sA\vert\) forward passes, making the Q-Learning update more efficient. (the Bellman optimality operator requires a max across all actions, i.e., \(\max_{\va'}Q(\vs',\va')\))
-
Experience Replay
- Neural networks prefers i.i.d. (independent and identically distributed) data.
- In Atari games, the data collected is highly correlated due to the game mechanics (e.g., game resets, common trajectories); the data distribution is also non-stationary due to the change of policy.
- The non-i.i.d. issue may be alleviated by sampling a mini-batch from a pool of previous transitions (experience buffer).
-
Target Network
- The target in supervised learning is defined based on the ground truth.
- The target of Q-Learning is defined based on its own output (i.e., bootstrapping). Such approximate dynamic programming (ADP) techniques may cause unstability when using NN approximations.
- The update can be stabilized by an additional target network obtained by copying and freezing the weights of the policy network that only updates once a while (preventing the target loss landscape from changing too fast).
-
Nature DQN vs. NIPS DQN
- The concept of target network is introduced in the Nature paper but not in the NIPS paper.
- Some old papers emphasize the use of target network by the term Nature DQN (contrasting to (NIPS) DQN which does not use target networks).
- Currently, the DQN term generally refers to Nature DQN. (the one with target network)
Official Resources
- (NIPS-DQN) [NIPS 2013] Playing Atari with Deep Reinforcement Learning [arxiv][preprint] (citations: 11636, 8652, as of 2023-04-29)
- (Nature-DQN) [Nature 2015] Human-level control through deep reinforcement learning [paper][ paper][blog][code] (citations: 23908, 19855, as of 2023-04-29)
Community Resources
- Reinforcement Learning (DQN) Tutorial, by PyTorch (with PyTorch)
- Train a Deep Q Network with TF-Agents, by TensorFlow (with TensorFlow)
- tensorflow/agents (with TensorFlow)
- The Deep Q-Network (DQN), by Hugging Face (with PyTorch, Stable Baselines3)
- DQN, by Stable Baselines3
- DLR-RM/stable-baselines3 (with PyTorch)
- Deep Q Networks (DQN), by Ray
- ray-project/ray (with PyTorch, TensorFlow)
- Deep Q-Learning (DQN), by Clean RL
- vwxyzjn/cleanrl (with PyTorch, Jax)
- Deep Q-Network (DQN), by SKRL
- Toni-SM/skrl (with PyTorch)
- pytorch/rl, by PyTorch (with PyTorch)
- google/dopamine, by Google (with TensorFlow, Jax)
- Deep Q Networks (DQN), by labml.ai
- labmlai/annotated_deep_learning_paper_implementations (with PyTorch)
- DQN, by Elegant RL
- AI4Finance-Foundation/ElegantRL (with PyTorch)
- DQNAgent, by Keras-RL
- keras-rl/keras-rl (with Keras)
- Paper with Code