Deep Q-Network (DQN)
Goal: Scale Q-Learning to high-dimensional state spaces.
Contribution: Achieved professional human performance across a set of 49 games (the Atari Learning Environment (ALE)), receiving only the pixels and the game score as inputs, using the same algorithm, network architecture, and hyperparameters across all games.
- Markov Decision Process (MDP)
- Q-Learning
- Neural Network (NN)
- Convolution Neural Network (CNN)
Stacked frames as MDP state
- If we define the state as a single Atari game frame, the problem becomes a POMDP instead of a MDP (e.g., the ball velocity in Pong cannot be inferred by a single frame).
- DQN defines the state as stacked frames of 4, which makes most Atari games a MDP (empirically).
Approximate the Q-value table with CNN
- Before the introduction of DQN, Q-Learning cannot handle high-dimensional state spaces well due to the use of the Q-value table.
- Function approximation allows storing the Q-values under limited memory and enables generalization.
Model \(Q(\vs,\cdot)\) instead of \(Q(\vs,\va)\)
- Allows all actions' Q-value to be obtained by a single forward pass of the CNN instead of \(\vert\sA\vert\) forward passes, making the Q-Learning update more efficient. (the Bellman optimality operator requires a max across all actions, i.e., \(\max_{\va'}Q(\vs',\va')\))
Experience Replay
- Neural networks prefers i.i.d. (independent and identically distributed) data.
- In Atari games, the data collected is highly correlated due to the game mechanics (e.g., game resets, common trajectories); the data distribution is also non-stationary due to the change of policy.
- The non-i.i.d. issue may be alleviated by sampling a mini-batch from a pool of previous transitions (experience buffer).
Target Network
- The target in supervised learning is defined based on the ground truth.
- The target of Q-Learning is defined based on its own output (i.e., bootstrapping). Such approximate dynamic programming (ADP) techniques may cause unstability when using NN approximations.
- The update can be stabilized by an additional target network obtained by copying and freezing the weights of the policy network that only updates once a while (preventing the target loss landscape from changing too fast).
Nature DQN vs. NIPS DQN
- The concept of target network is introduced in the Nature paper but not in the NIPS paper.
- Some old papers emphasize the use of target network by the term Nature DQN (contrasting to (NIPS) DQN which does not use target networks).
- Currently, the DQN term generally refers to Nature DQN. (the one with target network)
Official Resources
- (NIPS-DQN) [NIPS 2013] Playing Atari with Deep Reinforcement Learning [arxiv][preprint] (citations: 11636, 8652, as of 2023-04-29)
- (Nature-DQN) [Nature 2015] Human-level control through deep reinforcement learning [paper][ paper][blog][code] (citations: 23908, 19855, as of 2023-04-29)
Community Resources
- Reinforcement Learning (DQN) Tutorial, by PyTorch (with PyTorch)
- Train a Deep Q Network with TF-Agents, by TensorFlow (with TensorFlow)
- tensorflow/agents (with TensorFlow)
- The Deep Q-Network (DQN), by Hugging Face (with PyTorch, Stable Baselines3)
- DQN, by Stable Baselines3
- DLR-RM/stable-baselines3 (with PyTorch)
- Deep Q Networks (DQN), by Ray
- ray-project/ray (with PyTorch, TensorFlow)
- Deep Q-Learning (DQN), by Clean RL
- vwxyzjn/cleanrl (with PyTorch, Jax)
- Deep Q-Network (DQN), by SKRL
- Toni-SM/skrl (with PyTorch)
- pytorch/rl, by PyTorch (with PyTorch)
- google/dopamine, by Google (with TensorFlow, Jax)
- Deep Q Networks (DQN), by
- labmlai/annotated_deep_learning_paper_implementations (with PyTorch)
- DQN, by Elegant RL
- AI4Finance-Foundation/ElegantRL (with PyTorch)
- DQNAgent, by Keras-RL
- keras-rl/keras-rl (with Keras)
- Paper with Code