Lecture presentations (in Slovak)
|
Selected Mathematical Titbits
|
|
Introduction to Reinforcement Learning
|
|
Finite Markov Decision Process
|
|
Dynamic Programming Methods
|
|
Monte Carlo Methods
(alternative version)
|
|
Temporal-difference Methods
|
|
Value Function Approximation
|
|
Policy Approximation
|
|
Between MC and TD
|
|
Deep RL (value functions)
|
|
Planning and learning
|
|
Actor-Critic
|
|
RL for LLM training
|
Software tools
Final exam topics (2025/2026 version)
- Selected Mathematical Titbits
- Expected value
- Probability
- Markov chain
- Infinite series
- Introduction to Reinforcement Learning (RL)
- Characterisation of RL
- Basic elements of RL
- Interaction between agent and environment
- Exploration vs exploitation
- Taxonomy of RL methods
- Finite Markov Decision Process (MDP)
- Markov decission process (MDP)
- Environment dynamics modelling
- Reward and return modelling
- Value functions and optimal value functions
- Bellman expectation equation for v
- Bellman expectation equation for q
- Action selection and optimal action selection
- Bellman optimality equations
- Optimal policy retrieval
Dynamic programming based RL
Dynamic programming and RL
Policy evaluation
Policy improvement and its theorem
Policy iteration
Value iteration
Synchronous and asynchronous DP
Monte Carlo based RL
Action value function estimation
General policy iteration and convergence
On-policy learning with exploring starts
On-policy learning with soft policy
Off-policy and importance sampling
Off-policy action value function estimation
Off-policy optimal policy estimation
- Temporal-difference based RL
- TD vs MC approach
- Value function estimation
- Iterative policy learning
- On-policy Sarsa algorithm
- Off-policy Q-learning algorithm
- Expected Sarsa algorithm
- Double learning
Greedy policy as exporation policy
Delayed Q-learning
- Approximation of value functions
- Scaling problem
- Parametric vs memory approximator
- Framework for parametric value approximation
- SGD minimization
- Semi-gradient value function estimation
- Linear approximator
- Episodic semi-gradient Sarsa
- Average reward approach for continuous tasks
- Differential semi-gradient Sarsa
Inkremental vs mini-batch based approach
- Between MC and TD
- Estimation of cumulative reward
- Algorithm TD(n)
- Policy learning - Sarsa(n)
- λ-reward
- Forward and backward view
- Eligibility traces
- Policy learning - tabular Sarsa(λ)
- Algorithm semi-gradient TD(λ)
- Policy learning - Sarsa(λ)
- Deep RL (value functions)
- DNN as a parametric approximator
- Deep Q-learning and incompatibilities
- DQN + stability maintenance
- DQN enhancements - Rainbow
Continuous actions, NAF
Partial observability, DRQN
- Approximation of policy
- Parametric policy approximator
- Gradient learning in episodic environment
- Reinforce algorithm
- Reinforce with baseline algorithm
- One-step episodic actor-critic algorithm
- Learning in continuous environment
- One-step continuous actor-critic algorithm
- Continuous action space
- Planning and learning
- Model types
- Unification of planning and learning
- Algorithm Dyna-Q
- Exploration in planning context, Dyna-Q+
- Model sampling strategy
- Prioritized sweeping
NAF model training
- Actor-Critic algorithms
- Policy gradient and Advantage function
- Interaction of Actor-Critic with environment
- Surrogate goal
- PPO - clipped goal
- PPO - advantage calculation
- PPO - algorithm structure
- RL for LLM alignment
- RL in LLM training
- RL with human feedback
- PPO in RLHF
- GRPO in RLHF
- GSPO in RLHF
|