Use the loss perform of the Coverage Gradient algorithm as key to grasp numerous reinforcement studying algorithms: REINFORCE, Actor-Critic, and PPO, that are theoretical preparations to grasp the Reinforcement Studying from Human Suggestions (RLHF) algorithm used to construct ChatGPT.
Finding out reinforcement studying could be irritating as a result of the sphere is cursed with complicated jargon and algorithms with refined variations.
I struggled, till sooner or later my nice colleague Peter Vrancs swiftly wrote down the derivation of the loss perform for the Coverage Gradient algorithm REINFORCE for me. Utilizing this derivation, this text hyperlinks the next algorithms collectively:
- REINFORCE
- The idea of benefit for variance discount, and the Actor-Critic algorithm
- Proximal Coverage Optimisation (PPO)
Even when there are numerous articles masking these algorithms, this text offers a singular angle of learning them in a single go to avoid wasting you studying time!
For my part, understanding these three algorithms is the theoretical naked…