Skip to Content
精通機器學習
book

精通機器學習

by Aurélien Géron
April 2020
Intermediate to advanced
816 pages
18h 32m
Chinese
GoTop Information, Inc.
Content preview from 精通機器學習
596
|
第十八章:強化學習
我們可以合理地假設負優勢的行動是不好的
正優勢的行動是好的
完美
現在我
們有一種評估每一個行動的方法了
接下來要使用策略梯度來訓練我們的第一個
agent
我們來看一下怎麼做
策略梯度
如前所述
PG
演算法藉著追隨前往更高獎勵的梯度來優化策略的參數
有一種流行的
PG
演算法稱為
REINFORCE
演算法
它是
Ronald Williams
早在
1992
年就提出來的
https://
homl.info/132
11
以下是一種常見的變體
1. 先讓神經網路策略玩幾次遊戲,在每一步,計算可讓所選擇的行動更有可能被選中
的梯度,但是先不要套用這些梯度。
2. 執行幾期之後,計算各個行動的優勢(advantage)(使用上一節介紹的方法)。
3. 如果行動的優勢是正的,代表該行動可能是好的,此時你要套用之前算出來的梯度,
來讓此行動以後更有可能被選中。但是,如果行動的優勢是負的,代表該行動可能
是不好的,你要對這個行動套用相反的梯度,讓這個行動以後比較不會被選中。解
決方案就是直接將各個梯度向量乘以對映行動的優勢。
4. 最後,計算所有得到的梯度向量的均值,並用它來執行梯度下降。
我們用
tf.keras
來實作這個演算法
我們將訓練之前做出來的神經網路策略
讓它學會平
衡車上的桿子
首先
我們需要一個玩一個步驟的函式
我們先假裝它採取的行動是正確
以便計算
loss
與它的梯度
先將這些梯度存起來
稍後會根據行動最終的結果是好是
壞來修改它們
):
def
play_one_step(env, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

下一代空间计算:AR与VR创新理论与实践

下一代空间计算:AR与VR创新理论与实践

Erin Pangilinan, Steve Lukas, Vasanth Mohan
C语言核心技术(原书第2版)

C语言核心技术(原书第2版)

Peter Prinz, Tony Crawford

Publisher Resources

ISBN: 9789865024345