(zhuan) 一些RL的文献(及笔记)

一些RL的文献(及笔记)

copy from: https://zhuanlan.zhihu.com/p/25770890 

Introductions

Introduction to reinforcement learning
Index of /rowan/files/rl

ICML Tutorials:
http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

NIPS Tutorials:
CS 294 Deep Reinforcement Learning, Spring 2017
https://drive.google.com/file/d/0B_wzP_JlVFcKS2dDWUZqTTZGalU/view

Deep Q-Learning


DQN:
[1312.5602] Playing Atari with Deep Reinforcement Learning (and its nature version)

Double DQN
[1509.06461] Deep Reinforcement Learning with Double Q-learning

Bootstrapped DQN
[1602.04621] Deep Exploration via Bootstrapped DQN

Priority Experienced Replay
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/prioritized-replay.pdf

Duel DQN
[1511.06581] Dueling Network Architectures for Deep Reinforcement Learning

Classic Literature

SuttonBook
http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf
Book

David Silver's thesis
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/thesis.pdf

Policy Gradient Methods for Reinforcement Learning with Function Approximation
https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf
(Policy gradient theorem)

1. Policy-based approach is better than value based: policy function is smooth, while using value function to pick policy is not continuous.

2. Policy Gradient method.
Objective function is averaged on the stationary distribution (starting from s0).
For average reward, it needs to be truly stationary.
For state-action (with discount), if all experience starts with s0, then the objective is averaged over a discounted distribution (not necessarily fully-stationary). If we starts with any arbitrary state, then the objective is averaged over the (discounted) stationary distribution.
Policy gradient theorem: gradient operator can “pass” through the state distribution, which is dependent on the parameters (and at a first glance, should be taken derivatives, too). 

3. You can replace Q^\pi(s, a) with an approximate, which is only accurate when the approximate f(s, a) satisfies df/dw = d\pi/d\theta /\pi
If pi(s, a) is loglinear wrt some features, then f has to be linear to these features and \sum_a f(s, a) = 0 (So f is an advantage function).

4. First time to show the RL algorithm converges to a local optimum with relatively free-form functional estimator.

DAgger
https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats10-paper.pdf

Actor-Critic Models

Asynchronous Advantage Actor-Critic Model
[1602.01783] Asynchronous Methods for Deep Reinforcement Learning

Tensorpack's BatchA3C (ppwwyyxx/tensorpack) and GA3C ([1611.06256] Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU)
Instead of using a separate model for each actor (in separate CPU threads), they process all the data generated by actors with a single model, which is updated regularly via optimization. 

On actor-critic algorithms.
http://www.mit.edu/~jnt/Papers/J094-03-kon-actors.pdf
Only read the first part of the paper. It proves that actor-critic will converge to the local minima, when the feature space used to linearly represent Q(s, a) also covers the space spanned by \nabla log \pi(a|s) (compatibility condition), and the actor learns slower than the critic. 

https://dev.spline.de/trac/dbsprojekt_51_ss09/export/74/ki_seminar/referenzen/peters-ECML2005.pdf
Natural Actor-Critic
Natural gradient is applied on actor critic method. When the compatibility condition proposed by the policy gradient paper is satisfied (i.e., Q(s, a) is a linear function with respect to \nabla log pi(a|s), so that the gradient estimation using this estimated Q is the same as the true gradient which uses the unknown perfect Q function computed from the ground truth policy), then the natural gradient of the policy's parameters is just the linear coefficient of Q. 

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients
https://hal.archives-ouvertes.fr/hal-00756747/document
Covers the above two papers.

Continuous State/Action

Reinforcement Learning with Deep Energy-Based Policies 
Use the soft-Q formulation proposed by https://arxiv.org/pdf/1702.08892.pdf (in the math section) and naturally incorporate the entropy term in the Q-learning paradigm. For continuous space, both the training (updating Bellman equation) and sampling from the resulting policy (in terms of Q) are intractable. For the former, they propose to use a surrogate action distribution, and compute the gradient with importance sampling. For the latter, they use Stein variational method that matches a deterministic function a = f(e, s) to the learned Q-distribution. In terms of performance, they are comparable with DDPG. But since the learnt Q could be diverse (multimodal) under maximal entropy principle, it can be used as a common initialization for many specific tasks (Example, pretrain=learn to run towards arbitrary direction, task=run in a maze). 

Deterministic Policy Gradient Algorithms
http://jmlr.org/proceedings/papers/v32/silver14.pdf
Silver's paper. Learn an actor to prediction the deterministic action (rather than a conditional probability distribution \pi(a|s)) in Q-learning. When trained with Q-learning, propagate through Q to \pi. Similar to Policy Gradient Theorem (gradient operator can “pass” the state distribution, which is dependent on the parameters), there is also deterministic version of it. Also interesting comparison with stochastic offline actor-critic model (stochastic = \pi(a|s)). 

Continuous control with deep reinforcement learning (DDPG)
Deep version of DPG (with DQN trick). Neural network + minibatch → not stable, so they also add target network and replay buffer. 

Reward Shaping

Policy invariance under reward transformations: theory and application to reward shaping.
http://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf
Andrew Ng's reward shaping paper. It proves that for reward shaping, policy is invariant if and only if a difference of a potential function is added to the reward. 

Theoretical considerations of potential-based reward shaping for multi-agent systems
Theoretical considerations of potential-based reward shaping for multi-agent systems
Potential based reward-shaping can help a single-agent achieve optimal solution without changing the value (or Nash Equilibrium). This paper extends it to multi-agent case.

Reinforcement Learning with Unsupervised Auxiliary Tasks
[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks
ICLR17 Oral. Add auxiliary task to improve the performance of Atari Games and Navigation. Auxiliary task includes maximizing pixel changes and maximizing the activation of individual neurons. 

Navigation

Learning to Navigate in Complex Environments
https://openreview.net/forum?id=SJMGPrcle¬eId=SJMGPrcle
Raia's group from DM. ICLR17 poster, adding depth prediction as the auxiliary task and improve the navigation performance (also uses SLAM results as network input)

[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks (in reward shaping)

Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments
Goal: navigation without SLAM.
Learn successor features (Q, V before the last layer, these features have a similar Bellman equation.) for transfer learning: learn k top weights simultaneously while sharing the successor features, using DQN acting on the features). In addition to successor features, also try to reconstruct the frame.

Experiments on simulation.
state: 96x96x four most recent frames.
action: four discrete actions. (still, left, right, straight(1m))
baseline: train a CNN to directly predict the action of A*

Deep Recurrent Q-Learning for Partially Observable MDPs
There is no much performance difference between stacked frame DQN versus DRQN. DRQN may be more robust when the game state is flickered (some are 0)

Counterfactual Regret Minimization

Dynamic Thresholding
http://www.cs.cmu.edu/~sandholm/dynamicThresholding.aaai17.pdf
With proofs:
http://www.cs.cmu.edu/~ckroer/papers/pruning_agt_at_ijcai16.pdf

Study game state abstraction and its effect on Ludoc Poker.
https://webdocs.cs.ualberta.ca/~bowling/papers/09aamas-abstraction.pdf

https://www.cs.cmu.edu/~noamb/papers/17-AAAI-Refinement.pdf
https://arxiv.org/pdf/1603.01121v2.pdf
http://anytime.cs.umass.edu/aimath06/proceedings/P47.pdf

Decomposition:
Solving Imperfect Information Games Using Decomposition
http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8407/8476

Safe and Nested Endgame Solving for Imperfect-Information Games
https://www.cs.cmu.edu/~noamb/papers/17-AAAI-Refinement.pdf

Game-specific RL

Atari Game
http://www.readcube.com/articles/10.1038/nature14236

Go
AlphaGo https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf

DarkForest [1511.06410] Better Computer Go Player with Neural Network and Long-term Prediction

Super Smash Bros
https://arxiv.org/pdf/1702.06230.pdf

Doom
Arnold: [1609.05521] Playing FPS Games with Deep Reinforcement Learning
Intel: [1611.01779] Learning to Act by Predicting the Future
F1: https://openreview.net/forum?id=Hk3mPK5gg¬eId=Hk3mPK5gg

Poker
Limited Texas hold' em
http://ai.cs.unibas.ch/_files/teaching/fs15/ki/material/ki02-poker.pdf

Unlimited Texas hold 'em 
DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

时间: 2024-10-26 00:08:15

(zhuan) 一些RL的文献(及笔记)的相关文章

JavaMail学习笔记(七)、帐号激活与忘记密码 实例(zhuan)

一.帐户激活             在很多时候,在某些网站注册一个用户之后,网站会给这个用户注册时填写的email地址发送一封帐户激活邮件,这封邮件的内容就是一个激活帐户的链接和一段简短的文字描述,如果用户没有去邮箱将帐户激活,可能在使用网站的某些功能时就会受到限制,比如不能发贴.下载资料.评论等限制.这么做的原因应该是为了保证帐户的安全性和邮箱的有效性,以后网站如果有什么活动.资讯或系统安全通知等,可以在第一时间通知到用户.比如我在奇艺视频网站注册一个帐号之后,它就会往我注册时填写的邮箱中发

Word使用操作技巧:面对文献编号、查看文献条目

  16.4.9 面对文献编号.查看文献条目 这是16.1节所说的.也有2种情况: 面对的是"父"编号,太好了,鼠标指向编号,就看到了(犹如9.2.10.3节.图 9 5) ~~~~"子"~~,请点击"子"编号→会自动跳到"父"编号→查看→看完了吗?看完了!好,点击 返回到"子"编号--什么 ?找不到?唉,我说要"装修"的嘛. 16.4.10 面对文献条目.返回文献编号 这是16.1节所

Bash学习笔记

                                                                          第1 页        共28页 Bash shell学习笔记 .........................................................................................................................2 1. 引言................

《多核编程的艺术》读书笔记

感谢网友 郑思遥 投递本稿. 这份笔记是我2013年下半年以来读"The Art of Multiprocessor Programming"这本书的读书笔记.目前有关共享内存并发同步相关的书籍并不多,但是学术文献却不少,跨越的时间范围也非常长,说明人们一直在做出努力. 这本书是这个领域的好书,作为一本好书,它总结了这个领域自发展以来的大量重要成果,介绍了共享内存同步的基本理论,并介绍了大量并发算法和数据结构(主要是无锁算法),包括并发队列.栈.链表.计数器.排序网络.散列.跳表.优先

c++内存管理学习笔记结束语

         刚开学学习内存管理这一块时,发现需要对一些知识需要做一些笔记,打算向以前学习的方式一样,写在纸上,鉴于以前自己笔记的莫名的丢失,所以就打算发在了博客上.写在博客上有这么几个好处, 一是互联网上的学习资源很丰富,随时可以查找学习:二是利用博客的形式来做笔记能够与其他学习者分享.互相讨论.互相学习(虽然目前评论数不多-.):三是博客内容不容易丢失,除非哪天csdn说不做了,哈哈.        在学习过程中,我读了很对的相关书籍和文献,以及网上其他人的优秀博客.其中那些精辟到位的技

如何做文献综述:克雷斯威尔五步文献综述法

文献综述抽取某一个学科领域中的现有文献,总结这个领域研究的现状,从现有文献及过去的工作中,发现需要进一步研究的问题和角度. 文献综述是对某一领域某一方面的课题.问题或研究专题搜集大量情报资料,分析综合当前该课题.问题或研究专题的最新进展.学术见解和建议,从而揭示有关问题的新动态.新趋势.新水平.新原理和新技术等等,为后续研究寻找出发点.立足点和突破口. 文献综述看似简单.其实是一项高难度的工作.在国外,宏观的或者是比较系统的文献综述通常都是由一个领域里的顶级"大牛"来做的.在现有研究方

JavaScript闭包学习笔记

原文:JavaScript闭包学习笔记 闭包(closure)是JavaScript语言的一个难点,也是它的特色,很多高级应用都要依靠闭包实现. 下面就是我的学习笔记,对于JavaScript初学者应该是很有用的. 一.变量的作用域 要理解闭包,首先必须理解JavaScript特殊的变量作用域. 变量的作用域无非就是两种:全局变量和局部变量. JavaScript语言的特殊之处,就在于函数内部可以直接读取全局变量. 1 var n=999; 2 3 function f1() { 4 alert

强化学习在生成对抗网络文本生成中扮演的角色(Role of RL in Text Generation by GAN)(上)

1.基础:文本生成模型的标准框架 文本生成(Text Generation)通过 机器学习 + 自然语言处理 技术尝试使AI具有人类水平的语言表达能力,从一定程度上能够反应现今自然语言处理的发展水平. 下面用极简的描述介绍一下文本生成技术的大体框架,具体可以参阅各种网络文献(比如:CSDN经典Blog"好玩的文本生成"[1]),论文等. 文本生成按任务来说,比较流行的有:机器翻译.句子生成.对话生成等,本文着重讨论后面两种.基于深度学习的Text Generator 通常使用循环神经网

(转) 行人检测资源 综述文献

      首页 视界智尚 算法技术 每日技术 来打我呀 注册     行人检测资源(上)综述文献         行人检测具有极其广泛的应用:智能辅助驾驶,智能监控,行人分析以及智能机器人等领域.从2005年以来行人检测进入了一个快速的发展阶段,但是也存在很多问题还有待解决,主要还是在性能和速度方面还不能达到一个权衡.近年,以谷歌为首的自动驾驶技术的研发正如火如荼的进行,这也迫切需要能对行人进行快速有效的检测,以保证自动驾驶期间对行人的安全不会产生威胁. 1   行人检测的现状