读书笔记：LADDER: A Human-Level Bidding Agent for Large-Scale Real-Time Online Auctions

这篇文章是关于JD的实时在线竞价系统的论文。这篇论文介绍了京东的LADDER系统，第一个成功的运用深度增强学习agent直接处理原始的、包含复杂语义信息的大规模实时问题。这个agent基于DQN的异步随机版本DASQN。该系统将广告收入提高50%，大大提高了投资者的投资回报（ROI）。

简介：We present LADDER, the first deep reinforcement learning agent that can successfully learn control policies for largescale real-world problems directly from raw inputs composed of high-level semantic information. The agent is based on an asynchronous stochastic variant of DQN (Deep Q Network) named DASQN. The inputs of the agent are plain-text descriptions of states of a game of incomplete information, i.e. real-time large scale online auctions, and the rewards are auction profits of very large scale. We apply the agent to an essential portion of JD’s online RTB (real-time bidding) advertising business and find that it easily beats the former state-of-the-art bidding policy that had been carefully engineered and calibrated by human experts: during JD.com’s June 18th anniversary sale, the agent increased the company’s ads revenue from the portion by more than 50%, while the advertisers’ ROI (return on investment) also improved significantly.

内容：
要解决问题：
First, the solution space of the auction game is tremendous. JD DSP system is bidding for 100,000s of auctions per second, assume we have 10 actions and each day is an episode (ad plans are usually on a daily basis), simple math shows the solution space is of 10的10次方的9次方 . For comparison, the solution space of the game of Go is about 10的360次方 (Allis and others 1994; Silver et al. 2016).
Second, state-of-the-art RL algorithms are inherently sequential, hence cannot be applied to large-scale practical problems such as the auction game, for our online service cannot afford the inefficiencies of sequential algorithms.
Third, auction requests are actually triggered by JD users and randomness of human behaviors implies stochastic transitions of states. That’s very different from Atari games, text-based games (Narasimhan, Kulkarni, and Barzilay 2015) and the game of Go (Silver et al. 2016).
Besides, we have widely ranged rewards of which the maximum may be 100,000 times larger than the minimum, which implies only very expressive models are suitable.

大致做法：
In this paper, we model the auction game as a partially observable Markov decision process (POMDP) and present the DASQN algorithm which successfully solve the inherently synchronousness of RL algorithms and the stochastic transitions of the game. We encode each auction request into plain text in a domain specific natural language, feed the encoded request to a deep convolutional neural networks (CNN) and make full use of the high-level semantic information without any sophisticated feature engineering. This results in a lightweight model both responsive and expressive which can update in real-time and reacts to the changes of the auction environment rapidly. Our whole architecture is named LADDER.

具体的内容大家看论文吧，下载地址：
https://arxiv.org/abs/1708.05565

时间： 2024-10-31 07:41:05

读书笔记：LADDER: A Human-Level Bidding Agent for Large-Scale Real-Time Online Auctions

读书笔记：LADDER: A Human-Level Bidding Agent for Large-Scale Real-Time Online Auctions的相关文章

Java与XSLT读书笔记(1)

Querying Microsoft SQL Server 2012 读书笔记:查询和管理XML数据 2 -使用XQuery 查询XML数据

服务的协作：服务间的消息传递——《微服务设计》读书笔记

091025 L DNA读书笔记

《点石成金》读书笔记:为网站增加注意力吸引点

深入了解JVM-----Inside JVM读书笔记

PHP-SOCKETS读书笔记

一个男人和三个女人的故事[《.net框架程序设计》读书笔记

101 VB.NET Applications 读书笔记（1）