这篇文章是关于JD的实时在线竞价系统的论文。这篇论文介绍了京东的LADDER系统,第一个成功的运用深度增强学习agent直接处理原始的、包含复杂语义信息的大规模实时问题。这个agent基于DQN的异步随机版本DASQN。该系统将广告收入提高50%,大大提高了投资者的投资回报(ROI)。
简介:We present LADDER, the first deep reinforcement learning agent that can successfully learn control policies for largescale real-world problems directly from raw inputs composed of high-level semantic information. The agent is based on an asynchronous stochastic variant of DQN (Deep Q Network) named DASQN. The inputs of the agent are plain-text descriptions of states of a game of incomplete information, i.e. real-time large scale online auctions, and the rewards are auction profits of very large scale. We apply the agent to an essential portion of JD’s online RTB (real-time bidding) advertising business and find that it easily beats the former state-of-the-art bidding policy that had been carefully engineered and calibrated by human experts: during JD.com’s June 18th anniversary sale, the agent increased the company’s ads revenue from the portion by more than 50%, while the advertisers’ ROI (return on investment) also improved significantly.
内容:
要解决问题:
First, the solution space of the auction game is tremendous. JD DSP system is bidding for 100,000s of auctions per second, assume we have 10 actions and each day is an episode (ad plans are usually on a daily basis), simple math shows the solution space is of 10的10次方的9次方 . For comparison, the solution space of the game of Go is about 10的360次方 (Allis and others 1994; Silver et al. 2016).
Second, state-of-the-art RL algorithms are inherently sequential, hence cannot be applied to large-scale practical problems such as the auction game, for our online service cannot afford the inefficiencies of sequential algorithms.
Third, auction requests are actually triggered by JD users and randomness of human behaviors implies stochastic transitions of states. That’s very different from Atari games, text-based games (Narasimhan, Kulkarni, and Barzilay 2015) and the game of Go (Silver et al. 2016).
Besides, we have widely ranged rewards of which the maximum may be 100,000 times larger than the minimum, which implies only very expressive models are suitable.
大致做法:
In this paper, we model the auction game as a partially observable Markov decision process (POMDP) and present the DASQN algorithm which successfully solve the inherently synchronousness of RL algorithms and the stochastic transitions of the game. We encode each auction request into plain text in a domain specific natural language, feed the encoded request to a deep convolutional neural networks (CNN) and make full use of the high-level semantic information without any sophisticated feature engineering. This results in a lightweight model both responsive and expressive which can update in real-time and reacts to the changes of the auction environment rapidly. Our whole architecture is named LADDER.
具体的内容大家看论文吧,下载地址:
https://arxiv.org/abs/1708.05565