1 Introduction
A central challenge in RL is how to design algorithms that both scale to enormous or infinite state spaces, and efficiently balance exploration and exploitation in such environments. Much of the exciting advances in deep RL that scale to enormous domains employ simple exploration strategies such as greedy, which are often highly inefficient. Though there is a large body of work on efficient exploration relevant when the domain is small enough to be represented with lookup tables for the value function, there has been much less work on scaling exploration Several papers(Bellemare et al., 2016; Tang et al., 2016) that do combine generalization and strategic exploration use optimismunderuncertainty, which involve explicit or implicit bonuses over rewards based on uncertainty over the reward, dynamics or values.
An alternative to optimismunderuncertainty (Brafman & Tennenholtz, 2003) is Thompson Sampling (TS) (Thompson, 1933). Thompson sampling is a Bayesian approach which involves maintaining a prior distribution over the environment models (reward and/or dynamics), which is updated as observations are made during the interaction with the environment. To choose an action, a sample from the posterior belief is drawn and an action is selected that maximizes the expected return under the sampled belief. Interestingly, posterior sampling for decision making has also been studied in the field of psychology (Sanborn & Chater, 2016).
In the MDP setting this involves sampling a reward and dynamics model and then performing MDP planning using the sampled models to compute an optimal action for the current state (Strens, 2000; Osband et al., 2013). Thompson sampling approaches have been observed to often empirically work significantly better than optimistic approaches in contextual bandit settings(Chapelle & Li, 2011) and small MDPs(Osband et al., 2013) and still maintains strong preserve stateofthe art performance bounds(Russo & Van Roy, 2014; Agrawal & Goyal, 2012; Osband et al., 2013; AbbasiYadkori & Szepesvári, 2015).
In large MDPs, sampling a model and then performing planning for that model is computationally intractable. Therefore some form of function approximation is required to help scale the ideas of Thompson Sampling. To help address this, (Osband et al., 2014)
introduced randomized leastsquares value iteration (RLSVI). RLSVI involves combining linear value function approximation with Bayesian regression in order to be able to sample the value function weights from a distribution. The authors prove strong regret bounds for RLSVI when a tabular basis function set is used, but RLSVI is not scalable to largescale RL with deep neural networks. To try to combine the benefits of Thompson sampling style approaches with deep networks for generalization and scale,
Osband et al. (2016) introduced a bootstrappedensemble approach that trains several models in parallel to approximate the posterior distribution. Other works suggest using a posterior over the parameters of each node in the network and employ a variational approximation (Lipton et al., 2016b) or use noisy network (Fortunato et al., 2017). However, mostly these approaches have lead to modest gains on the Atari benchmarks, not equaling some of the substantial benefits over by combining optimismunderuncertainty with deep neural networks(Bellemare et al., 2016).Surprizingly in this paper we show that a simple approach that extends the randomized leastsquares value iteration method (Osband et al., 2014) to deep neural networks can yield substantial gains on Atari benchmarks. Specifically, we combine a deep neural network with Bayesian linear regression at the last layer of the network. Our work is also related to a concurrently developed approach by Levine et al. (2017) who perform least squares temporal difference learning on top of a deep neural network, uses greedy exploration on top of the learned function, and also demonstrate modest gains on 5 Atari benchmarks. Our results show that performing Bayesian regression instead, and sampling from the result, can yield a substantial benefit, indicating that it is not just the higher data efficiency at the last layer, but that leveraging an explicit uncertainty representation over the value function is of substantial benefit.
More specifically we introduce Bayesian deep Qnetworks (BDQN) which combines a Deep Q network (DQN) (Mnih et al., 2013) with a Bayesian linear regression model on the last layer. Due to linearity and by choosing a Gaussian prior, we derive a closedform analytical update to the approximated posterior distribution over Q functions. We can also draw samples efficiently from the Gaussian distribution. Exploration is performed by sampling from the learned Gaussian posterior to instantiate the Q values, and then the best action is selected. We test BDQN on a wide range of Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2017) Atari games, and compare our results to our own implementation of DDQN (Van Hasselt et al., 2016). BDQN and DDQN share the same architecture, and follow same target objective, and differ only in the way they are used to select actions: DDQN uses greedy and BDQN performs Bayesian linear regression on the last layer, samples the parameters from the resulting distributions, and selects the best action for that sample. We also compare our results to the reported results from a number of the stateoftheart approaches. Our proposed approach has several benefits– simplicity and targeted exploration– and yields performance often substantially better than existing optimismbased and other state of the art deep RL approaches.
Game  Steps  

Amidar  558%  788%  325%  100M 
Alien  103%  103%  43%  100M 
Assault  396%  176%  589%  100M 
Asteroids  2517%  1516%  108%  100M 
Asterix  531%  385%  687%  100M 
BeamRider  207%  114%  150%  70M 
BattleZone  281%  253%  172%  50M 
Atlantis  80604%  49413%  11172%  40M 
DemonAttack  292%  114%  326%  40M 
Centipede  114%  178%  61%  40M 
BankHeist  211%  100%  100%  40M 
CrazyClimber  148%  122%  350%  40M 
ChopperCommand  14500%  1576%  732%  40M 
Enduro  295%  350%  361%  30M 
Pong  112%  100%  226%  5M 
2 Related Work
The complexity of the explorationexploitation tradeoff has been deeply investigated in RL literature (Kearns & Singh, 2002; Brafman & Tennenholtz, 2003; Asmuth et al., 2009). Jaksch et al. (2010) investigates the regret analysis of MDPs where Optimism in Face of Uncertainty (OFU
) principle is deployed to guarantee a high probability regret upper bound.
Azizzadenesheli et al. (2016a) deploys OFU in order to propose the high probability regret upper bound for Partially Observable MDPs (POMDPs) using spectral methods (Anandkumar et al., 2014). Furthermore, Bartók et al. (2014) tackles a general case of partial monitoring games and provides minimax regret guarantee which is polynomial in certain dimensions of the problem.In multiarm bandit, there are compelling empirical pieces of evidence that Thompson Sampling can provide better results than optimismunderuncertainty approaches (Chapelle & Li, 2011), while the state of the art performance bounds are preserved (Russo & Van Roy, 2014; Agrawal & Goyal, 2012). A natural adaptation of this algorithm to RL, posterior sampling RL (PSRL), first proposed by Strens (2000) also shown to have good frequentist and Bayesian performance guarantees (Osband et al., 2013; AbbasiYadkori & Szepesvári, 2015). Even though the theoretical RL addresses the exploration and exploitation tradeoffs, these problems are still prominent in empirical reinforcement learning research (Mnih et al., 2015; Abel et al., 2016; Azizzadenesheli et al., 2016b). On the empirical side, the recent success in the video games has sparked a flurry of research interest. Following the success of Deep RL on Atari games (Mnih et al., 2015) and the board game Go (Silver et al., 2017), many researchers have begun exploring practical applications of deep reinforcement learning (DRL). Some investigated applications include, robotics (Levine et al., 2016), selfdriving cars (ShalevShwartz et al., 2016), and safety (Lipton et al., 2016a).
Inevitably for PSRL, the act of posterior sampling for policy or value is computationally intractable with large systems, so PSRL can not be easily leveraged to high dimensional problems. To remedy these failings Osband et al. (2017) consider the use of randomized value functions to approximate posterior samples for the value function in a computationally efficient manner. They show that with a suitable linear value function approximation, using the approximated Bayesian linear regression for randomized leastsquares value iteration method can remain statistically efficient (Osband et al., 2014) but still is not scalable to largescale RL with deep neural networks.
To combat these shortcomings, Osband et al. (2016) suggests a bootstrappedensemble approach that trains several models in parallel to approximate the posterior distribution. Other works suggest using a variational approximation to the Qnetworks (Lipton et al., 2016b) or noisy network (Fortunato et al., 2017). However, most of these approaches significantly increase the computational cost of DQN and neither approach produced much beyond modest gains on Atari games. Interestingly, Bayesian approach as a technique for learning a neural network has been deployed for object recognition and image caption generation where its significant advantage has been verified Snoek et al. (2015).
In this work we present another alternative approach that extends randomized leastsquares value iteration method (Osband et al., 2014) to deep neural networks: we approximate the posterior by a Bayesian linear regression only on the last layer of the neural network. This approach has several benefits, e.g. simplicity, robustness, targeted exploration, and most importantly, we find that this method is much more effective than any of these predecessors in terms of sample complexity and final performance.
Concurrently, Levine et al. (2017)
proposes least squares temporal difference which learns a linear model on the feature representation in order to estimate the
function while greedy exploration is employed and improvement on 5 tested Atari games is provided. Out of these 5 games, one is common with our set of 15 games which BDQN outperform it by factor of (w.r.t. the score reported in their paper). Dropout, as another randomized exploration method is proposed by Gal & Ghahramani (2016) but Osband et al. (2016) investigates the sufficiency of the estimated uncertainty and hardness in driving a suitable exploitation out of it. As stated before, in spite of the novelties proposed by the methods, mentioned in this section, neither of them, including TS based approaches, produced much beyond modest gains on Atari games while BDQN provides significant improvements in terms of both sample complexity and final performance.3 Thompson Sampling vs greedy
In this section, we enumerate a few benefits of TS over greedy strategies. We show how TS strategies exploit the uncertainties and expected returns to design a randomized exploration while greedy strategies disregard all these useful information for the exploration.
In order to make a balance between exploration and exploitation, TS explores actions with higher estimated return with higher probability. In order to exploit the estimated uncertainties, TS dedicates a higher chance to explore an action if its uncertainty increases. Fig. 1(a) expresses the agent’s estimated values and uncertainties for the available actions at a given state . While greedy strategy mostly focuses on the greedy action, action , the TS based strategy randomizes, mostly, over actions through , utilizes their approximated expected returns and uncertainties, and with low frequency explores actions , . On the other hand, greedy strategy explores actions and , the actions that the RL agent is almost sure about their low expected returns, as frequent as other subgreedy actions which increases its samples complexity. Moreover, a greedy strategy requires a deep network to approximate the value of all the subgreedy actions equally good, therefore, it dedicates the network capacity to accurately estimate the values of all the subgreedy actions equally good, instead of focusing more on the actions with higher promising estimated value. Therefore, it ends up with not accurate enough estimation of other good actions compared to the greedy action.
In a study of valuebased deep RL, e.g. DQN, the network is following a target value which is updated occasionally. Therefore, TS based strategy should not estimate the posterior distribution which adaptively follows the target values. A commonly used technique in deep RL is a moving window of replay buffer to store the recent experiences. The TS based agent, after a few tries of actions and , builds a belief in the low return of these actions given the current target values, while it is possible that later on, the target value suggests a high expected return of these actions. Since the replay buffer is bounded moving window, lack of samples of these actions pushes the posterior belief of these actions to the prior belief, over time, and the agent tries them again in order to update its belief. Fig. 1(b) shows that the lack of samples for action in the replay buffer, increases the uncertainty of this action and a randomized TS strategy starts to explore them over. It means that due to adaptive change of target value, respectively the objective, and limited replay buffer, the BDQN agent is never too confident about the expected return of poor actions and keeps exploring them once in a while.
In general, TS based strategy advances the explorationexploitation balance by making a tradeoff between the expected returns and the uncertainties, while greedy strategy ignores all of this information.
Another benefit of TS over greedy can be described using Fig. 1(c). Consider a deterministic and episodic maze game, with episode length of the shortest pass from the start to the destination. The agent is placed to the start point at the beginning of each episode where the goal state is to reach the destination and receive a reward of otherwise the reward is . Consider an agent, which is given a set of Qfunctions where the true Qfunction is within the set and is the most optimistic function in the set. The agent is supposed to find a from this set which maximizes the average return. It is worth noting that the agent task is to find a good function from a function set.
In this situation, TS randomizes over the Qfunctions with high promising returns and relatively high uncertainty, including the true Qfunction. When the TS
agent picks the true Qfunction, it increases the posterior probability of this Qfunction because it matches the observation. When the
TS agent chooses other functions, they predict deterministically wrong values and the posterior update of those functions set to zero. Therefore, the agent will not choose these functions again, i.e. TS finds the true Qfunction by transferring the information through posterior update which helps the agent to find the optimal Q very fast. For greedy agent, even though it chooses the true function at the beginning (it is the optimistic one), at each time step, it randomizes its action with the probability . Therefore, it takes exponentially many trials in order to get to the target in this game.4 Preliminaries
An infinite horizon discounted MDP is a tuple , with state space , action space , and the transition kernel , accompanied with reward function of where . At each time step , the environment is at a state , called current state, where the agent needs to make a decision under its policy. Given the current state and action, the environment stochastically proceed to a successor state
under probability distribution
and provides a stochastic reward with mean of . The agent objective is to optimize the overall expected discounted reward over its policy , a stochastic mapping from states to actions, .(1) 
The expectation in Eq. 1 is with respect to the randomness in the distribution of initial state, transition probabilities, stochastic rewards, and policy, under stationary distribution, where are optimal average return and optimal policy, respectively. Let denote the average discounted reward under policy starting off from state and taking action in the first place.
For a given policy and Markovian assumption of the model, we can rewrite the equation for the Q functions as follows:
(2) 
To find the optimal policy, one can solve a linear programming problem in Eq.
1 or follow the corresponding Bellman equation Eq. 2 where both of the optimization methods solve the following, and the optimal policy is a deterministic mapping from state to actions in , i.e. . In RL, we do not know the transition kernel and the reward function in advance, therefore, we cannot solve the posed Bellman equation directly. In order to tackle this problem, the property of minimizing the Bellman residual of a given Qfunction
(3) 
has been proposed (Lagoudakis & Parr, 2003; Antos et al., 2008). Here, the tuple consists of consecutive samples under behavioral policy . Furthermore, (Mnih et al., 2015) carries the same idea, and introduce Deep QNetwork (DQN) where the Qfunctions are parameterized by a deep network. To improve the quality of Q estimate, they use back propagation on loss using the TD update (Sutton & Barto, 1998). In the following we describe the setting used in DDQN. In order to reduce the bias of the estimator, they introduce target network and target value where with a new loss
(4) 
This regression problem minimizes the estimated loss , which minimize the distance between the and the target . A DDQN agent, once in a while updates the network by setting it to network, pursues the regression with the new target value and provides a biased estimator of the target.
5 Bayesian Deep QNetworks
We now show how we can extend the randomized leastsquares value iteration method (Osband et al., 2014) to combine it with a deep neural network. The result can be viewed as a coarse approximation to representing the uncertainty over the function, which we use to guide exploration.
We utilize the DQN architecture, remove its last layer, and directly build a Bayesian linear regression (BLR) (Rasmussen & Williams, 2006) on the output of the deep network , the feature representation layer, parametrized by . We use BLR to efficiently approximate the distribution over the Qvalues where the uncertainty over the values is captured. A common assumption in DNN is that the feature representation is suitable for linear classification or regression (same assumption in DQN), therefore, therefore building a linear model on the features a suitable choice (as was done recently in (Levine et al., 2017)).
The Qfunctions are approximated as a linear transformation of the deep neural network features, i.e. for a given pair of stateaction,
, where . Consequently, as mentioned in the previous section, the target value is generated using target model. The target model follows the same structure as the model, and contains denotes the feature representation of target network, and denotes the target linear model applied on the target feature representation. Inspired by DDQN, for a given tuple of experience , the predicted value of pair is , while the target value isTherefore, by deploying BLR on the space of features, we can approximate the posterior distribution of model parameter , as well as the posterior distribution of the functions using the corresponding target values. In Gaussian BLR models, in order to make the posterior update computationally tractable in a closed form a common approximation is to make the prior and likelihood choices as conjugates of each other. Therefore, for a given pair of
, the vector
is drawn from a Gaussian prior and given , the target value is generated from the following model;where is an iid noise. Therefore, . Moreover, the distribution of the target value is which also has a closed form.
Given a experience replay buffer , we construct (number of actions) disjoint datasets for each action, , where and is a set of tuples with the action and cardinality . We are interested in the approximated posterior distribution of and correspondingly the ; and .
The following are the standard Bayesian linear regression equations adjusted for our setting and we encourage readers who are familiar with Bayesian linear regression to skip the derivation. For each action and the corresponding dataset , we construct a matrix , a concatenation of feature column vectors , and , a concatenation of target values in set . Therefore the posterior distribution of is as follows:
(5) 
and
is an identity matrix. The
where is drawn following the posterior distribution in Eq. 5. Since the prior and likelihood are conjugate of each other we have the closed form posterior distribution of the discounted return, , is approximated as(6) 
As TS suggests, for the exploration, we exploit the expression in Eq. 5.At the decision time, we sample a wight vector for each action in order to have samples of Qvalues. Then we act optimally with respect to these sampled value
(7) 
Let , respectively , and . In BDQN, the agent interacts with the environment through applying the actions proposed by TS, i.e. . We utilize a notion of experience replay buffer where the agent stores its recent experiences. The agent draws (abbreviation for sampling of vector for each action separately) every steps and act optimally with respect to the drawn weights. During the inner loop of the algorithm, we draw a minibatch of data from replay buffer and use loss
(8) 
(9) 
and update the weights of network: .
We update the target network every steps and set to . With the period of the agent updates its posterior distribution using a larger minibatch of data drawn from replay buffer, set the to the mean of the posterior, and sample with respect to the updated posterior. Algorithm 1 gives the full description of BDQN.
6 Experiments
We apply BDQN on a variety of Atari games using the Arcade Learning Environment (Bellemare et al., 2013) through OpenAI Gym^{1}^{1}1Each input frame is a pixelmax of the two consecutive frames. We detailed the environment setting in the implementation code (Brockman et al., 2016). As a baseline, we run the DDQN algorithm and evaluate BDQN on the measures of sample complexity and score. Furthermore, all the implementations are coded in MXNet framework (Chen et al., 2015).
Network architecture:
The input to the network part of BDQN is tensor with a rescaled and averaged over channels of the last four observations. The first convolution layer has filters of size
with a stride of
. The second convolution layer has filters of size with stride . The last convolution layer has filters of size followed by a fully connected layer with size . We add a BLR layer on top of this.Game  BDQN  DDQN  Bootstrap^{2}^{2}2(Osband et al., 2016)  NoisyNet^{3}^{3}3(Fortunato et al., 2017)  CTS  Pixel  Reactor  Human  SC  SC  Step  

Amidar  5.52k  0.99k  0.7k  1.27k  1.5k  1.03k  0.62k  1.18k  1.7k  22.9M  4.4M  100M 
Alien  3k  2.9k  2.9k  2.44k  2.9k  1.9k  1.7k  3.5k  6.9k    36.27M  100M 
Assault  8.84k  2.23k  5.02k  8.05k  3.1k  2.88k  1.25k  3.5k  1.5k  1.6M  24.3M  100M 
Asteroids  14.1k  0.56k  0.93k  1.03k  2.1k  3.95k  0.9k  1.75k  13.1k  58.2M  9.7M  100M 
Asterix  58.4k  11k  15.15k  19.7k  11.0  9.55k  1.4k  6.2k  8.5k  3.6M  5.7M  100M 
BeamRider  8.7k  4.2k  7.6k  23.4k  14.7k  7.0k  3k  3.8k  5.8k  4.0M  8.1M  70M 
BattleZone  65.2k  23.2k  24.7k  36.7k  11.9k  7.97k  10k  45k  38k  25.1M  14.9M  50M 
Atlantis  3.24M  39.7k  64.76k  99.4k  7.9k  1.8M  40k  9.5M  29k  3.3M  5.1M  40M 
DemonAttack  11.1k  3.8k  9.7k  82.6k  26.7k  39.3k  1.3k  7k  3.4k  2.0M  19.9M  40M 
Centipede  7.3k  6.4k  4.1k  4.55k  3.35k  5.4k  1.8k  3.5k  12k    4.2M  40M 
BankHeist  0.72k  0.34k  0.72k  1.21k  0.64k  1.3k  0.42k  1.1k  0.72k  2.1M  10.1M  40M 
CrazyClimber  124k  84k  102k  138k  121k  112.9k  75k  119k  35.4k  0.12M  2.1M  40M 
ChopCmd^{4}^{4}4ChopperCommand  72.5k  0.5k  4.6k  4.1k  5.3k  5.1k  2.5k  4.8k  9.9k  4.4M  2.2M  40M 
Enduro  1.12k  0.38k  0.32k  1.59k  0.91k  0.69k  0.19k  2.49k  0.31k  0.82M  0.8M  30M 
Pong  21  18.82  21  20.9  21  20.8  17  20  9.3  1.2M  2.4M  5M 
Choice of hyperparameters:
For BDQN, we set the values of to the mean of the posterior distribution over the weights of BLR with covariances and draw from this posterior. For the fixed and , we randomly initialize the parameters of network part of BDQN,
, and train it using RMSProp, with learning rate of
, and a momentum of , inspired by (Mnih et al., 2015) where the discount factor is , the number of steps between target updates steps, and weights are resampled from their posterior distribution every steps. We update the network part of BDQN every steps by uniformly at random sampling a minibatch of size samples from the replay buffer. We update the posterior distribution of the weight set every using minibatch of size (if the size of replay buffer is less than at the current step, we choose the minimum of these two ), with entries sampled uniformly form replay buffer. The experience replay contains the most recent transitions. Further hyperparameters are equivalent to ones in DQN setting.Furthermore, for the BLR part of BDQN
, we have noise variance
, variance of prior over weights , sample size , posterior update period , and the posterior sampling period . To optimize for this set of hyperparameters we set up a very simple, fast, cheap, and nonexhaustive hyperparameter tuning procedure using a pretrained DQN model for the game of Assault. The simplicity and cheapness of our hyper parameter tuning proves the robustness and superiority of BDQN where the exhaustive hyperparameter search is likely to provide even better performance. The details of hyper parameters tuning is provided in Apx. A.Baselines:
We implemented DDQN and fix its architecture to match our BDQN implementation. We also aimed to implement a couple other deep RL methods that employ strategic exploration. Unfortunately we encountered several implementation challenges. To try to illustrate the performance of our approach we instead extracted the best reported results from a number of stateoftheart deep RL methods and include them in Table 2. Note that this is not a perfect comparison, as sometimes there can be additional details that are not included in the papers that mean that it is hard to compare the reported results (an issue that has been discussed extensively recently, e.g. (Henderson et al., 2017) ).^{5}^{5}5To further reproducibility, we released our codes and trained models https://github.com/kazizzad/BDQNMxNetGluon. Since DRL experiments are expensive, we also have released the recorded arrays of returns, in order to make it possible for others to compare against BDQN, without running the experiments again. Moreover, our Bootstrap DQN implementation is also available https://github.com/kazizzad/BootstrapDQN Here we tried to report final performance (when those were reported or identifiable from plots). We report results from bootstrapped DQN(Osband et al., 2016), count based exploration(Bellemare et al., 2016), the Pixel and Reactor results that build on the count based exploration(Ostrovski et al., 2017), and NoisyNet(Fortunato et al., 2017).
Results:
The results are provided in Fig. 2 and Table. 2. BDQN performs best across the majority of games at the stated number of samples, even typically performing much better than several other methods when they are trained for much longer. Note that comparisons to Bootstrap, NoisyNet and CTS should be viewed lightly, since the reported results for those algorithms were generally when trained for substantially longer (often 100200M steps). Reactor(Ostrovski et al., 2017) outperformed our BDQN on three games, Alien, Atlantis and Enduro, when trained for an identical number of time steps.
Note also that BDQN outperforms the optimism based approaches (Bellemare et al., 2016; Ostrovski et al., 2017)
on all other games we tried including Amidar, which they classify as one of the harder exploration games
(Ostrovski et al., 2017).It is worth noting that the scores of DDQN are reported during the leaning phase (not evaluation phase). For example, DDQN gives score of 18.82 during the learning phase, but setting to zero, it mostly gives the score of . We also report the number of samples (sample complexity (SC)) it take from BDQN to reach human scores, SC, and DDQN scores, SC, Apx. A.
For the game Atlantis, gives score of after samples during evaluation time, while BDQN reaches after samples. As it is been shown in Fig. 2, BDQN saturates for Atlantis after 20M samples. We realized that BDQN reaches the internal OpenAIGym limit of , where relaxing it improves score after steps to .
We observe that BDQN immediately learns significantly better policies due to its targeted exploration in a much shorter period of time. Since BDQN on game Atlantis promise a big jump around time step , we ran it five more times in order to make sure it was not just a coincidence Apx. A Fig. 5. For the game Pong, we ran the experiment for a longer period but just plotted the beginning of it in order to observe the difference. For some games, we did not run the experiment to samples since the reached their plateau.
7 Conclusion
In this work we proposed BDQN, a practical TS based RL algorithm which provides targeted exploration in a computationally efficient manner. It involved making simple modifications to the DDQN architecture by replacing the last layer with a Bayesian linear regression. Under the Gaussian prior, we obtained fast closedform updates for the posterior distribution. We demonstrated significantly faster training and much better performance in many games compared to the reported results of a wide number of stateoftheart baselines. Due to computational limitations we did not try the algorithm on all games and it remains an interesting issue to further explore its performance, and combine it with other advances in deep RL that can be easily extended.
In this work, for BDQN we randomize the last layer of the model and use a Bayesian linear regression framework to train it, and alternate it with training the other layers of the network. An alternative approach is to train it endtoend using stochastic optimization approaches (Welling & Teh, 2011). This could significantly speed up training while retaining the computational efficiency of DDQN. In this work, we have considered value based approaches in deep RL and we plan to explore the advantages of TS based exploration in policy gradient based approaches in future.
References
 AbbasiYadkori & Szepesvári (2015) AbbasiYadkori, Yasin and Szepesvári, Csaba. Bayesian optimal control of smoothly parameterized systems. In UAI, pp. 1–11, 2015.

Abel et al. (2016)
Abel, David, Agarwal, Alekh, Diaz, Fernando, Krishnamurthy, Akshay, and
Schapire, Robert E.
Exploratory gradient boosting for reinforcement learning in complex domains.
arXiv, 2016.  Agrawal & Goyal (2012) Agrawal, Shipra and Goyal, Navin. Analysis of thompson sampling for the multiarmed bandit problem. In COLT, 2012.
 Anandkumar et al. (2014) Anandkumar, Animashree, Ge, Rong, Hsu, Daniel, Kakade, Sham M, and Telgarsky, Matus. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
 Antos et al. (2008) Antos, András, Szepesvári, Csaba, and Munos, Rémi. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 2008.

Asmuth et al. (2009)
Asmuth, John, Li, Lihong, Littman, Michael L, Nouri, Ali, and Wingate, David.
A bayesian sampling approach to exploration in reinforcement
learning.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, 2009.  Azizzadenesheli et al. (2016a) Azizzadenesheli, Kamyar, Lazaric, Alessandro, and Anandkumar, Animashree. Reinforcement learning of pomdps using spectral methods. In Proceedings of the 29th Annual Conference on Learning Theory (COLT), 2016a.
 Azizzadenesheli et al. (2016b) Azizzadenesheli, Kamyar, Lazaric, Alessandro, and Anandkumar, Animashree. Reinforcement learning in richobservation mdps using spectral methods. arXiv preprint arXiv:1611.03907, 2016b.
 Bartók et al. (2014) Bartók, Gábor, Foster, Dean P, Pál, Dávid, Rakhlin, Alexander, and Szepesvári, Csaba. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 2014.
 Bellemare et al. (2016) Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
 Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 2013.
 Brafman & Tennenholtz (2003) Brafman, Ronen I and Tennenholtz, Moshe. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. The Journal of Machine Learning Research, 3:213–231, 2003.
 Brockman et al. (2016) Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, and Zaremba, Wojciech. Openai gym, 2016.
 Chapelle & Li (2011) Chapelle, Olivier and Li, Lihong. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, 2011.
 Chen et al. (2015) Chen, Tianqi, Li, Mu, Li, Yutian, Lin, Min, Wang, Naiyan, Wang, Minjie, Xiao, Tianjun, Xu, Bing, Zhang, Chiyuan, and Zhang, Zheng. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv, 2015.
 Fortunato et al. (2017) Fortunato, Meire, Azar, Mohammad Gheshlaghi, Piot, Bilal, Menick, Jacob, Osband, Ian, Graves, Alex, Mnih, Vlad, Munos, Remi, Hassabis, Demis, Pietquin, Olivier, et al. Noisy networks for exploration. arXiv, 2017.

Gal & Ghahramani (2016)
Gal, Yarin and Ghahramani, Zoubin.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In ICML, 2016.  Henderson et al. (2017) Henderson, Peter, Islam, Riashat, Bachman, Philip, Pineau, Joelle, Precup, Doina, and Meger, David. Deep reinforcement learning that matters. arXiv, 2017.
 Jaksch et al. (2010) Jaksch, Thomas, Ortner, Ronald, and Auer, Peter. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 2010.
 Kearns & Singh (2002) Kearns, Michael and Singh, Satinder. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 49(23):209–232, 2002.
 Lagoudakis & Parr (2003) Lagoudakis, Michail G and Parr, Ronald. Leastsquares policy iteration. Journal of machine learning research, 4(Dec):1107–1149, 2003.
 Levine et al. (2017) Levine, Nir, Zahavy, Tom, Mankowitz, Daniel J, Tamar, Aviv, and Mannor, Shie. Shallow updates for deep reinforcement learning. arXiv, 2017.
 Levine et al. (2016) Levine et al., Sergey. Endtoend training of deep visuomotor policies. JMLR, 2016.
 Lipton et al. (2016a) Lipton, Zachary C, Gao, Jianfeng, Li, Lihong, Chen, Jianshu, and Deng, Li. Combating reinforcement learning’s sisyphean curse with intrinsic fear. arXiv preprint arXiv:1611.01211, 2016a.
 Lipton et al. (2016b) Lipton, Zachary C, Gao, Jianfeng, Li, Lihong, Li, Xiujun, Ahmed, Faisal, and Deng, Li. Efficient exploration for dialogue policy learning with bbq networks & replay buffer spiking. arXiv preprint arXiv:1608.05081, 2016b.
 Machado et al. (2017) Machado, Marlos C, Bellemare, Marc G, Talvitie, Erik, Veness, Joel, Hausknecht, Matthew, and Bowling, Michael. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009, 2017.
 Mnih et al. (2013) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 2015.
 Osband et al. (2013) Osband, Ian, Russo, Dan, and Van Roy, Benjamin. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, 2013.
 Osband et al. (2014) Osband, Ian, Van Roy, Benjamin, and Wen, Zheng. Generalization and exploration via randomized value functions. arXiv, 2014.
 Osband et al. (2016) Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, 2016.
 Osband et al. (2017) Osband, Ian, Russo, Daniel, Wen, Zheng, and Van Roy, Benjamin. Deep exploration via randomized value functions. arXiv, 2017.
 Ostrovski et al. (2017) Ostrovski, Georg, Bellemare, Marc G, Oord, Aaron van den, and Munos, Rémi. Countbased exploration with neural density models. arXiv, 2017.
 Rasmussen & Williams (2006) Rasmussen, Carl Edward and Williams, Christopher KI. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
 Russo & Van Roy (2014) Russo, Daniel and Van Roy, Benjamin. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 Sanborn & Chater (2016) Sanborn, Adam N and Chater, Nick. Bayesian brains without probabilities. Trends in cognitive sciences, 2016.
 ShalevShwartz et al. (2016) ShalevShwartz, Shai, Shammah, Shaked, and Shashua, Amnon. Safe, multiagent, reinforcement learning for autonomous driving. arXiv, 2016.
 Silver et al. (2017) Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, et al. Mastering the game of go without human knowledge. Nature, 2017.
 Snoek et al. (2015) Snoek, Jasper, Rippel, Oren, Swersky, Kevin, Kiros, Ryan, Satish, Nadathur, Sundaram, Narayanan, Patwary, Mostofa, Prabhat, Mr, and Adams, Ryan. Scalable bayesian optimization using deep neural networks. In ICML, 2015.
 Strens (2000) Strens, Malcolm. A bayesian framework for reinforcement learning. In ICML, 2000.
 Sutton & Barto (1998) Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
 Tang et al. (2016) Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke, Adam, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, and Abbeel, Pieter. Exploration:a study of countbased exploration for deep reinforcement learning. arXiv, 2016.
 Thompson (1933) Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.
 Van Hasselt et al. (2016) Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double qlearning. In AAAI, 2016.
 Welling & Teh (2011) Welling, Max and Teh, Yee W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 681–688, 2011.
Appendix A Appendix
Hyperparameters tuning:
For the BLR, we have noise variance , variance of prior over weights , sample size , posterior update period , and the posterior sampling period . To optimize for this set of hyperparameters we set up a very simple, fast, and cheap hyperparameter tuning procedure which proves the robustness of BDQN. To find the first three, we set up a simple hyperparameter search. We used a pretrained DQN model for the game of Assault, and removed the last fully connected layer in order to have access to its already trained feature representation. Then we tried combination of , , and and test for episode of the game. We set these parameters to their best .
The above hyperparameter tuning is cheap and fast since it requires only a few times the number of forwarding passes. For the remaining parameter, we ran BDQN ( with weights randomly initialized) on the same game, Assault, for time steps, with a set of and where BDQN performed better with choice of . For both choices of , it performed almost equal where we choose the higher one. We started off with the learning rate of and did not tune for that. Thanks to the efficient TS exploration and closed form BLR, BDQN can learn a better policy in an even shorter period of time. In contrast, it is well known for DQN based methods that changing the learning rate causes a major degradation in the performance, Apx. A. The proposed hyperparameter search is very simple where the exhaustive hyperparameter search is likely to provide even better performance.
Learning rate:
It is well known that DQN and DDQN are sensitive to the learning rate and change of learning rate can degrade the performance to even worse than random policy. We tried the same learning rate as BDQN, 0.0025, for DDQN and observed that its performance drops. Fig. 3 shows that the DDQN with higher learning rates learns as good as BDQN at the very beginning but it can not maintain the rate of improvement and degrade even worse than the original DDQN.
Computational and sample cost comparison:
For a given period of game time, the number of the backward pass in both BDQN and DQN are the same where for BDQN it is cheaper since it has one layer (the last layer) less than DQN. In the sense of fairness in sample usage, for example in duration of , all the layers of both BDQN and DQN, except the last layer, sees the same number of samples, but the last layer of BDQN sees times fewer samples compared to the last layer of DQN. The last layer of DQN for a duration of , observes (4 is back prob period) mini batches of size , which is , where the last layer of BDQN just observes samples size of . As it is mentioned in Alg. 1, to update the posterior distribution, BDQN draws samples from the replay buffer and needs to compute the feature vector of them. Therefore, during the duration of decision making steps, for the learning procedure, DDQN does of forward passes and of backward passes, while BDQN does same number of backward passes (cheaper since there is no backward pass for the final layer) and of forward passes. One can easily relax it by parallelizing this step along the main body of BDQN or deploying online posterior update methods.
Thompson sampling frequency:
The choice of TS update frequency can be crucial from domain to domain. If one chooses
too short, then computed gradient for backpropagation of the feature representation is not going to be useful since the gradient get noisier and the loss function is changing too frequently. On the other hand, the network tries to find a feature representation which is suitable for a wide range of different weights of the last layer, results in improper use of model capacity. If the
TS update frequency is too low, then it is far from being TS and losses randomized exploration property. The current choice of is suitable for a variety of Atari games since the length of each episode is in range of and is infrequent enough to make the feature representation robust to big changes.For the RL problems with shorter horizon we suggest to introduce two more parameters, and where , the period that of is sampled our of posterior, is much smaller than and is being used just for making TS actions while is used for backpropagation of feature representation. For game Assault, we tried using and but did not observe much a difference, and set them to and . But for RL setting with a shorter horizon, we suggest using them.
Further investigation in Atlantis:
After removing the maximum episode length limit for the game Atlantis, BDQN gets the score of 62M. This episode is long enough to fill half of the replay buffer and make the model perfect for the later part of the game but losing the crafted skill for the beginning of the game. We observe in Fig. 4 that after losing the game in a long episode, the agent forgets a bit of its skill and loses few games but wraps up immediately and gets to score of . To overcome this issue, one can expand the replay buffer size, stochastically store samples in the reply buffer where the later samples get stored with lowers chance, or train new models for the later parts of the episode. There are many possible cures for this interesting observation and while we are comparing against DDQN, we do not want to advance BDQN structurewise.
Further discussion on Reproducibility
In Table. 2, we provide the scores of bootstrap DQN (Osband et al., 2016) and NoisyNet^{6}^{6}6This work does not have scores of Noisynet with DDQN objective function but it has Noisynet with DQN objective which are the scores reported in Table. 2(Fortunato et al., 2017) along side with BDQN. These score are directly copied from their original papers and we did not make any change to them. We also desired to report the scores of countbased method (Ostrovski et al., 2017), but unfortunately there is no table of score in that paper in order to provide them here.
In order to make it easier for the readers to compare against the results in Ostrovski et al. (2017), we visually approximated their plotted curves for , and , and added them to the Table. 2. We added these numbers just for the convenience of the readers Surly we do not argue any scientific meaning for them and leave it to the readers to interpret them.
Table. 2 shows a significant improvement of BDQN over these baselines by looking at Table. 2. Despite the simplicity and negligible computation overhead of BDQN over DDQN, we can not scientifically claim that BDQN outperforms these baselines by just looking at the scores in Table.2 because we are not aware of their detailed implementation as well as environment detail. For example, in this work, we directly implemented DDQN by following the implementation details mentioned in the original DDQN paper and the scores of our DDQN implementation during the evaluation time almost matches the scores of DDQN reported in the original paper. But the reported scores of implemented DDQN in Osband et al. (2016) are much different from the reported score in the original DDQN paper.
Comments
There are no comments yet.