One method that is often used in combination with the RL algorithms is the Beltzmann or softmax exploration strategy.
The action selection strategy is still random, but selection probabilities are weighted by their relative Q-values. This makes it more likely for the agent to choose good actions, whereas two actions that have similar Q-values will have almost the same probability to get selected. Its general form is
P(a)=eQ(s,a)T∑ieQ(s,ai)T
in which P(a) is the probability of selecting action a and T is the temperature parameter. Higher values of T will move the selection more towards a purely random strategy and lower values will move to a fully greedy strategy.
时间: 2024-11-05 06:04:58