(zhuan) Attention in Neural Networks and How to Use It

Adam Kosiorek

About

Attention in Neural Networks and How to Use It 

this blog comes from: http://akosiorek.github.io/ml/2017/10/14/visual-attention.html 

Oct 14, 2017

Attention mechanisms in neural networks, otherwise known as neural attention or just attention, have recently attracted a lot of attention (pun intended). In this post, I will try to find a common denominator for different mechanisms and use-cases and I will describe (and implement!) two mechanisms of soft visual attention.

What is Attention?

Informally, a neural attention mechanism equips a neural network with the ability to focus on a subset of its inputs (or features): it selects specific inputs. Let x∈Rdx∈Rd an input, z∈Rkz∈Rk a feature vector, a∈[0,1]ka∈[0,1]k an attention vector and fϕ(x)fϕ(x) an attention network. Typically, attention is implemented as

aza=fϕ(x),=a⊙z,(1)(1)a=fϕ(x),za=a⊙z,

where ⊙⊙ is element-wise multiplication. We can talk about soft attention, which multiplies features with a (soft) mask of values between zero and one, or hard attention, when those values are constrained to be exactly zero or one, namely a∈{0,1}ka∈{0,1}k. In the latter case, we can use the hard attention mask to directly index the feature vector: za=z[a]za=z[a] (in Matlab notation), which changes its dimensionality.

To understand why attention is important, we have to think about what a neural network really is: a function approximator. Its ability to approximate different classes of functions depends on its architecture. A typical neural net is implemented as a chain of matrix multiplications and element-wise non-linearities, where elements of the input or feature vectors interact with each other only by addition.

Attention mechanisms compute a mask which is used to multiply features. This seemingly innocent extension has profound implications: suddenly, the space of functions that can be well approximated by a neural net is vastly expanded, making entirely new use-cases possible. Why? While I have no proof, the intuition is following: the theory says that neural networks are universal function approximators and can approximate an arbitrary function to arbitrary precision, but only in the limit of an infinite number of hidden units. In any practical setting, that is not the case: we are limited by the number of hidden units we can use. Consider the following example: we would like to approximate the product of NN inputs. A feed-forward neural network can do it only by simulating multiplications with (many) additions (plus non-linearities), and thus it requires a lot of neural-network real estate. If we introduce multiplicative interactions, it becomes simple and compact.

The above definition of attention as multiplicative interactions allow us to consider a broader class of models if we relax the constrains on the values of the attention mask and let a∈Rka∈Rk. For example,Dynamic Filter Networks (DFN) use a filter-generating network, which computes filters (or weights of arbitrary magnitudes) based on inputs, and applies them to features, which effectively is a multiplicative interaction. The only difference with soft-attention mechanisms is that the attention weights are not constrained to lie between zero and one. Going further in that direction, it would be very interesting to learn which interactions should be additive and which multiplicative, a concept explored in A Differentiable Transition Between Additive and Multiplicative Neurons. The excellent distill blog provides a great overview of soft-attention mechanisms.

Visual Attention

Attention can be applied to any kind of inputs, regardless of their shape. In the case of matrix-valued inputs, such as images, we can talk about visual attention. Let I∈RH×WI∈RH×W be an image and g∈Rh×wg∈Rh×wan attention glimpse i.e. the result of applying an attention mechanism to the image II.

Hard Attention

Hard attention for images has been known for a very long time: image cropping. It is very easy conceptually, as it only requires indexing. Hard-attention can be implemented in Python (or Tensorflow) as

g = I[y:y+h, x:x+w]

The only problem with the above is that it is non-differentiable; to learn the parameters of the model, one must resort to e.g. the score-function estimator, briefly mentioned in my previous post.

Soft Attention

Soft attention, in its simplest variant, is no different for images than for vector-valued features and is implemented exactly as in equation 11. One of the early uses of this types of attention comes from the paper called Show, Attend and TellThe model learns to attend to specific parts of the image while generating the word describing that part.

This type of soft attention is computationally wasteful, however. The blacked-out parts of the input do not contribute to the results but still need to be processed. It is also over-parametrised: sigmoid activations that implement the attention are independent of each other. It can select multiple objects at once, but in practice we often want to be selective and focus only on a single element of the scene. The two following mechanisms, introduced by DRAW and Spatial Transformer Networks, respectively, solve this issue. They can also resize the input, leading to further potential gains in performance.

Gaussian Attention

Gaussian attention works by exploiting parametrised one-dimensional Gaussian filters to create an image-sized attention map. Let ay∈Rhay∈Rh and ax∈Rwax∈Rw be attention vectors, which specify which part of the image should be attended to in yy and xx axis, respectively. The attention masks can be created as a=ayaTxa=ayaxT.

In the above figure, the top row shows axax, the column on the right shows ayay and the middle rectangle shows the resulting aa. Here, for the visualisation purposes, the vectors contain only zeros and ones. In practice, they can be implemented as vectors of one-dimensional Gaussians. Typically, the number of Gaussians is equal to the spatial dimension and each vector is parametrised by three parameters: centre of the first Gaussian μμ, distance between centres of consecutive Gaussians dd and the standard deviation of the Gaussians σσ. With this parametrisation, both attention and the glimpse are differentiable with respect to attention parameters, and thus easily learnable.

Attention in the above form is still wasteful, as it selects only a part of the image while blacking-out all the remaining parts. Instead of using the vectors directly, we can cast them into matrices Ay∈Rh×HAy∈Rh×Hand Ax∈Rw×WAx∈Rw×W, respectively. Now, each matrix has one Gaussian per row and the parameter ddspecifies distance (in column units) between centres of Gaussians in consecutive rows. Glimpse is now implemented as

g=AyIATx.g=AyIAxT.

I used this mechanism in HART, my recent paper on biologically-inspired object tracking with RNNs with attention. Here is an example with the input image on the left hand side and the attention glimpse on the right hand side; the glimpse shows the box marked in the main image in green:

 

The code below lets you create one of the above matrix-valued masks for a mini-batch of samples in Tensorflow. If you want to create AyAy, you would call it as Ay = gaussian_mask(u, s, d, h, H), where u, s, d are μ,σμ,σ and dd, in that order and specified in pixels.

def gaussian_mask(u, s, d, R, C):
  """
  :param u: tf.Tensor, centre of the first Gaussian.
  :param s: tf.Tensor, standard deviation of Gaussians.
  :param d: tf.Tensor, shift between Gaussian centres.
  :param R: int, number of rows in the mask, there is one Gaussian per row.
  :param C: int, number of columns in the mask.
  """
  # indices to create centres
  R = tf.to_float(tf.reshape(tf.range(R), (1, 1, R)))
  C = tf.to_float(tf.reshape(tf.range(C), (1, C, 1)))
  centres = u[np.newaxis, :, np.newaxis] + R * d
  column_centres = C - centres
  mask = tf.exp(-.5 * tf.square(column_centres / s))
  # we add eps for numerical stability
  normalised_mask /= tf.reduce_sum(mask, 1, keep_dims=True) + 1e-8
  return normalised_mask

We can also write a function to directly extract a glimpse from the image:

def gaussian_glimpse(img_tensor, transform_params, crop_size):
  """
  :param img_tensor: tf.Tensor of size (batch_size, Height, Width, channels)
  :param transform_params: tf.Tensor of size (batch_size, 6), where params are  (mean_y, std_y, d_y, mean_x, std_x, d_x) specified in pixels.
  :param crop_size): tuple of 2 ints, size of the resulting crop
  """
  # parse arguments
  h, w = crop_size
  H, W = img_tensor.shape.as_list()[1:3]
  uy, sy, dy, ux, sx, dx = tf.split(transform_params, 6, -1)
  # create Gaussian masks, one for each axis
  Ay = mask(uy, sy, dy, h, H)
  Ax = mask(ux, sx, dx, w, W)
  # extract glimpse
  glimpse = tf.matmul(tf.matmul(Ay, img_tensor, adjoint_a=True), Ax)
  return glimpse

Spatial Transformer

Spatial Transformer (STN) allows for much more general transformation that just differentiable image-cropping, but image cropping is one of the possible use cases. It is made of two components: a grid generator and a sampler. The grid generator specifies a grid of points to be sampled from, while the sampler, well, samples. The Tensorflow implementation is particularly easy in Sonnet, a recent neural network library from DeepMind.

def spatial_transformer(img_tensor, transform_params, crop_size):
  """
  :param img_tensor: tf.Tensor of size (batch_size, Height, Width, channels)
  :param transform_params: tf.Tensor of size (batch_size, 4), where params are  (scale_y, shift_y, scale_x, shift_x)
  :param crop_size): tuple of 2 ints, size of the resulting crop
  """
  constraints = snt.AffineWarpConstraints.no_shear_2d()
  img_size = img_tensor.shape.as_list()[1:]
  warper = snt.AffineGridWarper(img_size, crop_size, constraints)
  grid_coords = warper(transform_params)
  glimpse = snt.resampler(img_tensor, grid_coords)
  return glimpse

Gaussian Attention vs. Spatial Transformer

Both Gaussian attention and Spatial Transformer can implement a very similar behaviour. How do we choose which to use? There are several nuances:

  • Gaussian attention is an over-parametrised cropping mechanism: it requires six parameters, but there are only four degrees of freedom (y, x, height width). STN needs only four parameters.
  • I haven’t run any tests yet, but STN should be faster. It relies on linear interpolation at sampling points, while the Gaussian attention has to perform two huge matrix multiplication. STN could be an order of magnitude faster (in terms of pixels in the input image).
  • Gaussian attention should be (no tests run) easier to train. This is because every pixel in the resulting glimpse can be a convex combination of a relatively big patch of pixels of the source image, which (informally) makes it easier to find the cause of any errors. STN, on the other hand, relies on linear interpolation, which means that gradient at every sampling point is non-zero only with respect to the two nearest pixels.

Closing Thoughts

Attention mechanisms expand capabilities of neural networks: they allow approximating more complicated functions, or in more intuitive terms, they enable focusing on specific parts of the input. They have led to performance improvements on natural language benchmarks, as well as to entirely new capabilities such as image captioning, addressing in memory networks and neural programmers.

I believe that the most important cases in which attention is useful have not been yet discovered. For example, we know that objects in videos are consistent and coherent, e.g. they do not disappear into thin air between frames. Attention mechanisms can be used to express this consistency prior. How? Stay tuned.

Share this:

Facebook Twitter Google+ LinkedIn

 

 

  • Adam Kosiorek
  • adamk@robots.ox.ac.uk

Generative timeseries modelling, but also attention, memory and other cool tricks.

时间: 2024-10-28 10:05:42

(zhuan) Attention in Neural Networks and How to Use It的相关文章

(zhuan) Building Convolutional Neural Networks with Tensorflow

Ahmet Taspinar  Home About Contact Building Convolutional Neural Networks with Tensorflow Posted on augustus 15, 2017 adminPosted in convolutional neural networks, deep learning, tensorflow 1. Introduction In the past I have mostly written about 'cla

(zhuan) Attention in Long Short-Term Memory Recurrent Neural Networks

Attention in Long Short-Term Memory Recurrent Neural Networks by Jason Brownlee on June 30, 2017 in Deep Learning   The Encoder-Decoder architecture is popular because it has demonstrated state-of-the-art results across a range of domains. A limitati

Attention and Augmented Recurrent Neural Networks

Attention and Augmented Recurrent Neural Networks CHRIS OLAHGoogle Brain   SHAN CARTERGoogle Brain   Sept. 8 2016   Citation: Olah & Carter, 2016   Recurrent neural networks are one of the staples of deep learning, allowing neural networks to work wi

(zhuan) How to Train Neural Networks With Backpropagation

this blog from: http://blog.demofox.org/2017/03/09/how-to-train-neural-networks-with-backpropagation/   How to Train Neural Networks With Backpropagation Posted on March 9 2017 by Demofox This post is an attempt to demystify backpropagation, which is

Sequence to Sequence Learning with Neural Networks

seq2seq+各种形式的attention近期横扫了nlp的很多任务,本篇将分享的文章是比较早(可能不是最早)提出用seq2seq来解决机器翻译任务的,并且取得了不错的效果.本文的题目是Sequence to Sequence Learning with Neural Networks,作者是来自Google的Ilya Sutskever博士(现在OpenAI).可以说这篇文章较早地探索了seq2seq在nlp任务中的应用,后续的研究者在其基础上进行了更广泛的应用,比如自动文本摘要,对话机器人

Consensus Attention-based Neural Networks for Chinese Reading

本文分享的是今天刚刚刷出的一篇paper,是研究阅读理解的同学们的福音,因为要放出新的而且是中文的数据集.本文的题目是Consensus Attention-based Neural Networks for Chinese Reading Comprehension,作者均来自哈工大讯飞联合实验室. 对于机器阅读理解的基本内容就不作介绍了,感兴趣的同学可以参考之前写的一篇摘要教机器学习阅读.本文最大的亮点在于构建了中文机器阅读语料,语料分为两个部分,一个是训练集和自测试集,一个是领域外的测试集

Convolutional Neural Networks for Sentence Classification

本篇将分享一个有监督学习句子表示的方法,文章是Convolutional Neural Networks for Sentence Classification,作者是Harvard NLP组的Yoon Kim,并且开源了代码 sent-conv-torch. 卷积神经网络(CNN)在计算机视觉中应用广泛,其捕捉局部feature的能力非常强,为分析和利用图像数据的研究者提供了极大额帮助.本文作者将CNN引用到了NLP的文本分类任务中. 本文模型架构图: 熟悉CNN结构的童鞋们看这个图就会非常眼

Hacker's guide to Neural Networks

PS:   许多同学对于机器学习及深度学习的困惑在于,数学方面已经大致理解了,但是动起手来却不知道如何下手写代码.斯坦福深度学习博士Andrej Karpathy写了一篇实战版本的深度学习及机器学习教程,手把手教你用Javascript写神经网络和SVM!!我还没怎么看... Hacker's guide to Neural Networks Hi there, I'm a CS PhD student at Stanford. I've worked on Deep Learning for

Multiresolution Recurrent Neural Networks: An Application to...

昨天介绍了一篇工程性比较强的paper,关于对话生成(bot)任务的,今天继续分享一篇bot方面的paper,6月2日刚刚submit在arxiv上.昨天的文章用了一种最最简单的端到端模型来生成对话,取得了不错的结果,而本文用了一种更加复杂的模型来解决这个问题,取得了更好的结果.文章的题目是Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation,作者是来自蒙特利尔大学的博士