深度學習中Attention Mechanism

Attention Mechanism可以對模型的輸入的每個部分(區域)，賦予不同的權重，以抽取出局部關鍵的資訊，使模型做出更精準確的判斷。Attention in deep learning localizes information in making predictions.

Attention Mechanism與人類對外界事物的觀察機制很類似，當人類觀察外界事物的時候，一般不會把事物當成一個整體去看，往往傾向於根據需要選擇性的去獲取被觀察事物的某些重要部分，比如我們看到一個人時，往往先Attention到這個人的臉，然後再把不同區域的資訊組合起來，形成一個對被觀察事物的整體印象。

Attention Mechanism用於提升RNN（LSTM或GRU）的Encoder + Decoder模型的效果的的機制（Mechanism）。Attention Mechanism廣泛應用於機器翻譯、語音識別、影象標註（Image Caption）等很多領域。Attention的成功，是在於賦予了模型區分辨別的能力。

例如，在機器翻譯、語音識別應用中，為句子中的每個詞賦予不同的權重，使神經網路模型的學習變得更加靈活（soft），同時Attention本身可以做為一種對齊(alignment)關係，解釋翻譯輸入/輸出句子之間的對齊關係，解釋模型到底學到了什麼知識。

With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector. Rather, we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far. So, in languages that are pretty well aligned (like English and German) the decoder would probably choose to attend to things sequentially. Attending to the first word when producing the first English word, and so on. That’s what was done in Neural Machine Translation by Jointly Learning to Align and Translate and look as follows:

Here, The $y$ ‘s are our translated words produced by the decoder, and the $x$ ‘s are our source sentence words. The above illustration uses a bidirectional recurrent network, but that’s not important and you can just ignore the inverse direction. The important part is that each decoder output word $y_t$ now depends on a weighted combination of all the input states, not just the last state. The $a$ ‘s are weights that define in how much of each input state should be considered for each output. So, if $a_{3,2}$ is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The $a's$ are typically normalized to sum to 1 (so they are a distribution over the input states).

計算 attention 3個主要步驟
1> Obtaining a weight from the similarity computation based on the query and each key. Similarity functions frequently use dot product, splice, detector, etc.
2> The second step is typically to use a softmax function to normalize these weights.
3> Finally, to weight these weights in conjunction with the corresponding values and obtain the final Attention.

如圖1所示。The x-axis corresponds to the source sentence in English and the y-axis to the generated translation in French. Each pixel shows where the model is shifting its attention, the whiter the area the more attention it was paying, the darker the less (like the previous example).

(Attentional Interpretation of French to English TranslationTaken from Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015)

圖1 NLP中的attention視覺化

下圖為image caption的例子，
With an attention mechanism, the image is first divided into n parts, and we compute with a Convolutional Neural Network (CNN) representations of each part h1,…,hn.
When the RNN is generating a new word, the attention mechanism is focusing on the relevant part of the image, so the decoder only uses specific parts of the image.

On the figure below (upper row), we can see for each word of the caption what part of the image (in white) is used to generate it.

In broad terms, Attention is one component of a network’s architecture, and is in charge of managing and quantifying the interdependence: Between the input and output elements (General Attention), Within the input elements (Self-Attention).

Attention = (Fuzzy) Memory?

The basic problem that the attention mechanism solves is that it allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. The attention mechanism is simply giving the network access to its internal memory, which is the hidden state of the encoder. In this interpretation, instead of choosing what to “attend” to, the network chooses what to retrieve from memory. Unlike typical memory, the memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location. Making the memory access soft has the benefit that we can easily train the network end-to-end using backpropagation (though there have been non-fuzzy approaches where the gradients are calculated using sampling methods instead of backpropagation).

The hidden state of a standard Recurrent Neural Network is itself a type of internal memory. RNNs suffer from the vanishing gradient problem that prevents them from learning long-range dependencies. LSTMs improved upon this by using a gating mechanism that allows for explicit memory deletes and updates.

If we have predicted i words, the hidden state of the LSTM is hi. We select the « relevant » part of the image by using hi as the context. Then, the output of the attention model zi, which is the representation of the image filtered such that only the relevant parts of the image remains, is used as an input for the LSTM. Then, the LSTM predicts a new word and returns a new hidden state hi+1.

The key question is, how does the model know where to focus?
It calculates a score known as alignment score which quantifies how much attention we should give to each input. There exist multiple alignment scores, the most popular are additive (also known as concat, Bahdanau et al 2015), location-base, general and dot-product (Luong 2015). This distinction has led to broader categories like Global/Soft and Local/Hard attention.

Stochastic Hard Attention vs. Deterministic Soft Attention

With an attention mechanism, the image is first divided into n parts, and we compute with a Convolutional Neural Network (CNN) representations of each part h1,…,hn. When the RNN is generating a new word, the attention mechanism is focusing on the relevant part of the image, so the decoder only uses specific parts of the image.

An attention model is a method that takes n arguments y1,…, yn (in the preceding examples, the yi would be the hi), and a context c. It return a vector z which is supposed to be the “summary” of the yi, focusing on information linked to the context c. More formally, it returns weighted arithmetic mean of the yi, and the weights are chosen according to the relevance of each yi given the context c.

Hard attention is a stochastic process: instead of using all the hidden states as an input for the decoding, the system samples a hidden state yi with the probabilities si. In order to propagate a gradient through this process, we estimate the gradient by Monte Carlo sampling.

Soft attention is more popular because the backpropagation seems more effective.

【參考】

https://www.itread01.com/content/1548179294.html

Evolution of attention mechanism>
https://towardsdatascience.com/what-is-attention-mechanism-can-i-have-your-attention-please-3333637f2eac

A good example of the 【attention mechanism】 can be found in the paper ‘Show, Attend and Tell’ for Image captioning.

Another (attention mechanism) example can be found in the paper ‘Neural Machine Translation by jointly learning to align and translate’ by Bzmitry Bahdanau et al in 2015.

https://jhui.github.io/2017/03/15/Soft-and-hard-attention/

https://blog.floydhub.com/attention-mechanism/

https://zhuanlan.zhihu.com/p/31547842

The Dance of Disorder (Fluctuations of Entropy)

Casual notes consist of scattered perspectives on the Existence.
宇宙中估計有無數孤單的波茲曼大腦漂浮在無序中，對於宇宙來說，觀測者更有可能是種隨機漲落出現的意識。

Further Reading

熱門文章

參觀人氣

The Dance of Disorder (Fluctuations of Entropy)

Casual notes consist of scattered perspectives on the Existence.宇宙中估計有無數孤單的波茲曼大腦漂浮在無序中，對於宇宙來說，觀測者更有可能是種隨機漲落出現的意識。