0%

Attention is All You Need is a paper published in 2017 by Google Brain team. In this paper, the prominent self-attention mechanism and multi-headed self-attention mechanism are proposed to enhance the general capacity of neural networks. Meanwhile, a basic neural network architecture Transformer is proposed to adopt such mechanism and its been tested on translation task.

What is Attention?

The attention mentioned in neural network is similar with the one we mentioned about human intelligence: an ability to capture information which seems to be more important (subjectively). The “attention mechanism” in human congnition is prominently manifested across various sensory perceptions. For instance, when using our eyes, certain features such as bright colors, rapid movements, or familiar faces tend to capture our gaze. Similarly, during reading, specific names, events, or elegant phrases can leave a lasting impression. This attention mechanism also extends to other senses like smell, touch, and hearing. Furthermore, the stream of consciousness during thought processes is often influenced by particularly striking information, all of which illustrate the attention mechanism in human cognition.

In the context of a nerual network, what does the attention mechanism refer to? For example, when we want a neural network to process an image, how does it determine which elements of the image are more important. Consider objects like people, pets, or cars in an image–how does the network decide which one is more noteworthy in the image’s “context”? The attention mechanism addresses this problem. When processing an image, the neural network assigns higher weights, or “attention scores”, to what it deems more important information. During subsequent forwarding process, the high-weight information is emphasized, enabling the network to capture more detailed and deeper features. This, in turn, aids in handling the semantic importance of the image. Besides image processing, the mechanism is applicable to text, audio, graph, and other data types. Since data in neural networks exists as discrete tensors, the fundamental principles of calculating attention scores remain consistent across different data types.

Self-Attention Mechanism

The so-called self-attention mechanism refers to a process where the attention scores are computed by comparing the elements within the sequence itself. In other words, the reference for calculating attention scores is the sequence as a whole. For a tensor sequence input into a neural network, the self-attention mechanism requires computing the attention scores between each element and every other element in the sequence. The basic process can be represented in pseudocode as follows:

1
2
3
4
5
for (element_x : input_sequence) {
for (element_y : input_sequence) {
attention_score[x][y] = compute_score(element_x, element_y);
}
}

However, for deep learning tasks, using to nested loops to compute values is highly time-consuming, especially when the input sequence length can be in the tens of thousands. Therefore, researchers use vectorized operations to calculate all pairwise attention scores efficiently. By using vectorized multiplication, we can obtain all attention scores with a single dot product operation between the input tensor sequence and its transposed version.

$$ X= \left[ \begin{matrix} x_1, ..., x_n \end{matrix} \right] $$

$$ X^T= \left[ \begin{matrix} x_1 \\ \vdots \\ x_n \\ \end{matrix} \right] $$

$$ AttentionScore = X \cdot X^T = \left[ \begin{matrix} x_1 \cdot x_1 & ... & x_1 \cdot x_n\\ \vdots & & \vdots \\ x_n \cdot x_1 & ... & x_n \cdot x_n\\ \end{matrix} \right] $$

Through this matrix multiplication operation, combined with parallel computing technology and the parallel processing capabilities of GPU devices, the efficiency of this computation can be greatly enhanced. To ensure that the attention scores are not fixed and be gradually optimized during the training process, we use parameter matrices to adjust the calculation of attention scores. This enable the model to truely capture the information that deserves attention. Thus, the equation for computing attention score can be expressed as follows: $$ Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $$ Here, $ Q, K, V $ refer to each element in the input sequence being multiplied by three shared weighted matrices $ W_q, W_k $ and $ W_v $. The data multiplied by $ W_q $ represents the element that will compute attention scores with other elements in the sequence. The data multiplied by $ W_k $ represents the data being compared. After comparing each element with every other element, the resulting sequence is the attention score sequence for that element. This sequence is processed through the softmax function and then multiplied by $ W_v $ to perform a weighted sum. The entire process essentially calculates the correlations between all elements and the performs a weighted sum, with the result being the output of the attention mechanism. Through this process, the information that is considered more important is multiplied by greater weights, as a result their “information” can be magnified in the subsequent processes.

Transformer Model

The Transformer model adopts an encoder-decoder architecture which is a popular architecture in current deep learning field. Two independent models are separatedly used as an encoder and a decoder. For most cases, they are used to process different types of data. For example, in image captioning, encoder is used for process image signal and decoder is for generating captions. In this paper, the Transformer is tested on English-German translation task, so, unprecisely, you can image the encoder is trying to “understand” a sentence in English and then generate some kind of metaphysical or abstract information, these information contains the concepts, subjects and objectives that are “mentioned” in the original language. Then, these information is passed to decoder, and the task of decode is to decrypt these information into German.

Importantly, the comparison I mentioned above is extremely unprecise, this model cannot translate as human do, in fact it’s still a prediction or simulation, but not understanding.

Given two string word1 and word2, return the minimum number of operation required to converted word1 to word2.

Following three operations are permitted on a word:

  • Insert a character
  • Delete a character
  • Replace a character

Example:

1
2
Input: word1 = "horse", word2 = "ros";
Output: 3;

Constraints:

  • 0 <= len(word1), len(word2) <= 500
  • word1 and word2 consist of lower case English letters

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class EditDistance {
public:
int minDistance(string word1, string word2) {
word1 = " " + word1;
word2 = " " + word2;

int **matrix = new int*[word1.size()];
for(int i = 0; i < word1.size(); i++) {
matrix[i] = new int[word2.size()];
fill_n(matrix[i], word2.size(), 0);
}

for(int i = 0; i < word1.size(); i++) {
for(int j = 0; j < word2.size(); j++) {
if(_min_(i, j) == 0) matrix[i][j] = _max_(i, j);
else {
if(word1[i] == word2[j]) {
matrix[i][j] = matrix[i - 1][j - 1];
} else {
matrix[i][j] = _trimin_(
matrix[i - 1][j - 1],
matrix[i][j - 1],
matrix[i - 1][j]
) + 1;
}
}
}
}
return matrix[word1.size() - 1][word2.size() - 1];
}

private:
int _min_(int i, int j) {
return i < j ? i : j;
}
int _max_(int i, int j) {
return i > j ? i : j;
}
int _trimin_(int x, int y, int z) {
return _min_(x, y) < _min_(y, z) ? _min_(x, y) : _min_(y, z);
}
};

Explanation:

Use matrix[i][j] to represents the edit distance between string word1[0…i] and word2[0…j]. The index of string’s content begins at 1, matrix[0][0] indicates the case that word1 and word2 are both vacant, which means their edit distance is 0. So, it is obvious that matrix[0][j] is the case when word1 is empty and the length of word2 is j, and their edit distance should be just j. Similarly, matrix[0][i] is the case when the length of word1 is i and word2 is empty, their edit distance is i.

Then we can deduce the equation for the above three operations accordingly:

  • matrix[i - 1][j] + 1: insert the last character of word1 into the tail of word2
  • matrix[i][j - 1] + 1: delete the last character of word2
  • matrix[i - 1][j - 1] + 1: replace the last character of word2 by the last character of word1

Given n pairs of parentheses, write a function to generate all combinations of well-formed parentheses.

Example:

1
2
Input: n = 3;
Output: ["((()))", "(()())", "(())()", "()(())", "()()()"]

Constraints:

  • 1 <= n <= 8

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class GenerateParentheses {
public:
vector<string> res;
vector<string> generateParentheses(int n) {
if(n <= 0) return res;
p("", n, n);
return res;
}

void p(string str, int left, int right) {
if(left == 0 && right == 0) {
res.push_back(str);
return;
}
if(left == right) {
p(str + "(", left - 1, right);
} else {
if(left > 0) {
p(str + "(", left - 1, right);
}
p(str + ")", left, right - 1);
}
}
};

Explanation:

The number of remained left parenthese must be less or equal to the number of remained right parenthese. If the number of remained left parenthese equals to the remained right one, the next can only be a left parenthese. If the number of remained left parenthese is less than the right one, the next one can be either a left parenthese or a right parenthese. This problem can be solved by using recursion algorithm.

Given a m * n grid filled with non-negative numbers, find a path from top left to bottom right, which minizes the sum of all numbers along its path.

  • Moves can only be taken down or right at any point in time.

Example:

1
2
Input: grid = [[1, 3, 1], [1, 5, 1], [4, 2, 1]];
Output: 7;

Constraints:

  • 1 <= m, n <= 200
  • 0 <= grid[i, j] <= 200

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class MinimumPathSum {
public:
int minPathSum(vector<vector<int>>& grid) {
int n_row = grid.size();
int n_col = grid[0].size();
int **dp = new int*[n_row];
for(int i = 0; i < n_row; i++) {
dp[i] = new int[n_col];
fill_n(dp[i], n_col, INT_MAX);
}
dp[0][0] = grid[0][0];

for(int i = 0; i < n_row; i++) {
for(int j = 0; j < n_col; j++) {
if(i) {
dp[i][j] = dp[i - 1][j] + grid[i][j] < dp[i][j] ? dp[i - 1][j] + grid[i][j] : dp[i][j];
}
if(j) {
dp[i][j] = dp[i][j - 1] + grid[i][j] < dp[i][j] ? dp[i][j - 1] + grid[i][j] : dp[i][j];
}
}
}
return dp[n_row - 1][n_col - 1];
}
};

Explanation:

Using a dp array to indicate the current minimum path sum. And this sum can only be the sum of the up dp point and current grid point or the left dp point and current grid point. The dp function can be defined as follow:

1
dp[i, j] := min(dp[i - 1, j] + grid[i, j], dp[i, j - 1] + grid[i, j])

Welcome to my personal blog. I open this blog when I am about to graduate from my university, at the time I extremely want to use something to document my life and things I’ve learned. And now I’m on my way to the United States to start a brand new experience of studying. I want to keep some memories through this blog, and meanwhile take some notes about the things that might be useful to my future development.

About This Blog

Due to my major is computer science/engineering, so most of study notes are related to fields like computer science, computer engineering, software developing, etc., I have to admit that I’m not excel at such subjects, but I am still keen on them and wish to make something big from my own hands. Hope this blog can witness my improvements :)

This blog does not only focus on such academic matters. For other parts, I will upload some of my daily life records, like photos, music, and maybe some unexpected inspiration and ideas. And I will write each blog article in English as possible as I can, to improve my English writing skills, and maybe it’s benefit to my future experience in the US🙂

Hope you enjoy my blog, and discover something new!

Crescat Scientia Vita Excolatur

益智厚生