|
@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture.
|
|
|

|
|
|
</center>
|
|
|
|
|
|
-In this diagram, $$\{ s_0, s_1, ..., s_{N-1} \}$$ are the words of the caption
|
|
|
-and $$\{ w_e s_0, w_e s_1, ..., w_e s_{N-1} \}$$ are their corresponding word
|
|
|
-embedding vectors. The outputs $$\{ p_1, p_2, ..., p_N \}$$ of the LSTM are
|
|
|
-probability distributions generated by the model for the next word in the
|
|
|
-sentence. The terms $$\{ \log p_1(s_1), \log p_2(s_2), ..., \log p_N(s_N) \}$$
|
|
|
-are the log-likelihoods of the correct word at each step; the negated sum of
|
|
|
-these terms is the minimization objective of the model.
|
|
|
+In this diagram, \{*s*<sub>0</sub>, *s*<sub>1</sub>, ..., *s*<sub>*N*-1</sub>\}
|
|
|
+are the words of the caption and \{*w*<sub>*e*</sub>*s*<sub>0</sub>,
|
|
|
+*w*<sub>*e*</sub>*s*<sub>1</sub>, ..., *w*<sub>*e*</sub>*s*<sub>*N*-1</sub>\}
|
|
|
+are their corresponding word embedding vectors. The outputs \{*p*<sub>1</sub>,
|
|
|
+*p*<sub>2</sub>, ..., *p*<sub>*N*</sub>\} of the LSTM are probability
|
|
|
+distributions generated by the model for the next word in the sentence. The
|
|
|
+terms \{log *p*<sub>1</sub>(*s*<sub>1</sub>),
|
|
|
+log *p*<sub>2</sub>(*s*<sub>2</sub>), ...,
|
|
|
+log *p*<sub>*N*</sub>(*s*<sub>*N*</sub>)\} are the log-likelihoods of the
|
|
|
+correct word at each step; the negated sum of these terms is the minimization
|
|
|
+objective of the model.
|
|
|
|
|
|
During the first phase of training the parameters of the *Inception v3* model
|
|
|
are kept fixed: it is simply a static image encoder function. A single trainable
|
|
@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are
|
|
|
trained to jointly fine-tune the image encoder and the LSTM.
|
|
|
|
|
|
Given a trained model and an image we use *beam search* to generate captions for
|
|
|
-that image. Captions are generated word-by-word, where at each step $$t$$ we use
|
|
|
-the set of sentences already generated with length $$t-1$$ to generate a new set
|
|
|
-of sentences with length $$t$$. We keep only the top $$k$$ candidates at each
|
|
|
-step, where the hyperparameter $$k$$ is called the *beam size*. We have found
|
|
|
-the best performance with $$k=3$$.
|
|
|
+that image. Captions are generated word-by-word, where at each step *t* we use
|
|
|
+the set of sentences already generated with length *t* - 1 to generate a new set
|
|
|
+of sentences with length *t*. We keep only the top *k* candidates at each step,
|
|
|
+where the hyperparameter *k* is called the *beam size*. We have found the best
|
|
|
+performance with *k* = 3.
|
|
|
|
|
|
## Getting Started
|
|
|
|