10 years ago · 0d8720c2ac
--- a/im2txt/README.md
+++ b/im2txt/README.md
@@ -67,13 +67,17 @@ The following diagram illustrates the model architecture.
 
				 ![Show and Tell Architecture](g3doc/show_and_tell_architecture.png)
			
 
				 </center>
			
 
				 
			
 
				-In this diagram, $$\{ s_0, s_1, ..., s_{N-1} \}$$ are the words of the caption
			
 
				-and $$\{ w_e s_0, w_e s_1, ..., w_e s_{N-1} \}$$ are their corresponding word
			
 
				-embedding vectors. The outputs $$\{ p_1, p_2, ..., p_N \}$$ of the LSTM are
			
 
				-probability distributions generated by the model for the next word in the
			
 
				-sentence. The terms $$\{ \log p_1(s_1), \log p_2(s_2), ..., \log p_N(s_N) \}$$
			
 
				-are the log-likelihoods of the correct word at each step; the negated sum of
			
 
				-these terms is the minimization objective of the model.
			
 
				+In this diagram, \{*s*<sub>0</sub>, *s*<sub>1</sub>, ..., *s*<sub>*N*-1</sub>\}
			
 
				+are the words of the caption and \{*w*<sub>*e*</sub>*s*<sub>0</sub>,
			
 
				+*w*<sub>*e*</sub>*s*<sub>1</sub>, ..., *w*<sub>*e*</sub>*s*<sub>*N*-1</sub>\}
			
 
				+are their corresponding word embedding vectors. The outputs \{*p*<sub>1</sub>,
			
 
				+*p*<sub>2</sub>, ..., *p*<sub>*N*</sub>\} of the LSTM are probability
			
 
				+distributions generated by the model for the next word in the sentence. The
			
 
				+terms \{log *p*<sub>1</sub>(*s*<sub>1</sub>),
			
 
				+log *p*<sub>2</sub>(*s*<sub>2</sub>), ...,
			
 
				+log *p*<sub>*N*</sub>(*s*<sub>*N*</sub>)\} are the log-likelihoods of the
			
 
				+correct word at each step; the negated sum of these terms is the minimization
			
 
				+objective of the model.
			
 
				 
			
 
				 During the first phase of training the parameters of the *Inception v3* model
			
 
				 are kept fixed: it is simply a static image encoder function. A single trainable
			
@@ -85,11 +89,11 @@ training, all parameters - including the parameters of *Inception v3* - are
 
				 trained to jointly fine-tune the image encoder and the LSTM.
			
 
				 
			
 
				 Given a trained model and an image we use *beam search* to generate captions for
			
 
				-that image. Captions are generated word-by-word, where at each step $$t$$ we use
			
 
				-the set of sentences already generated with length $$t-1$$ to generate a new set
			
 
				-of sentences with length $$t$$. We keep only the top $$k$$ candidates at each
			
 
				-step, where the hyperparameter $$k$$ is called the *beam size*. We have found
			
 
				-the best performance with $$k=3$$.
			
 
				+that image. Captions are generated word-by-word, where at each step *t* we use
			
 
				+the set of sentences already generated with length *t* - 1 to generate a new set
			
 
				+of sentences with length *t*. We keep only the top *k* candidates at each step,
			
 
				+where the hyperparameter *k* is called the *beam size*. We have found the best
			
 
				+performance with *k* = 3.
			
 
				 
			
 
				 ## Getting Started