10 yıl önce · f3a2f63b73
--- a/swivel/.gitignore
+++ b/swivel/.gitignore
@@ -0,0 +1,12 @@
 
																+*.an.tab
															
 
																+*.pyc
															
 
																+*.ws.tab
															
 
																+MEN.tar.gz
															
 
																+Mtruk.csv
															
 
																+SimLex-999.zip
															
 
																+analogy
															
 
																+fastprep
															
 
																+myz_naacl13_test_set.tgz
															
 
																+questions-words.txt
															
 
																+rw.zip
															
 
																+ws353simrel.tar.gz
															
--- a/swivel/README.md
+++ b/swivel/README.md
@@ -0,0 +1,182 @@
 
																+# Swivel in Tensorflow
															
 
																+
															
 
																+This is a [TensorFlow](http://www.tensorflow.org/) implementation of the
															
 
																+[Swivel algorithm](http://arxiv.org/abs/1602.02215) for generating word
															
 
																+embeddings.
															
 
																+
															
 
																+Swivel works as follows:
															
 
																+
															
 
																+1. Compute the co-occurrence statistics from a corpus; that is, determine how
															
 
																+   often a word *c* appears the context (e.g., "within ten words") of a focus
															
 
																+   word *f*.  This results in a sparse *co-occurrence matrix* whose rows
															
 
																+   represent the focus words, and whose columns represent the context
															
 
																+   words. Each cell value is the number of times the focus and context words
															
 
																+   were observed together.
															
 
																+2. Re-organize the co-occurrence matrix and chop it into smaller pieces.
															
 
																+3. Assign a random *embedding vector* of fixed dimension (say, 300) to each
															
 
																+   focus word and to each context word.
															
 
																+4. Iteratively attempt to approximate the
															
 
																+   [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
															
 
																+   (PMI) between words with the dot product of the corresponding embedding
															
 
																+   vectors.
															
 
																+
															
 
																+Note that the resulting co-occurrence matrix is very sparse (i.e., contains many
															
 
																+zeros) since most words won't have been observed in the context of other words.
															
 
																+In the case of very rare words, it seems reasonable to assume that you just
															
 
																+haven't sampled enough data to spot their co-occurrence yet.  On the other hand,
															
 
																+if we've failed to observed to common words co-occuring, it seems likely that
															
 
																+they are *anti-correlated*.
															
 
																+
															
 
																+Swivel attempts to capture this intuition by using both the observed and the
															
 
																+un-observed co-occurrences to inform the way it iteratively adjusts vectors.
															
 
																+Empirically, this seems to lead to better embeddings, especially for rare words.
															
 
																+
															
 
																+# Contents
															
 
																+
															
 
																+This release includes the following programs.
															
 
																+
															
 
																+* `prep.py` is a program that takes a text corpus and pre-processes it for
															
 
																+  training. Specifically, it computes a vocabulary and token co-occurrence
															
 
																+  statistics for the corpus.  It then outputs the information into a format that
															
 
																+  can be digested by the TensorFlow trainer.
															
 
																+* `swivel.py` is a TensorFlow program that generates embeddings from the
															
 
																+  co-occurrence statistics.  It uses the files created by `prep.py` as input,
															
 
																+  and generates two text files as output: the row and column embeddings.
															
 
																+* `text2bin.py` combines the row and column vectors generated by Swivel into a
															
 
																+  flat binary file that can be quickly loaded into memory to perform vector
															
 
																+  arithmetic.  This can also be used to convert embeddings from
															
 
																+  [Glove](http://nlp.stanford.edu/projects/glove/) and
															
 
																+  [word2vec](https://code.google.com/archive/p/word2vec/) into a form that can
															
 
																+  be used by the following tools.
															
 
																+* `nearest.py` is a program that you can use to manually inspect binary
															
 
																+  embeddings.
															
 
																+* `eval.mk` is a GNU makefile that fill retrieve and normalize several common
															
 
																+  word similarity and analogy evaluation data sets.
															
 
																+* `wordsim.py` performs word similarity evaluation of the resulting vectors.
															
 
																+* `analogy` performs analogy evaluation of the resulting vectors.
															
 
																+* `fastprep` is a C++ program that works much more quickly that `prep.py`, but
															
 
																+  also has some additional dependencies to build.
															
 
																+
															
 
																+# Building Embeddings with Swivel
															
 
																+
															
 
																+To build your own word embeddings with Swivel, you'll need the following:
															
 
																+
															
 
																+* A large corpus of text; for example, the
															
 
																+  [dump of English Wikipedia](https://dumps.wikimedia.org/enwiki/).
															
 
																+* A working [TensorFlow](http://www.tensorflow.org/) implementation.
															
 
																+* A machine with plenty of disk space and, ideally, a beefy GPU card.  (We've
															
 
																+  experimented with the
															
 
																+  [Nvidia Titan X](http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x),
															
 
																+  for example.)
															
 
																+
															
 
																+You'll then run `prep.py` (or `fastprep`) to prepare the data for Swivel and run
															
 
																+`swivel.py` to create the embeddings. The resulting embeddings will be output
															
 
																+into two large text files: one for the row vectors and one for the column
															
 
																+vectors.  You can use those "as is", or convert them into a binary file using
															
 
																+`text2bin.py` and then use the tools here to experiment with the resulting
															
 
																+vectors.
															
 
																+
															
 
																+## Preparing the data for training
															
 
																+
															
 
																+Once you've downloaded the corpus (e.g., to `/tmp/wiki.txt`), run `prep.py` to
															
 
																+prepare the data for training:
															
 
																+
															
 
																+    ./prep.py --output_dir /tmp/swivel_data --input /tmp/wiki.txt
															
 
																+
															
 
																+By default, `prep.py` will make one pass through the text file to compute a
															
 
																+"vocabulary" of the most frequent words, and then a second pass to compute the
															
 
																+co-occurrence statistics.  The following options allow you to control this
															
 
																+behavior:
															
 
																+
															
 
																+|:--- |:--- |
															
 
																+| `--min_count <n>` | Only include words in the generated vocabulary that appear at least *n* times. |
															
 
																+| `--max_vocab <n>` | Admit at most *n* words into the vocabulary. |
															
 
																+| `--vocab <filename>` | Use the specified filename as the vocabulary instead of computing it from the corpus.  The file should contain one word per line. |
															
 
																+
															
 
																+The `prep.py` program is pretty simple.  Notably, it does almost no text
															
 
																+processing: it does no case translation and simply breaks text into tokens by
															
 
																+splitting on spaces. Feel free to experiment with the `words` function if you'd
															
 
																+like to do something more sophisticated.
															
 
																+
															
 
																+Unfortunately, `prep.py` is pretty slow.  Also included is `fastprep`, a C++
															
 
																+equivalent that works much more quickly.  Building `fastprep.cc` is a bit more
															
 
																+involved: it requires you to pull and build the Tensorflow source code in order
															
 
																+to provide the libraries and headers that it needs.  See `fastprep.mk` for more
															
 
																+details.
															
 
																+
															
 
																+## Training the embeddings
															
 
																+
															
 
																+When `prep.py` completes, it will have produced a directory containing the data
															
 
																+that the Swivel trainer needs to run.  Train embeddings as follows:
															
 
																+
															
 
																+    ./swivel.py --input_base_path /tmp/swivel_data \
															
 
																+       --output_base_path /tmp/swivel_data
															
 
																+
															
 
																+There are a variety of parameters that you can fiddle with to customize the
															
 
																+embeddings; some that you may want to experiment with include:
															
 
																+
															
 
																+|:--- |:--- |
															
 
																+| `--embedding_size <dim>` | The dimensionality of the embeddings that are created.  By default, 300 dimensional embeddings are created. |
															
 
																+| `--num_epochs <n>` | The number of iterations through the data that are performed.  By default, 40 epochs are trained. |
															
 
																+
															
 
																+As mentioned above, access to beefy GPU will dramatically reduce the amount of
															
 
																+time it takes Swivel to train embeddings.
															
 
																+
															
 
																+When complete, you should find `row_embeddings.tsv` and `col_embedding.tsv` in
															
 
																+the directory specified by `--ouput_base_path`.  These files are tab-delimited
															
 
																+files that contain one embedding per line.  Each line contains the token
															
 
																+followed by *dim* floating point numbers.
															
 
																+
															
 
																+## Exploring and evaluating the embeddings
															
 
																+
															
 
																+There are also some simple tools you can to explore the embeddings.  These tools
															
 
																+work with a simple binary vector format that can be `mmap`-ed into memory along
															
 
																+with a separate vocabulary file.  Use `text2bin.py` to generate these files:
															
 
																+
															
 
																+    ./text2bin.py -o vecs.bin -v vocab.txt /tmp/swivel_data/*_embedding.tsv
															
 
																+
															
 
																+You can do some simple exploration using `nearest.py`:
															
 
																+
															
 
																+    ./nearest.py -v vocab.txt -e vecs.bin
															
 
																+    query> dog
															
 
																+    dog
															
 
																+    dogs
															
 
																+    cat
															
 
																+    ...
															
 
																+    query> man woman king
															
 
																+    king
															
 
																+    queen
															
 
																+    princess
															
 
																+    ...
															
 
																+
															
 
																+To evaluate the embeddings using common word similarity and analogy datasets,
															
 
																+use `eval.mk` to retrieve the data sets and build the tools:
															
 
																+
															
 
																+    make -f eval.mk
															
 
																+    ./wordsim.py -v vocab.txt -e vecs.bin *.ws.tab
															
 
																+    ./analogy --vocab vocab.txt --embeddings vecs.bin *.an.tab
															
 
																+
															
 
																+The word similarity evaluation compares the embeddings' estimate of "similarity"
															
 
																+with human judgement using
															
 
																+[Spearman's rho](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
															
 
																+as the measure of correlation.  (Bigger numbers are better.)
															
 
																+
															
 
																+The analogy evaluation tests how well the embeddings can predict analogies like
															
 
																+"man is to woman as king is to queen".
															
 
																+
															
 
																+Note that `eval.mk` forces all evaluation data into lower case.  From there,
															
 
																+both the word similarity and analogy evaluations assume that the eval data and
															
 
																+the embeddings use consistent capitalization: if you train embeddings using
															
 
																+mixed case and evaluate them using lower case, things won't work well.
															
 
																+
															
 
																+# Contact
															
 
																+
															
 
																+If you have any questions about Swivel, feel free to post to
															
 
																+[swivel-embeddings@googlegroups.com](https://groups.google.com/forum/#!forum/swivel-embeddings)
															
 
																+or contact us directly:
															
 
																+
															
 
																+* Noam Shazeer (`noam@google.com`)
															
 
																+* Ryan Doherty (`portalfire@google.com`)
															
 
																+* Colin Evans (`colinhevans@google.com`)
															
 
																+* Chris Waterson (`waterson@google.com`)
															
 
																+
															
--- a/swivel/analogy.cc
+++ b/swivel/analogy.cc
@@ -0,0 +1,365 @@
 
																+/* -*- Mode: C++ -*- */
															
 
																+
															
 
																+/*
															
 
																+ * Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+ *
															
 
																+ * Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+ * you may not use this file except in compliance with the License.
															
 
																+ * You may obtain a copy of the License at
															
 
																+ *
															
 
																+ *    http://www.apache.org/licenses/LICENSE-2.0
															
 
																+ *
															
 
																+ * Unless required by applicable law or agreed to in writing, software
															
 
																+ * distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+ * See the License for the specific language governing permissions and
															
 
																+ * limitations under the License.
															
 
																+ */
															
 
																+
															
 
																+/*
															
 
																+ * Computes embedding performance on analogy tasks.  Accepts as input one or
															
 
																+ * more files containing four words per line (A B C D), and determines if:
															
 
																+ *
															
 
																+ *   vec(C) - vec(A) + vec(B) ~= vec(D)
															
 
																+ *
															
 
																+ * Cosine distance in the embedding space is used to retrieve neighbors. Any
															
 
																+ * missing vocabulary items are scored as losses.
															
 
																+ */
															
 
																+#include <fcntl.h>
															
 
																+#include <math.h>
															
 
																+#include <pthread.h>
															
 
																+#include <stdio.h>
															
 
																+#include <stdlib.h>
															
 
																+#include <string.h>
															
 
																+#include <sys/stat.h>
															
 
																+#include <sys/types.h>
															
 
																+#include <unistd.h>
															
 
																+
															
 
																+#include <fstream>
															
 
																+#include <iostream>
															
 
																+#include <string>
															
 
																+#include <unordered_map>
															
 
																+#include <vector>
															
 
																+
															
 
																+static const char usage[] = R"(
															
 
																+Performs analogy testing of embedding vectors.
															
 
																+
															
 
																+Usage:
															
 
																+
															
 
																+  analogy --embeddings <embeddings> --vocab <vocab> eval1.tab ...
															
 
																+
															
 
																+Options:
															
 
																+
															
 
																+  --embeddings <filename>
															
 
																+    The file containing the binary embedding vectors to evaluate.
															
 
																+
															
 
																+  --vocab <filename>
															
 
																+    The vocabulary file corresponding to the embedding vectors.
															
 
																+
															
 
																+  --nthreads <integer>
															
 
																+    The number of evaluation threads to run (default: 8)
															
 
																+)";
															
 
																+
															
 
																+// Reads the vocabulary file into a map from token to vector index.
															
 
																+static std::unordered_map<std::string, int> ReadVocab(
															
 
																+    const std::string& vocab_filename) {
															
 
																+  std::unordered_map<std::string, int> vocab;
															
 
																+  std::ifstream fin(vocab_filename);
															
 
																+
															
 
																+  int index = 0;
															
 
																+  for (std::string token; std::getline(fin, token); ++index) {
															
 
																+    auto n = token.find('\t');
															
 
																+    if (n != std::string::npos) token = token.substr(n);
															
 
																+
															
 
																+    vocab[token] = index;
															
 
																+  }
															
 
																+
															
 
																+  return vocab;
															
 
																+}
															
 
																+
															
 
																+// An analogy query: "A is to B as C is to D".
															
 
																+typedef std::tuple<int, int, int, int> AnalogyQuery;
															
 
																+
															
 
																+std::vector<AnalogyQuery> ReadQueries(
															
 
																+    const std::string &filename,
															
 
																+    const std::unordered_map<std::string, int> &vocab, int *total) {
															
 
																+  std::ifstream fin(filename);
															
 
																+
															
 
																+  std::vector<AnalogyQuery> queries;
															
 
																+  int lineno = 0;
															
 
																+  while (1) {
															
 
																+    // Read the four words.
															
 
																+    std::string words[4];
															
 
																+    int nread = 0;
															
 
																+    for (int i = 0; i < 4; ++i) {
															
 
																+      fin >> words[i];
															
 
																+      if (!words[i].empty()) ++nread;
															
 
																+    }
															
 
																+
															
 
																+    ++lineno;
															
 
																+    if (nread == 0) break;
															
 
																+
															
 
																+    if (nread < 4) {
															
 
																+      std::cerr << "expected four words at line " << lineno << std::endl;
															
 
																+      break;
															
 
																+    }
															
 
																+
															
 
																+    // Look up each word's index.
															
 
																+    int ixs[4], nvalid;
															
 
																+    for (nvalid = 0; nvalid < 4; ++nvalid) {
															
 
																+      std::unordered_map<std::string, int>::const_iterator it =
															
 
																+          vocab.find(words[nvalid]);
															
 
																+
															
 
																+      if (it == vocab.end()) break;
															
 
																+
															
 
																+      ixs[nvalid] = it->second;
															
 
																+    }
															
 
																+
															
 
																+    // If we don't have all the words, count it as a loss.
															
 
																+    if (nvalid >= 4)
															
 
																+      queries.push_back(std::make_tuple(ixs[0], ixs[1], ixs[2], ixs[3]));
															
 
																+  }
															
 
																+
															
 
																+  *total = lineno;
															
 
																+  return queries;
															
 
																+}
															
 
																+
															
 
																+
															
 
																+// A thread that evaluates some fraction of the analogies.
															
 
																+class AnalogyEvaluator {
															
 
																+ public:
															
 
																+  // Creates a new Analogy evaluator for a range of analogy queries.
															
 
																+  AnalogyEvaluator(std::vector<AnalogyQuery>::const_iterator begin,
															
 
																+                   std::vector<AnalogyQuery>::const_iterator end,
															
 
																+                   const float *embeddings, const int num_embeddings,
															
 
																+                   const int dim)
															
 
																+      : begin_(begin),
															
 
																+        end_(end),
															
 
																+        embeddings_(embeddings),
															
 
																+        num_embeddings_(num_embeddings),
															
 
																+        dim_(dim) {}
															
 
																+
															
 
																+  // A thunk for pthreads.
															
 
																+  static void* Run(void *param) {
															
 
																+    AnalogyEvaluator *self = static_cast<AnalogyEvaluator*>(param);
															
 
																+    self->Evaluate();
															
 
																+    return nullptr;
															
 
																+  }
															
 
																+
															
 
																+  // Evaluates the analogies.
															
 
																+  void Evaluate();
															
 
																+
															
 
																+  // Returns the number of correct analogies after evaluation is complete.
															
 
																+  int GetNumCorrect() const { return correct_; }
															
 
																+
															
 
																+ protected:
															
 
																+  // The beginning of the range of queries to consider.
															
 
																+  std::vector<AnalogyQuery>::const_iterator begin_;
															
 
																+
															
 
																+  // The end of the range of queries to consider.
															
 
																+  std::vector<AnalogyQuery>::const_iterator end_;
															
 
																+
															
 
																+  // The raw embedding vectors.
															
 
																+  const float *embeddings_;
															
 
																+
															
 
																+  // The number of embedding vectors.
															
 
																+  const int num_embeddings_;
															
 
																+
															
 
																+  // The embedding vector dimensionality.
															
 
																+  const int dim_;
															
 
																+
															
 
																+  // The number of correct analogies.
															
 
																+  int correct_;
															
 
																+};
															
 
																+
															
 
																+
															
 
																+void AnalogyEvaluator::Evaluate() {
															
 
																+  float* sum = new float[dim_];
															
 
																+
															
 
																+  correct_ = 0;
															
 
																+  for (auto query = begin_; query < end_; ++query) {
															
 
																+    const float* vec;
															
 
																+    int a, b, c, d;
															
 
																+    std::tie(a, b, c, d) = *query;
															
 
																+
															
 
																+    // Compute C - A + B.
															
 
																+    vec = embeddings_ + dim_ * c;
															
 
																+    for (int i = 0; i < dim_; ++i) sum[i] = vec[i];
															
 
																+
															
 
																+    vec = embeddings_ + dim_ * a;
															
 
																+    for (int i = 0; i < dim_; ++i) sum[i] -= vec[i];
															
 
																+
															
 
																+    vec = embeddings_ + dim_ * b;
															
 
																+    for (int i = 0; i < dim_; ++i) sum[i] += vec[i];
															
 
																+
															
 
																+    // Find the nearest neighbor that isn't one of the query words.
															
 
																+    int best_ix = -1;
															
 
																+    float best_dot = -1.0;
															
 
																+    for (int i = 0; i < num_embeddings_; ++i) {
															
 
																+      if (i == a || i == b || i == c) continue;
															
 
																+
															
 
																+      vec = embeddings_ + dim_ * i;
															
 
																+
															
 
																+      float dot = 0;
															
 
																+      for (int j = 0; j < dim_; ++j) dot += vec[j] * sum[j];
															
 
																+
															
 
																+      if (dot > best_dot) {
															
 
																+        best_ix = i;
															
 
																+        best_dot = dot;
															
 
																+      }
															
 
																+    }
															
 
																+
															
 
																+    // The fourth word is the answer; did we get it right?
															
 
																+    if (best_ix == d) ++correct_;
															
 
																+  }
															
 
																+
															
 
																+  delete[] sum;
															
 
																+}
															
 
																+
															
 
																+
															
 
																+int main(int argc, char *argv[]) {
															
 
																+  if (argc <= 1) {
															
 
																+    printf(usage);
															
 
																+    return 2;
															
 
																+  }
															
 
																+
															
 
																+  std::string embeddings_filename, vocab_filename;
															
 
																+  int nthreads = 8;
															
 
																+
															
 
																+  std::vector<std::string> input_filenames;
															
 
																+  std::vector<std::tuple<int, int, int, int>> queries;
															
 
																+
															
 
																+  for (int i = 1; i < argc; ++i) {
															
 
																+    std::string arg = argv[i];
															
 
																+    if (arg == "--embeddings") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      embeddings_filename = argv[i];
															
 
																+    } else if (arg == "--vocab") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      vocab_filename = argv[i];
															
 
																+    } else if (arg == "--nthreads") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      if ((nthreads = atoi(argv[i])) <= 0) goto badarg;
															
 
																+    } else if (arg == "--help") {
															
 
																+      std::cout << usage << std::endl;
															
 
																+      return 0;
															
 
																+    } else if (arg[0] == '-') {
															
 
																+      std::cerr << "unknown option: '" << arg << "'" << std::endl;
															
 
																+      return 2;
															
 
																+    } else {
															
 
																+      input_filenames.push_back(arg);
															
 
																+    }
															
 
																+
															
 
																+    continue;
															
 
																+
															
 
																+  argmissing:
															
 
																+    std::cerr << "missing value for '" << argv[i - 1] << "' (--help for help)"
															
 
																+              << std::endl;
															
 
																+    return 2;
															
 
																+
															
 
																+  badarg:
															
 
																+    std::cerr << "invalid value '" << argv[i] << "' for '" << argv[i - 1]
															
 
																+              << "' (--help for help)" << std::endl;
															
 
																+
															
 
																+    return 2;
															
 
																+  }
															
 
																+
															
 
																+  // Read the vocabulary.
															
 
																+  std::unordered_map<std::string, int> vocab = ReadVocab(vocab_filename);
															
 
																+  if (!vocab.size()) {
															
 
																+    std::cerr << "unable to read vocabulary file '" << vocab_filename << "'"
															
 
																+              << std::endl;
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  const int n = vocab.size();
															
 
																+
															
 
																+  // Read the vectors.
															
 
																+  int fd;
															
 
																+  if ((fd = open(embeddings_filename.c_str(), O_RDONLY)) < 0) {
															
 
																+    std::cerr << "unable to open embeddings file '" << embeddings_filename
															
 
																+              << "'" << std::endl;
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  off_t nbytes = lseek(fd, 0, SEEK_END);
															
 
																+  if (nbytes == -1) {
															
 
																+    std::cerr << "unable to determine file size for '" << embeddings_filename
															
 
																+              << "'" << std::endl;
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  if (nbytes % (sizeof(float) * n) != 0) {
															
 
																+    std::cerr << "'" << embeddings_filename
															
 
																+              << "' has a strange file size; expected it to be "
															
 
																+                 "a multiple of the vocabulary size"
															
 
																+              << std::endl;
															
 
																+
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  const int dim = nbytes / (sizeof(float) * n);
															
 
																+  float *embeddings = static_cast<float *>(malloc(nbytes));
															
 
																+  lseek(fd, 0, SEEK_SET);
															
 
																+  if (read(fd, embeddings, nbytes) < nbytes) {
															
 
																+    std::cerr << "unable to read embeddings from " << embeddings_filename
															
 
																+              << std::endl;
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  close(fd);
															
 
																+
															
 
																+  /* Normalize the vectors. */
															
 
																+  for (int i = 0; i < n; ++i) {
															
 
																+    float *vec = embeddings + dim * i;
															
 
																+    float norm = 0;
															
 
																+    for (int j = 0; j < dim; ++j) norm += vec[j] * vec[j];
															
 
																+
															
 
																+    norm = sqrt(norm);
															
 
																+    for (int j = 0; j < dim; ++j) vec[j] /= norm;
															
 
																+  }
															
 
																+
															
 
																+  pthread_attr_t attr;
															
 
																+  if (pthread_attr_init(&attr) != 0) {
															
 
																+    std::cerr << "unable to initalize pthreads" << std::endl;
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  /* Read each input file. */
															
 
																+  for (const auto filename : input_filenames) {
															
 
																+    int total = 0;
															
 
																+    std::vector<AnalogyQuery> queries =
															
 
																+        ReadQueries(filename.c_str(), vocab, &total);
															
 
																+
															
 
																+    const int queries_per_thread = queries.size() / nthreads;
															
 
																+    std::vector<AnalogyEvaluator*> evaluators;
															
 
																+    std::vector<pthread_t> threads;
															
 
																+
															
 
																+    for (int i = 0; i < nthreads; ++i) {
															
 
																+      auto begin = queries.begin() + i * queries_per_thread;
															
 
																+      auto end = (i + 1 < nthreads)
															
 
																+                     ? queries.begin() + (i + 1) * queries_per_thread
															
 
																+                     : queries.end();
															
 
																+
															
 
																+      AnalogyEvaluator *evaluator =
															
 
																+          new AnalogyEvaluator(begin, end, embeddings, n, dim);
															
 
																+
															
 
																+      pthread_t thread;
															
 
																+      pthread_create(&thread, &attr, AnalogyEvaluator::Run, evaluator);
															
 
																+      evaluators.push_back(evaluator);
															
 
																+      threads.push_back(thread);
															
 
																+    }
															
 
																+
															
 
																+    for (auto &thread : threads) pthread_join(thread, 0);
															
 
																+
															
 
																+    int correct = 0;
															
 
																+    for (const AnalogyEvaluator* evaluator : evaluators) {
															
 
																+      correct += evaluator->GetNumCorrect();
															
 
																+      delete evaluator;
															
 
																+    }
															
 
																+
															
 
																+    printf("%0.3f %s\n", static_cast<float>(correct) / total, filename.c_str());
															
 
																+  }
															
 
																+
															
 
																+  return 0;
															
 
																+}
															
--- a/swivel/eval.mk
+++ b/swivel/eval.mk
@@ -0,0 +1,98 @@
 
																+# -*- Mode: Makefile -*-
															
 
																+#
															
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+# This makefile pulls down the evaluation datasets and formats them uniformly.
															
 
																+# Word similarity evaluations are formatted to contain exactly three columns:
															
 
																+# the two words being compared and the human judgement.
															
 
																+#
															
 
																+# Use wordsim.py and analogy to run the actual evaluations.
															
 
																+
															
 
																+CXXFLAGS=-std=c++11 -m64 -mavx -g -Ofast -Wall
															
 
																+LDLIBS=-lpthread -lm
															
 
																+
															
 
																+WORDSIM_EVALS=	ws353sim.ws.tab \
															
 
																+		ws353rel.ws.tab \
															
 
																+		men.ws.tab	\
															
 
																+		mturk.ws.tab \
															
 
																+		rarewords.ws.tab \
															
 
																+		simlex999.ws.tab \
															
 
																+		$(NULL)
															
 
																+
															
 
																+ANALOGY_EVALS=	mikolov.an.tab \
															
 
																+		msr.an.tab \
															
 
																+		$(NULL)
															
 
																+
															
 
																+all: $(WORDSIM_EVALS) $(ANALOGY_EVALS) analogy
															
 
																+
															
 
																+ws353sim.ws.tab: ws353simrel.tar.gz
															
 
																+	tar Oxfz $^ wordsim353_sim_rel/wordsim_similarity_goldstandard.txt > $@
															
 
																+
															
 
																+ws353rel.ws.tab: ws353simrel.tar.gz
															
 
																+	tar Oxfz $^ wordsim353_sim_rel/wordsim_relatedness_goldstandard.txt > $@
															
 
																+
															
 
																+men.ws.tab: MEN.tar.gz
															
 
																+	tar Oxfz $^ MEN/MEN_dataset_natural_form_full | tr ' ' '\t' > $@
															
 
																+
															
 
																+mturk.ws.tab: Mtruk.csv
															
 
																+	cat $^ | tr -d '\r' | tr ',' '\t' > $@
															
 
																+
															
 
																+rarewords.ws.tab: rw.zip
															
 
																+	unzip -p $^ rw/rw.txt | cut -f1-3 -d $$'\t' > $@
															
 
																+
															
 
																+simlex999.ws.tab: SimLex-999.zip
															
 
																+	unzip -p $^ SimLex-999/SimLex-999.txt \
															
 
																+	| tail -n +2 | cut -f1,2,4 -d $$'\t' > $@
															
 
																+
															
 
																+mikolov.an.tab: questions-words.txt
															
 
																+	egrep -v -E '^:' $^ | tr '[A-Z] ' '[a-z]\t' > $@
															
 
																+
															
 
																+msr.an.tab: myz_naacl13_test_set.tgz
															
 
																+	tar Oxfz $^ test_set/word_relationship.questions | tr ' ' '\t' > /tmp/q
															
 
																+	tar Oxfz $^ test_set/word_relationship.answers | cut -f2 -d ' ' > /tmp/a
															
 
																+	paste /tmp/q /tmp/a > $@
															
 
																+	rm -f /tmp/q /tmp/a
															
 
																+
															
 
																+
															
 
																+# wget commands to fetch the datasets.  Please see the original datasets for
															
 
																+# appropriate references if you use these.
															
 
																+ws353simrel.tar.gz:
															
 
																+	wget http://alfonseca.org/pubs/ws353simrel.tar.gz
															
 
																+
															
 
																+MEN.tar.gz:
															
 
																+	wget http://clic.cimec.unitn.it/~elia.bruni/resources/MEN.tar.gz
															
 
																+
															
 
																+Mtruk.csv:
															
 
																+	wget http://tx.technion.ac.il/~kirar/files/Mtruk.csv
															
 
																+
															
 
																+rw.zip:
															
 
																+	wget http://www-nlp.stanford.edu/~lmthang/morphoNLM/rw.zip
															
 
																+
															
 
																+SimLex-999.zip:
															
 
																+	wget http://www.cl.cam.ac.uk/~fh295/SimLex-999.zip
															
 
																+
															
 
																+questions-words.txt:
															
 
																+	wget http://word2vec.googlecode.com/svn/trunk/questions-words.txt
															
 
																+
															
 
																+myz_naacl13_test_set.tgz:
															
 
																+	wget http://research.microsoft.com/en-us/um/people/gzweig/Pubs/myz_naacl13_test_set.tgz
															
 
																+
															
 
																+analogy: analogy.cc
															
 
																+
															
 
																+clean:
															
 
																+	rm -f *.ws.tab *.an.tab analogy *.pyc
															
 
																+
															
 
																+distclean: clean
															
 
																+	rm -f *.tgz *.tar.gz *.zip Mtruk.csv questions-words.txt
															
--- a/swivel/fastprep.cc
+++ b/swivel/fastprep.cc
@@ -0,0 +1,680 @@
 
																+/* -*- Mode: C++ -*- */
															
 
																+
															
 
																+/*
															
 
																+ * Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+ *
															
 
																+ * Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+ * you may not use this file except in compliance with the License.
															
 
																+ * You may obtain a copy of the License at
															
 
																+ *
															
 
																+ *     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+ *
															
 
																+ * Unless required by applicable law or agreed to in writing, software
															
 
																+ * distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+ * See the License for the specific language governing permissions and
															
 
																+ * limitations under the License.
															
 
																+ */
															
 
																+
															
 
																+/*
															
 
																+ * This program starts with a text file (and optionally a vocabulary file) and
															
 
																+ * computes co-occurrence statistics. It emits output in a format that can be
															
 
																+ * consumed by the "swivel" program.  It's functionally equivalent to "prep.py",
															
 
																+ * but works much more quickly.
															
 
																+ */
															
 
																+
															
 
																+#include <assert.h>
															
 
																+#include <fcntl.h>
															
 
																+#include <pthread.h>
															
 
																+#include <stdio.h>
															
 
																+#include <sys/mman.h>
															
 
																+#include <sys/stat.h>
															
 
																+#include <unistd.h>
															
 
																+
															
 
																+#include <algorithm>
															
 
																+#include <fstream>
															
 
																+#include <iomanip>
															
 
																+#include <iostream>
															
 
																+#include <map>
															
 
																+#include <string>
															
 
																+#include <tuple>
															
 
																+#include <unordered_map>
															
 
																+#include <vector>
															
 
																+
															
 
																+#include "google/protobuf/io/zero_copy_stream_impl.h"
															
 
																+#include "tensorflow/core/example/example.pb.h"
															
 
																+#include "tensorflow/core/example/feature.pb.h"
															
 
																+
															
 
																+static const char usage[] = R"(
															
 
																+Prepares a corpus for processing by Swivel.
															
 
																+
															
 
																+Usage:
															
 
																+
															
 
																+  prep --output_dir <output-dir> --input <text-file>
															
 
																+
															
 
																+Options:
															
 
																+
															
 
																+  --input <filename>
															
 
																+      The input text.
															
 
																+
															
 
																+  --output_dir <directory>
															
 
																+      Specifies the output directory where the various Swivel data
															
 
																+      files should be placed.  This directory must exist.
															
 
																+
															
 
																+  --shard_size <int>
															
 
																+      Specifies the shard size; default 4096.
															
 
																+
															
 
																+  --min_count <int>
															
 
																+      The minimum number of times a word should appear to be included in the
															
 
																+      generated vocabulary; default 5.  (Ignored if --vocab is used.)
															
 
																+
															
 
																+  --max_vocab <int>
															
 
																+      The maximum vocabulary size to generate from the input corpus; default
															
 
																+      102,400.  (Ignored if --vocab is used.)
															
 
																+
															
 
																+  --vocab <filename>
															
 
																+      Use the specified unigram vocabulary instead of generating
															
 
																+      it from the corpus.
															
 
																+
															
 
																+  --window_size <int>
															
 
																+      Specifies the window size for computing co-occurrence stats;
															
 
																+      default 10.
															
 
																+)";
															
 
																+
															
 
																+struct cooc_t {
															
 
																+  int row;
															
 
																+  int col;
															
 
																+  float cnt;
															
 
																+};
															
 
																+
															
 
																+typedef std::map<long long, float> cooc_counts_t;
															
 
																+
															
 
																+// Retrieves the next word from the input stream, treating words as simply being
															
 
																+// delimited by whitespace.  Returns true if this is the end of a "sentence";
															
 
																+// i.e., a newline.
															
 
																+bool NextWord(std::ifstream &fin, std::string* word) {
															
 
																+  std::string buf;
															
 
																+  char c;
															
 
																+
															
 
																+  if (fin.eof()) {
															
 
																+    word->erase();
															
 
																+    return true;
															
 
																+  }
															
 
																+
															
 
																+  // Skip leading whitespace.
															
 
																+  do {
															
 
																+    c = fin.get();
															
 
																+  } while (!fin.eof() && std::isspace(c));
															
 
																+
															
 
																+  if (fin.eof()) {
															
 
																+    word->erase();
															
 
																+    return true;
															
 
																+  }
															
 
																+
															
 
																+  // Read the next word.
															
 
																+  do {
															
 
																+    buf += c;
															
 
																+    c = fin.get();
															
 
																+  } while (!fin.eof() && !std::isspace(c));
															
 
																+
															
 
																+  *word = buf;
															
 
																+  if (c == '\n' || fin.eof()) return true;
															
 
																+
															
 
																+  // Skip trailing whitespace.
															
 
																+  do {
															
 
																+    c = fin.get();
															
 
																+  } while (!fin.eof() && std::isspace(c));
															
 
																+
															
 
																+  if (fin.eof()) return true;
															
 
																+
															
 
																+  fin.unget();
															
 
																+  return false;
															
 
																+}
															
 
																+
															
 
																+// Creates a vocabulary from the most frequent terms in the input file.
															
 
																+std::vector<std::string> CreateVocabulary(const std::string input_filename,
															
 
																+                                          const int shard_size,
															
 
																+                                          const int min_vocab_count,
															
 
																+                                          const int max_vocab_size) {
															
 
																+  std::vector<std::string> vocab;
															
 
																+
															
 
																+  // Count all the distinct tokens in the file.  (XXX this will eventually
															
 
																+  // consume all memory and should be re-written to periodically trim the data.)
															
 
																+  std::unordered_map<std::string, long long> counts;
															
 
																+
															
 
																+  std::ifstream fin(input_filename, std::ifstream::ate);
															
 
																+
															
 
																+  if (!fin) {
															
 
																+    std::cerr << "couldn't read input file '" << input_filename << "'"
															
 
																+              << std::endl;
															
 
																+
															
 
																+    return vocab;
															
 
																+  }
															
 
																+
															
 
																+  const auto input_size = fin.tellg();
															
 
																+  fin.seekg(0);
															
 
																+
															
 
																+  long long ntokens = 0;
															
 
																+  while (!fin.eof()) {
															
 
																+    std::string word;
															
 
																+    NextWord(fin, &word);
															
 
																+    counts[word] += 1;
															
 
																+
															
 
																+    if (++ntokens % 1000000 == 0) {
															
 
																+      const float pct = 100.0 * static_cast<float>(fin.tellg()) / input_size;
															
 
																+      fprintf(stdout, "\rComputing vocabulary: %0.1f%% complete...", pct);
															
 
																+      std::flush(std::cout);
															
 
																+    }
															
 
																+  }
															
 
																+
															
 
																+  std::cout << counts.size() << " distinct tokens" << std::endl;
															
 
																+
															
 
																+  // Sort the vocabulary from most frequent to least frequent.
															
 
																+  std::vector<std::pair<std::string, long long>> buf;
															
 
																+  std::copy(counts.begin(), counts.end(), std::back_inserter(buf));
															
 
																+  std::sort(buf.begin(), buf.end(),
															
 
																+            [](const std::pair<std::string, long long> &a,
															
 
																+               const std::pair<std::string, long long> &b) {
															
 
																+              return b.second < a.second;
															
 
																+            });
															
 
																+
															
 
																+  // Truncate to the maximum vocabulary size
															
 
																+  if (static_cast<int>(buf.size()) > max_vocab_size) buf.resize(max_vocab_size);
															
 
																+  if (buf.empty()) return vocab;
															
 
																+
															
 
																+  // Eliminate rare tokens and truncate to a size modulo the shard size.
															
 
																+  int vocab_size = buf.size();
															
 
																+  while (vocab_size > 0 && buf[vocab_size - 1].second < min_vocab_count)
															
 
																+    --vocab_size;
															
 
																+
															
 
																+  vocab_size -= vocab_size % shard_size;
															
 
																+  if (static_cast<int>(buf.size()) > vocab_size) buf.resize(vocab_size);
															
 
																+
															
 
																+  // Copy out the tokens.
															
 
																+  for (const auto& pair : buf) vocab.push_back(pair.first);
															
 
																+
															
 
																+  return vocab;
															
 
																+}
															
 
																+
															
 
																+std::vector<std::string> ReadVocabulary(const std::string vocab_filename) {
															
 
																+  std::vector<std::string> vocab;
															
 
																+
															
 
																+  std::ifstream fin(vocab_filename);
															
 
																+  int index = 0;
															
 
																+  for (std::string token; std::getline(fin, token); ++index) {
															
 
																+    auto n = token.find('\t');
															
 
																+    if (n != std::string::npos) token = token.substr(n);
															
 
																+
															
 
																+    vocab.push_back(token);
															
 
																+  }
															
 
																+
															
 
																+  return vocab;
															
 
																+}
															
 
																+
															
 
																+void WriteVocabulary(const std::vector<std::string> &vocab,
															
 
																+                     const std::string &output_dirname) {
															
 
																+  for (const std::string filename : {"row_vocab.txt", "col_vocab.txt"}) {
															
 
																+    std::ofstream fout(output_dirname + "/" + filename);
															
 
																+    for (const auto &token : vocab) fout << token << std::endl;
															
 
																+  }
															
 
																+}
															
 
																+
															
 
																+// Manages accumulation of co-occurrence data into temporary disk buffer files.
															
 
																+class CoocBuffer {
															
 
																+ public:
															
 
																+  CoocBuffer(const std::string &output_dirname, const int num_shards,
															
 
																+             const int shard_size);
															
 
																+
															
 
																+  // Accumulate the co-occurrence counts to the buffer.
															
 
																+  void AccumulateCoocs(const cooc_counts_t &coocs);
															
 
																+
															
 
																+  // Read the buffer to produce shard files.
															
 
																+  void WriteShards();
															
 
																+
															
 
																+ protected:
															
 
																+  // The output directory. Also used for temporary buffer files.
															
 
																+  const std::string output_dirname_;
															
 
																+
															
 
																+  // The number of row/column shards.
															
 
																+  const int num_shards_;
															
 
																+
															
 
																+  // The number of elements per shard.
															
 
																+  const int shard_size_;
															
 
																+
															
 
																+  // Parallel arrays of temporary file paths and file descriptors.
															
 
																+  std::vector<std::string> paths_;
															
 
																+  std::vector<int> fds_;
															
 
																+
															
 
																+  // Ensures that only one buffer file is getting written at a time.
															
 
																+  pthread_mutex_t writer_mutex_;
															
 
																+};
															
 
																+
															
 
																+CoocBuffer::CoocBuffer(const std::string &output_dirname, const int num_shards,
															
 
																+                       const int shard_size)
															
 
																+    : output_dirname_(output_dirname),
															
 
																+      num_shards_(num_shards),
															
 
																+      shard_size_(shard_size),
															
 
																+      writer_mutex_(PTHREAD_MUTEX_INITIALIZER) {
															
 
																+  for (int row = 0; row < num_shards_; ++row) {
															
 
																+    for (int col = 0; col < num_shards_; ++col) {
															
 
																+      char filename[256];
															
 
																+      sprintf(filename, "shard-%03d-%03d.tmp", row, col);
															
 
																+
															
 
																+      std::string path = output_dirname + "/" + filename;
															
 
																+      int fd = open(path.c_str(), O_RDWR | O_CREAT | O_TRUNC, 0666);
															
 
																+      assert(fd > 0);
															
 
																+
															
 
																+      paths_.push_back(path);
															
 
																+      fds_.push_back(fd);
															
 
																+    }
															
 
																+  }
															
 
																+}
															
 
																+
															
 
																+void CoocBuffer::AccumulateCoocs(const cooc_counts_t &coocs) {
															
 
																+  std::vector<std::vector<cooc_t>> bufs(fds_.size());
															
 
																+
															
 
																+  for (const auto &cooc : coocs) {
															
 
																+    const int row_id = cooc.first >> 32;
															
 
																+    const int col_id = cooc.first & 0xffffffff;
															
 
																+    const float cnt = cooc.second;
															
 
																+
															
 
																+    const int row_shard = row_id % num_shards_;
															
 
																+    const int row_off = row_id / num_shards_;
															
 
																+    const int col_shard = col_id % num_shards_;
															
 
																+    const int col_off = col_id / num_shards_;
															
 
																+
															
 
																+    const int top_shard_idx = row_shard * num_shards_ + col_shard;
															
 
																+    bufs[top_shard_idx].push_back(cooc_t{row_off, col_off, cnt});
															
 
																+
															
 
																+    const int bot_shard_idx = col_shard * num_shards_ + row_shard;
															
 
																+    bufs[bot_shard_idx].push_back(cooc_t{col_off, row_off, cnt});
															
 
																+  }
															
 
																+
															
 
																+  // XXX TODO: lock
															
 
																+  for (int i = 0; i < static_cast<int>(fds_.size()); ++i) {
															
 
																+    int rv = pthread_mutex_lock(&writer_mutex_);
															
 
																+    assert(rv == 0);
															
 
																+    const int nbytes = bufs[i].size() * sizeof(cooc_t);
															
 
																+    int nwritten = write(fds_[i], bufs[i].data(), nbytes);
															
 
																+    assert(nwritten == nbytes);
															
 
																+    pthread_mutex_unlock(&writer_mutex_);
															
 
																+  }
															
 
																+}
															
 
																+
															
 
																+void CoocBuffer::WriteShards() {
															
 
																+  for (int shard = 0; shard < static_cast<int>(fds_.size()); ++shard) {
															
 
																+    const int row_shard = shard / num_shards_;
															
 
																+    const int col_shard = shard % num_shards_;
															
 
																+
															
 
																+    std::cout << "\rwriting shard " << (shard + 1) << "/"
															
 
																+              << (num_shards_ * num_shards_);
															
 
																+    std::flush(std::cout);
															
 
																+
															
 
																+    // Construct the tf::Example proto.  First, we add the global rows and
															
 
																+    // column that are present in the shard.
															
 
																+    tensorflow::Example example;
															
 
																+
															
 
																+    auto &feature = *example.mutable_features()->mutable_feature();
															
 
																+    auto global_row = feature["global_row"].mutable_int64_list();
															
 
																+    auto global_col = feature["global_col"].mutable_int64_list();
															
 
																+
															
 
																+    for (int i = 0; i < shard_size_; ++i) {
															
 
																+      global_row->add_value(row_shard + i * num_shards_);
															
 
																+      global_col->add_value(col_shard + i * num_shards_);
															
 
																+    }
															
 
																+
															
 
																+    // Next we add co-occurrences as a sparse representation.  Map the
															
 
																+    // co-occurrence counts that we've spooled off to disk: these are in
															
 
																+    // arbitrary order and may contain duplicates.
															
 
																+    const off_t nbytes = lseek(fds_[shard], 0, SEEK_END);
															
 
																+    cooc_t *coocs = static_cast<cooc_t*>(
															
 
																+        mmap(0, nbytes, PROT_READ | PROT_WRITE, MAP_SHARED, fds_[shard], 0));
															
 
																+
															
 
																+    const int ncoocs = nbytes / sizeof(cooc_t);
															
 
																+    cooc_t* cur = coocs;
															
 
																+    cooc_t* end = coocs + ncoocs;
															
 
																+
															
 
																+    auto sparse_value = feature["sparse_value"].mutable_float_list();
															
 
																+    auto sparse_local_row = feature["sparse_local_row"].mutable_int64_list();
															
 
																+    auto sparse_local_col = feature["sparse_local_col"].mutable_int64_list();
															
 
																+
															
 
																+    std::sort(cur, end, [](const cooc_t &a, const cooc_t &b) {
															
 
																+      return a.row < b.row || (a.row == b.row && a.col < b.col);
															
 
																+    });
															
 
																+
															
 
																+    // Accumulate the counts into the protocol buffer.
															
 
																+    int last_row = -1, last_col = -1;
															
 
																+    float count = 0;
															
 
																+    for (; cur != end; ++cur) {
															
 
																+      if (cur->row != last_row || cur->col != last_col) {
															
 
																+        if (last_row >= 0 && last_col >= 0) {
															
 
																+          sparse_local_row->add_value(last_row);
															
 
																+          sparse_local_col->add_value(last_col);
															
 
																+          sparse_value->add_value(count);
															
 
																+        }
															
 
																+
															
 
																+        last_row = cur->row;
															
 
																+        last_col = cur->col;
															
 
																+        count = 0;
															
 
																+      }
															
 
																+
															
 
																+      count += cur->cnt;
															
 
																+    }
															
 
																+
															
 
																+    if (last_row >= 0 && last_col >= 0) {
															
 
																+      sparse_local_row->add_value(last_row);
															
 
																+      sparse_local_col->add_value(last_col);
															
 
																+      sparse_value->add_value(count);
															
 
																+    }
															
 
																+
															
 
																+    munmap(coocs, nbytes);
															
 
																+    close(fds_[shard]);
															
 
																+
															
 
																+    // Write the protocol buffer as a binary blob to disk.
															
 
																+    char filename[256];
															
 
																+    snprintf(filename, sizeof(filename), "shard-%03d-%03d.pb", row_shard,
															
 
																+             col_shard);
															
 
																+
															
 
																+    const std::string path = output_dirname_ + "/" + filename;
															
 
																+    int fd = open(path.c_str(), O_WRONLY | O_TRUNC | O_CREAT, 0666);
															
 
																+    assert(fd != -1);
															
 
																+
															
 
																+    google::protobuf::io::FileOutputStream fout(fd);
															
 
																+    example.SerializeToZeroCopyStream(&fout);
															
 
																+    fout.Close();
															
 
																+
															
 
																+    // Remove the temporary file.
															
 
																+    unlink(paths_[shard].c_str());
															
 
																+  }
															
 
																+
															
 
																+  std::cout << std::endl;
															
 
																+}
															
 
																+
															
 
																+// Counts the co-occurrences in part of the file.
															
 
																+class CoocCounter {
															
 
																+ public:
															
 
																+  CoocCounter(const std::string &input_filename, const off_t start,
															
 
																+              const off_t end, const int window_size,
															
 
																+              const std::unordered_map<std::string, int> &token_to_id_map,
															
 
																+              CoocBuffer *coocbuf)
															
 
																+      : fin_(input_filename, std::ifstream::ate),
															
 
																+        start_(start),
															
 
																+        end_(end),
															
 
																+        window_size_(window_size),
															
 
																+        token_to_id_map_(token_to_id_map),
															
 
																+        coocbuf_(coocbuf),
															
 
																+        marginals_(token_to_id_map.size()) {}
															
 
																+
															
 
																+  // PTthreads-friendly thunk to Count.
															
 
																+  static void* Run(void* param) {
															
 
																+    CoocCounter* self = static_cast<CoocCounter*>(param);
															
 
																+    self->Count();
															
 
																+    return nullptr;
															
 
																+  }
															
 
																+
															
 
																+  // Counts the co-occurrences.
															
 
																+  void Count();
															
 
																+
															
 
																+  const std::vector<double>& Marginals() const { return marginals_; }
															
 
																+
															
 
																+ protected:
															
 
																+  // The input stream.
															
 
																+  std::ifstream fin_;
															
 
																+
															
 
																+  // The range of the file to which this counter should attend.
															
 
																+  const off_t start_;
															
 
																+  const off_t end_;
															
 
																+
															
 
																+  // The window size for computing co-occurrences.
															
 
																+  const int window_size_;
															
 
																+
															
 
																+  // A reference to the mapping from tokens to IDs.
															
 
																+  const std::unordered_map<std::string, int> &token_to_id_map_;
															
 
																+
															
 
																+  // The buffer into which counts are to be accumulated.
															
 
																+  CoocBuffer* coocbuf_;
															
 
																+
															
 
																+  // The marginal counts accumulated by this counter.
															
 
																+  std::vector<double> marginals_;
															
 
																+};
															
 
																+
															
 
																+void CoocCounter::Count() {
															
 
																+  const int max_coocs_size = 16 * 1024 * 1024;
															
 
																+
															
 
																+  // A buffer of co-occurrence counts that we'll periodically sort into
															
 
																+  // shards.
															
 
																+  cooc_counts_t coocs;
															
 
																+
															
 
																+  fin_.seekg(start_);
															
 
																+
															
 
																+  int nlines = 0;
															
 
																+  for (off_t filepos = start_; filepos < end_; filepos = fin_.tellg()) {
															
 
																+    // Buffer a single sentence.
															
 
																+    std::vector<int> sentence;
															
 
																+    bool eos;
															
 
																+    do {
															
 
																+      std::string word;
															
 
																+      eos = NextWord(fin_, &word);
															
 
																+      auto it = token_to_id_map_.find(word);
															
 
																+      if (it != token_to_id_map_.end()) sentence.push_back(it->second);
															
 
																+    } while (!eos);
															
 
																+
															
 
																+    // Generate the co-occurrences for the sentence.
															
 
																+    for (int pos = 0; pos < static_cast<int>(sentence.size()); ++pos) {
															
 
																+      const int left_id = sentence[pos];
															
 
																+
															
 
																+      const int window_extent =
															
 
																+          std::min(static_cast<int>(sentence.size()) - pos, 1 + window_size_);
															
 
																+
															
 
																+      for (int off = 1; off < window_extent; ++off) {
															
 
																+        const int right_id = sentence[pos + off];
															
 
																+        const double count = 1.0 / static_cast<double>(off);
															
 
																+        const long long lo = std::min(left_id, right_id);
															
 
																+        const long long hi = std::max(left_id, right_id);
															
 
																+        const long long key = (hi << 32) | lo;
															
 
																+        coocs[key] += count;
															
 
																+
															
 
																+        marginals_[left_id] += count;
															
 
																+        marginals_[right_id] += count;
															
 
																+      }
															
 
																+
															
 
																+      marginals_[left_id] += 1.0;
															
 
																+      const long long key = (static_cast<long long>(left_id) << 32) |
															
 
																+                            static_cast<long long>(left_id);
															
 
																+
															
 
																+      coocs[key] += 0.5;
															
 
																+    }
															
 
																+
															
 
																+    // Periodically flush the co-occurrences to disk.
															
 
																+    if (coocs.size() > max_coocs_size) {
															
 
																+      coocbuf_->AccumulateCoocs(coocs);
															
 
																+      coocs.clear();
															
 
																+    }
															
 
																+
															
 
																+    if (start_ == 0 && ++nlines % 1000 == 0) {
															
 
																+      const double pct = 100.0 * filepos / end_;
															
 
																+      fprintf(stdout, "\rComputing co-occurrences: %0.1f%% complete...", pct);
															
 
																+      std::flush(std::cout);
															
 
																+    }
															
 
																+  }
															
 
																+
															
 
																+  // Accumulate anything we haven't flushed yet.
															
 
																+  coocbuf_->AccumulateCoocs(coocs);
															
 
																+
															
 
																+  if (start_ == 0) std::cout << "done." << std::endl;
															
 
																+}
															
 
																+
															
 
																+void WriteMarginals(const std::vector<double> &marginals,
															
 
																+                    const std::string &output_dirname) {
															
 
																+  for (const std::string filename : {"row_sums.txt", "col_sums.txt"}) {
															
 
																+    std::ofstream fout(output_dirname + "/" + filename);
															
 
																+    fout.setf(std::ios::fixed);
															
 
																+    for (double sum : marginals) fout << sum << std::endl;
															
 
																+  }
															
 
																+}
															
 
																+
															
 
																+int main(int argc, char *argv[]) {
															
 
																+  std::string input_filename;
															
 
																+  std::string vocab_filename;
															
 
																+  std::string output_dirname;
															
 
																+  bool generate_vocab = true;
															
 
																+  int max_vocab_size = 100 * 1024;
															
 
																+  int min_vocab_count = 5;
															
 
																+  int window_size = 10;
															
 
																+  int shard_size = 4096;
															
 
																+  int num_threads = 4;
															
 
																+
															
 
																+  for (int i = 1; i < argc; ++i) {
															
 
																+    std::string arg(argv[i]);
															
 
																+    if (arg == "--vocab") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      generate_vocab = false;
															
 
																+      vocab_filename = argv[i];
															
 
																+    } else if (arg == "--max_vocab") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      if ((max_vocab_size = atoi(argv[i])) <= 0) goto badarg;
															
 
																+    } else if (arg == "--min_count") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      if ((min_vocab_count = atoi(argv[i])) <= 0) goto badarg;
															
 
																+    } else if (arg == "--window_size") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      if ((window_size = atoi(argv[i])) <= 0) goto badarg;
															
 
																+    } else if (arg == "--input") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      input_filename = argv[i];
															
 
																+    } else if (arg == "--output_dir") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      output_dirname = argv[i];
															
 
																+    } else if (arg == "--shard_size") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      shard_size = atoi(argv[i]);
															
 
																+    } else if (arg == "--num_threads") {
															
 
																+      if (++i >= argc) goto argmissing;
															
 
																+      num_threads = atoi(argv[i]);
															
 
																+    } else if (arg == "--help") {
															
 
																+      std::cout << usage << std::endl;
															
 
																+      return 0;
															
 
																+    } else {
															
 
																+      std::cerr << "unknown arg '" << arg << "'; try --help?" << std::endl;
															
 
																+      return 2;
															
 
																+    }
															
 
																+
															
 
																+    continue;
															
 
																+
															
 
																+  badarg:
															
 
																+    std::cerr << "'" << argv[i] << "' is not a valid value for '" << arg
															
 
																+              << "'; try --help?" << std::endl;
															
 
																+
															
 
																+    return 2;
															
 
																+
															
 
																+  argmissing:
															
 
																+    std::cerr << arg << " requires an argument; try --help?" << std::endl;
															
 
																+  }
															
 
																+
															
 
																+  if (input_filename.empty()) {
															
 
																+    std::cerr << "please specify the input text with '--input'; try --help?"
															
 
																+              << std::endl;
															
 
																+    return 2;
															
 
																+  }
															
 
																+
															
 
																+  if (output_dirname.empty()) {
															
 
																+    std::cerr << "please specify the output directory with '--output_dir'"
															
 
																+              << std::endl;
															
 
																+
															
 
																+    return 2;
															
 
																+  }
															
 
																+
															
 
																+  struct stat sb;
															
 
																+  if (lstat(output_dirname.c_str(), &sb) != 0 || !S_ISDIR(sb.st_mode)) {
															
 
																+    std::cerr << "output directory '" << output_dirname
															
 
																+              << "' does not exist of is not a directory." << std::endl;
															
 
																+
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  if (lstat(input_filename.c_str(), &sb) != 0 || !S_ISREG(sb.st_mode)) {
															
 
																+    std::cerr << "input file '" << input_filename
															
 
																+              << "' does not exist or is not a file." << std::endl;
															
 
																+
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  // The total size of the input.
															
 
																+  const off_t input_size = sb.st_size;
															
 
																+
															
 
																+  const std::vector<std::string> vocab =
															
 
																+      generate_vocab ? CreateVocabulary(input_filename, shard_size,
															
 
																+                                        min_vocab_count, max_vocab_size)
															
 
																+                     : ReadVocabulary(vocab_filename);
															
 
																+
															
 
																+  if (!vocab.size()) {
															
 
																+    std::cerr << "Empty vocabulary." << std::endl;
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  std::cout << "Generating Swivel co-occurrence data into " << output_dirname
															
 
																+            << std::endl;
															
 
																+
															
 
																+  std::cout << "Shard size: " << shard_size << "x" << shard_size << std::endl;
															
 
																+  std::cout << "Vocab size: " << vocab.size() << std::endl;
															
 
																+
															
 
																+  // Write the vocabulary files into  the output directory.
															
 
																+  WriteVocabulary(vocab, output_dirname);
															
 
																+
															
 
																+  const int num_shards = vocab.size() / shard_size;
															
 
																+  CoocBuffer coocbuf(output_dirname, num_shards, shard_size);
															
 
																+
															
 
																+  // Build a mapping from the token to its position in the vocabulary file.
															
 
																+  std::unordered_map<std::string, int> token_to_id_map;
															
 
																+  for (int i = 0; i < static_cast<int>(vocab.size()); ++i)
															
 
																+    token_to_id_map[vocab[i]] = i;
															
 
																+
															
 
																+  // Compute the co-occurrences
															
 
																+  std::vector<pthread_t> threads;
															
 
																+  std::vector<CoocCounter*> counters;
															
 
																+  const off_t nbytes_per_thread = input_size / num_threads;
															
 
																+
															
 
																+  pthread_attr_t attr;
															
 
																+  if (pthread_attr_init(&attr) != 0) {
															
 
																+    std::cerr << "unable to initalize pthreads" << std::endl;
															
 
																+    return 1;
															
 
																+  }
															
 
																+
															
 
																+  for (int i = 0; i < num_threads; ++i) {
															
 
																+    // We could make this smarter and look around for newlines.  But
															
 
																+    // realistically that's not going to change things much.
															
 
																+    const off_t start = i * nbytes_per_thread;
															
 
																+    const off_t end =
															
 
																+        i < num_threads - 1 ? (i + 1) * nbytes_per_thread : input_size;
															
 
																+
															
 
																+    CoocCounter *counter = new CoocCounter(
															
 
																+        input_filename, start, end, window_size, token_to_id_map, &coocbuf);
															
 
																+
															
 
																+    counters.push_back(counter);
															
 
																+
															
 
																+    pthread_t thread;
															
 
																+    pthread_create(&thread, &attr, CoocCounter::Run, counter);
															
 
																+
															
 
																+    threads.push_back(thread);
															
 
																+  }
															
 
																+
															
 
																+  // Wait for threads to finish and collect marginals.
															
 
																+  std::vector<double> marginals(vocab.size());
															
 
																+  for (int i = 0; i < num_threads; ++i) {
															
 
																+    pthread_join(threads[i], 0);
															
 
																+
															
 
																+    const std::vector<double>& counter_marginals = counters[i]->Marginals();
															
 
																+    for (int j = 0; j < static_cast<int>(vocab.size()); ++j)
															
 
																+      marginals[j] += counter_marginals[j];
															
 
																+
															
 
																+    delete counters[i];
															
 
																+  }
															
 
																+
															
 
																+  std::cout << "writing marginals..." << std::endl;
															
 
																+  WriteMarginals(marginals, output_dirname);
															
 
																+
															
 
																+  std::cout << "writing shards..." << std::endl;
															
 
																+  coocbuf.WriteShards();
															
 
																+
															
 
																+  return 0;
															
 
																+}
															
--- a/swivel/fastprep.mk
+++ b/swivel/fastprep.mk
@@ -0,0 +1,87 @@
 
																+# -*- Mode: Makefile -*-
															
 
																+
															
 
																+#
															
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+
															
 
																+
															
 
																+# This makefile builds "fastprep", a faster version of prep.py that can be used
															
 
																+# to build training data for Swivel.  Building "fastprep" is a bit more
															
 
																+# involved: you'll need to pull and build the Tensorflow source, and then build
															
 
																+# and install compatible protobuf software.  We've tested this with Tensorflow
															
 
																+# version 0.7.
															
 
																+#
															
 
																+# = Step 1. Pull and Build Tensorflow. =
															
 
																+#
															
 
																+# These instructions are somewhat abridged; for pre-requisites and the most
															
 
																+# up-to-date instructions, refer to:
															
 
																+#
															
 
																+#   <https://www.tensorflow.org/versions/r0.7/get_started/os_setup.html#installing-from-sources>
															
 
																+#
															
 
																+# To build the Tensorflow components required for "fastpret", you'll need to
															
 
																+# install Bazel, Numpy, Swig, and Python development headers as described in at
															
 
																+# the above URL.  Run the "configure" script as appropriate for your
															
 
																+# environment and then build the "build_pip_package" target:
															
 
																+#
															
 
																+#   bazel build -c opt [--config=cuda] //tensorflow/tools/pip_package:build_pip_package
															
 
																+#
															
 
																+# This will generate the Tensorflow headers and libraries necessary for
															
 
																+# "fastprep".
															
 
																+#
															
 
																+#
															
 
																+# = Step 2. Build and Install Compatible Protobuf Libraries =
															
 
																+#
															
 
																+# "fastprep" also needs compatible protocol buffer libraries, which you can
															
 
																+# build from the protobuf implementation included with the Tensorflow
															
 
																+# distribution:
															
 
																+#
															
 
																+#   cd ${TENSORFLOW_SRCDIR}/google/protobuf
															
 
																+#   ./autogen.sh
															
 
																+#   ./configure --prefix=${HOME}  # ...or whatever
															
 
																+#   make
															
 
																+#   make install  # ...or maybe "sudo make install"
															
 
																+#
															
 
																+# This will install the headers and libraries appropriately.
															
 
																+#
															
 
																+#
															
 
																+# = Step 3. Build "fastprep". =
															
 
																+#
															
 
																+# Finally modify this file (if necessary) to update PB_DIR and TF_DIR to refer
															
 
																+# to appropriate locations, and:
															
 
																+#
															
 
																+#   make -f fastprep.mk
															
 
																+#
															
 
																+# If all goes well, you should have a program that is "flag compatible" with
															
 
																+# "prep.py" and runs significantly faster.  Use it to generate the co-occurrence
															
 
																+# matrices and other files necessary to train a Swivel matrix.
															
 
																+
															
 
																+
															
 
																+# The root directory where the Google Protobuf software is installed.
															
 
																+# Alternative locations might be "/usr" or "/usr/local".
															
 
																+PB_DIR=$(HOME)
															
 
																+
															
 
																+# Assuming you've got the Tensorflow source unpacked and built in ${HOME}/src:
															
 
																+TF_DIR=$(HOME)/src/tensorflow
															
 
																+
															
 
																+PB_INCLUDE=$(PB_DIR)/include
															
 
																+TF_INCLUDE=$(TF_DIR)/bazel-genfiles
															
 
																+CXXFLAGS=-std=c++11 -m64 -mavx -g -Ofast -Wall -I$(TF_INCLUDE) -I$(PB_INCLUDE)
															
 
																+
															
 
																+PB_LIBDIR=$(PB_DIR)/lib
															
 
																+TF_LIBDIR=$(TF_DIR)/bazel-bin/tensorflow/core
															
 
																+LDFLAGS=-L$(TF_LIBDIR) -L$(PB_LIBDIR)
															
 
																+LDLIBS=-lprotos_all_cc -lprotobuf -lpthread -lm
															
 
																+
															
 
																+fastprep: fastprep.cc
															
--- a/swivel/nearest.py
+++ b/swivel/nearest.py
@@ -0,0 +1,75 @@
 
																+#!/usr/bin/env python
															
 
																+#
															
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+"""Simple tool for inspecting nearest neighbors and analogies."""
															
 
																+
															
 
																+import re
															
 
																+import sys
															
 
																+from getopt import GetoptError, getopt
															
 
																+
															
 
																+from vecs import Vecs
															
 
																+
															
 
																+try:
															
 
																+  opts, args = getopt(sys.argv[1:], 'v:e:', ['vocab=', 'embeddings='])
															
 
																+except GetoptError, e:
															
 
																+  print >> sys.stderr, e
															
 
																+  sys.exit(2)
															
 
																+
															
 
																+opt_vocab = 'vocab.txt'
															
 
																+opt_embeddings = None
															
 
																+
															
 
																+for o, a in opts:
															
 
																+  if o in ('-v', '--vocab'):
															
 
																+    opt_vocab = a
															
 
																+  if o in ('-e', '--embeddings'):
															
 
																+    opt_embeddings = a
															
 
																+
															
 
																+vecs = Vecs(opt_vocab, opt_embeddings)
															
 
																+
															
 
																+while True:
															
 
																+  sys.stdout.write('query> ')
															
 
																+  sys.stdout.flush()
															
 
																+
															
 
																+  query = sys.stdin.readline().strip()
															
 
																+  if not query:
															
 
																+    break
															
 
																+
															
 
																+  parts = re.split(r'\s+', query)
															
 
																+
															
 
																+  if len(parts) == 1:
															
 
																+    res = vecs.neighbors(parts[0])
															
 
																+
															
 
																+  elif len(parts) == 3:
															
 
																+    vs = [vecs.lookup(w) for w in parts]
															
 
																+    if any(v is None for v in vs):
															
 
																+      print 'not in vocabulary: %s' % (
															
 
																+          ', '.join(tok for tok, v in zip(parts, vs) if v is None))
															
 
																+
															
 
																+      continue
															
 
																+
															
 
																+    res = vecs.neighbors(vs[2] - vs[0] + vs[1])
															
 
																+
															
 
																+  else:
															
 
																+    print 'use a single word to query neighbors, or three words for analogy'
															
 
																+    continue
															
 
																+
															
 
																+  if not res:
															
 
																+    continue
															
 
																+
															
 
																+  for word, sim in res[:20]:
															
 
																+    print '%0.4f: %s' % (sim, word)
															
 
																+
															
 
																+  print
															
--- a/swivel/prep.py
+++ b/swivel/prep.py
@@ -0,0 +1,317 @@
 
																+#!/usr/bin/env python
															
 
																+#
															
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+"""Prepare a corpus for processing by swivel.
															
 
																+
															
 
																+Creates a sharded word co-occurrence matrix from a text file input corpus.
															
 
																+
															
 
																+Usage:
															
 
																+
															
 
																+  prep.py --output_dir <output-dir> --input <text-file>
															
 
																+
															
 
																+Options:
															
 
																+
															
 
																+  --input <filename>
															
 
																+      The input text.
															
 
																+
															
 
																+  --output_dir <directory>
															
 
																+      Specifies the output directory where the various Swivel data
															
 
																+      files should be placed.
															
 
																+
															
 
																+  --shard_size <int>
															
 
																+      Specifies the shard size; default 4096.
															
 
																+
															
 
																+  --min_count <int>
															
 
																+      Specifies the minimum number of times a word should appear
															
 
																+      to be included in the vocabulary; default 5.
															
 
																+
															
 
																+  --max_vocab <int>
															
 
																+      Specifies the maximum vocabulary size; default shard size
															
 
																+      times 1024.
															
 
																+
															
 
																+  --vocab <filename>
															
 
																+      Use the specified unigram vocabulary instead of generating
															
 
																+      it from the corpus.
															
 
																+
															
 
																+  --window_size <int>
															
 
																+      Specifies the window size for computing co-occurrence stats;
															
 
																+      default 10.
															
 
																+
															
 
																+  --bufsz <int>
															
 
																+      The number of co-occurrences that are buffered; default 16M.
															
 
																+
															
 
																+"""
															
 
																+
															
 
																+import itertools
															
 
																+import math
															
 
																+import os
															
 
																+import struct
															
 
																+import sys
															
 
																+
															
 
																+import tensorflow as tf
															
 
																+
															
 
																+flags = tf.app.flags
															
 
																+
															
 
																+flags.DEFINE_string('input', '', 'The input text.')
															
 
																+flags.DEFINE_string('output_dir', '/tmp/swivel_data',
															
 
																+                    'Output directory for Swivel data')
															
 
																+flags.DEFINE_integer('shard_size', 4096, 'The size for each shard')
															
 
																+flags.DEFINE_integer('min_count', 5,
															
 
																+                     'The minimum number of times a word should occur to be '
															
 
																+                     'included in the vocabulary')
															
 
																+flags.DEFINE_integer('max_vocab', 4096 * 64, 'The maximum vocabulary size')
															
 
																+flags.DEFINE_string('vocab', '', 'Vocabulary to use instead of generating one')
															
 
																+flags.DEFINE_integer('window_size', 10, 'The window size')
															
 
																+flags.DEFINE_integer('bufsz', 16 * 1024 * 1024,
															
 
																+                     'The number of co-occurrences to buffer')
															
 
																+
															
 
																+FLAGS = flags.FLAGS
															
 
																+
															
 
																+shard_cooc_fmt = struct.Struct('iif')
															
 
																+
															
 
																+
															
 
																+def words(line):
															
 
																+  """Splits a line of text into tokens."""
															
 
																+  return line.strip().split()
															
 
																+
															
 
																+
															
 
																+def create_vocabulary(lines):
															
 
																+  """Reads text lines and generates a vocabulary."""
															
 
																+  lines.seek(0, os.SEEK_END)
															
 
																+  nbytes = lines.tell()
															
 
																+  lines.seek(0, os.SEEK_SET)
															
 
																+
															
 
																+  vocab = {}
															
 
																+  for lineno, line in enumerate(lines, start=1):
															
 
																+    for word in words(line):
															
 
																+      vocab.setdefault(word, 0)
															
 
																+      vocab[word] += 1
															
 
																+
															
 
																+    if lineno % 100000 == 0:
															
 
																+      pos = lines.tell()
															
 
																+      sys.stdout.write('\rComputing vocabulary: %0.1f%% (%d/%d)...' % (
															
 
																+          100.0 * pos / nbytes, pos, nbytes))
															
 
																+      sys.stdout.flush()
															
 
																+
															
 
																+  sys.stdout.write('\n')
															
 
																+
															
 
																+  vocab = [(tok, n) for tok, n in vocab.iteritems() if n >= FLAGS.min_count]
															
 
																+  vocab.sort(key=lambda kv: (-kv[1], kv[0]))
															
 
																+
															
 
																+  num_words = max(len(vocab), FLAGS.shard_size)
															
 
																+  num_words = min(len(vocab), FLAGS.max_vocab)
															
 
																+  if num_words % FLAGS.shard_size != 0:
															
 
																+    num_words -= num_words % FLAGS.shard_size
															
 
																+
															
 
																+  if not num_words:
															
 
																+    raise Exception('empty vocabulary')
															
 
																+
															
 
																+  print 'vocabulary contains %d tokens' % num_words
															
 
																+
															
 
																+  vocab = vocab[:num_words]
															
 
																+  return [tok for tok, n in vocab]
															
 
																+
															
 
																+
															
 
																+def write_vocab_and_sums(vocab, sums, vocab_filename, sums_filename):
															
 
																+  """Writes vocabulary and marginal sum files."""
															
 
																+  with open(os.path.join(FLAGS.output_dir, vocab_filename), 'w') as vocab_out:
															
 
																+    with open(os.path.join(FLAGS.output_dir, sums_filename), 'w') as sums_out:
															
 
																+      for tok, cnt in itertools.izip(vocab, sums):
															
 
																+        print >> vocab_out, tok
															
 
																+        print >> sums_out, cnt
															
 
																+
															
 
																+
															
 
																+def compute_coocs(lines, vocab):
															
 
																+  """Compute the co-occurrence statistics from the text.
															
 
																+
															
 
																+  This generates a temporary file for each shard that contains the intermediate
															
 
																+  counts from the shard: these counts must be subsequently sorted and collated.
															
 
																+
															
 
																+  """
															
 
																+  word_to_id = {tok: idx for idx, tok in enumerate(vocab)}
															
 
																+
															
 
																+  lines.seek(0, os.SEEK_END)
															
 
																+  nbytes = lines.tell()
															
 
																+  lines.seek(0, os.SEEK_SET)
															
 
																+
															
 
																+  num_shards = len(vocab) / FLAGS.shard_size
															
 
																+
															
 
																+  shardfiles = {}
															
 
																+  for row in range(num_shards):
															
 
																+    for col in range(num_shards):
															
 
																+      filename = os.path.join(
															
 
																+          FLAGS.output_dir, 'shard-%03d-%03d.tmp' % (row, col))
															
 
																+
															
 
																+      shardfiles[(row, col)] = open(filename, 'w+')
															
 
																+
															
 
																+  def flush_coocs():
															
 
																+    for (row_id, col_id), cnt in coocs.iteritems():
															
 
																+      row_shard = row_id % num_shards
															
 
																+      row_off = row_id / num_shards
															
 
																+      col_shard = col_id % num_shards
															
 
																+      col_off = col_id / num_shards
															
 
																+
															
 
																+      # Since we only stored (a, b), we emit both (a, b) and (b, a).
															
 
																+      shardfiles[(row_shard, col_shard)].write(
															
 
																+          shard_cooc_fmt.pack(row_off, col_off, cnt))
															
 
																+
															
 
																+      shardfiles[(col_shard, row_shard)].write(
															
 
																+          shard_cooc_fmt.pack(col_off, row_off, cnt))
															
 
																+
															
 
																+  coocs = {}
															
 
																+  sums = [0.0] * len(vocab)
															
 
																+
															
 
																+  for lineno, line in enumerate(lines, start=1):
															
 
																+    # Computes the word IDs for each word in the sentence.  This has the effect
															
 
																+    # of "stretching" the window past OOV tokens.
															
 
																+    wids = filter(
															
 
																+        lambda wid: wid is not None,
															
 
																+        (word_to_id.get(w) for w in words(line)))
															
 
																+
															
 
																+    for pos in xrange(len(wids)):
															
 
																+      lid = wids[pos]
															
 
																+      window_extent = min(FLAGS.window_size + 1, len(wids) - pos)
															
 
																+      for off in xrange(1, window_extent):
															
 
																+        rid = wids[pos + off]
															
 
																+        pair = (min(lid, rid), max(lid, rid))
															
 
																+        count = 1.0 / off
															
 
																+        sums[lid] += count
															
 
																+        sums[rid] += count
															
 
																+        coocs.setdefault(pair, 0.0)
															
 
																+        coocs[pair] += count
															
 
																+
															
 
																+      sums[lid] += 1.0
															
 
																+      pair = (lid, lid)
															
 
																+      coocs.setdefault(pair, 0.0)
															
 
																+      coocs[pair] += 0.5  # Only add 1/2 since we output (a, b) and (b, a)
															
 
																+
															
 
																+    if lineno % 10000 == 0:
															
 
																+      pos = lines.tell()
															
 
																+      sys.stdout.write('\rComputing co-occurrences: %0.1f%% (%d/%d)...' % (
															
 
																+          100.0 * pos / nbytes, pos, nbytes))
															
 
																+      sys.stdout.flush()
															
 
																+
															
 
																+      if len(coocs) > FLAGS.bufsz:
															
 
																+        flush_coocs()
															
 
																+        coocs = {}
															
 
																+
															
 
																+  flush_coocs()
															
 
																+  sys.stdout.write('\n')
															
 
																+
															
 
																+  return shardfiles, sums
															
 
																+
															
 
																+
															
 
																+def write_shards(vocab, shardfiles):
															
 
																+  """Processes the temporary files to generate the final shard data.
															
 
																+
															
 
																+  The shard data is stored as a tf.Example protos using a TFRecordWriter. The
															
 
																+  temporary files are removed from the filesystem once they've been processed.
															
 
																+
															
 
																+  """
															
 
																+  num_shards = len(vocab) / FLAGS.shard_size
															
 
																+
															
 
																+  ix = 0
															
 
																+  for (row, col), fh in shardfiles.iteritems():
															
 
																+    ix += 1
															
 
																+    sys.stdout.write('\rwriting shard %d/%d' % (ix, len(shardfiles)))
															
 
																+    sys.stdout.flush()
															
 
																+
															
 
																+    # Read the entire binary co-occurrence and unpack it into an array.
															
 
																+    fh.seek(0)
															
 
																+    buf = fh.read()
															
 
																+    os.unlink(fh.name)
															
 
																+    fh.close()
															
 
																+
															
 
																+    coocs = [
															
 
																+        shard_cooc_fmt.unpack_from(buf, off)
															
 
																+        for off in range(0, len(buf), shard_cooc_fmt.size)]
															
 
																+
															
 
																+    # Sort and merge co-occurrences for the same pairs.
															
 
																+    coocs.sort()
															
 
																+
															
 
																+    if coocs:
															
 
																+      current_pos = 0
															
 
																+      current_row_col = (coocs[current_pos][0], coocs[current_pos][1])
															
 
																+      for next_pos in range(1, len(coocs)):
															
 
																+        next_row_col = (coocs[next_pos][0], coocs[next_pos][1])
															
 
																+        if current_row_col == next_row_col:
															
 
																+          coocs[current_pos] = (
															
 
																+              coocs[current_pos][0],
															
 
																+              coocs[current_pos][1],
															
 
																+              coocs[current_pos][2] + coocs[next_pos][2])
															
 
																+        else:
															
 
																+          current_pos += 1
															
 
																+          if current_pos < next_pos:
															
 
																+            coocs[current_pos] = coocs[next_pos]
															
 
																+
															
 
																+          current_row_col = (coocs[current_pos][0], coocs[current_pos][1])
															
 
																+
															
 
																+      coocs = coocs[:(1 + current_pos)]
															
 
																+
															
 
																+    # Convert to a TF Example proto.
															
 
																+    def _int64s(xs):
															
 
																+      return tf.train.Feature(int64_list=tf.train.Int64List(value=list(xs)))
															
 
																+
															
 
																+    def _floats(xs):
															
 
																+      return tf.train.Feature(float_list=tf.train.FloatList(value=list(xs)))
															
 
																+
															
 
																+    example = tf.train.Example(features=tf.train.Features(feature={
															
 
																+        'global_row': _int64s(
															
 
																+            row + num_shards * i for i in range(FLAGS.shard_size)),
															
 
																+        'global_col': _int64s(
															
 
																+            col + num_shards * i for i in range(FLAGS.shard_size)),
															
 
																+
															
 
																+        'sparse_local_row': _int64s(cooc[0] for cooc in coocs),
															
 
																+        'sparse_local_col': _int64s(cooc[1] for cooc in coocs),
															
 
																+        'sparse_value': _floats(cooc[2] for cooc in coocs),
															
 
																+    }))
															
 
																+
															
 
																+    filename = os.path.join(FLAGS.output_dir, 'shard-%03d-%03d.pb' % (row, col))
															
 
																+    with open(filename, 'w') as out:
															
 
																+      out.write(example.SerializeToString())
															
 
																+
															
 
																+  sys.stdout.write('\n')
															
 
																+
															
 
																+
															
 
																+def main(_):
															
 
																+  # Create the output directory, if necessary
															
 
																+  if FLAGS.output_dir and not os.path.isdir(FLAGS.output_dir):
															
 
																+    os.makedirs(FLAGS.output_dir)
															
 
																+
															
 
																+  # Read the file onces to create the vocabulary.
															
 
																+  if FLAGS.vocab:
															
 
																+    with open(FLAGS.vocab, 'r') as lines:
															
 
																+      vocab = [line.strip() for line in lines]
															
 
																+  else:
															
 
																+    with open(FLAGS.input, 'r') as lines:
															
 
																+      vocab = create_vocabulary(lines)
															
 
																+
															
 
																+  # Now read the file again to determine the co-occurrence stats.
															
 
																+  with open(FLAGS.input, 'r') as lines:
															
 
																+    shardfiles, sums = compute_coocs(lines, vocab)
															
 
																+
															
 
																+  # Collect individual shards into the shards.recs file.
															
 
																+  write_shards(vocab, shardfiles)
															
 
																+
															
 
																+  # Now write the marginals.  They're symmetric for this application.
															
 
																+  write_vocab_and_sums(vocab, sums, 'row_vocab.txt', 'row_sums.txt')
															
 
																+  write_vocab_and_sums(vocab, sums, 'col_vocab.txt', 'col_sums.txt')
															
 
																+
															
 
																+  print 'done!'
															
 
																+
															
 
																+
															
 
																+if __name__ == '__main__':
															
 
																+  tf.app.run()
															
--- a/swivel/swivel.py
+++ b/swivel/swivel.py
@@ -0,0 +1,347 @@
 
																+#!/usr/bin/env python
															
 
																+#
															
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+"""Submatrix-wise Vector Embedding Learner.
															
 
																+
															
 
																+Implementation of SwiVel algorithm described at:
															
 
																+http://arxiv.org/abs/1602.02215
															
 
																+
															
 
																+This program expects an input directory that contains the following files.
															
 
																+
															
 
																+  row_vocab.txt, col_vocab.txt
															
 
																+
															
 
																+    The row an column vocabulary files.  Each file should contain one token per
															
 
																+    line; these will be used to generate a tab-separate file containing the
															
 
																+    trained embeddings.
															
 
																+
															
 
																+  row_sums.txt, col_sum.txt
															
 
																+
															
 
																+    The matrix row and column marginal sums.  Each file should contain one
															
 
																+    decimal floating point number per line which corresponds to the marginal
															
 
																+    count of the matrix for that row or column.
															
 
																+
															
 
																+  shards.recs
															
 
																+
															
 
																+    A file containing the sub-matrix shards, stored as TFRecords.  Each shard is
															
 
																+    expected to be a serialzed tf.Example protocol buffer with the following
															
 
																+    properties:
															
 
																+
															
 
																+      global_row: the global row indicies contained in the shard
															
 
																+      global_col: the global column indicies contained in the shard
															
 
																+      sparse_local_row, sparse_local_col, sparse_value: three parallel arrays
															
 
																+      that are a sparse representation of the submatrix counts.
															
 
																+
															
 
																+It will generate embeddings, training from the input directory for the specified
															
 
																+number of epochs.  When complete, it will output the trained vectors to a
															
 
																+tab-separated file that contains one line per embedding.  Row and column
															
 
																+embeddings are stored in separate files.
															
 
																+
															
 
																+"""
															
 
																+
															
 
																+import argparse
															
 
																+import glob
															
 
																+import math
															
 
																+import os
															
 
																+import sys
															
 
																+import time
															
 
																+import threading
															
 
																+
															
 
																+import numpy as np
															
 
																+import tensorflow as tf
															
 
																+
															
 
																+flags = tf.app.flags
															
 
																+
															
 
																+flags.DEFINE_string('input_base_path', '/tmp/swivel_data',
															
 
																+                    'Directory containing input shards, vocabularies, '
															
 
																+                    'and marginals.')
															
 
																+flags.DEFINE_string('output_base_path', '/tmp/swivel_data',
															
 
																+                    'Path where to write the trained embeddings.')
															
 
																+flags.DEFINE_integer('embedding_size', 300, 'Size of the embeddings')
															
 
																+flags.DEFINE_boolean('trainable_bias', False, 'Biases are trainable')
															
 
																+flags.DEFINE_integer('submatrix_rows', 4096, 'Rows in each training submatrix. '
															
 
																+                     'This must match the training data.')
															
 
																+flags.DEFINE_integer('submatrix_cols', 4096, 'Rows in each training submatrix. '
															
 
																+                     'This must match the training data.')
															
 
																+flags.DEFINE_float('loss_multiplier', 1.0 / 4096,
															
 
																+                   'constant multiplier on loss.')
															
 
																+flags.DEFINE_float('confidence_exponent', 0.5,
															
 
																+                   'Exponent for l2 confidence function')
															
 
																+flags.DEFINE_float('confidence_scale', 0.25, 'Scale for l2 confidence function')
															
 
																+flags.DEFINE_float('confidence_base', 0.1, 'Base for l2 confidence function')
															
 
																+flags.DEFINE_float('learning_rate', 1.0, 'Initial learning rate')
															
 
																+flags.DEFINE_integer('num_concurrent_steps', 2,
															
 
																+                     'Number of threads to train with')
															
 
																+flags.DEFINE_float('num_epochs', 40, 'Number epochs to train for')
															
 
																+flags.DEFINE_float('per_process_gpu_memory_fraction', 0.25,
															
 
																+                   'Fraction of GPU memory to use')
															
 
																+
															
 
																+FLAGS = flags.FLAGS
															
 
																+
															
 
																+
															
 
																+def embeddings_with_init(vocab_size, embedding_dim, name):
															
 
																+  """Creates and initializes the embedding tensors."""
															
 
																+  return tf.get_variable(name=name,
															
 
																+                         shape=[vocab_size, embedding_dim],
															
 
																+                         initializer=tf.random_normal_initializer(
															
 
																+                             stddev=math.sqrt(1.0 / embedding_dim)))
															
 
																+
															
 
																+
															
 
																+def count_matrix_input(filenames, submatrix_rows, submatrix_cols):
															
 
																+  """Reads submatrix shards from disk."""
															
 
																+  filename_queue = tf.train.string_input_producer(filenames)
															
 
																+  reader = tf.WholeFileReader()
															
 
																+  _, serialized_example = reader.read(filename_queue)
															
 
																+  features = tf.parse_single_example(
															
 
																+      serialized_example,
															
 
																+      features={
															
 
																+          'global_row': tf.FixedLenFeature([submatrix_rows], dtype=tf.int64),
															
 
																+          'global_col': tf.FixedLenFeature([submatrix_cols], dtype=tf.int64),
															
 
																+          'sparse_local_row': tf.VarLenFeature(dtype=tf.int64),
															
 
																+          'sparse_local_col': tf.VarLenFeature(dtype=tf.int64),
															
 
																+          'sparse_value': tf.VarLenFeature(dtype=tf.float32)
															
 
																+      })
															
 
																+
															
 
																+  global_row = features['global_row']
															
 
																+  global_col = features['global_col']
															
 
																+
															
 
																+  sparse_local_row = features['sparse_local_row'].values
															
 
																+  sparse_local_col = features['sparse_local_col'].values
															
 
																+  sparse_count = features['sparse_value'].values
															
 
																+
															
 
																+  sparse_indices = tf.concat(1, [tf.expand_dims(sparse_local_row, 1),
															
 
																+                                 tf.expand_dims(sparse_local_col, 1)])
															
 
																+  count = tf.sparse_to_dense(sparse_indices, [submatrix_rows, submatrix_cols],
															
 
																+                             sparse_count)
															
 
																+
															
 
																+  queued_global_row, queued_global_col, queued_count = tf.train.batch(
															
 
																+      [global_row, global_col, count],
															
 
																+      batch_size=1,
															
 
																+      num_threads=4,
															
 
																+      capacity=32)
															
 
																+
															
 
																+  queued_global_row = tf.reshape(queued_global_row, [submatrix_rows])
															
 
																+  queued_global_col = tf.reshape(queued_global_col, [submatrix_cols])
															
 
																+  queued_count = tf.reshape(queued_count, [submatrix_rows, submatrix_cols])
															
 
																+
															
 
																+  return queued_global_row, queued_global_col, queued_count
															
 
																+
															
 
																+
															
 
																+def read_marginals_file(filename):
															
 
																+  """Reads text file with one number per line to an array."""
															
 
																+  with open(filename) as lines:
															
 
																+    return [float(line) for line in lines]
															
 
																+
															
 
																+
															
 
																+def write_embedding_tensor_to_disk(vocab_path, output_path, sess, embedding):
															
 
																+  """Writes tensor to output_path as tsv"""
															
 
																+  # Fetch the embedding values from the model
															
 
																+  embeddings = sess.run(embedding)
															
 
																+
															
 
																+  with open(output_path, 'w') as out_f:
															
 
																+    with open(vocab_path) as vocab_f:
															
 
																+      for index, word in enumerate(vocab_f):
															
 
																+        word = word.strip()
															
 
																+        embedding = embeddings[index]
															
 
																+        out_f.write(word + '\t' + '\t'.join([str(x) for x in embedding]) + '\n')
															
 
																+
															
 
																+
															
 
																+def write_embeddings_to_disk(config, model, sess):
															
 
																+  """Writes row and column embeddings disk"""
															
 
																+  # Row Embedding
															
 
																+  row_vocab_path = config.input_base_path + '/row_vocab.txt'
															
 
																+  row_embedding_output_path = config.output_base_path + '/row_embedding.tsv'
															
 
																+  print 'Writing row embeddings to:', row_embedding_output_path
															
 
																+  write_embedding_tensor_to_disk(row_vocab_path, row_embedding_output_path,
															
 
																+                                 sess, model.row_embedding)
															
 
																+
															
 
																+  # Column Embedding
															
 
																+  col_vocab_path = config.input_base_path + '/col_vocab.txt'
															
 
																+  col_embedding_output_path = config.output_base_path + '/col_embedding.tsv'
															
 
																+  print 'Writing column embeddings to:', col_embedding_output_path
															
 
																+  write_embedding_tensor_to_disk(col_vocab_path, col_embedding_output_path,
															
 
																+                                 sess, model.col_embedding)
															
 
																+
															
 
																+
															
 
																+class SwivelModel(object):
															
 
																+  """Small class to gather needed pieces from a Graph being built."""
															
 
																+
															
 
																+  def __init__(self, config):
															
 
																+    """Construct graph for dmc."""
															
 
																+    self._config = config
															
 
																+
															
 
																+    # Create paths to input data files
															
 
																+    print 'Reading model from:', config.input_base_path
															
 
																+    count_matrix_files = glob.glob(config.input_base_path + '/shard-*.pb')
															
 
																+    row_sums_path = config.input_base_path + '/row_sums.txt'
															
 
																+    col_sums_path = config.input_base_path + '/col_sums.txt'
															
 
																+
															
 
																+    # Read marginals
															
 
																+    row_sums = read_marginals_file(row_sums_path)
															
 
																+    col_sums = read_marginals_file(col_sums_path)
															
 
																+
															
 
																+    self.n_rows = len(row_sums)
															
 
																+    self.n_cols = len(col_sums)
															
 
																+    print 'Matrix dim: (%d,%d) SubMatrix dim: (%d,%d) ' % (
															
 
																+        self.n_rows, self.n_cols, config.submatrix_rows, config.submatrix_cols)
															
 
																+    self.n_submatrices = (self.n_rows * self.n_cols /
															
 
																+                          (config.submatrix_rows * config.submatrix_cols))
															
 
																+    print 'n_submatrices: %d' % (self.n_submatrices)
															
 
																+
															
 
																+    # ===== CREATE VARIABLES ======
															
 
																+
															
 
																+    with tf.device('/cpu:0'):
															
 
																+      # embeddings
															
 
																+      self.row_embedding = embeddings_with_init(
															
 
																+          embedding_dim=config.embedding_size,
															
 
																+          vocab_size=self.n_rows,
															
 
																+          name='row_embedding')
															
 
																+      self.col_embedding = embeddings_with_init(
															
 
																+          embedding_dim=config.embedding_size,
															
 
																+          vocab_size=self.n_cols,
															
 
																+          name='col_embedding')
															
 
																+      tf.histogram_summary('row_emb', self.row_embedding)
															
 
																+      tf.histogram_summary('col_emb', self.col_embedding)
															
 
																+
															
 
																+      matrix_log_sum = math.log(np.sum(row_sums) + 1)
															
 
																+      row_bias_init = [math.log(x + 1) for x in row_sums]
															
 
																+      col_bias_init = [math.log(x + 1) for x in col_sums]
															
 
																+      self.row_bias = tf.Variable(row_bias_init,
															
 
																+                                  trainable=config.trainable_bias)
															
 
																+      self.col_bias = tf.Variable(col_bias_init,
															
 
																+                                  trainable=config.trainable_bias)
															
 
																+      tf.histogram_summary('row_bias', self.row_bias)
															
 
																+      tf.histogram_summary('col_bias', self.col_bias)
															
 
																+
															
 
																+    # ===== CREATE GRAPH =====
															
 
																+
															
 
																+    # Get input
															
 
																+    with tf.device('/cpu:0'):
															
 
																+      global_row, global_col, count = count_matrix_input(
															
 
																+          count_matrix_files, config.submatrix_rows, config.submatrix_cols)
															
 
																+
															
 
																+      # Fetch embeddings.
															
 
																+      selected_row_embedding = tf.nn.embedding_lookup(self.row_embedding,
															
 
																+                                                      global_row)
															
 
																+      selected_col_embedding = tf.nn.embedding_lookup(self.col_embedding,
															
 
																+                                                      global_col)
															
 
																+
															
 
																+      # Fetch biases.
															
 
																+      selected_row_bias = tf.nn.embedding_lookup([self.row_bias], global_row)
															
 
																+      selected_col_bias = tf.nn.embedding_lookup([self.col_bias], global_col)
															
 
																+
															
 
																+    # Multiply the row and column embeddings to generate predictions.
															
 
																+    predictions = tf.matmul(
															
 
																+        selected_row_embedding, selected_col_embedding, transpose_b=True)
															
 
																+
															
 
																+    # These binary masks separate zero from non-zero values.
															
 
																+    count_is_nonzero = tf.to_float(tf.cast(count, tf.bool))
															
 
																+    count_is_zero = 1 - tf.to_float(tf.cast(count, tf.bool))
															
 
																+
															
 
																+    objectives = count_is_nonzero * tf.log(count + 1e-30)
															
 
																+    objectives -= tf.reshape(selected_row_bias, [config.submatrix_rows, 1])
															
 
																+    objectives -= selected_col_bias
															
 
																+    objectives += matrix_log_sum
															
 
																+
															
 
																+    err = predictions - objectives
															
 
																+
															
 
																+    # The confidence function scales the L2 loss based on the raw co-occurrence
															
 
																+    # count.
															
 
																+    l2_confidence = (config.confidence_base + config.confidence_scale * tf.pow(
															
 
																+        count, config.confidence_exponent))
															
 
																+
															
 
																+    l2_loss = config.loss_multiplier * tf.reduce_sum(
															
 
																+        0.5 * l2_confidence * err * err * count_is_nonzero)
															
 
																+
															
 
																+    sigmoid_loss = config.loss_multiplier * tf.reduce_sum(
															
 
																+        tf.nn.softplus(err) * count_is_zero)
															
 
																+
															
 
																+    self.loss = l2_loss + sigmoid_loss
															
 
																+
															
 
																+    tf.scalar_summary("l2_loss", l2_loss)
															
 
																+    tf.scalar_summary("sigmoid_loss", sigmoid_loss)
															
 
																+    tf.scalar_summary("loss", self.loss)
															
 
																+
															
 
																+    # Add optimizer.
															
 
																+    self.global_step = tf.Variable(0, name='global_step')
															
 
																+    opt = tf.train.AdagradOptimizer(config.learning_rate)
															
 
																+    self.train_op = opt.minimize(self.loss, global_step=self.global_step)
															
 
																+    self.saver = tf.train.Saver(sharded=True)
															
 
																+
															
 
																+
															
 
																+def main(_):
															
 
																+  # Create the output path.  If this fails, it really ought to fail
															
 
																+  # now. :)
															
 
																+  if not os.path.isdir(FLAGS.output_base_path):
															
 
																+    os.makedirs(FLAGS.output_base_path)
															
 
																+
															
 
																+  # Create and run model
															
 
																+  with tf.Graph().as_default():
															
 
																+    model = SwivelModel(FLAGS)
															
 
																+
															
 
																+    # Create a session for running Ops on the Graph.
															
 
																+    gpu_options = tf.GPUOptions(
															
 
																+        per_process_gpu_memory_fraction=FLAGS.per_process_gpu_memory_fraction)
															
 
																+    sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
															
 
																+
															
 
																+    # Run the Op to initialize the variables.
															
 
																+    sess.run(tf.initialize_all_variables())
															
 
																+
															
 
																+    # Start feeding input
															
 
																+    coord = tf.train.Coordinator()
															
 
																+    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
															
 
																+
															
 
																+    # Calculate how many steps each thread should run
															
 
																+    n_total_steps = int(FLAGS.num_epochs * model.n_rows * model.n_cols) / (
															
 
																+        FLAGS.submatrix_rows * FLAGS.submatrix_cols)
															
 
																+    n_steps_per_thread = n_total_steps / FLAGS.num_concurrent_steps
															
 
																+    n_submatrices_to_train = model.n_submatrices * FLAGS.num_epochs
															
 
																+    t0 = [time.time()]
															
 
																+
															
 
																+    def TrainingFn():
															
 
																+      for _ in range(n_steps_per_thread):
															
 
																+        _, global_step = sess.run([model.train_op, model.global_step])
															
 
																+        n_steps_between_status_updates = 100
															
 
																+        if (global_step % n_steps_between_status_updates) == 0:
															
 
																+          elapsed = float(time.time() - t0[0])
															
 
																+          print '%d/%d submatrices trained (%.1f%%), %.1f submatrices/sec' % (
															
 
																+              global_step, n_submatrices_to_train,
															
 
																+              100.0 * global_step / n_submatrices_to_train,
															
 
																+              n_steps_between_status_updates / elapsed)
															
 
																+          t0[0] = time.time()
															
 
																+
															
 
																+    # Start training threads
															
 
																+    train_threads = []
															
 
																+    for _ in range(FLAGS.num_concurrent_steps):
															
 
																+      t = threading.Thread(target=TrainingFn)
															
 
																+      train_threads.append(t)
															
 
																+      t.start()
															
 
																+
															
 
																+    # Wait for threads to finish.
															
 
																+    for t in train_threads:
															
 
																+      t.join()
															
 
																+
															
 
																+    coord.request_stop()
															
 
																+    coord.join(threads)
															
 
																+
															
 
																+    # Write out vectors
															
 
																+    write_embeddings_to_disk(FLAGS, model, sess)
															
 
																+
															
 
																+    #Shutdown
															
 
																+    sess.close()
															
 
																+
															
 
																+
															
 
																+if __name__ == '__main__':
															
 
																+  tf.app.run()
															
--- a/swivel/text2bin.py
+++ b/swivel/text2bin.py
@@ -0,0 +1,88 @@
 
																+#!/usr/bin/env python
															
 
																+#
															
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+"""Converts vectors from text to a binary format for quicker manipulation.
															
 
																+
															
 
																+Usage:
															
 
																+
															
 
																+  text2bin.py -o <out> -v <vocab> vec1.txt [vec2.txt ...]
															
 
																+
															
 
																+Optiona:
															
 
																+
															
 
																+  -o <filename>, --output <filename>
															
 
																+    The name of the file into which the binary vectors are written.
															
 
																+
															
 
																+  -v <filename>, --vocab <filename>
															
 
																+    The name of the file into which the vocabulary is written.
															
 
																+
															
 
																+Description
															
 
																+
															
 
																+This program merges one or more whitespace separated vector files into a single
															
 
																+binary vector file that can be used by downstream evaluation tools in this
															
 
																+directory ("wordsim.py" and "analogy").
															
 
																+
															
 
																+If more than one vector file is specified, then the files must be aligned
															
 
																+row-wise (i.e., each line must correspond to the same embedding), and they must
															
 
																+have the same number of columns (i.e., be the same dimension).
															
 
																+
															
 
																+"""
															
 
																+
															
 
																+from itertools import izip
															
 
																+from getopt import GetoptError, getopt
															
 
																+import os
															
 
																+import struct
															
 
																+import sys
															
 
																+
															
 
																+try:
															
 
																+  opts, args = getopt(
															
 
																+      sys.argv[1:], 'o:v:', ['output=', 'vocab='])
															
 
																+except GetoptError, e:
															
 
																+  print >> sys.stderr, e
															
 
																+  sys.exit(2)
															
 
																+
															
 
																+opt_output = 'vecs.bin'
															
 
																+opt_vocab = 'vocab.txt'
															
 
																+for o, a in opts:
															
 
																+  if o in ('-o', '--output'):
															
 
																+    opt_output = a
															
 
																+  if o in ('-v', '--vocab'):
															
 
																+    opt_vocab = a
															
 
																+
															
 
																+def go(fhs):
															
 
																+  fmt = None
															
 
																+  with open(opt_vocab, 'w') as vocab_out:
															
 
																+    with open(opt_output, 'w') as vecs_out:
															
 
																+      for lines in izip(*fhs):
															
 
																+        parts = [line.split() for line in lines]
															
 
																+        token = parts[0][0]
															
 
																+        if any(part[0] != token for part in parts[1:]):
															
 
																+          raise IOError('vector files must be aligned')
															
 
																+
															
 
																+        print >> vocab_out, token
															
 
																+
															
 
																+        vec = [sum(float(x) for x in xs) for xs in zip(*parts)[1:]]
															
 
																+        if not fmt:
															
 
																+          fmt = struct.Struct('%df' % len(vec))
															
 
																+
															
 
																+        vecs_out.write(fmt.pack(*vec))
															
 
																+
															
 
																+if args:
															
 
																+  fhs = [open(filename) for filename in args]
															
 
																+  go(fhs)
															
 
																+  for fh in fhs:
															
 
																+    fh.close()
															
 
																+else:
															
 
																+  go([sys.stdin])
															
--- a/swivel/vecs.py
+++ b/swivel/vecs.py
@@ -0,0 +1,90 @@
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+import mmap
															
 
																+import numpy as np
															
 
																+import os
															
 
																+import struct
															
 
																+
															
 
																+class Vecs(object):
															
 
																+  def __init__(self, vocab_filename, rows_filename, cols_filename=None):
															
 
																+    """Initializes the vectors from a text vocabulary and binary data."""
															
 
																+    with open(vocab_filename, 'r') as lines:
															
 
																+      self.vocab = [line.split()[0] for line in lines]
															
 
																+      self.word_to_idx = {word: idx for idx, word in enumerate(self.vocab)}
															
 
																+
															
 
																+    n = len(self.vocab)
															
 
																+
															
 
																+    with open(rows_filename, 'r') as rows_fh:
															
 
																+      rows_fh.seek(0, os.SEEK_END)
															
 
																+      size = rows_fh.tell()
															
 
																+
															
 
																+      # Make sure that the file size seems reasonable.
															
 
																+      if size % (4 * n) != 0:
															
 
																+        raise IOError(
															
 
																+            'unexpected file size for binary vector file %s' % rows_filename)
															
 
																+
															
 
																+      # Memory map the rows.
															
 
																+      dim = size / (4 * n)
															
 
																+      rows_mm = mmap.mmap(rows_fh.fileno(), 0, prot=mmap.PROT_READ)
															
 
																+      rows = np.matrix(
															
 
																+          np.frombuffer(rows_mm, dtype=np.float32).reshape(n, dim))
															
 
																+
															
 
																+      # If column vectors were specified, then open them and add them to the row
															
 
																+      # vectors.
															
 
																+      if cols_filename:
															
 
																+        with open(cols_filename, 'r') as cols_fh:
															
 
																+          cols_mm = mmap.mmap(cols_fh.fileno(), 0, prot=mmap.PROT_READ)
															
 
																+          cols_fh.seek(0, os.SEEK_END)
															
 
																+          if cols_fh.tell() != size:
															
 
																+            raise IOError('row and column vector files have different sizes')
															
 
																+
															
 
																+          cols = np.matrix(
															
 
																+              np.frombuffer(cols_mm, dtype=np.float32).reshape(n, dim))
															
 
																+
															
 
																+          rows += cols
															
 
																+          cols_mm.close()
															
 
																+
															
 
																+      # Normalize so that dot products are just cosine similarity.
															
 
																+      self.vecs = rows / np.linalg.norm(rows, axis=1).reshape(n, 1)
															
 
																+      rows_mm.close()
															
 
																+
															
 
																+  def similarity(self, word1, word2):
															
 
																+    """Computes the similarity of two tokens."""
															
 
																+    idx1 = self.word_to_idx.get(word1)
															
 
																+    idx2 = self.word_to_idx.get(word2)
															
 
																+    if not idx1 or not idx2:
															
 
																+      return None
															
 
																+
															
 
																+    return float(self.vecs[idx1] * self.vecs[idx2].transpose())
															
 
																+
															
 
																+  def neighbors(self, query):
															
 
																+    """Returns the nearest neighbors to the query (a word or vector)."""
															
 
																+    if isinstance(query, basestring):
															
 
																+      idx = self.word_to_idx.get(query)
															
 
																+      if idx is None:
															
 
																+        return None
															
 
																+
															
 
																+      query = self.vecs[idx]
															
 
																+
															
 
																+    neighbors = self.vecs * query.transpose()
															
 
																+
															
 
																+    return sorted(
															
 
																+      zip(self.vocab, neighbors.flat),
															
 
																+      key=lambda kv: kv[1], reverse=True)
															
 
																+
															
 
																+  def lookup(self, word):
															
 
																+    """Returns the embedding for a token, or None if no embedding exists."""
															
 
																+    idx = self.word_to_idx.get(word)
															
 
																+    return None if idx is None else self.vecs[idx]
															
--- a/swivel/wordsim.py
+++ b/swivel/wordsim.py
@@ -0,0 +1,90 @@
 
																+#!/usr/bin/env python
															
 
																+#
															
 
																+# Copyright 2016 Google Inc. All Rights Reserved.
															
 
																+#
															
 
																+# Licensed under the Apache License, Version 2.0 (the "License");
															
 
																+# you may not use this file except in compliance with the License.
															
 
																+# You may obtain a copy of the License at
															
 
																+#
															
 
																+#     http://www.apache.org/licenses/LICENSE-2.0
															
 
																+#
															
 
																+# Unless required by applicable law or agreed to in writing, software
															
 
																+# distributed under the License is distributed on an "AS IS" BASIS,
															
 
																+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
															
 
																+# See the License for the specific language governing permissions and
															
 
																+# limitations under the License.
															
 
																+
															
 
																+"""Computes Spearman's rho with respect to human judgements.
															
 
																+
															
 
																+Given a set of row (and potentially column) embeddings, this computes Spearman's
															
 
																+rho between the rank ordering of predicted word similarity and human judgements.
															
 
																+
															
 
																+Usage:
															
 
																+
															
 
																+  wordim.py --embeddings=<binvecs> --vocab=<vocab> eval1.tab eval2.tab ...
															
 
																+
															
 
																+Options:
															
 
																+
															
 
																+  --embeddings=<filename>: the vectors to test
															
 
																+  --vocab=<filename>: the vocabulary file
															
 
																+
															
 
																+Evaluation files are assumed to be tab-separated files with exactly three
															
 
																+columns.  The first two columns contain the words, and the third column contains
															
 
																+the scored human judgement.
															
 
																+
															
 
																+"""
															
 
																+
															
 
																+import scipy.stats
															
 
																+import sys
															
 
																+from getopt import GetoptError, getopt
															
 
																+
															
 
																+from vecs import Vecs
															
 
																+
															
 
																+try:
															
 
																+  opts, args = getopt(sys.argv[1:], '', ['embeddings=', 'vocab='])
															
 
																+except GetoptError, e:
															
 
																+  print >> sys.stderr, e
															
 
																+  sys.exit(2)
															
 
																+
															
 
																+opt_embeddings = None
															
 
																+opt_vocab = None
															
 
																+
															
 
																+for o, a in opts:
															
 
																+  if o == '--embeddings':
															
 
																+    opt_embeddings = a
															
 
																+  if o == '--vocab':
															
 
																+    opt_vocab = a
															
 
																+
															
 
																+if not opt_vocab:
															
 
																+  print >> sys.stderr, 'please specify a vocabulary file with "--vocab"'
															
 
																+  sys.exit(2)
															
 
																+
															
 
																+if not opt_embeddings:
															
 
																+  print >> sys.stderr, 'please specify the embeddings with "--embeddings"'
															
 
																+  sys.exit(2)
															
 
																+
															
 
																+try:
															
 
																+  vecs = Vecs(opt_vocab, opt_embeddings)
															
 
																+except IOError, e:
															
 
																+  print >> sys.stderr, e
															
 
																+  sys.exit(1)
															
 
																+
															
 
																+def evaluate(lines):
															
 
																+  acts, preds = [], []
															
 
																+
															
 
																+  with open(filename, 'r') as lines:
															
 
																+    for line in lines:
															
 
																+      w1, w2, act = line.strip().split('\t')
															
 
																+      pred = vecs.similarity(w1, w2)
															
 
																+      if pred is None:
															
 
																+        continue
															
 
																+
															
 
																+      acts.append(float(act))
															
 
																+      preds.append(pred)
															
 
																+
															
 
																+  rho, _ = scipy.stats.spearmanr(acts, preds)
															
 
																+  return rho
															
 
																+
															
 
																+for filename in args:
															
 
																+  with open(filename, 'r') as lines:
															
 
																+    print '%0.3f %s' % (evaluate(lines), filename)