Vadim Markovtsev 7 anni fa
parent
commit
b45345ff04
2 ha cambiato i file con 132 aggiunte e 22 eliminazioni
  1. 124 14
      README.md
  2. 8 8
      labours.py

+ 124 - 14
README.md

@@ -3,19 +3,19 @@
 Hercules
 Hercules
 --------
 --------
 
 
-This tool calculates the lines burnout stats in a Git repository.
+This project calculates and plots the lines burndown and other fun stats in Git repositories.
 Exactly the same what [git-of-theseus](https://github.com/erikbern/git-of-theseus)
 Exactly the same what [git-of-theseus](https://github.com/erikbern/git-of-theseus)
 does actually, but using [go-git](https://github.com/src-d/go-git).
 does actually, but using [go-git](https://github.com/src-d/go-git).
 Why? [source{d}](http://sourced.tech) builds it's own data pipeline to
 Why? [source{d}](http://sourced.tech) builds it's own data pipeline to
 process every git repository in the world and the calculation of the
 process every git repository in the world and the calculation of the
-annual burnout ratio will be embedded into it. This project is an
+annual burnout ratio will be embedded into it. `hercules` contains an
 open source implementation of the specific `git blame` flavour on top
 open source implementation of the specific `git blame` flavour on top
-of go-git. Blaming is done incrementally using the custom RB tree tracking
+of go-git. Blaming is performed incrementally using the custom RB tree tracking
 algorithm, only the last modification date is recorded.
 algorithm, only the last modification date is recorded.
 
 
 There are two tools: `hercules` and `labours.py`. The first is the program
 There are two tools: `hercules` and `labours.py`. The first is the program
-written in Go which collects the burnout stats from a Git repository.
-The second is the Python script which draws the stack area plot and optionally
+written in Go which collects the burndown and other stats from a Git repository.
+The second is the Python script which draws the stack area plots and optionally
 resamples the time series. These two tools are normally used together through
 resamples the time series. These two tools are normally used together through
 the pipe. `hercules` prints results in plain text. The first line is four numbers:
 the pipe. `hercules` prints results in plain text. The first line is four numbers:
 UNIX timestamp which corresponds to the time the repository was created,
 UNIX timestamp which corresponds to the time the repository was created,
@@ -54,26 +54,136 @@ hercules https://github.com/git/git /tmp/repo-cache | python3 labours.py --resam
 # Now something fun
 # Now something fun
 # Get the linear history from git rev-list, reverse it
 # Get the linear history from git rev-list, reverse it
 # Pipe to hercules, produce the snapshots for every 30 days grouped by 30 days
 # Pipe to hercules, produce the snapshots for every 30 days grouped by 30 days
-# Save the raw data to cache.txt, so that later simply cat cache.txt | python3 labours.py
+# Save the raw data to cache.yaml, so that later is possible to python3 labours.py -i cache.yaml
 # Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
 # Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
-git rev-list HEAD | tac | hercules -commits - https://github.com/git/git | tee cache.txt | python3 labours.py --font-size 16 --backend Agg --output git.png
+git rev-list HEAD | tac | hercules -commits - https://github.com/git/git | tee cache.yaml | python3 labours.py --font-size 16 --backend Agg --output git.png
 ```
 ```
 
 
+`labours.py -i /path/to/yaml` allows to read the output from `hercules` which was saved on disk.
+
 ### Extensions
 ### Extensions
 
 
-Option `-files` additionally prints the corresponding burndown table for every
-file in the repository. `-people` does the same for the developers; `-people-dict` allows to specify
-the custom identity matching.
+#### Files
+
+```
+hercules -files
+python3 labours.py -m files
+```
+
+Burndown statistics for every file in the repository which is alive in the latest revision.
+
+#### People
+
+```
+hercules -people [-people-dict=/path/to/identities]
+python3 labours.py -m person
+```
+
+Burndown statistics for developers. If `-people-dict` is not specified, the identities are
+discovered by the following algorithm:
+
+0. We start from the root commit towards the HEAD. Emails and names are converted to lower case.
+1. If we process an unknown email and name, record them as a new developer.
+2. If we process a known email but unknown name, match to the developer with the matching email,
+and add the unknown name to the list of that developer's names.
+3. If we process an unknown email but known name, match to the developer with the matching name,
+and add the unknown email to the list of that developer's emails.
+
+If `-people-dict` is specified, it should point to a text file with the custom identities. The
+format is: every line is a single developer, it contains all the matching emails and names separated
+by `|`. The case is ignored.
+
+#### Churn matrix
+
+```
+hercules -people [-people-dict=/path/to/identities]
+python3 labours.py -m churn_matrix
+```
+
+Besides the burndown information, `-people` collects the added and deleted line statistics per
+developer. It shows how many lines written by developer A are removed by developer B. The format is
+the matrix with N rows and (N+2) columns, where N is the number of developers.
+
+1. First column is the number of lines the developer wrote.
+2. Second column is how many lines were written by the developer and deleted by unidentified developers
+(if `-people-dict` is not specified, it is always 0).
+3. The rest of the columns show how many lines were written by the developer and deleted by identified
+developers.
+
+The sequence of developers is stored in `people_sequence` YAML node.
+
+#### Code share
+
+```
+hercules -people [-people-dict=/path/to/identities]
+python3 labours.py -m people
+```
+
+`-people` also allows to draw the code share through time stacked area plot. That is,
+how many lines are alive at the sampled moments in time for each identified developer.
+
+#### Couples
+
+```
+hercules -couples [-people-dict=/path/to/identities]
+python3 labours.py -m couples -o <name> [--couples-tmp-dir=/tmp]
+```
+
+The files are coupled if they are changed in the same commit. The developers are coupled if they
+change the same file. `hercules` records the number of couples throught the whole commti history
+and outputs the two corresponding co-occurrence matrices. `labours.py` then trains
+[Swivel embeddings](https://github.com/src-d/tensorflow-swivel) - dense vectors which reflect the
+co-occurrence probability through the Euclidean distance. The training requires a working
+[Tensorflow](http://tensorflow.org) installation. The intermediate files are stored in the
+system temporary directory or `--couples-tmp-dir` if it is specified. The trained embeddings are
+written to the current working directory with the name depending on `-o`. The output format is TSV
+and matches [Tensorflow Projector])(http://projector.tensorflow.org/) so that the files and people
+can be visualized with t-SNE implemented in TF Projector.
 
 
-Correspondingly, `labours.py` has `--mode` which allows to plot all the burndowns for files,
-people and the overwrite matrix. The latter shows how much code written by a developer is removed
-by other developers, the rows are normalized to the number of individual insertions.
+#### Everything in a single pass
 
 
+```
+hercules -files -people -couples [-people-dict=/path/to/identities]
+python3 labours.py -m all
+```
+
+### Bad unicode errors
+
+YAML does not support the whole range of Unicode characters and the parser on `labours.py` side
+may raise exceptions. Filter the output from `hercules` through `fix_yaml_unicode.py` to discard
+such offending characters.
+
+```
+hercules -people https://github.com/... | python3 fix_yaml_unicode.py | python3 labours.py -m people
+```
+
+### Plotting
+
+These options affects all plots:
+
+```
+python3 labours.py [--style=white|black] [--backend=]
+```
+
+`--style` changes the background to be either white ("black" foreground) or black ("white" foreground).
+`--backend` chooses the Matplotlib backend.
+
+These options are effective in burndown charts only:
+
+```
+python3 labours.py [--text-size] [--relative]
+```
 
 
+`--text-size` changes the font size, `--relative` activate the stretched burndown layout.
 
 
 ### Caveats
 ### Caveats
 
 
-1. Currently, go-git's "file system" backend is considerably slower than the in-memory one, so you should clone repos instead of reading them from disk whenever possible.
+1. Currently, go-git's file system storage backend is considerably slower than the in-memory one,
+so you should clone repos instead of reading them from disk whenever possible. Please note that the
+in-memory storage may require much RAM, for example, the Linux kernel takes over 200GB in 2017.
+2. Parsing YAML in Python is slow when the number of internal objects is big. `hercules`' output
+for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be
+parsed. However, most of the repositories are parsed within a minute.
 
 
 ### License
 ### License
 MIT.
 MIT.

+ 8 - 8
labours.py

@@ -36,7 +36,7 @@ def parse_args():
                         help="Occupy 100%% height for every measurement.")
                         help="Occupy 100%% height for every measurement.")
     parser.add_argument("--couples-tmp-dir", help="Temporary directory to work with couples.")
     parser.add_argument("--couples-tmp-dir", help="Temporary directory to work with couples.")
     parser.add_argument("-m", "--mode",
     parser.add_argument("-m", "--mode",
-                        choices=["project", "file", "person", "matrix", "people", "couples",
+                        choices=["project", "file", "person", "churn_matrix", "people", "couples",
                                  "all"],
                                  "all"],
                         default="project", help="What to plot.")
                         default="project", help="What to plot.")
     parser.add_argument(
     parser.add_argument(
@@ -177,7 +177,7 @@ def load_main(header, name, matrix, resample):
     return name, matrix, date_range_sampling, labels, granularity, sampling, resample
     return name, matrix, date_range_sampling, labels, granularity, sampling, resample
 
 
 
 
-def load_matrix(contents):
+def load_churn_matrix(contents):
     matrix = numpy.array([numpy.fromstring(line, dtype=int, sep=" ")
     matrix = numpy.array([numpy.fromstring(line, dtype=int, sep=" ")
                           for line in contents.split("\n")])
                           for line in contents.split("\n")])
     return matrix
     return matrix
@@ -324,7 +324,7 @@ def plot_many(args, target, header, parts):
     sys.stdout.write(stdout.getvalue())
     sys.stdout.write(stdout.getvalue())
 
 
 
 
-def plot_matrix(args, repo, people, matrix):
+def plot_churn_matrix(args, repo, people, matrix):
     matrix = matrix.astype(float)
     matrix = matrix.astype(float)
     zeros = matrix[:, 0] == 0
     zeros = matrix[:, 0] == 0
     matrix[zeros, :] = 1
     matrix[zeros, :] = 1
@@ -471,10 +471,10 @@ def train_embeddings(coocc_tree, tmpdir, shard_size=4096):
         swivel.FLAGS.submatrix_cols = shard_size
         swivel.FLAGS.submatrix_cols = shard_size
         if len(meta_index) < 10000:
         if len(meta_index) < 10000:
             embedding_size = 50
             embedding_size = 50
-            num_epochs = 100
+            num_epochs = 200
         elif len(meta_index) < 100000:
         elif len(meta_index) < 100000:
             embedding_size = 100
             embedding_size = 100
-            num_epochs = 200
+            num_epochs = 250
         elif len(meta_index) < 500000:
         elif len(meta_index) < 500000:
             embedding_size = 200
             embedding_size = 200
             num_epochs = 300
             num_epochs = 300
@@ -541,11 +541,11 @@ def main():
             print(people_warning)
             print(people_warning)
             return
             return
         plot_many(args, "person", header, people_contents)
         plot_many(args, "person", header, people_contents)
-    elif args.mode == "matrix":
+    elif args.mode == "churn_matrix":
         if not people_contents:
         if not people_contents:
             print(people_warning)
             print(people_warning)
             return
             return
-        plot_matrix(args, name, people_sequence, load_matrix(people_matrix))
+        plot_churn_matrix(args, name, people_sequence, load_churn_matrix(people_matrix))
     elif args.mode == "people":
     elif args.mode == "people":
         if not people_contents:
         if not people_contents:
             print(people_warning)
             print(people_warning)
@@ -563,7 +563,7 @@ def main():
             plot_many(args, "file", header, files_contents)
             plot_many(args, "file", header, files_contents)
         if people_contents:
         if people_contents:
             plot_many(args, "person", header, people_contents)
             plot_many(args, "person", header, people_contents)
-            plot_matrix(args, name, people_sequence, load_matrix(people_matrix))
+            plot_churn_matrix(args, name, people_sequence, load_churn_matrix(people_matrix))
             plot_people(args, name, *load_people(header, people_sequence, people_contents))
             plot_people(args, name, *load_people(header, people_sequence, people_contents))
         if people_coocc:
         if people_coocc:
             assert files_coocc
             assert files_coocc