radu/hercules: Fast, insightful and highly customizable Git history analysis. @ f7ef7afc6970237bbe668d16a1f1beb730bd596c

Fast, insightful and highly customizable Git history analysis.

Vadim Markovtsev f7ef7afc69 Switch to v2		%!s(int64=9) %!d(string=hai) anos
cmd	f7ef7afc69 Switch to v2	%!s(int64=9) %!d(string=hai) anos
toposort	2b1ed97819 Refactor the engine to enable many analyses	%!s(int64=9) %!d(string=hai) anos
.gitignore	cce947b98a Initial commit	%!s(int64=9) %!d(string=hai) anos
.travis.yml	f7ef7afc69 Switch to v2	%!s(int64=9) %!d(string=hai) anos
LICENSE	cce947b98a Initial commit	%!s(int64=9) %!d(string=hai) anos
README.md	f7ef7afc69 Switch to v2	%!s(int64=9) %!d(string=hai) anos
blob_cache.go	2b1ed97819 Refactor the engine to enable many analyses	%!s(int64=9) %!d(string=hai) anos
burndown.go	db87120df4 Fix deleting binary files in history	%!s(int64=9) %!d(string=hai) anos
couples.go	7ab067d012 Fix the panic on identity list misses	%!s(int64=9) %!d(string=hai) anos
day.go	2b1ed97819 Refactor the engine to enable many analyses	%!s(int64=9) %!d(string=hai) anos
doc.go	8a60feab56 Add documentation	%!s(int64=9) %!d(string=hai) anos
dummies.go	2b1ed97819 Refactor the engine to enable many analyses	%!s(int64=9) %!d(string=hai) anos
file.go	d38c16ab62 Run go fmt	%!s(int64=9) %!d(string=hai) anos
file_test.go	d38c16ab62 Run go fmt	%!s(int64=9) %!d(string=hai) anos
fix_yaml_unicode.py	a431fead30 Add the workaround for broken YAML unicode	%!s(int64=9) %!d(string=hai) anos
identity.go	7ab067d012 Fix the panic on identity list misses	%!s(int64=9) %!d(string=hai) anos
labours.py	b45345ff04 Update the readme	%!s(int64=9) %!d(string=hai) anos
linux.png	396aa0e8fb Update the readme with Linux image	%!s(int64=9) %!d(string=hai) anos
pipeline.go	f7ef7afc69 Switch to v2	%!s(int64=9) %!d(string=hai) anos
rbtree.go	8a60feab56 Add documentation	%!s(int64=9) %!d(string=hai) anos
renames.go	2b1ed97819 Refactor the engine to enable many analyses	%!s(int64=9) %!d(string=hai) anos
requirements.txt	bf7da5b1c6 Addressing review feedback	%!s(int64=9) %!d(string=hai) anos
swivel.py	763ead8089 Add Tensorflow Projector visualisation of couples	%!s(int64=9) %!d(string=hai) anos
tree_diff.go	2b1ed97819 Refactor the engine to enable many analyses	%!s(int64=9) %!d(string=hai) anos

Hercules

This project calculates and plots the lines burndown and other fun stats in Git repositories. Exactly the same what git-of-theseus does actually, but using go-git. Why? source{d} builds it's own data pipeline to process every git repository in the world and the calculation of the annual burnout ratio will be embedded into it. hercules contains an open source implementation of the specific git blame flavour on top of go-git. Blaming is performed incrementally using the custom RB tree tracking algorithm, only the last modification date is recorded.

There are two tools: hercules and labours.py. The first is the program written in Go which collects the burndown and other stats from a Git repository. The second is the Python script which draws the stack area plots and optionally resamples the time series. These two tools are normally used together through the pipe. hercules prints results in plain text. The first line is four numbers: UNIX timestamp which corresponds to the time the repository was created, UNIX timestamp of the last commit, granularity and sampling. Granularity is the number of days each band in the stack consists of. Sampling is the frequency with which the burnout state is snapshotted. The smaller the value, the more smooth is the plot but the more work is done.

torvalds/linux burndown (granularity 30, sampling 30, resampled by year)

There is an option to resample the bands inside labours.py, so that you can define a very precise distribution and visualize it different ways. Besides, resampling aligns the bands across periodic boundaries, e.g. months or years. Unresampled bands are apparently not aligned and start from the project's birth date.

There is a presentation available.

Installation

You are going to need Go and Python 2 or 3.

go get gopkg.in/src-d/hercules.v2/cmd/hercules
pip install -r requirements.txt
wget https://github.com/src-d/hercules/raw/master/labours.py

Usage

# Use "memory" go-git backend and display the plot. This is the fastest but the repository data must fit into RAM.
hercules https://github.com/src-d/go-git | python3 labours.py --resample month
# Use "file system" go-git backend and print the raw data.
hercules /path/to/cloned/go-git
# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache and display the unresampled plot.
hercules https://github.com/git/git /tmp/repo-cache | python3 labours.py --resample raw

# Now something fun
# Get the linear history from git rev-list, reverse it
# Pipe to hercules, produce the snapshots for every 30 days grouped by 30 days
# Save the raw data to cache.yaml, so that later is possible to python3 labours.py -i cache.yaml
# Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
git rev-list HEAD | tac | hercules -commits - https://github.com/git/git | tee cache.yaml | python3 labours.py --font-size 16 --backend Agg --output git.png

labours.py -i /path/to/yaml allows to read the output from hercules which was saved on disk.

Extensions

Files

hercules -files
python3 labours.py -m files

Burndown statistics for every file in the repository which is alive in the latest revision.

People

hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m person

Burndown statistics for developers. If -people-dict is not specified, the identities are discovered by the following algorithm:

We start from the root commit towards the HEAD. Emails and names are converted to lower case.
If we process an unknown email and name, record them as a new developer.
If we process a known email but unknown name, match to the developer with the matching email, and add the unknown name to the list of that developer's names.
If we process an unknown email but known name, match to the developer with the matching name, and add the unknown email to the list of that developer's emails.

If -people-dict is specified, it should point to a text file with the custom identities. The format is: every line is a single developer, it contains all the matching emails and names separated by |. The case is ignored.

Churn matrix

hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m churn_matrix

Besides the burndown information, -people collects the added and deleted line statistics per developer. It shows how many lines written by developer A are removed by developer B. The format is the matrix with N rows and (N+2) columns, where N is the number of developers.

First column is the number of lines the developer wrote.
Second column is how many lines were written by the developer and deleted by unidentified developers (if -people-dict is not specified, it is always 0).
The rest of the columns show how many lines were written by the developer and deleted by identified developers.

The sequence of developers is stored in people_sequence YAML node.

Code share

hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m people

-people also allows to draw the code share through time stacked area plot. That is, how many lines are alive at the sampled moments in time for each identified developer.

Couples

hercules -couples [-people-dict=/path/to/identities]
python3 labours.py -m couples -o <name> [--couples-tmp-dir=/tmp]

The files are coupled if they are changed in the same commit. The developers are coupled if they change the same file. hercules records the number of couples throught the whole commti history and outputs the two corresponding co-occurrence matrices. labours.py then trains Swivel embeddings - dense vectors which reflect the co-occurrence probability through the Euclidean distance. The training requires a working Tensorflow installation. The intermediate files are stored in the system temporary directory or --couples-tmp-dir if it is specified. The trained embeddings are written to the current working directory with the name depending on -o. The output format is TSV and matches Tensorflow Projector so that the files and people can be visualized with t-SNE implemented in TF Projector.

Everything in a single pass

hercules -files -people -couples [-people-dict=/path/to/identities]
python3 labours.py -m all

Bad unicode errors

YAML does not support the whole range of Unicode characters and the parser on labours.py side may raise exceptions. Filter the output from hercules through fix_yaml_unicode.py to discard such offending characters.

hercules -people https://github.com/... | python3 fix_yaml_unicode.py | python3 labours.py -m people

Plotting

These options affects all plots:

python3 labours.py [--style=white|black] [--backend=]

--style changes the background to be either white ("black" foreground) or black ("white" foreground). --backend chooses the Matplotlib backend.

(required in macOS) you can pin the default Matplotlib backend with

echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc

These options are effective in burndown charts only:

python3 labours.py [--text-size] [--relative]

--text-size changes the font size, --relative activate the stretched burndown layout.

Caveats

Currently, go-git's file system storage backend is considerably slower than the in-memory one, so you should clone repos instead of reading them from disk whenever possible. Please note that the in-memory storage may require much RAM, for example, the Linux kernel takes over 200GB in 2017.
Parsing YAML in Python is slow when the number of internal objects is big. hercules' output for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be parsed. However, most of the repositories are parsed within a minute.
To speed-up yaml parsing ```

Debian, Ubuntu

apt install libyaml-dev

macOS

brew install yaml-cpp libyaml

# you might need to re-install pyyaml for changes to make effect pip uninstall pyyaml pip --no-cache-dir install pyyaml ```

License

MIT.

README.md