Fast, insightful and highly customizable Git history analysis.
|  | 8 anni fa | |
|---|---|---|
| cmd | 8 anni fa | |
| doc | 8 anni fa | |
| pb | 8 anni fa | |
| rbtree | 8 anni fa | |
| stdout | 8 anni fa | |
| test_data | 8 anni fa | |
| toposort | 8 anni fa | |
| .gitignore | 8 anni fa | |
| .travis.yml | 8 anni fa | |
| LICENSE | 8 anni fa | |
| README.md | 8 anni fa | |
| blob_cache.go | 8 anni fa | |
| blob_cache_test.go | 8 anni fa | |
| burndown.go | 8 anni fa | |
| burndown_test.go | 8 anni fa | |
| couples.go | 8 anni fa | |
| couples_test.go | 8 anni fa | |
| day.go | 8 anni fa | |
| day_test.go | 8 anni fa | |
| diff.go | 8 anni fa | |
| diff_test.go | 8 anni fa | |
| doc.go | 8 anni fa | |
| dummies.go | 8 anni fa | |
| dummies_test.go | 8 anni fa | |
| file.go | 8 anni fa | |
| file_test.go | 8 anni fa | |
| fix_yaml_unicode.py | 8 anni fa | |
| identity.go | 8 anni fa | |
| identity_test.go | 8 anni fa | |
| labours.py | 8 anni fa | |
| mailmap.go | 8 anni fa | |
| mailmap_test.go | 8 anni fa | |
| pipeline.go | 8 anni fa | |
| pipeline_test.go | 8 anni fa | |
| renames.go | 8 anni fa | |
| renames_test.go | 8 anni fa | |
| requirements.txt | 8 anni fa | |
| swivel.py | 8 anni fa | |
| tree_diff.go | 8 anni fa | |
| tree_diff_test.go | 8 anni fa | 
This project calculates and plots the lines burndown and other fun stats in Git repositories.
Exactly the same what git-of-theseus
does actually, but using go-git.
Why? source{d} builds it's own data pipeline to
process every git repository in the world and the calculation of the
annual burnout ratio will be embedded into it. hercules contains an
open source implementation of the specific git blame flavour on top
of go-git. Blaming is performed incrementally using the custom RB tree tracking
algorithm, only the last modification date is recorded.
There are two tools: hercules and labours.py. The first is the program
written in Go which collects the burndown and other stats from a Git repository.
The second is the Python script which draws the stack area plots and optionally
resamples the time series. These two tools are normally used together through
the pipe. hercules prints results in plain text. The first line is four numbers:
UNIX timestamp which corresponds to the time the repository was created,
UNIX timestamp of the last commit, granularity and sampling.
Granularity is the number of days each band in the stack consists of. Sampling
is the frequency with which the burnout state is snapshotted. The smaller the
value, the more smooth is the plot but the more work is done.
torvalds/linux burndown (granularity 30, sampling 30, resampled by year)
There is an option to resample the bands inside labours.py, so that you can
define a very precise distribution and visualize it different ways. Besides,
resampling aligns the bands across periodic boundaries, e.g. months or years.
Unresampled bands are apparently not aligned and start from the project's birth date.
There is a presentation available.
You are going to need Go (>= v1.8) and Python 2 or 3.
go get gopkg.in/src-d/hercules.v2/cmd/hercules
pip install -r requirements.txt
wget https://github.com/src-d/hercules/raw/master/labours.py
Numpy and SciPy are requirements. Install the correct version by downloading the wheel from http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy.
# Use "memory" go-git backend and display the plot. This is the fastest but the repository data must fit into RAM.
hercules https://github.com/src-d/go-git | python3 labours.py --resample month
# Use "file system" go-git backend and print the raw data.
hercules /path/to/cloned/go-git
# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache, use Protocol Buffers and display the unresampled plot.
hercules -pb https://github.com/git/git /tmp/repo-cache | python3 labours.py -f pb --resample raw
# Now something fun
# Get the linear history from git rev-list, reverse it
# Pipe to hercules, produce the snapshots for every 30 days grouped by 30 days
# Save the raw data to cache.yaml, so that later is possible to python3 labours.py -i cache.yaml
# Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
git rev-list HEAD | tac | hercules -commits - https://github.com/git/git | tee cache.yaml | python3 labours.py --font-size 16 --backend Agg --output git.png
labours.py -i /path/to/yaml allows to read the output from hercules which was saved on disk.
It is possible to store the cloned repository on disk. The subsequent analysis can run on the corresponding directory instead of cloning from scratch:
# First time - cache
hercules https://github.com/git/git /tmp/repo-cache
# Second time - use the cache
hercules /tmp/repo-cache
hercules -files
python3 labours.py -m files
Burndown statistics for every file in the repository which is alive in the latest revision.
hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m person
Burndown statistics for developers. If -people-dict is not specified, the identities are
discovered by the following algorithm:
If -people-dict is specified, it should point to a text file with the custom identities. The
format is: every line is a single developer, it contains all the matching emails and names separated
by |. The case is ignored.
Wireshark top 20 devs - churn matrix
hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m churn_matrix
Besides the burndown information, -people collects the added and deleted line statistics per
developer. It shows how many lines written by developer A are removed by developer B. The format is
the matrix with N rows and (N+2) columns, where N is the number of developers.
-people-dict is not specified, it is always 0).The sequence of developers is stored in people_sequence YAML node.
Ember.js top 20 devs - code ownership
hercules -people [-people-dict=/path/to/identities]
python3 labours.py -m ownership
-people also allows to draw the code share through time stacked area plot. That is,
how many lines are alive at the sampled moments in time for each identified developer.
torvalds/linux files' coupling in Tensorflow Projector
hercules -couples [-people-dict=/path/to/identities]
python3 labours.py -m couples -o <name> [--couples-tmp-dir=/tmp]
Important: it requires Tensorflow to be installed, please follow official instuctions.
The files are coupled if they are changed in the same commit. The developers are coupled if they
change the same file. hercules records the number of couples throught the whole commti history
and outputs the two corresponding co-occurrence matrices. labours.py then trains
Swivel embeddings - dense vectors which reflect the
co-occurrence probability through the Euclidean distance. The training requires a working
Tensorflow installation. The intermediate files are stored in the
system temporary directory or --couples-tmp-dir if it is specified. The trained embeddings are
written to the current working directory with the name depending on -o. The output format is TSV
and matches Tensorflow Projector so that the files and people
can be visualized with t-SNE implemented in TF Projector.
hercules -files -people -couples [-people-dict=/path/to/identities]
python3 labours.py -m all
YAML does not support the whole range of Unicode characters and the parser on labours.py side
may raise exceptions. Filter the output from hercules through fix_yaml_unicode.py to discard
such offending characters.
hercules -people https://github.com/... | python3 fix_yaml_unicode.py | python3 labours.py -m people
These options affects all plots:
python3 labours.py [--style=white|black] [--backend=] [--size=Y,X]
--style changes the background to be either white ("black" foreground) or black ("white" foreground).
--backend chooses the Matplotlib backend.
--size sets the size of the figure in inches. The default is 12,9.
(required in macOS) you can pin the default Matplotlib backend with
echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc
These options are effective in burndown charts only:
python3 labours.py [--text-size] [--relative]
--text-size changes the font size, --relative activate the stretched burndown layout.
It is possible to output all the information needed to draw the plots in JSON format.
Simply append .json to the output (-o) and you are done. The data format is not fully
specified and depends on the Python code which generates it. Each JSON file should
contain "type" which reflects the plot kind.
hercules' output
for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be
parsed. However, most of the repositories are parsed within a minute. Try using Protocol Buffers
instead (hercules -pb and labours.py -f pb).To speed-up yaml parsing ```
apt install libyaml-dev
brew install yaml-cpp libyaml
# you might need to re-install pyyaml for changes to make effect pip uninstall pyyaml pip --no-cache-dir install pyyaml ```
Apache 2.0.