radu/hercules: Fast, insightful and highly customizable Git history analysis. @ v3

Fast, insightful and highly customizable Git history analysis.

Vadim Markovtsev 5a75d8ff07 Merge pull request #92 from vmarkovtsev/v3		7 лет назад
cmd	e41847c99d Improve the progress indication	8 лет назад
contrib	9524bbb9ea Run gofmt -s	8 лет назад
doc	1651cd38a4 Mention sentiment in README	8 лет назад
pb	f8a7048a30 Add CommentSentimentAnalysis	8 лет назад
rbtree	ac9cc316dd Pass through golint warnings	8 лет назад
test_data	b7f221856b Try to determine EndPosition if it does not exist	8 лет назад
toposort	ac9cc316dd Pass through golint warnings	8 лет назад
vendor	67e185333d Fix golint warnings in files not on the top level	8 лет назад
yaml	67e185333d Fix golint warnings in files not on the top level	8 лет назад
.gitignore	4e1327c760 Add FileHistory analysis	8 лет назад
.travis.yml	9d2e3fa6f6 Update the CI	7 лет назад
CODE_OF_CONDUCT.md	7ad35117e9 Add the misc docs	8 лет назад
CONTRIBUTING.md	7ad35117e9 Add the misc docs	8 лет назад
DCO	7ad35117e9 Add the misc docs	8 лет назад
Dockerfile	6fe671bbe6 Repair Dockerfile	8 лет назад
LICENSE.md	7ad35117e9 Add the misc docs	8 лет назад
MAINTAINERS.md	7ad35117e9 Add the misc docs	8 лет назад
Makefile	9d2e3fa6f6 Update the CI	7 лет назад
OCTOPUS.md	6d566a2aa2 Add the plan to resolve #11	8 лет назад
PLUGINS.md	6458136f35 Switch from vanilla "flag" to Cobra	8 лет назад
README.md	c49fc73ac9 Add protoc to the requirements for building from source	8 лет назад
appveyor.yml	9d2e3fa6f6 Update the CI	7 лет назад
blob_cache.go	c723fb076d Replace fmt.Fprintf(os.Stderr) with log.Printf()	8 лет назад
blob_cache_test.go	79ce321f29 Add more tests	8 лет назад
burndown.go	c723fb076d Replace fmt.Fprintf(os.Stderr) with log.Printf()	8 лет назад
burndown_test.go	930f63b6c0 Make constants for dependency names	8 лет назад
changes_xpather.go	61fa07c413 Reorder imports	8 лет назад
changes_xpather_test.go	734a3bf3d0 Test in AppVeyor	8 лет назад
comment_sentiment.go	5d48dd70cf Pass CommentSentiment test suite	8 лет назад
comment_sentiment_test.go	64d952771c Skip sentiment tests without tensorflow	8 лет назад
couples.go	ac9cc316dd Pass through golint warnings	8 лет назад
couples_test.go	44c4e38ac8 Fix golint warnings unrelated to export and docs	8 лет назад
day.go	d8e3306d94 Add tracking commits by day in DaysSinceStart	8 лет назад
day_test.go	d8e3306d94 Add tracking commits by day in DaysSinceStart	8 лет назад
diff.go	ac9cc316dd Pass through golint warnings	8 лет назад
diff_refiner.go	ac9cc316dd Pass through golint warnings	8 лет назад
diff_refiner_test.go	b7f221856b Try to determine EndPosition if it does not exist	8 лет назад
diff_test.go	31d564a3e9 go fmt ./...	8 лет назад
doc.go	fabd3116f6 Add godoc badge	8 лет назад
dummies.go	d2a2093e49 Make CountLines() and BlobToString() public	8 лет назад
dummies_test.go	d2a2093e49 Make CountLines() and BlobToString() public	8 лет назад
file.go	ac9cc316dd Pass through golint warnings	8 лет назад
file_history.go	ac9cc316dd Pass through golint warnings	8 лет назад
file_history_test.go	fac1929bbc Run go fmt	8 лет назад
file_test.go	b86f61d059 Fix golint warnings related to docs for exported types	8 лет назад
fix_yaml_unicode.py	3c44f04c8a Make labours.py runnable	8 лет назад
identity.go	ac9cc316dd Pass through golint warnings	8 лет назад
identity_test.go	b86f61d059 Fix golint warnings related to docs for exported types	8 лет назад
labours.py	a9c7a66131 Add sentiment support to labours.py	8 лет назад
mailmap.go	7aa35d12e4 Fix broken mailmap parsing	8 лет назад
mailmap_test.go	7aa35d12e4 Fix broken mailmap parsing	8 лет назад
pipeline.go	9851a3d3b2 Add parameter to skip vendor directories	8 лет назад
pipeline_test.go	9851a3d3b2 Add parameter to skip vendor directories	8 лет назад
registry.go	9851a3d3b2 Add parameter to skip vendor directories	8 лет назад
registry_test.go	5de1987fc4 Move featureFlags from global scope to PipelineRegistry	8 лет назад
renames.go	c723fb076d Replace fmt.Fprintf(os.Stderr) with log.Printf()	8 лет назад
renames_test.go	44c4e38ac8 Fix golint warnings unrelated to export and docs	8 лет назад
requirements.txt	280d13007f Finish with the binary packages	8 лет назад
shotness.go	c723fb076d Replace fmt.Fprintf(os.Stderr) with log.Printf()	8 лет назад
shotness_test.go	b7f221856b Try to determine EndPosition if it does not exist	8 лет назад
swivel.py	3d2dc109df Fix TensorBoard labels	8 лет назад
tree_diff.go	448ded335c revert rename skip -> enable	8 лет назад
tree_diff_test.go	71340d9e41 Option names skip -> enable	8 лет назад
uast.go	e4d48d9057 Add more default languages	8 лет назад
uast_test.go	734a3bf3d0 Test in AppVeyor	8 лет назад
version.go	b86f61d059 Fix golint warnings related to docs for exported types	8 лет назад

Hercules

Amazingly fast and highly customizable Git repository analysis engine written in Go. Batteries included. Powered by go-git and Babelfish.

There are two tools: hercules and labours.py. The first is the program written in Go which takes a Git repository and runs a Directed Acyclic Graph (DAG) of analysis tasks. The second is the Python script which draws some predefined plots. These two tools are normally used together through a pipe. It is possible to write custom analyses using the plugin system. It is also possible to merge several analysis results together. There is a presentation available.

The DAG of burndown and couples analyses with UAST diff refining. Generated with hercules --burndown --burndown-people --couples --feature=uast --dry-run --dump-dag doc/dag.dot https://github.com/src-d/hercules

torvalds/linux line burndown (granularity 30, sampling 30, resampled by year). Generated with hercules --burndown --pb https://github.com/torvalds/linux | python3 labours.py -f pb -m project

Installation

Grab hercules binary from the Releases page. labours.py requires the Python packages listed in requirements.txt:

pip3 install -r requirements.txt

Numpy and Scipy can be installed on Windows using http://www.lfd.uci.edu/~gohlke/pythonlibs/ Linux releases require libtensorflow.

Build from source

You are going to need Go (>= v1.8), protoc and Python 2 or 3.

go get -d gopkg.in/src-d/hercules.v3/cmd/hercules
cd $GOPATH/src/gopkg.in/src-d/hercules.v3
make

Replace $GOPATH with %GOPATH% on Windows.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

Apache 2.0

Usage

# Use "memory" go-git backend and display the burndown plot. "memory" is the fastest but the repository's git data must fit into RAM.
hercules --burndown https://github.com/src-d/go-git | python3 labours.py -m project --resample month
# Use "file system" go-git backend and print some basic information about the repository.
hercules /path/to/cloned/go-git
# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache, use Protocol Buffers and display the burndown plot without resampling.
hercules --burndown --pb https://github.com/git/git /tmp/repo-cache | python3 labours.py -m project -f pb --resample raw

# Now something fun
# Get the linear history from git rev-list, reverse it
# Pipe to hercules, produce burndown snapshots for every 30 days grouped by 30 days
# Save the raw data to cache.yaml, so that later is possible to python3 labours.py -i cache.yaml
# Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
git rev-list HEAD | tac | hercules --commits - --burndown https://github.com/git/git | tee cache.yaml | python3 labours.py -m project --font-size 16 --backend Agg --output git.png

labours.py -i /path/to/yaml allows to read the output from hercules which was saved on disk.

Caching

It is possible to store the cloned repository on disk. The subsequent analysis can run on the corresponding directory instead of cloning from scratch:

# First time - cache
hercules https://github.com/git/git /tmp/repo-cache

# Second time - use the cache
hercules --some-analysis /tmp/repo-cache

Docker image

docker run --rm srcd/hercules hercules --burndown --pb https://github.com/git/git | docker run --rm -i -v $(pwd):/io srcd/hercules labours.py -f pb -m project -o /io/git_git.png

Built-in analyses

Project burndown

hercules --burndown
python3 labours.py -m project

Line burndown statistics for the whole repository. Exactly the same what git-of-theseus does but much faster. Blaming is performed efficiently and incrementally using a custom RB tree tracking algorithm, and only the last modification date is recorded while running the analysis.

All burndown analyses depend on the values of granularity and sampling. Granularity is the number of days each band in the stack consists of. Sampling is the frequency with which the burnout state is snapshotted. The smaller the value, the more smooth is the plot but the more work is done.

There is an option to resample the bands inside labours.py, so that you can define a very precise distribution and visualize it different ways. Besides, resampling aligns the bands across periodic boundaries, e.g. months or years. Unresampled bands are apparently not aligned and start from the project's birth date.

Files

hercules --burndown --burndown-files
python3 labours.py -m file

Burndown statistics for every file in the repository which is alive in the latest revision.

Note: it will generate separate graph for every file. You might don't want to run it on repository with many files.

People

hercules --burndown --burndown-people [-people-dict=/path/to/identities]
python3 labours.py -m person

Burndown statistics for the repository's contributors. If -people-dict is not specified, the identities are discovered by the following algorithm:

We start from the root commit towards the HEAD. Emails and names are converted to lower case.
If we process an unknown email and name, record them as a new developer.
If we process a known email but unknown name, match to the developer with the matching email, and add the unknown name to the list of that developer's names.
If we process an unknown email but known name, match to the developer with the matching name, and add the unknown email to the list of that developer's emails.

If -people-dict is specified, it should point to a text file with the custom identities. The format is: every line is a single developer, it contains all the matching emails and names separated by |. The case is ignored.

Churn matrix

Wireshark top 20 devs - churn matrix

hercules --burndown --burndown-people [-people-dict=/path/to/identities]
python3 labours.py -m churn_matrix

Besides the burndown information, -people collects the added and deleted line statistics per developer. It shows how many lines written by developer A are removed by developer B. The format is the matrix with N rows and (N+2) columns, where N is the number of developers.

First column is the number of lines the developer wrote.
Second column is how many lines were written by the developer and deleted by unidentified developers (if -people-dict is not specified, it is always 0).
The rest of the columns show how many lines were written by the developer and deleted by identified developers.

The sequence of developers is stored in people_sequence YAML node.

Code ownership

Ember.js top 20 devs - code ownership

hercules --burndown --burndown-people [-people-dict=/path/to/identities]
python3 labours.py -m ownership

-people also allows to draw the code share through time stacked area plot. That is, how many lines are alive at the sampled moments in time for each identified developer.

Couples

torvalds/linux files' coupling in Tensorflow Projector

hercules --couples [-people-dict=/path/to/identities]
python3 labours.py -m couples -o <name> [--couples-tmp-dir=/tmp]

Important: it requires Tensorflow to be installed, please follow official instructions.

The files are coupled if they are changed in the same commit. The developers are coupled if they change the same file. hercules records the number of couples throught the whole commit history and outputs the two corresponding co-occurrence matrices. labours.py then trains Swivel embeddings - dense vectors which reflect the co-occurrence probability through the Euclidean distance. The training requires a working Tensorflow installation. The intermediate files are stored in the system temporary directory or --couples-tmp-dir if it is specified. The trained embeddings are written to the current working directory with the name depending on -o. The output format is TSV and matches Tensorflow Projector so that the files and people can be visualized with t-SNE implemented in TF Projector.

Structural hotness

      46  jinja2/compiler.py:visit_Template [FunctionDef]
      42  jinja2/compiler.py:visit_For [FunctionDef]
      34  jinja2/compiler.py:visit_Output [FunctionDef]
      29  jinja2/environment.py:compile [FunctionDef]
      27  jinja2/compiler.py:visit_Include [FunctionDef]
      22  jinja2/compiler.py:visit_Macro [FunctionDef]
      22  jinja2/compiler.py:visit_FromImport [FunctionDef]
      21  jinja2/compiler.py:visit_Filter [FunctionDef]
      21  jinja2/runtime.py:__call__ [FunctionDef]
      20  jinja2/compiler.py:visit_Block [FunctionDef]

Thanks to Babelfish, hercules is able to measure how many times each structural unit has been modified. By default, it looks at functions; refer to UAST XPath manual to set an other query.

hercules --shotness [--shotness-xpath-*]
python3 labours.py -m shotness

Couples analysis automatically loads "shotness" data if available.

hercules --shotness --pb https://github.com/pallets/jinja | python3 labours.py -m couples -f pb

Sentiment (positive and negative code)

hercules --sentiment --pb https://github.com/django/django | python3 labours.py -m sentiment -f pb

We extract new or changed comments from source code on every commit, apply [BiDiSentiment]() general purpose sentiment recurrent neural network and plot the results. Requires libtensorflow. E.g. sadly, we need to hide the rect from the documentation finder for now is negative and Theano has a built-in optimization for logsumexp (...) so we can just write the expression directly is positive. Don't expect too much though - as was written, the sentiment model is general purpose and the code comments have different nature, so there is no magic (for now).

Everything in a single pass

hercules --burndown --burndown-files --burndown-people --couples --shotness [-people-dict=/path/to/identities]
python3 labours.py -m all

Plugins

Hercules has a plugin system and allows to run custom analyses. See PLUGINS.md.

Merging

hercules combine is the command which joins several analysis results in Protocol Buffers format together.

hercules --burndown --pb https://github.com/src-d/go-git > go-git.pb
hercules --burndown --pb https://github.com/src-d/hercules > hercules.pb
hercules combine go-git.pb hercules.pb | python3 labours.py -f pb -m project --resample M

Bad unicode errors

YAML does not support the whole range of Unicode characters and the parser on labours.py side may raise exceptions. Filter the output from hercules through fix_yaml_unicode.py to discard such offending characters.

hercules --burndown --burndown-people https://github.com/... | python3 fix_yaml_unicode.py | python3 labours.py -m people

Plotting

These options affects all plots:

python3 labours.py [--style=white|black] [--backend=] [--size=Y,X]

--style changes the background to be either white ("black" foreground) or black ("white" foreground). --backend chooses the Matplotlib backend. --size sets the size of the figure in inches. The default is 12,9.

(required in macOS) you can pin the default Matplotlib backend with

echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc

These options are effective in burndown charts only:

python3 labours.py [--text-size] [--relative]

--text-size changes the font size, --relative activate the stretched burndown layout.

Custom plotting backend

It is possible to output all the information needed to draw the plots in JSON format. Simply append .json to the output (-o) and you are done. The data format is not fully specified and depends on the Python code which generates it. Each JSON file should contain "type" which reflects the plot kind.

Caveats

Currently, go-git's file system storage backend is considerably slower than the in-memory one, so you should clone repos instead of reading them from disk whenever possible. Please note that the in-memory storage may require much RAM, for example, the Linux kernel takes over 200GB in 2017.
Parsing YAML in Python is slow when the number of internal objects is big. hercules' output for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be parsed. However, most of the repositories are parsed within a minute. Try using Protocol Buffers instead (hercules --pb and labours.py -f pb).
To speed-up yaml parsing ```

Debian, Ubuntu

apt install libyaml-dev

macOS

brew install yaml-cpp libyaml

# you might need to re-install pyyaml for changes to make effect pip uninstall pyyaml pip --no-cache-dir install pyyaml ```

README.md