Browse Source

Update readme

Vadim Markovtsev 8 years ago
parent
commit
c03bf606eb
2 changed files with 56 additions and 5 deletions
  1. 51 5
      README.md
  2. 5 0
      labours.py

+ 51 - 5
README.md

@@ -1,16 +1,62 @@
 Hercules
 --------
 
-This tool calculates the weekly lines burnout in a Git repository.
+This tool calculates the lines burnout stats in a Git repository.
+Exactly the same what [git-of-theseus](https://github.com/erikbern/git-of-theseus)
+does actually, but using [go-git](https://github.com/src-d/go-git).
+Why? [source{d}](http://sourced.tech) builds it's own data pipeline to
+process every git repository in the world and the calculation of the
+annual burnout ratio will be embedded into it. This project is the
+open source implementation of the specific `git blame` flavour on top
+of go-git. It is done incrementally using the custom RB tree tracking
+algorithm, only the last modification date is recorded.
 
-###Usage
+There are two tools: `hercules` and `labours.py`. The first is the program
+written in Go which collects the burnout stats from a Git repository.
+The second is the Python script which draws the stack area plot. They
+are normally used together through a pipe. `hercules` prints
+text results. The first line is three numbers: UNIX timestamp which
+corresponds to the time the repository was created, *granularity* and *sampling*.
+Granularity is the number of days each band in the stack consists of. For example,
+to get the annual burnout plot, set granularity to 365. Sampling is the
+frequency with which the burnout is snapshotted. The smaller the value,
+the more smooth is the plot but the more work is done.
+
+###Installation
+You are going to need Go and Python 2 or 3.
+```
+go get gopkg.in/src-d/hercules.v1/cmd/hercules
+pip install pandas seaborn
+wget https://github.com/src-d/hercules/raw/master/labours.py
+```
 
+###Usage
 ```
+# Use "memory" go-git backend and display the plot. This is the fastest but the repository data must fit into RAM.
 hercules https://github.com/src-d/go-git | python3 labours.py
-hercules /path/to/cloned/go-git | python3 labours.py
-hercules https://github.com/torvalds/linux /tmp/linux_cache | python3 labours.py
-git rev-list HEAD | tac | hercules -commits -sampling 7 - https://github.com/src-d/go-git | python3 labours.py
+# Use "file system" go-git backend and print the raw data.
+hercules /path/to/cloned/go-git
+#  Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache and display the plot.
+hercules https://github.com/git/git /tmp/repo-cache | python3 labours.py
+
+# Now something fun
+# Get the linear history from git rev-list, reverse it
+# Pipe to hercules, produce the snapshot every 30 days with 1 year grouping
+# Save the raw data to cache.txt, so that later simply cat cache.txt | python3 labours.py
+# Pipe the raw data to labours.py, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
+git rev-list HEAD | tac | hercules -commits - -sampling 30 -granularity 365 https://github.com/git/git | tee cache.txt | python3 labours.py --font-size 16 --backend Agg --output git.png
 ```
 
+###Caveats
+1. Currently, go-git's "diff tree" algorithm's complexity is n log(n) where
+n is the number of files in the tree. Git's and libgit2's complexity
+is sublinear, almost constant because they are comparing the hashes of subtrees. go-git
+will have the same complexity in the very near future.
+
+2. Currently, go-git's "file system" backend does not cache anything in memory.
+Every object retrieval operation decompresses the packfiles, parses them, etc.
+Effectively, the performance **slowdown** is **100x**. This will be fixed
+in the near future too.
+
 ###License
 MIT.

+ 5 - 0
labours.py

@@ -5,6 +5,11 @@ import sys
 import numpy
 
 
+if sys.version_info[0] < 3:
+    # OK, ancients, I will support Python 2, but you owe me a beer
+    input = raw_input
+
+
 def parse_args():
     parser = argparse.ArgumentParser()
     parser.add_argument("--output", default="",