5 years ago · 6d975fbf57
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,38 @@
 
																+---
															
 
																+name: Bug report
															
 
																+about: Create a report to help us improve
															
 
																+title: ''
															
 
																+labels: ''
															
 
																+assignees: ''
															
 
																+
															
 
																+---
															
 
																+
															
 
																+**Describe the bug**
															
 
																+A clear and concise description of what the bug is.
															
 
																+
															
 
																+**To Reproduce**
															
 
																+Steps to reproduce the behavior:
															
 
																+1. Go to '...'
															
 
																+2. Click on '....'
															
 
																+3. Scroll down to '....'
															
 
																+4. See error
															
 
																+
															
 
																+**Expected behavior**
															
 
																+A clear and concise description of what you expected to happen.
															
 
																+
															
 
																+**Screenshots**
															
 
																+If applicable, add screenshots to help explain your problem.
															
 
																+
															
 
																+**Desktop (please complete the following information):**
															
 
																+ - OS: [e.g. iOS]
															
 
																+ - Browser [e.g. chrome, safari]
															
 
																+ - Version [e.g. 22]
															
 
																+
															
 
																+**Smartphone (please complete the following information):**
															
 
																+ - Device: [e.g. iPhone6]
															
 
																+ - OS: [e.g. iOS8.1]
															
 
																+ - Browser [e.g. stock browser, safari]
															
 
																+ - Version [e.g. 22]
															
 
																+
															
 
																+**Additional context**
															
 
																+Add any other context about the problem here.
															
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,20 @@
 
																+---
															
 
																+name: Feature request
															
 
																+about: Suggest an idea for this project
															
 
																+title: ''
															
 
																+labels: ''
															
 
																+assignees: ''
															
 
																+
															
 
																+---
															
 
																+
															
 
																+**Is your feature request related to a problem? Please describe.**
															
 
																+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
															
 
																+
															
 
																+**Describe the solution you'd like**
															
 
																+A clear and concise description of what you want to happen.
															
 
																+
															
 
																+**Describe alternatives you've considered**
															
 
																+A clear and concise description of any alternative solutions or features you've considered.
															
 
																+
															
 
																+**Additional context**
															
 
																+Add any other context or screenshots about the feature request here.
															
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,76 @@
 
																+# Contributor Covenant Code of Conduct
															
 
																+
															
 
																+## Our Pledge
															
 
																+
															
 
																+In the interest of fostering an open and welcoming environment, we as
															
 
																+contributors and maintainers pledge to making participation in our project and
															
 
																+our community a harassment-free experience for everyone, regardless of age, body
															
 
																+size, disability, ethnicity, sex characteristics, gender identity and expression,
															
 
																+level of experience, education, socio-economic status, nationality, personal
															
 
																+appearance, race, religion, or sexual identity and orientation.
															
 
																+
															
 
																+## Our Standards
															
 
																+
															
 
																+Examples of behavior that contributes to creating a positive environment
															
 
																+include:
															
 
																+
															
 
																+* Using welcoming and inclusive language
															
 
																+* Being respectful of differing viewpoints and experiences
															
 
																+* Gracefully accepting constructive criticism
															
 
																+* Focusing on what is best for the community
															
 
																+* Showing empathy towards other community members
															
 
																+
															
 
																+Examples of unacceptable behavior by participants include:
															
 
																+
															
 
																+* The use of sexualized language or imagery and unwelcome sexual attention or
															
 
																+ advances
															
 
																+* Trolling, insulting/derogatory comments, and personal or political attacks
															
 
																+* Public or private harassment
															
 
																+* Publishing others' private information, such as a physical or electronic
															
 
																+ address, without explicit permission
															
 
																+* Other conduct which could reasonably be considered inappropriate in a
															
 
																+ professional setting
															
 
																+
															
 
																+## Our Responsibilities
															
 
																+
															
 
																+Project maintainers are responsible for clarifying the standards of acceptable
															
 
																+behavior and are expected to take appropriate and fair corrective action in
															
 
																+response to any instances of unacceptable behavior.
															
 
																+
															
 
																+Project maintainers have the right and responsibility to remove, edit, or
															
 
																+reject comments, commits, code, wiki edits, issues, and other contributions
															
 
																+that are not aligned to this Code of Conduct, or to ban temporarily or
															
 
																+permanently any contributor for other behaviors that they deem inappropriate,
															
 
																+threatening, offensive, or harmful.
															
 
																+
															
 
																+## Scope
															
 
																+
															
 
																+This Code of Conduct applies both within project spaces and in public spaces
															
 
																+when an individual is representing the project or its community. Examples of
															
 
																+representing a project or community include using an official project e-mail
															
 
																+address, posting via an official social media account, or acting as an appointed
															
 
																+representative at an online or offline event. Representation of a project may be
															
 
																+further defined and clarified by project maintainers.
															
 
																+
															
 
																+## Enforcement
															
 
																+
															
 
																+Instances of abusive, harassing, or otherwise unacceptable behavior may be
															
 
																+reported by contacting the project team at luke_wilson@dell.com. All
															
 
																+complaints will be reviewed and investigated and will result in a response that
															
 
																+is deemed necessary and appropriate to the circumstances. The project team is
															
 
																+obligated to maintain confidentiality with regard to the reporter of an incident.
															
 
																+Further details of specific enforcement policies may be posted separately.
															
 
																+
															
 
																+Project maintainers who do not follow or enforce the Code of Conduct in good
															
 
																+faith may face temporary or permanent repercussions as determined by other
															
 
																+members of the project's leadership.
															
 
																+
															
 
																+## Attribution
															
 
																+
															
 
																+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
															
 
																+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
															
 
																+
															
 
																+[homepage]: https://www.contributor-covenant.org
															
 
																+
															
 
																+For answers to common questions about this code of conduct, see
															
 
																+https://www.contributor-covenant.org/faq
															
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
--- a/INSTALL.md
+++ b/INSTALL.md
@@ -0,0 +1,55 @@
 
																+Dancing to the beat of a different drum.
															
 
																+
															
 
																+# Short Version:
															
 
																+
															
 
																+Install Kubernetes and all dependencies
															
 
																+```
															
 
																+ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml
															
 
																+```
															
 
																+
															
 
																+Initialize K8S cluster
															
 
																+```
															
 
																+ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml --tags "init"
															
 
																+```
															
 
																+
															
 
																+
															
 
																+# What this does:
															
 
																+
															
 
																+## Build/Install
															
 
																+
															
 
																+### Add additional repositories:
															
 
																+
															
 
																+- Kubernetes (Google)
															
 
																+- El Repo (nvidia drivers)
															
 
																+- Nvidia (nvidia-docker)
															
 
																+- EPEL (Extra Packages for Enterprise Linux)
															
 
																+
															
 
																+### Install common packages
															
 
																+ - gcc
															
 
																+ - python-pip
															
 
																+ - docker
															
 
																+ - kubelet
															
 
																+ - kubeadm
															
 
																+ - kubectl
															
 
																+ - nvidia-detect
															
 
																+ - kmod-nvidia
															
 
																+ - nvidia-x11-drv
															
 
																+ - nvidia-container-runtime
															
 
																+ - ksonnet (CLI framework for K8S configs)
															
 
																+
															
 
																+### Enable GPU Device Plugins (nvidia-container-runtime-hook)
															
 
																+
															
 
																+### Modify kubeadm config to allow GPUs as schedulable resource 
															
 
																+
															
 
																+### Start and enable services
															
 
																+ - Docker
															
 
																+ - Kubelet
															
 
																+
															
 
																+## Initialize Cluster
															
 
																+### Head/master
															
 
																+- Start K8S pass startup token to compute/slaves
															
 
																+- Initialize networking (Currently using WeaveNet)
															
 
																+-Setup K8S Dashboard
															
 
																+- Create dynamic/persistent volumes
															
 
																+### Compute/slaves
															
 
																+- Join k8s cluster
															
--- a/README.md
+++ b/README.md
@@ -1,55 +1,14 @@
 
																-Dancing to the beat of a different drum.
															
 
																+# Omnia
															
 
																+#### Ansible playbook-based deployment of Slurm and Kubernetes on factory-provisioned Dell EMC PowerEdge servers
															
 
																-# Short Version:
															
 
																+Omnia (Latin: all or everything) is a deployment tool to turn Dell EMC PowerEdge servers with factory-installed OS images into a functioning Slurm/Kubernetes cluster.
															
 
																-Install Kubernetes and all dependencies
															
 
																-```
															
 
																-ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml
															
 
																-```
															
 
																+## Installing Omnia
															
 
																+To install Omnia, see [INSTALL](INSTALL.md)
															
 
																-Initialize K8S cluster
															
 
																-```
															
 
																-ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml --tags "init"
															
 
																-```
															
 
																+## Contributing
															
 
																+To contribute to the Omnia project, see [CONTRIBUTING](CONTRIBUTING.md)
															
 
																-
															
 
																-# What this does:
															
 
																-
															
 
																-## Build/Install
															
 
																-
															
 
																-### Add additional repositories:
															
 
																-
															
 
																-- Kubernetes (Google)
															
 
																-- El Repo (nvidia drivers)
															
 
																-- Nvidia (nvidia-docker)
															
 
																-- EPEL (Extra Packages for Enterprise Linux)
															
 
																-
															
 
																-### Install common packages
															
 
																- - gcc
															
 
																- - python-pip
															
 
																- - docker
															
 
																- - kubelet
															
 
																- - kubeadm
															
 
																- - kubectl
															
 
																- - nvidia-detect
															
 
																- - kmod-nvidia
															
 
																- - nvidia-x11-drv
															
 
																- - nvidia-container-runtime
															
 
																- - ksonnet (CLI framework for K8S configs)
															
 
																-
															
 
																-### Enable GPU Device Plugins (nvidia-container-runtime-hook)
															
 
																-
															
 
																-### Modify kubeadm config to allow GPUs as schedulable resource 
															
 
																-
															
 
																-### Start and enable services
															
 
																- - Docker
															
 
																- - Kubelet
															
 
																-
															
 
																-## Initialize Cluster
															
 
																-### Head/master
															
 
																-- Start K8S pass startup token to compute/slaves
															
 
																-- Initialize networking (Currently using WeaveNet)
															
 
																--Setup K8S Dashboard
															
 
																-- Create dynamic/persistent volumes
															
 
																-### Compute/slaves
															
 
																-- Join k8s cluster
															
 
																+### Current maintainers:
															
 
																+* Lucas A. Wilson (Dell Technologies)
															
 
																+* John Lockman (Dell Technologies)
															
--- a/examples/TensorRT-InferenceServer/README.md
+++ b/examples/TensorRT-InferenceServer/README.md
@@ -0,0 +1,156 @@
 
																+# Run Nvidia's TensorRT Inference Server on omnia 
															
 
																+
															
 
																+Clone the repo
															
 
																+
															
 
																+`git clone https://github.com/NVIDIA/tensorrt-inference-server.git`
															
 
																+
															
 
																+Download models
															
 
																+
															
 
																+`cd tensorrt-inference-server/docs/examples/`
															
 
																+`./fetch_models.sh`
															
 
																+
															
 
																+Copy models to shared NFS location
															
 
																+
															
 
																+`cp -rp model_repository ensemble_model_repository /home/k8sSHARE`
															
 
																+
															
 
																+Fix permissions on model files
															
 
																+
															
 
																+`chmod -R a+r /home/k8sSHARE/model_repository`
															
 
																+
															
 
																+
															
 
																+## Deploy Prometheus and Grafana
															
 
																+
															
 
																+Prometheus collects metrics for viewing in Grafana. Install the prometheus-operator for these components. The serviceMonitorSelectorNilUsesHelmValues flag is needed so that Prometheus can find the inference server metrics in the example release deployed below:
															
 
																+
															
 
																+`helm install --name example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false stable/prometheus-operator`
															
 
																+
															
 
																+Setup port-forward to the Grafana service for local access
															
 
																+
															
 
																+`kubectl port-forward service/example-metrics-grafana 8080:80`
															
 
																+
															
 
																+Navigate in your browser to localhost:8080 for the Grafana login page. 
															
 
																+`username=admin password=prom-operator`
															
 
																+
															
 
																+## Setup TensorRT Inference Server Deployment
															
 
																+Change to helm chart directory
															
 
																+`cd ~/tensorrt-inference-server/deploy/single_server/`
															
 
																+
															
 
																+Modify `values.yaml` changing `modelRepositoryPath`
															
 
																+
															
 
																+<pre>
															
 
																+image:
															
 
																+  imageName: nvcr.io/nvidia/tensorrtserver:20.01-py3
															
 
																+  pullPolicy: IfNotPresent
															
 
																+  #modelRepositoryPath: gs://tensorrt-inference-server-repository/model_repository
															
 
																+  modelRepositoryPath: /data/model_repository
															
 
																+  numGpus: 1
															
 
																+ </pre>
															
 
																+
															
 
																+Modify `templates/deployment.yaml` in **bold** to add the local NFS mount:
															
 
																+<pre>
															
 
																+...
															
 
																+    spec:
															
 
																+      containers:
															
 
																+        - name: {{ .Chart.Name }}
															
 
																+          image: "{{ .Values.image.imageName }}"
															
 
																+          imagePullPolicy: {{ .Values.image.pullPolicy }}
															
 
																+         <b style='background-color:yellow'> volumeMounts:
															
 
																+            - mountPath: /data/
															
 
																+              name: work-volume</b>
															
 
																+ ...
															
 
																+   <b>   volumes:
															
 
																+      - name: work-volume
															
 
																+        hostPath:
															
 
																+          # directory locally mounted on host
															
 
																+          path: /home/k8sSHARE
															
 
																+          type: Directory
															
 
																+   </b>
															
 
																+   </pre>
															
 
																+
															
 
																+
															
 
																+### Deploy the inference server
															
 
																+
															
 
																+<pre>
															
 
																+cd ~/tensorrt-inference-server/deploy/single_server/
															
 
																+helm install --name example .
															
 
																+</pre>
															
 
																+
															
 
																+### Verify deployment
															
 
																+<pre>
															
 
																+helm ls
															
 
																+NAME           	REVISION	UPDATED                 	STATUS  	CHART                          	APP VERSION	NAMESPACE
															
 
																+example        	1       	Wed Feb 26 15:46:18 2020	DEPLOYED	tensorrt-inference-server-1.0.0	1.0        	default  
															
 
																+example-metrics	1       	Tue Feb 25 17:45:54 2020	DEPLOYED	prometheus-operator-8.9.2      	0.36.0     	default  
															
 
																+</pre>
															
 
																+
															
 
																+<pre>
															
 
																+kubectl get pods
															
 
																+NAME                                                     READY   STATUS    RESTARTS   AGE
															
 
																+example-tensorrt-inference-server-f45d865dc-62c46        1/1     Running   0          53m
															
 
																+</pre>
															
 
																+
															
 
																+<pre>
															
 
																+kubectl get svc
															
 
																+NAME                                        TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                                        AGE
															
 
																+...
															
 
																+example-tensorrt-inference-server           LoadBalancer   10.150.77.138    192.168.60.150   8000:31165/TCP,8001:31408/TCP,8002:30566/TCP   53m
															
 
																+</pre>
															
 
																+
															
 
																+## Setup NGC login secret for nvcr.io
															
 
																+
															
 
																+`kubectl create secret docker-registry <your-secret-name> --docker-server=<your-registry-server> --docker-username=<your-registry-username> --docker-password=<your-registry-apikey> --docker-email=<your-email>
															
 
																+`
															
 
																+
															
 
																+Parameter Description:
															
 
																+docker-registry <your-secret-name> – the name you will use for this secret
															
 
																+docker-server <your-registry-server> – nvcr.io is the container registry for NGC
															
 
																+docker-username <your-registry-username> – for nvcr.io this is ‘$oauthtoken’ (including quotes)
															
 
																+docker-password <your-registry-apikey> – this is the API Key you obtained earlier
															
 
																+docker-email <your-email> – your NGC email address
															
 
																+
															
 
																+Example (you will need to generate your own oauth token)
															
 
																+`kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=clkaw309f3jfaJ002EIVCJAC0Cpcklajser90wezxc98wdn09ICJA09xjc09j09JV00JV0JVCLR0WQE8ACZz --docker-email=john@example.com`
															
 
																+
															
 
																+Verify your secret has been stored:
															
 
																+<pre>
															
 
																+kubectl get secrets
															
 
																+NAME                                                          TYPE                                  DATA   AGE
															
 
																+...
															
 
																+ngc-secret                                                    kubernetes.io/dockerconfigjson        1      106m
															
 
																+</pre>
															
 
																+
															
 
																+## Run TensorRT Client
															
 
																+`kubectl apply -f trt-client.yaml`
															
 
																+
															
 
																+Verify it is running:
															
 
																+<pre>
															
 
																+kubectl get pod tensorrt-client 
															
 
																+NAME              READY   STATUS    RESTARTS   AGE
															
 
																+tensorrt-client   1/1     Running   0          5m
															
 
																+</pre>
															
 
																+
															
 
																+Run the inception test using the client Pod. The TensorRT Inference IP Address can be found by running `kubectl get svc`
															
 
																+<pre>
															
 
																+kubectl exec -it tensorrt-client -- /bin/bash -c "image_client -u 192.168.60.150:8000 -m resnet50_netdef -s INCEPTION images/mug.jpg"
															
 
																+Request 0, batch size 1
															
 
																+Image 'images/mug.jpg':
															
 
																+    504 (COFFEE MUG) = 0.723992
															
 
																+</pre>
															
 
																+
															
 
																+Run inception test with batch size 2 and print top 3 classifications
															
 
																+<pre>
															
 
																+ kubectl exec -it tensorrt-client -- /bin/bash -c "image_client  -u 192.168.60.150:8000 -m resnet50_netdef -s INCEPTION images/ -c 3 -b 2"
															
 
																+Request 0, batch size 2
															
 
																+Image 'images//mug.jpg':
															
 
																+    504 (COFFEE MUG) = 0.723992
															
 
																+    968 (CUP) = 0.270953
															
 
																+    967 (ESPRESSO) = 0.00115996
															
 
																+Image 'images//mug.jpg':
															
 
																+    504 (COFFEE MUG) = 0.723992
															
 
																+    968 (CUP) = 0.270953
															
 
																+    967 (ESPRESSO) = 0.00115996
															
 
																+</pre>
															
 
																+
															
 
																+
															
 
																+
															
 
																+