5 년 전 · 6d975fbf57
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,38 @@
 
				+---
			
 
				+name: Bug report
			
 
				+about: Create a report to help us improve
			
 
				+title: ''
			
 
				+labels: ''
			
 
				+assignees: ''
			
 
				+
			
 
				+---
			
 
				+
			
 
				+**Describe the bug**
			
 
				+A clear and concise description of what the bug is.
			
 
				+
			
 
				+**To Reproduce**
			
 
				+Steps to reproduce the behavior:
			
 
				+1. Go to '...'
			
 
				+2. Click on '....'
			
 
				+3. Scroll down to '....'
			
 
				+4. See error
			
 
				+
			
 
				+**Expected behavior**
			
 
				+A clear and concise description of what you expected to happen.
			
 
				+
			
 
				+**Screenshots**
			
 
				+If applicable, add screenshots to help explain your problem.
			
 
				+
			
 
				+**Desktop (please complete the following information):**
			
 
				+ - OS: [e.g. iOS]
			
 
				+ - Browser [e.g. chrome, safari]
			
 
				+ - Version [e.g. 22]
			
 
				+
			
 
				+**Smartphone (please complete the following information):**
			
 
				+ - Device: [e.g. iPhone6]
			
 
				+ - OS: [e.g. iOS8.1]
			
 
				+ - Browser [e.g. stock browser, safari]
			
 
				+ - Version [e.g. 22]
			
 
				+
			
 
				+**Additional context**
			
 
				+Add any other context about the problem here.
			
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,20 @@
 
				+---
			
 
				+name: Feature request
			
 
				+about: Suggest an idea for this project
			
 
				+title: ''
			
 
				+labels: ''
			
 
				+assignees: ''
			
 
				+
			
 
				+---
			
 
				+
			
 
				+**Is your feature request related to a problem? Please describe.**
			
 
				+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
			
 
				+
			
 
				+**Describe the solution you'd like**
			
 
				+A clear and concise description of what you want to happen.
			
 
				+
			
 
				+**Describe alternatives you've considered**
			
 
				+A clear and concise description of any alternative solutions or features you've considered.
			
 
				+
			
 
				+**Additional context**
			
 
				+Add any other context or screenshots about the feature request here.
			
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,76 @@
 
				+# Contributor Covenant Code of Conduct
			
 
				+
			
 
				+## Our Pledge
			
 
				+
			
 
				+In the interest of fostering an open and welcoming environment, we as
			
 
				+contributors and maintainers pledge to making participation in our project and
			
 
				+our community a harassment-free experience for everyone, regardless of age, body
			
 
				+size, disability, ethnicity, sex characteristics, gender identity and expression,
			
 
				+level of experience, education, socio-economic status, nationality, personal
			
 
				+appearance, race, religion, or sexual identity and orientation.
			
 
				+
			
 
				+## Our Standards
			
 
				+
			
 
				+Examples of behavior that contributes to creating a positive environment
			
 
				+include:
			
 
				+
			
 
				+* Using welcoming and inclusive language
			
 
				+* Being respectful of differing viewpoints and experiences
			
 
				+* Gracefully accepting constructive criticism
			
 
				+* Focusing on what is best for the community
			
 
				+* Showing empathy towards other community members
			
 
				+
			
 
				+Examples of unacceptable behavior by participants include:
			
 
				+
			
 
				+* The use of sexualized language or imagery and unwelcome sexual attention or
			
 
				+ advances
			
 
				+* Trolling, insulting/derogatory comments, and personal or political attacks
			
 
				+* Public or private harassment
			
 
				+* Publishing others' private information, such as a physical or electronic
			
 
				+ address, without explicit permission
			
 
				+* Other conduct which could reasonably be considered inappropriate in a
			
 
				+ professional setting
			
 
				+
			
 
				+## Our Responsibilities
			
 
				+
			
 
				+Project maintainers are responsible for clarifying the standards of acceptable
			
 
				+behavior and are expected to take appropriate and fair corrective action in
			
 
				+response to any instances of unacceptable behavior.
			
 
				+
			
 
				+Project maintainers have the right and responsibility to remove, edit, or
			
 
				+reject comments, commits, code, wiki edits, issues, and other contributions
			
 
				+that are not aligned to this Code of Conduct, or to ban temporarily or
			
 
				+permanently any contributor for other behaviors that they deem inappropriate,
			
 
				+threatening, offensive, or harmful.
			
 
				+
			
 
				+## Scope
			
 
				+
			
 
				+This Code of Conduct applies both within project spaces and in public spaces
			
 
				+when an individual is representing the project or its community. Examples of
			
 
				+representing a project or community include using an official project e-mail
			
 
				+address, posting via an official social media account, or acting as an appointed
			
 
				+representative at an online or offline event. Representation of a project may be
			
 
				+further defined and clarified by project maintainers.
			
 
				+
			
 
				+## Enforcement
			
 
				+
			
 
				+Instances of abusive, harassing, or otherwise unacceptable behavior may be
			
 
				+reported by contacting the project team at luke_wilson@dell.com. All
			
 
				+complaints will be reviewed and investigated and will result in a response that
			
 
				+is deemed necessary and appropriate to the circumstances. The project team is
			
 
				+obligated to maintain confidentiality with regard to the reporter of an incident.
			
 
				+Further details of specific enforcement policies may be posted separately.
			
 
				+
			
 
				+Project maintainers who do not follow or enforce the Code of Conduct in good
			
 
				+faith may face temporary or permanent repercussions as determined by other
			
 
				+members of the project's leadership.
			
 
				+
			
 
				+## Attribution
			
 
				+
			
 
				+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
			
 
				+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
			
 
				+
			
 
				+[homepage]: https://www.contributor-covenant.org
			
 
				+
			
 
				+For answers to common questions about this code of conduct, see
			
 
				+https://www.contributor-covenant.org/faq
			
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
--- a/INSTALL.md
+++ b/INSTALL.md
@@ -0,0 +1,55 @@
 
				+Dancing to the beat of a different drum.
			
 
				+
			
 
				+# Short Version:
			
 
				+
			
 
				+Install Kubernetes and all dependencies
			
 
				+```
			
 
				+ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml
			
 
				+```
			
 
				+
			
 
				+Initialize K8S cluster
			
 
				+```
			
 
				+ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml --tags "init"
			
 
				+```
			
 
				+
			
 
				+
			
 
				+# What this does:
			
 
				+
			
 
				+## Build/Install
			
 
				+
			
 
				+### Add additional repositories:
			
 
				+
			
 
				+- Kubernetes (Google)
			
 
				+- El Repo (nvidia drivers)
			
 
				+- Nvidia (nvidia-docker)
			
 
				+- EPEL (Extra Packages for Enterprise Linux)
			
 
				+
			
 
				+### Install common packages
			
 
				+ - gcc
			
 
				+ - python-pip
			
 
				+ - docker
			
 
				+ - kubelet
			
 
				+ - kubeadm
			
 
				+ - kubectl
			
 
				+ - nvidia-detect
			
 
				+ - kmod-nvidia
			
 
				+ - nvidia-x11-drv
			
 
				+ - nvidia-container-runtime
			
 
				+ - ksonnet (CLI framework for K8S configs)
			
 
				+
			
 
				+### Enable GPU Device Plugins (nvidia-container-runtime-hook)
			
 
				+
			
 
				+### Modify kubeadm config to allow GPUs as schedulable resource 
			
 
				+
			
 
				+### Start and enable services
			
 
				+ - Docker
			
 
				+ - Kubelet
			
 
				+
			
 
				+## Initialize Cluster
			
 
				+### Head/master
			
 
				+- Start K8S pass startup token to compute/slaves
			
 
				+- Initialize networking (Currently using WeaveNet)
			
 
				+-Setup K8S Dashboard
			
 
				+- Create dynamic/persistent volumes
			
 
				+### Compute/slaves
			
 
				+- Join k8s cluster
			
--- a/README.md
+++ b/README.md
@@ -1,55 +1,14 @@
 
				-Dancing to the beat of a different drum.
			
 
				+# Omnia
			
 
				+#### Ansible playbook-based deployment of Slurm and Kubernetes on factory-provisioned Dell EMC PowerEdge servers
			
 
				 
			
 
				-# Short Version:
			
 
				+Omnia (Latin: all or everything) is a deployment tool to turn Dell EMC PowerEdge servers with factory-installed OS images into a functioning Slurm/Kubernetes cluster.
			
 
				 
			
 
				-Install Kubernetes and all dependencies
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml
			
 
				-```
			
 
				+## Installing Omnia
			
 
				+To install Omnia, see [INSTALL](INSTALL.md)
			
 
				 
			
 
				-Initialize K8S cluster
			
 
				-```
			
 
				-ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml --tags "init"
			
 
				-```
			
 
				+## Contributing
			
 
				+To contribute to the Omnia project, see [CONTRIBUTING](CONTRIBUTING.md)
			
 
				 
			
 
				-
			
 
				-# What this does:
			
 
				-
			
 
				-## Build/Install
			
 
				-
			
 
				-### Add additional repositories:
			
 
				-
			
 
				-- Kubernetes (Google)
			
 
				-- El Repo (nvidia drivers)
			
 
				-- Nvidia (nvidia-docker)
			
 
				-- EPEL (Extra Packages for Enterprise Linux)
			
 
				-
			
 
				-### Install common packages
			
 
				- - gcc
			
 
				- - python-pip
			
 
				- - docker
			
 
				- - kubelet
			
 
				- - kubeadm
			
 
				- - kubectl
			
 
				- - nvidia-detect
			
 
				- - kmod-nvidia
			
 
				- - nvidia-x11-drv
			
 
				- - nvidia-container-runtime
			
 
				- - ksonnet (CLI framework for K8S configs)
			
 
				-
			
 
				-### Enable GPU Device Plugins (nvidia-container-runtime-hook)
			
 
				-
			
 
				-### Modify kubeadm config to allow GPUs as schedulable resource 
			
 
				-
			
 
				-### Start and enable services
			
 
				- - Docker
			
 
				- - Kubelet
			
 
				-
			
 
				-## Initialize Cluster
			
 
				-### Head/master
			
 
				-- Start K8S pass startup token to compute/slaves
			
 
				-- Initialize networking (Currently using WeaveNet)
			
 
				--Setup K8S Dashboard
			
 
				-- Create dynamic/persistent volumes
			
 
				-### Compute/slaves
			
 
				-- Join k8s cluster
			
 
				+### Current maintainers:
			
 
				+* Lucas A. Wilson (Dell Technologies)
			
 
				+* John Lockman (Dell Technologies)
			
--- a/examples/TensorRT-InferenceServer/README.md
+++ b/examples/TensorRT-InferenceServer/README.md
@@ -0,0 +1,156 @@
 
				+# Run Nvidia's TensorRT Inference Server on omnia 
			
 
				+
			
 
				+Clone the repo
			
 
				+
			
 
				+`git clone https://github.com/NVIDIA/tensorrt-inference-server.git`
			
 
				+
			
 
				+Download models
			
 
				+
			
 
				+`cd tensorrt-inference-server/docs/examples/`
			
 
				+`./fetch_models.sh`
			
 
				+
			
 
				+Copy models to shared NFS location
			
 
				+
			
 
				+`cp -rp model_repository ensemble_model_repository /home/k8sSHARE`
			
 
				+
			
 
				+Fix permissions on model files
			
 
				+
			
 
				+`chmod -R a+r /home/k8sSHARE/model_repository`
			
 
				+
			
 
				+
			
 
				+## Deploy Prometheus and Grafana
			
 
				+
			
 
				+Prometheus collects metrics for viewing in Grafana. Install the prometheus-operator for these components. The serviceMonitorSelectorNilUsesHelmValues flag is needed so that Prometheus can find the inference server metrics in the example release deployed below:
			
 
				+
			
 
				+`helm install --name example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false stable/prometheus-operator`
			
 
				+
			
 
				+Setup port-forward to the Grafana service for local access
			
 
				+
			
 
				+`kubectl port-forward service/example-metrics-grafana 8080:80`
			
 
				+
			
 
				+Navigate in your browser to localhost:8080 for the Grafana login page. 
			
 
				+`username=admin password=prom-operator`
			
 
				+
			
 
				+## Setup TensorRT Inference Server Deployment
			
 
				+Change to helm chart directory
			
 
				+`cd ~/tensorrt-inference-server/deploy/single_server/`
			
 
				+
			
 
				+Modify `values.yaml` changing `modelRepositoryPath`
			
 
				+
			
 
				+<pre>
			
 
				+image:
			
 
				+  imageName: nvcr.io/nvidia/tensorrtserver:20.01-py3
			
 
				+  pullPolicy: IfNotPresent
			
 
				+  #modelRepositoryPath: gs://tensorrt-inference-server-repository/model_repository
			
 
				+  modelRepositoryPath: /data/model_repository
			
 
				+  numGpus: 1
			
 
				+ </pre>
			
 
				+
			
 
				+Modify `templates/deployment.yaml` in **bold** to add the local NFS mount:
			
 
				+<pre>
			
 
				+...
			
 
				+    spec:
			
 
				+      containers:
			
 
				+        - name: {{ .Chart.Name }}
			
 
				+          image: "{{ .Values.image.imageName }}"
			
 
				+          imagePullPolicy: {{ .Values.image.pullPolicy }}
			
 
				+         <b style='background-color:yellow'> volumeMounts:
			
 
				+            - mountPath: /data/
			
 
				+              name: work-volume</b>
			
 
				+ ...
			
 
				+   <b>   volumes:
			
 
				+      - name: work-volume
			
 
				+        hostPath:
			
 
				+          # directory locally mounted on host
			
 
				+          path: /home/k8sSHARE
			
 
				+          type: Directory
			
 
				+   </b>
			
 
				+   </pre>
			
 
				+
			
 
				+
			
 
				+### Deploy the inference server
			
 
				+
			
 
				+<pre>
			
 
				+cd ~/tensorrt-inference-server/deploy/single_server/
			
 
				+helm install --name example .
			
 
				+</pre>
			
 
				+
			
 
				+### Verify deployment
			
 
				+<pre>
			
 
				+helm ls
			
 
				+NAME           	REVISION	UPDATED                 	STATUS  	CHART                          	APP VERSION	NAMESPACE
			
 
				+example        	1       	Wed Feb 26 15:46:18 2020	DEPLOYED	tensorrt-inference-server-1.0.0	1.0        	default  
			
 
				+example-metrics	1       	Tue Feb 25 17:45:54 2020	DEPLOYED	prometheus-operator-8.9.2      	0.36.0     	default  
			
 
				+</pre>
			
 
				+
			
 
				+<pre>
			
 
				+kubectl get pods
			
 
				+NAME                                                     READY   STATUS    RESTARTS   AGE
			
 
				+example-tensorrt-inference-server-f45d865dc-62c46        1/1     Running   0          53m
			
 
				+</pre>
			
 
				+
			
 
				+<pre>
			
 
				+kubectl get svc
			
 
				+NAME                                        TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                                        AGE
			
 
				+...
			
 
				+example-tensorrt-inference-server           LoadBalancer   10.150.77.138    192.168.60.150   8000:31165/TCP,8001:31408/TCP,8002:30566/TCP   53m
			
 
				+</pre>
			
 
				+
			
 
				+## Setup NGC login secret for nvcr.io
			
 
				+
			
 
				+`kubectl create secret docker-registry <your-secret-name> --docker-server=<your-registry-server> --docker-username=<your-registry-username> --docker-password=<your-registry-apikey> --docker-email=<your-email>
			
 
				+`
			
 
				+
			
 
				+Parameter Description:
			
 
				+docker-registry <your-secret-name> – the name you will use for this secret
			
 
				+docker-server <your-registry-server> – nvcr.io is the container registry for NGC
			
 
				+docker-username <your-registry-username> – for nvcr.io this is ‘$oauthtoken’ (including quotes)
			
 
				+docker-password <your-registry-apikey> – this is the API Key you obtained earlier
			
 
				+docker-email <your-email> – your NGC email address
			
 
				+
			
 
				+Example (you will need to generate your own oauth token)
			
 
				+`kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=clkaw309f3jfaJ002EIVCJAC0Cpcklajser90wezxc98wdn09ICJA09xjc09j09JV00JV0JVCLR0WQE8ACZz --docker-email=john@example.com`
			
 
				+
			
 
				+Verify your secret has been stored:
			
 
				+<pre>
			
 
				+kubectl get secrets
			
 
				+NAME                                                          TYPE                                  DATA   AGE
			
 
				+...
			
 
				+ngc-secret                                                    kubernetes.io/dockerconfigjson        1      106m
			
 
				+</pre>
			
 
				+
			
 
				+## Run TensorRT Client
			
 
				+`kubectl apply -f trt-client.yaml`
			
 
				+
			
 
				+Verify it is running:
			
 
				+<pre>
			
 
				+kubectl get pod tensorrt-client 
			
 
				+NAME              READY   STATUS    RESTARTS   AGE
			
 
				+tensorrt-client   1/1     Running   0          5m
			
 
				+</pre>
			
 
				+
			
 
				+Run the inception test using the client Pod. The TensorRT Inference IP Address can be found by running `kubectl get svc`
			
 
				+<pre>
			
 
				+kubectl exec -it tensorrt-client -- /bin/bash -c "image_client -u 192.168.60.150:8000 -m resnet50_netdef -s INCEPTION images/mug.jpg"
			
 
				+Request 0, batch size 1
			
 
				+Image 'images/mug.jpg':
			
 
				+    504 (COFFEE MUG) = 0.723992
			
 
				+</pre>
			
 
				+
			
 
				+Run inception test with batch size 2 and print top 3 classifications
			
 
				+<pre>
			
 
				+ kubectl exec -it tensorrt-client -- /bin/bash -c "image_client  -u 192.168.60.150:8000 -m resnet50_netdef -s INCEPTION images/ -c 3 -b 2"
			
 
				+Request 0, batch size 2
			
 
				+Image 'images//mug.jpg':
			
 
				+    504 (COFFEE MUG) = 0.723992
			
 
				+    968 (CUP) = 0.270953
			
 
				+    967 (ESPRESSO) = 0.00115996
			
 
				+Image 'images//mug.jpg':
			
 
				+    504 (COFFEE MUG) = 0.723992
			
 
				+    968 (CUP) = 0.270953
			
 
				+    967 (ESPRESSO) = 0.00115996
			
 
				+</pre>
			
 
				+
			
 
				+
			
 
				+
			
 
				+