소스 검색

Merge pull request #1 from dellhpc/master

merging
John Lockman 5 년 전
부모
커밋
6d975fbf57
7개의 변경된 파일420개의 추가작업 그리고 51개의 파일을 삭제
  1. 38 0
      .github/ISSUE_TEMPLATE/bug_report.md
  2. 20 0
      .github/ISSUE_TEMPLATE/feature_request.md
  3. 76 0
      CODE_OF_CONDUCT.md
  4. 65 0
      CONTRIBUTING.md
  5. 55 0
      INSTALL.md
  6. 10 51
      README.md
  7. 156 0
      examples/TensorRT-InferenceServer/README.md

+ 38 - 0
.github/ISSUE_TEMPLATE/bug_report.md

@@ -0,0 +1,38 @@
+---
+name: Bug report
+about: Create a report to help us improve
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+
+**Desktop (please complete the following information):**
+ - OS: [e.g. iOS]
+ - Browser [e.g. chrome, safari]
+ - Version [e.g. 22]
+
+**Smartphone (please complete the following information):**
+ - Device: [e.g. iPhone6]
+ - OS: [e.g. iOS8.1]
+ - Browser [e.g. stock browser, safari]
+ - Version [e.g. 22]
+
+**Additional context**
+Add any other context about the problem here.

+ 20 - 0
.github/ISSUE_TEMPLATE/feature_request.md

@@ -0,0 +1,20 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Additional context**
+Add any other context or screenshots about the feature request here.

+ 76 - 0
CODE_OF_CONDUCT.md

@@ -0,0 +1,76 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+ advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+ address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+ professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at luke_wilson@dell.com. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq

파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 65 - 0
CONTRIBUTING.md


+ 55 - 0
INSTALL.md

@@ -0,0 +1,55 @@
+Dancing to the beat of a different drum.
+
+# Short Version:
+
+Install Kubernetes and all dependencies
+```
+ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml
+```
+
+Initialize K8S cluster
+```
+ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml --tags "init"
+```
+
+
+# What this does:
+
+## Build/Install
+
+### Add additional repositories:
+
+- Kubernetes (Google)
+- El Repo (nvidia drivers)
+- Nvidia (nvidia-docker)
+- EPEL (Extra Packages for Enterprise Linux)
+
+### Install common packages
+ - gcc
+ - python-pip
+ - docker
+ - kubelet
+ - kubeadm
+ - kubectl
+ - nvidia-detect
+ - kmod-nvidia
+ - nvidia-x11-drv
+ - nvidia-container-runtime
+ - ksonnet (CLI framework for K8S configs)
+
+### Enable GPU Device Plugins (nvidia-container-runtime-hook)
+
+### Modify kubeadm config to allow GPUs as schedulable resource 
+
+### Start and enable services
+ - Docker
+ - Kubelet
+
+## Initialize Cluster
+### Head/master
+- Start K8S pass startup token to compute/slaves
+- Initialize networking (Currently using WeaveNet)
+-Setup K8S Dashboard
+- Create dynamic/persistent volumes
+### Compute/slaves
+- Join k8s cluster

+ 10 - 51
README.md

@@ -1,55 +1,14 @@
-Dancing to the beat of a different drum.
+# Omnia
+#### Ansible playbook-based deployment of Slurm and Kubernetes on factory-provisioned Dell EMC PowerEdge servers
 
-# Short Version:
+Omnia (Latin: all or everything) is a deployment tool to turn Dell EMC PowerEdge servers with factory-installed OS images into a functioning Slurm/Kubernetes cluster.
 
-Install Kubernetes and all dependencies
-```
-ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml
-```
+## Installing Omnia
+To install Omnia, see [INSTALL](INSTALL.md)
 
-Initialize K8S cluster
-```
-ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml --tags "init"
-```
+## Contributing
+To contribute to the Omnia project, see [CONTRIBUTING](CONTRIBUTING.md)
 
-
-# What this does:
-
-## Build/Install
-
-### Add additional repositories:
-
-- Kubernetes (Google)
-- El Repo (nvidia drivers)
-- Nvidia (nvidia-docker)
-- EPEL (Extra Packages for Enterprise Linux)
-
-### Install common packages
- - gcc
- - python-pip
- - docker
- - kubelet
- - kubeadm
- - kubectl
- - nvidia-detect
- - kmod-nvidia
- - nvidia-x11-drv
- - nvidia-container-runtime
- - ksonnet (CLI framework for K8S configs)
-
-### Enable GPU Device Plugins (nvidia-container-runtime-hook)
-
-### Modify kubeadm config to allow GPUs as schedulable resource 
-
-### Start and enable services
- - Docker
- - Kubelet
-
-## Initialize Cluster
-### Head/master
-- Start K8S pass startup token to compute/slaves
-- Initialize networking (Currently using WeaveNet)
--Setup K8S Dashboard
-- Create dynamic/persistent volumes
-### Compute/slaves
-- Join k8s cluster
+### Current maintainers:
+* Lucas A. Wilson (Dell Technologies)
+* John Lockman (Dell Technologies)

+ 156 - 0
examples/TensorRT-InferenceServer/README.md

@@ -0,0 +1,156 @@
+# Run Nvidia's TensorRT Inference Server on omnia 
+
+Clone the repo
+
+`git clone https://github.com/NVIDIA/tensorrt-inference-server.git`
+
+Download models
+
+`cd tensorrt-inference-server/docs/examples/`
+`./fetch_models.sh`
+
+Copy models to shared NFS location
+
+`cp -rp model_repository ensemble_model_repository /home/k8sSHARE`
+
+Fix permissions on model files
+
+`chmod -R a+r /home/k8sSHARE/model_repository`
+
+
+## Deploy Prometheus and Grafana
+
+Prometheus collects metrics for viewing in Grafana. Install the prometheus-operator for these components. The serviceMonitorSelectorNilUsesHelmValues flag is needed so that Prometheus can find the inference server metrics in the example release deployed below:
+
+`helm install --name example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false stable/prometheus-operator`
+
+Setup port-forward to the Grafana service for local access
+
+`kubectl port-forward service/example-metrics-grafana 8080:80`
+
+Navigate in your browser to localhost:8080 for the Grafana login page. 
+`username=admin password=prom-operator`
+
+## Setup TensorRT Inference Server Deployment
+Change to helm chart directory
+`cd ~/tensorrt-inference-server/deploy/single_server/`
+
+Modify `values.yaml` changing `modelRepositoryPath`
+
+<pre>
+image:
+  imageName: nvcr.io/nvidia/tensorrtserver:20.01-py3
+  pullPolicy: IfNotPresent
+  #modelRepositoryPath: gs://tensorrt-inference-server-repository/model_repository
+  modelRepositoryPath: /data/model_repository
+  numGpus: 1
+ </pre>
+
+Modify `templates/deployment.yaml` in **bold** to add the local NFS mount:
+<pre>
+...
+    spec:
+      containers:
+        - name: {{ .Chart.Name }}
+          image: "{{ .Values.image.imageName }}"
+          imagePullPolicy: {{ .Values.image.pullPolicy }}
+         <b style='background-color:yellow'> volumeMounts:
+            - mountPath: /data/
+              name: work-volume</b>
+ ...
+   <b>   volumes:
+      - name: work-volume
+        hostPath:
+          # directory locally mounted on host
+          path: /home/k8sSHARE
+          type: Directory
+   </b>
+   </pre>
+
+
+### Deploy the inference server
+
+<pre>
+cd ~/tensorrt-inference-server/deploy/single_server/
+helm install --name example .
+</pre>
+
+### Verify deployment
+<pre>
+helm ls
+NAME           	REVISION	UPDATED                 	STATUS  	CHART                          	APP VERSION	NAMESPACE
+example        	1       	Wed Feb 26 15:46:18 2020	DEPLOYED	tensorrt-inference-server-1.0.0	1.0        	default  
+example-metrics	1       	Tue Feb 25 17:45:54 2020	DEPLOYED	prometheus-operator-8.9.2      	0.36.0     	default  
+</pre>
+
+<pre>
+kubectl get pods
+NAME                                                     READY   STATUS    RESTARTS   AGE
+example-tensorrt-inference-server-f45d865dc-62c46        1/1     Running   0          53m
+</pre>
+
+<pre>
+kubectl get svc
+NAME                                        TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                                        AGE
+...
+example-tensorrt-inference-server           LoadBalancer   10.150.77.138    192.168.60.150   8000:31165/TCP,8001:31408/TCP,8002:30566/TCP   53m
+</pre>
+
+## Setup NGC login secret for nvcr.io
+
+`kubectl create secret docker-registry <your-secret-name> --docker-server=<your-registry-server> --docker-username=<your-registry-username> --docker-password=<your-registry-apikey> --docker-email=<your-email>
+`
+
+Parameter Description:
+docker-registry <your-secret-name> – the name you will use for this secret
+docker-server <your-registry-server> – nvcr.io is the container registry for NGC
+docker-username <your-registry-username> – for nvcr.io this is ‘$oauthtoken’ (including quotes)
+docker-password <your-registry-apikey> – this is the API Key you obtained earlier
+docker-email <your-email> – your NGC email address
+
+Example (you will need to generate your own oauth token)
+`kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=clkaw309f3jfaJ002EIVCJAC0Cpcklajser90wezxc98wdn09ICJA09xjc09j09JV00JV0JVCLR0WQE8ACZz --docker-email=john@example.com`
+
+Verify your secret has been stored:
+<pre>
+kubectl get secrets
+NAME                                                          TYPE                                  DATA   AGE
+...
+ngc-secret                                                    kubernetes.io/dockerconfigjson        1      106m
+</pre>
+
+## Run TensorRT Client
+`kubectl apply -f trt-client.yaml`
+
+Verify it is running:
+<pre>
+kubectl get pod tensorrt-client 
+NAME              READY   STATUS    RESTARTS   AGE
+tensorrt-client   1/1     Running   0          5m
+</pre>
+
+Run the inception test using the client Pod. The TensorRT Inference IP Address can be found by running `kubectl get svc`
+<pre>
+kubectl exec -it tensorrt-client -- /bin/bash -c "image_client -u 192.168.60.150:8000 -m resnet50_netdef -s INCEPTION images/mug.jpg"
+Request 0, batch size 1
+Image 'images/mug.jpg':
+    504 (COFFEE MUG) = 0.723992
+</pre>
+
+Run inception test with batch size 2 and print top 3 classifications
+<pre>
+ kubectl exec -it tensorrt-client -- /bin/bash -c "image_client  -u 192.168.60.150:8000 -m resnet50_netdef -s INCEPTION images/ -c 3 -b 2"
+Request 0, batch size 2
+Image 'images//mug.jpg':
+    504 (COFFEE MUG) = 0.723992
+    968 (CUP) = 0.270953
+    967 (ESPRESSO) = 0.00115996
+Image 'images//mug.jpg':
+    504 (COFFEE MUG) = 0.723992
+    968 (CUP) = 0.270953
+    967 (ESPRESSO) = 0.00115996
+</pre>
+
+
+
+