Browse Source

Issue #944: nfd-worker pods are going to crashloop backoff state

Signed-off-by: Lakshmi-Patneedi <Lakshmi_Patneedi@Dellteam.com>
Lakshmi-Patneedi 3 years ago
parent
commit
97f896eb66

+ 3 - 3
docs/Security/ENABLE_SECURITY_LOGIN_NODE.md

@@ -42,17 +42,17 @@
 
 ## Limiting User Authentication over sshd
 
-Users logging into this host will can be __optionally__ allowed or denied using an access control list. All users to be allowed or denied are to be listed in the variable `user` in `security_vars.yml`. 
+Users logging into this host will can be __optionally__ allowed or denied using an access control list. All users to be allowed or denied are to be listed in the variable `user` in `omnia_security_vars.yml`. 
 
 >> __Note:__ All users on the server will have to be defined manually. Omnia does not create any users by default.
 
 ## Session Timeout
 
-To encourage security, users who have been idle over 3 minutes will be logged out automatically. To adjust this value, update the `session_timeout` variable in `security_vars.yml`. This variable is mandatory. 
+To encourage security, users who have been idle over 3 minutes will be logged out automatically. To adjust this value, update the `session_timeout` variable in `omnia_security_vars.yml`. This variable is mandatory. 
 
 ## Restricting Program Support
 
-Optionally, different communication protocols can be disabled on the management station using the `restrict_program_support` and `restrict_softwares` variables. These protocols include: telnet,lpd,bluetooth,rlogin and rexec. Features that cannot be disabled include: ftp,smbd,nmbd,automount and portmap. 
+Optionally, different communication protocols can be disabled on the management station using the `restrict_program_support` and `restrict_softwares` variables in `omnia_security_vars.yml. These protocols include: telnet,lpd,bluetooth,rlogin and rexec. Features that cannot be disabled include: ftp,smbd,nmbd,automount and portmap. 
 
 
 ## Kernel Lockdown

+ 8 - 2
roles/k8s_start_services/tasks/deploy_k8s_services.yml

@@ -96,11 +96,13 @@
 - name: Helm - add Nvidia k8s-device-plugin (nvdp) repo
   command: "helm repo add nvdp '{{ nvidia_k8s_device_plugin_repo_url }}'"
   changed_when: true
+  when: ansible_local.inventory.nvidia_gpu > 0
   tags: init
 
 - name: Helm - add Nvidia GPU discovery (nvgfd) repo
   command: "helm repo add nvgfd '{{ nvidia_gpu_discovery_repo_url }}'"
   changed_when: true
+  when: ansible_local.inventory.nvidia_gpu > 0
   tags: init
 
 - name: Helm - update repo
@@ -189,13 +191,17 @@
 - name: Install nvidia-device-plugin
   command: "helm install --version='{{ nvidia_device_plugin_version }}' --generate-name --set migStrategy='{{ mig_strategy }}' nvdp/nvidia-device-plugin"
   changed_when: true
-  when: "'nvidia-device-plugin' not in k8s_pods.stdout"
+  when:
+    - "'nvidia-device-plugin' not in k8s_pods.stdout"
+    - ansible_local.inventory.nvidia_gpu > 0
   tags: init
 
 - name: Install GPU Feature Discovery
   command: "helm install --version='{{ gpu_feature_discovery_version }}' --generate-name --set migStrategy='{{ mig_strategy }}' nvgfd/gpu-feature-discovery"
   changed_when: true
-  when: "'node-feature-discovery' not in k8s_pods.stdout"
+  when:
+    - "'node-feature-discovery' not in k8s_pods.stdout"
+    - ansible_local.inventory.nvidia_gpu > 0
   tags: init
 
 - name: Deploy Xilinx Device plugin

+ 1 - 1
roles/k8s_start_services/vars/main.yml

@@ -91,7 +91,7 @@ prometheus_path_on_host: /var/lib/prometheus-2.23.0.linux-amd64/
 
 spark_operator_repo: https://googlecloudplatform.github.io/spark-on-k8s-operator
 
-operator_image_tag: latest
+operator_image_tag: v1beta2-1.3.3-3.1.1
 
 volcano_scheduling_yaml_url: https://raw.githubusercontent.com/volcano-sh/volcano/v1.3.0/installer/volcano-development.yaml