Bläddra i källkod

Issue #215: .md file update for all defects resolved

Signed-off-by: avinashvishwanath <avinash_vishwanath@dell.com>
Lucas A. Wilson 4 år sedan
förälder
incheckning
cb9e260e50

+ 9 - 9
appliance/roles/inventory/files/create_inventory.yml

@@ -69,19 +69,19 @@
       command: grep "{{ inventory_hostname }}" ../../provision/files/new_mapping_file.csv
       delegate_to: localhost
       register: file_present
-      when: mapping_file == "true"
+      when: mapping_file | bool == true
       ignore_errors: true
 
     - name: Set fact if mapping file present
       set_fact:
-        mapping_file_present: file_present.stdout
-      when: mapping_file == "true"
+        mapping_file_present: "{{ file_present.stdout }}"
+      when: mapping_file | bool == true
       ignore_errors: true
 
     - name: Get the static hostname from mapping file
       shell: awk -F',' '$3 == "{{ inventory_hostname }}" { print $2 }' ../../provision/files/new_mapping_file.csv
       delegate_to: localhost
-      when: ('localhost' in hostname_check.stdout) and (mapping_file_present != "" ) and ( mapping_file == "true" )
+      when: ('localhost' in hostname_check.stdout) and (mapping_file_present != "" ) and ( mapping_file | bool == true )
       register: host_name
       ignore_errors: true
 
@@ -89,23 +89,23 @@
       hostname:
         name: "{{ host_name.stdout }}"
       register: result_host_name
-      when: ('localhost' in hostname_check.stdout) and (mapping_file_present != "" ) and  (mapping_file == "true" )
+      when: ('localhost' in hostname_check.stdout) and (mapping_file_present != "" ) and  (mapping_file | bool == true )
       ignore_errors: true
 
     - name: Set the system hostname
       hostname:
         name: "compute{{ inventory_hostname.split('.')[-2] + '-' + inventory_hostname.split('.')[-1] }}"
       register: result_name
-      when: ('localhost' in hostname_check.stdout) and (mapping_file_present == "")
+      when: ('localhost' in hostname_check.stdout) and (mapping_file | bool == false)
       ignore_errors: true
 
-    - name: Add new hostname to /etc/hosts
+    - name: Add new hostname to /etc/hosts from mapping file
       lineinfile:
         dest: /etc/hosts
         regexp: '^127\.0\.0\.1[ \t]+localhost'
         line: "127.0.0.1 localhost {{ host_name.stdout }}"
         state: present
-      when: ('localhost' in hostname_check.stdout) and ( mapping_file_present != "" ) and ( mapping_file == "true" )
+      when: ('localhost' in hostname_check.stdout) and ( mapping_file_present != "" ) and ( mapping_file | bool == true )
       ignore_errors: true
 
     - name: Add new hostname to /etc/hosts
@@ -114,7 +114,7 @@
         regexp: '^127\.0\.0\.1[ \t]+localhost'
         line: "127.0.0.1 localhost 'compute{{ inventory_hostname.split('.')[-2] + '-' + inventory_hostname.split('.')[-1] }}'"
         state: present
-      when: ('localhost' in hostname_check.stdout) and (mapping_file_present == "" )
+      when: ('localhost' in hostname_check.stdout) and (mapping_file | bool == false )
       ignore_errors: true
 
 - name: Update inventory

+ 1 - 1
appliance/roles/inventory/tasks/main.yml

@@ -80,7 +80,7 @@
           command: >-
             ansible-playbook -i {{ role_path }}/files/provisioned_hosts.yml
             {{ role_path }}/files/create_inventory.yml
-            --extra-vars "cobbler_username={{ cobbler_username }} cobbler_password={{ cobbler_password }} mapping_file={{ mapping_file }}"
+            --extra-vars "cobbler_username={{ cobbler_username }} cobbler_password={{ cobbler_password }} mapping_file={{ mapping_file | bool }}"
           no_log: True
           register: register_error
       rescue:

+ 2 - 0
appliance/roles/provision/files/inventory_creation.yml

@@ -28,6 +28,7 @@
     - name: Create the static ip
       shell: awk -F',' 'NR >1{print $3}' omnia/appliance/roles/provision/files/new_mapping_file.csv > static_hosts.yml
       changed_when: false
+      ignore_errors: true
 
     - name: Create the dynamic inventory
       shell: |
@@ -35,6 +36,7 @@
         echo "{{ vars_new }}" > temp.txt
         egrep -o '[1-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' temp.txt >>dynamic_hosts.yml
       changed_when: false
+      ignore_errors: true
 
     - name: Final inventory
       shell: cat dynamic_hosts.yml static_hosts.yml| sort -ur  >> omnia/appliance/roles/inventory/files/provisioned_hosts.yml

+ 0 - 105
docs/INSTALL.md

@@ -1,105 +0,0 @@
-## TL;DR Installation
- 
-### Kubernetes Only
-Install Kubernetes and all dependencies
-```
-ansible-playbook -i host_inventory_file omnia.yml --skip-tags "slurm"
-```
-
-Initialize K8s cluster
-```
-ansible-playbook -i host_inventory_file omnia.yml --tags "init"
-```
-
-### Install Kubeflow 
-```
-ansible-playbook -i host_inventory_file platform/kubeflow.yaml
-```
-
-### Slurm Only
-```
-ansible-playbook -i host_inventory_file omnia.yml --skip-tags "k8s"
-```
-
-# Omnia  
-Omnia is a collection of [Ansible](https://www.ansible.com/) playbooks which perform:
-* Installation of [Slurm](https://slurm.schedmd.com/) and/or [Kubernetes](https://kubernetes.io/) on servers already provisioned with a standard [CentOS](https://www.centos.org/) image.
-* Installation of auxiliary scripts for administrator functions such as moving nodes between Slurm and Kubernetes personalities.
-
-Omnia playbooks perform several tasks:
-`common` playbook handles installation of software 
-* Add yum repositories:
-    - Kubernetes (Google)
-    - El Repo (for Nvidia drivers)
-    - EPEL (Extra Packages for Enterprise Linux)
-* Install Packages from repos:
-    - bash-completion
-    - docker
-    - gcc
-    - python-pip
-    - kubelet
-    - kubeadm
-    - kubectl
-    - nfs-utils
-    - nvidia-detect
-    - yum-plugin-versionlock
-* Restart and enable system level services
-    - Docker
-    - Kubelet
-
-`computeGPU` playbook installs Nvidia drivers and nvidia-container-runtime-hook
-* Add yum repositories:
-    - Nvidia (container runtime)
-* Install Packages from repos:
-    - kmod-nvidia
-    - nvidia-container-runtime-hook
-* Restart and enable system level services
-    - Docker
-    - Kubelet
-* Configuration:
-    - Enable GPU Device Plugins (nvidia-container-runtime-hook)
-    - Modify kubeadm config to allow GPUs as schedulable resource 
-* Restart and enable system level services
-    - Docker
-    - Kubelet
-
-`manager` playbook
-* Install Helm v3
-* (optional) add firewall rules for Slurm and kubernetes
-
-Everything from this point on can be called by using the `init` tag
-```
-ansible-playbook -i host_inventory_file kubernetes/kubernetes.yml --tags "init"
-```
-
-`startmanager` playbook
-* turn off swap
-*Initialize Kubernetes
-    * Head/manager
-        - Start K8S pass startup token to compute/slaves
-        - Initialize software defined networking (Calico)
-
-`startworkers` playbook
-* turn off swap
-* Join k8s cluster
-
-`startservices` playbook
-* Setup K8S Dashboard
-* Add `stable` repo to helm
-* Add `jupyterhub` repo to helm
-* Update helm repos
-* Deploy NFS client Provisioner
-* Deploy Jupyterhub
-* Deploy Prometheus
-* Install MPI Operator
-
-
-### Slurm
-* Downloads and builds Slurm from source
-* Install package dependencies
-    - Python3
-    - munge
-    - MariaDB
-    - MariaDB development libraries
-* Build Slurm configuration files
-

+ 186 - 0
docs/INSTALL_OMNIA.md

@@ -0,0 +1,186 @@
+# Install Omnia
+
+## Prerequisties
+Perform the following tasks before installing Omnia:
+* On the management node, install Ansible and Git using the following commands:
+	* `yum install epel-release -y`
+	* `yum install ansible git -y`
+
+__Note:__ ansible should be installed using __yum__ only.
+
+__Note:__ If ansible is installed using __pip3__, install it again using __yum__ only.
+
+* Ensure a stable Internet connection is available on management node and target nodes. 
+* CentOS 7.9 2009 is installed on the management node.
+* To provision the bare metal servers,
+	* Go to http://isoredirect.centos.org/centos/7/isos/x86_64/ and download the **CentOS-7-x86_64-Minimal-2009** ISO file to the following directory on the management node: `omnia/appliance/roles/provision/files`.
+	* Rename the downloaded ISO file to `CentOS-7-x86_64-Minimal-2009.iso`.
+* For DHCP configuration, you can provide a mapping file named mapping_file.csv under __omnia/appliance/roles/provision/files__. The details provided in the CSV file must be in the format: MAC, Hostname, IP __xx:xx:4B:C4:xx:44,validation01,172.17.0.81 xx:xx:4B:C5:xx:52,validation02,172.17.0.82__
+__Note:__ Duplicate hostnames must not be provided in the mapping file and the hostname should not contain these characters: "_" and "."
+* Connect one of the Ethernet cards on the management node to the HPC switch and one of the ethernet card connected to the __global_network__.
+* If SELinux is not disabled on the management node, disable it from /etc/sysconfig/selinux and restart the management node.
+* The default mode of PXE is __UEFI__ and the __BIOS legacy__ mode is not supported.
+* The default boot order for the bare metal server should be __PXE__.
+* Configuration of __RAID__ is not part of omnia. If bare metal server has __RAID__ controller installed then it's compulsory to create __VIRTUAL DISK__.
+
+## Steps to install Omnia
+1. On the management node, change the working directory to the directory where you want to clone the Omnia Git repository.
+2. Clone the Omnia repository.
+``` 
+$ git clone https://github.com/dellhpc/omnia.git 
+```
+__Note:__ After the Omnia repository is cloned, a folder named __omnia__ is created. It is recommended that you do not rename this folder.
+
+3. Change the directory to `omnia/appliance`
+4. To provide passwords for Cobbler and AWX, edit the __`appliance_config.yml`__ file.
+* If user want to provide the mapping file for DHCP configuration, go to  __appliance_config.yml__ file there is variable name __mapping_file_exits__ set as __true__ otherwise __false__.
+
+Omnia considers the following usernames as default:  
+* `cobbler` for Cobbler Server
+* `admin` for AWX`
+* `slurm` for Slurm
+
+**Note**: 
+* Minimum length of the password must be at least eight characters and maximum of 30 characters.
+* Do not use these characters while entering a password: -, \\, "", and \'
+
+5. Using the `appliance_config.yml` file, you can also change the NIC for the DHCP server under *hpc_nic* and the NIC used to connect to the Internet under public_nic. Default values of both __hpc_nic__ and __public_nic__ is set to em1 and em2 respectively.
+6. The valid DHCP range for HPC cluster is set into two variables name __Dhcp_start_ip_range__ and __Dhcp_end_ip_range__ present in the __appliance_config.yml__ file.
+7. To provide password for Slurm Database and Kubernetes CNI, edit the __`omnia_config.yml`__ file.
+
+**Note**:
+* Supported Kubernetes CNI : calico and flannel, default is __calico__.
+
+To view the set passwords of __`appliance_config.yml`__ at a later time, run the following command under omnia->appliance:
+```
+ansible-vault view appliance_config.yml --vault-password-file .vault_key
+```
+
+To view the set passwords of __`omnia_config.yml`__ at a later time, run the following command:
+```
+ansible-vault view omnia_config.yml --vault-password-file .omnia_vault_key
+```
+
+  
+5. To install Omnia, run the following command:
+```
+ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"
+```
+   
+Omnia creates a log file which is available at: `/var/log/omnia.log`.
+
+**Provision operating system on the target nodes**  
+Omnia role used: *provision*
+
+To create the Cobbler image, Omnia configures the following:
+* Firewall settings are configured.
+* The kickstart file of Cobbler will enable the UEFI PXE boot.
+
+To access the Cobbler dashboard, enter `https://<IP>/cobbler_web` where `<IP>` is the Global IP address of the management node.  	For example, enter
+`https://100.98.24.225/cobbler_web` to access the Cobbler dashboard.
+
+__Note__: If a mapping file is not provided, the hostname to the server is given on the basis of following format: __compute<xxx>-<xxx>__ where "xxx" is the last 2 octets of Host Ip address
+After the Cobbler Server provisions the operating system on the nodes, IP addresses and host names are assigned by the DHCP service. The host names are assigned based on the following format: **compute\<xxx>-xxx** where **xxx** is the Host ID (last 2 octet) of the Host IP address. For example, if the Host IP address is 172.17.0.11 then assigned hostname will be compute0-11.
+__Note__: If a mapping file is provided, the hostnames follow the format provided in the mapping file.
+
+**Install and configure Ansible AWX**  
+Omnia role used: *web_ui*  
+AWX repository is cloned from the GitHub path: https://github.com/ansible/awx.git 
+
+
+Omnia performs the following configuration on AWX:
+* The default organization name is set to **Dell EMC**.
+* The default project name is set to **omnia**.
+* Credential: omnia_credential
+* Inventory: omnia_inventory with compute and manager groups
+* Template: DeployOmnia and Dynamic Inventory
+* Schedules: DynamicInventorySchedule which is scheduled for every 10 mins
+
+To access the AWX dashboard, enter `http://<IP>:8081` where **\<IP>** is the Global IP address of the management node. For example, enter `http://100.98.24.225:8081` to access the AWX dashboard.
+
+***Note**: The AWX configurations are automatically performed Omnia and Dell Technologies recommends that you do not change the default configurations provided by Omnia as the functionality may be impacted.
+
+__Note__: Although AWX UI is accessible, hosts will be shown only after few nodes have been provisioned by a cobbler. It will take approx 10-15 mins. If any server is provisioned but user is not able to see any host on the AWX UI, then user can run __provision_report.yml__ playbook from __omnia__ -> __appliance__ ->__tools__ folder to see which hosts are reachable.
+
+
+## Install Kubernetes and Slurm using AWX UI
+Kubernetes and Slurm are installed by deploying the **DeployOmnia** template on the AWX dashboard.
+
+1. On the AWX dashboard, under __RESOURCES__ __->__ __Inventories__, select __Groups__.
+2. Select either __compute__ or __manager__ group.
+3. Select the __Hosts__ tab.
+4. To add the hosts provisioned by Cobbler, select __Add__ __->__ __Add__ __existing__ __host__, and then select the hosts from the list and click __Save__.
+5. To deploy Omnia, under __RESOURCES__ -> __Templates__, select __DeployOmnia__ and click __LAUNCH__.
+6. By default, no skip tags are selected and both Kubernetes and Slurm will be deployed. To install only Kubernetes, enter `slurm` and select **Create "slurm"**. Similarly, to install only Slurm, select and add `kubernetes` skip tag. 
+
+__Note:__
+*	If you would like to skip the NFS client setup, enter _nfs_client in the skip tag section to skip the k8s_nfs_client_setup__ role of Kubernetes.
+
+7. Click **Next**.
+8. Review the details in the **Preview** window, and click **Launch** to run the DeployOmnia template. 
+
+To establish the passwordless communication between compute nodes and manager node:
+1. In AWX UI, under __RESOURCES__ -> __Templates__, select __DeployOmnia__ template.
+2. From __Playbook dropdown__ menu, select __appliance/tools/passwordless_ssh.yml__ and __Launch__ the template.
+
+__Note:__ If you want to install __jupyterhub__ and __kubeflow__ playbooks, you have to first install the __jupyterhub__ playbook and then install the __kubeflow__ playbook.
+
+__Note:__ To install __jupyterhub__ and __kubeflow__ playbook:
+*	From __AWX UI__, under __RESOURCES__ -> __Templates__, select __DeployOmnia__ template.
+*	From __Playbook dropdown__ menu, select __platforms/jupyterhub.yml__ option and __Launch__ the template to install jupyterhub playbook.
+*	From __Playbook dropdown__ menu, select __platforms/kubeflow.yml__ option and __Launch__ the template to install kubeflow playbook.
+
+
+The DeployOmnia template may not run successfully if:
+- The Manager group contains more than one host.
+- The Compute group does not contain a host. Ensure that the Compute group must be assigned with a minimum of one host node.
+- Under Skip Tags, when both kubernetes and slurm tags are selected.
+
+After **DeployOmnia** template is executed from the AWX UI, the **omnia.yml** file installs Kubernetes and Slurm, or either Kubernetes or slurm, as per the selection in the template on the management node. Additionally, appropriate roles are assigned to the compute and manager groups.
+
+The following __kubernetes__ roles are provided by Omnia when __omnia.yml__ file is executed:
+- __common__ role:
+	- Install common packages on master and compute nodes
+	- Docker is installed
+	- Deploy time ntp/chrony
+	- Install Nvidia drivers and software components
+- __k8s_common__ role: 
+	- Required Kubernetes packages are installed
+	- Starts the docker and kubernetes services.
+- __k8s_manager__ role: 
+	- __helm__ package for Kubernetes is installed.
+- __k8s_firewalld__ role: This role is used to enable the required ports to be used by Kubernetes. 
+	- For __head-node-ports__: 6443, 2379-2380,10251,10252
+	- For __compute-node-ports__: 10250,30000-32767
+	- For __calico-udp-ports__: 4789
+	- For __calico-tcp-ports__: 5473,179
+	- For __flanel-udp-ports__: 8285,8472
+- __k8s_nfs_server_setup__ role: 
+	- A __nfs-share__ directory, __/home/k8nfs__, is created. Using this directory, compute nodes share the common files.
+- __k8s_nfs_client_setup__ role
+- __k8s_start_manager__ role: 
+	- Runs the __/bin/kubeadm init__ command to initialize the Kubernetes services on manager node.
+	- Initialize the Kubernetes services in the manager node and create service account for Kubernetes Dashboard
+- __k8s_start_workers__ role: 
+	- The compute nodes are initialized and joined to the Kubernetes cluster with the manager node. 
+- __k8s_start_services__ role
+	- Kubernetes services are deployed such as Kubernetes Dashboard, Prometheus, MetalLB and NFS client provisioner
+
+__Note:__ Once kubernetes is installed and configured, few Kubernetes and calico/flannel related ports will be opened in the manager/compute nodes. This is required for kubernetes Pod-to-Pod and Pod-to-Service communications. Calico/flannel provides a full networking stack for kubernetes pods.
+
+The following __Slurm__ roles are provided by Omnia when __omnia.yml__ file is executed:
+- __slurm_common__ role:
+	- Install the common packages on manager/head node and compute node.
+- __slurm_manager__ role:
+	- Install the packages only related to manager node
+	- This role also enables the required ports to be used by slurm.
+		__tcp_ports__: 6817,6818,6819
+		__udp_ports__: 6817,6818,6819
+	- Creating and updating the slurm configuration files based on the manager node requirements.
+- __slurm_workers__ role:
+	- Install the slurm packages into all compute nodes as per the compute node requirements.
+- __slurm_start_services__ role: 
+	- Starting the slurm services so that compute node starts to communicate with manager node.
+- __slurm_exporter__ role: 
+	- slurm exporter is a package for exporting metrics collected from slurm resource scheduling system to prometheus.
+	- Slurm exporter is installed on the host just like slurm and slurm exporter will be successfully installed only if slurm is installed.

+ 75 - 0
docs/MONITOR_CLUSTERS.md

@@ -0,0 +1,75 @@
+# Monitor Kuberentes and Slurm
+Omnia provides playbooks to configure additional software components for Kubernetes such as JupyterHub and Kubeflow. For workload management (submitting, conrolling, and managing jobs) of HPC, AI, and Data Analytics clusters, you can access Kubernetes and Slurm dashboards and other supported applications. 
+
+__Note:__ To access the below dashboards, user has to login to the manager node and open the installed web browser.
+
+__Note:__ If you are connecting remotely make sure your putty or any other similar client supports X11 forwarding. If you are using mobaxterm version 8 and above, follow the below mentioned steps:
+1. `yum install firefox -y`
+2. `yum install xorg-x11-xauth`
+3. `logout and login back`
+4. To launch firefox from terminal use the following command: 
+   `firefox&`
+
+## Access Kuberentes Dashboard
+1. To verify if the __Kubernetes-dashboard service__ is __running__, run the following command:
+  `kubectl get pods --all-namespaces`
+2. To start the Kubernetes dashboard, run the following command:
+  `kubectl proxy`
+3. From the CLI, run the following command to see the generated tokens: `kubectl get secrets`
+4. Copy the token with the name __prometheus-__-kube-state-metrics__ of the type __kubernetes.io/service-account-token__.
+5. Run the following command: `kubectl describe secret __<copied token name>__`
+6. Copy the encrypted token value.
+7. On a web browser(installed on the manager node), enter http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ to access the Kubernetes Dashboard.
+8. Select the authentication method as __Token__.
+9. On the Kuberenetes Dashboard, paste the copied encrypted token and click __Sign in__.
+
+## Access Kubeflow Dashboard
+
+__Note:__ Use only port number between __8000-8999__
+
+1. To see which are the ports are in use, use the following command:
+   `netstat -an`
+2. Choose port number from __8000-8999__ which is not in use.
+3. To run the __kubeflow__ dashboard at selected port number, run the following command:
+   `kubectl port-forward -n istio-system svc/istio-ingressgateway __selected-port-number__:80`
+4. On a web browser installed on the __manager node__, go to http://localhost:selected-port-number/ to launch the kubeflow central navigation dashboard.
+
+## Access JupyterHub Dashboard
+If you have installed the JupyterHub application for Kubernetes, you can access the dashboard by following these actions:
+1. To verify if the JupyterHub services are running, run the following command: 
+   `kubectl get pods --namespace default`
+2. Ensure that the pod names starting with __hub__ and __proxy__ are in __running__ status.
+3. Run the following command:
+   `kubectl get services`
+4. Copy the **External IP** of __proxy-public__ service.
+5. On a web browser installed on the __manager node__, use the External IP address to access the JupyterHub Dashboard.
+6. Enter any __username__ and __password__ combination to enter the Jupyterhub. The __username__ and __password__ can be later configured from the JupyterHub dashboard.
+
+## Prometheus:
+
+* Prometheus is installed in two different ways:
+  * Prometheus is installed on the host when Slurm is installed without installing kubernetes.
+  * Prometheus is installed as a Kubernetes role, if you install both Slurm and Kubernetes.
+
+If Prometheus is installed as part of k8s role, run the following commands before starting the Prometheus UI:
+1. `export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")`
+2. `echo $POD_NAME`
+3. `kubectl --namespace default port-forward $POD_NAME 9090`
+
+__Note:__ If Prometheus is installed on the host, start the Prometheus web server with the following command:
+* Navigate to Prometheus folder. The default path is __/var/lib/prometheus-2.23.0.linux-amd64/__.
+* Start the web server, 
+  `./prometheus.yml`
+
+Go to http://localhost:9090 to launch the Prometheus UI in the browser.
+
+
+
+
+ 
+
+
+
+
+
+

Filskillnaden har hållts tillbaka eftersom den är för stor
+ 51 - 7
docs/README.md