control_plane.yml
fails?Potential Causes:
Resolution:
Wait for AWX UI to be accessible at http://<management-station-IP>:8081, and then run the control_plane.yml
file again, where management-station-IP is the IP address of the management node.
Wait for 15 minutes after the Kubernetes cluster reboots. Next, verify the status of the cluster using the following commands:
kubectl get nodes
on the manager node to get the real-time k8s cluster status.kubectl get pods --all-namespaces
on the manager node to check which the pods are in the Running state.kubectl cluster-info
on the manager node to verify that both the k8s master and kubeDNS are in the Running state.kubectl get pods --all-namespaces
to verify that all pods are in the Running state.kubectl delete pods <name of pod>
omnia.yml
, jupyterhub.yml
, or kubeflow.yml
.Run the command kubectl get pods --namespace default
to ensure nfs-client pod and all Prometheus server pods are in the Running state.
control_plane.yml
fail during the Run import command?Cause:
Resolution:
docker rm -f cobbler
and rerun control_plane.yml
.Potential Causes:
Resolution:
docker rm -f cobbler
and docker image rm -f cobbler
.
systemctl restart slurmdbd
systemctl restart slurmctld
systemctl restart prometheus-slurm-exporter
systemctl status slurmd
to manually restart the following service on all the compute nodes.Potential Cause: The slurm.conf
is not configured properly.
Recommended Actions:
slurmdbd -Dvvv
slurmctld -Dvvv
/var/lib/log/slurmctld.log
file for more information.Cause: Slurm database connection fails.
Recommended Actions:
slurmdbd -Dvvv
slurmctld -Dvvv
/var/lib/log/slurmctld.log
file.netstat -antp | grep LISTEN
for PIDs in the listening state.slurmctl restart slurmctld
on manager node
systemctl restart slurmdbd
on manager node
systemctl restart slurmd
on compute node
Potential Cause: The host network is faulty causing DNS to be unresponsive
Resolution:
kubeadm reset -f
on all the nodes.omnia_config.yml
file to change the Kubernetes Pod Network CIDR. The suggested IP range is 192.168.0.0/16. Ensure that the IP provided is not in use on your host network.ansible-playbook omnia.yml --skip-tags slurm
Potential Cause: Unstable or slow Internet connectivity.
Resolution:
calico
to flannel
.Run kubectl rollout restart deployment awx -n awx
from the management station and try to re-run the job.
If the above solution doesn't work,
/var/nfs_awx
.omnia/control_plane/roles/webui_awx/files/.tower_cli.cfg
.idrac.yml
file or other .yml files from AWX?Potential Cause: The "PermissionError: [Errno 13] Permission denied" error is displayed if you have used the ansible-vault decrypt or encrypt commands.
Resolution:
chmod 664 <filename>.yml
It is recommended that the ansible-vault view or edit commands are used and not the ansible-vault decrypt or encrypt commands.
racadm getremoteservicesstatus
Error: The specified disk is not available. - Unavailable disk (0.x) in disk range '0.x-x'
:show disks
You cannot create a linear disk group when a virtual disk group exists on the system.
?At any given time only one type of disk group can be created on the system. That is, all disk groups on the system have to exclusively be linear or virtual. To fix the issue, either delete the existing disk group or change the type of pool you are creating.
Potential Cause: Older firmware version in PowerEdge servers. Omnia supports only iDRAC 8 based Dell EMC PowerEdge Servers with firmware versions 2.75.75.75 and above and iDRAC 9 based Dell EMC PowerEdge Servers with Firmware versions 4.40.40.00 and above.
/var/nfs_awx
/<project name>/control_plane/roles/webui_awx/files/.tower_cli.cfg
Once complete, it's safe to re-run control_plane.yml.
Potential Cause: The control_plane playbook does not support hostnames with an underscore in it such as 'mgmt_station'.
As defined in RFC 822, the only legal characters are the following:
Alphanumeric (a-z and 0-9): Both uppercase and lowercase letters are acceptable, and the hostname is case insensitive. In other words, dvader.empire.gov is identical to DVADER.EMPIRE.GOV and Dvader.Empire.Gov.
Hyphen (-): Neither the first nor the last character in a hostname field should be a hyphen.
Period (.): The period should be used only to delimit fields in a hostname (e.g., dvader.empire.gov)
Potential Cause: Your Docker pull limit has been exceeded. For more information, click here
helm delete jupyterhub -n jupyterhub
Potential Cause: Your Docker pull limit has been exceeded. For more information, click here
kfctl delete -V -f /root/k8s/omnia-kubeflow/kfctl_k8s_istio.v1.0.2.yaml
No. During Cobbler based deployment, only one OS is supported at a time. If the user would like to deploy both, please deploy one first, unmount /mnt/iso
and then re-run cobbler for the second OS.
Due to the latest catalog.xml
file, Firmware updates fail for some components on server models R640 and R740. Omnia execution doesn't get interrupted but an error gets logged. For now, please download those individual updates manually.