November 13, 2024 in HPC, OpenCHAMI, LANL by Travis Cotton and Alex Lovell-Troy3 minutes
This blog post is an abridged version of the training we give internal sysadmins at LANL. It guides you through the whole process of building and deploying OpenCHAMI on a set of small teaching clusters that we maintain for that purpose. For more details and example image configurations, visit our repo at github.com/OpenCHAMI/mini-bootcamp
To get started, you’ll need:
Install necessary packages for OpenCHAMI deployment:
dnf install -y ansible git podman jqEdit your /etc/hosts file to include entries for your cluster. For example:
172.16.0.254 stratus.openchami.cluster
172.16.0.1 st001
#...additional entries for each nodeInstall powerman and conman for node power and console management:
dnf install -y powerman conman jqConfigure Powerman: Add device and node info to /etc/powerman/powerman.conf using your shortnames.
device "ipmi0" "ipmipower" "/usr/sbin/ipmipower -D lanplus -u admin -p Password123! -h pst[001-009] -I 17 -W ipmiping |&"
node "st[001-009]" "ipmi0" "pst[001-009]"Start Powerman:
systemctl start powerman
systemctl enable powermanUse buildah to create a lightweight test image.
Install buildah:
dnf install -y buildahBuild the base image:
CNAME=$(buildah from scratch)
MNAME=$(buildah mount $CNAME)
dnf groupinstall -y --installroot=$MNAME --releasever=8 "Minimal Install"Set up the kernel and dependencies:
dnf install -y --installroot=$MNAME kernel dracut-live fuse-overlayfs cloud-initRebuild initrd:
buildah run --tty $CNAME bash -c 'dracut --add "dmsquash-live livenet network-manager" --kver $(basename /lib/modules/*) -N -f --logfile /tmp/dracut.log 2>/dev/null'Save the image:
buildah commit $CNAME test-image:v1OpenCHAMI relies on several key microservices:
Clone the deployment recipes repository:
git clone https://github.com/OpenCHAMI/deployment-recipes.gitGo to the LANL podman-quadlets recipe:
cd deployment-recipes/lanl/podman-quadletsInventory Setup: Edit the inventory/01-ochami file to specify your hostname.
Cluster Configurations: Update inventory/group_vars/ochami/cluster.yaml with your cluster name and shortname.
SSH Key Pair: Generate an SSH key, add it to inventory/group_vars/ochami/cluster.yaml under cluster_boot_ssh_pub_key.
Run the Playbook:
ansible-playbook -l $HOSTNAME -c local -i inventory -t configs ochami_playbook.yamlAfter rebooting, run the full playbook:
ansible-playbook -l $HOSTNAME -c local -i inventory ochami_playbook.yamlCheck that the expected containers are running:
# podman ps | awk '{print $NF}' | sortVerify Services: Ensure SMD, BSS, and cloud-init are populated correctly.
ochami-cli smd --get-components
ochami-cli bss --get-bootparams
ochami-cli cloud-init --get-ci-data --name computeBoot Nodes: Start and monitor node boots using pm and conman commands.
Logs for Debugging: Open additional terminal windows to monitor logs for DHCP, BSS, and cloud-init.
For more complex deployments, use the image-builder tool to build layered images.
podman build -t image-builder:test -f dockerfiles/dnf/Dockerfile_interactive .
podman run --device /dev/fuse -it --name image-builder --rm -v $PWD:/data image-builder:test 'image-build --log-level INFO --config /data/image-configs/base.yaml'tpm-manager to handle secure data distribution to nodes.By now, you should have a fully deployed OpenCHAMI environment, equipped with essential microservices and custom-built images, ready to scale. As a final step, consider adding further integrations like Slurm for job scheduling and network-mounted filesystems for additional storage solutions.