Using SkyPilot to deploy a Kubernetes cluster#
Introduction#
SkyPilot makes it easy to deploy a Kubernetes cluster using Lambda Public Cloud on-demand instances. The NVIDIA GPU Operator is preinstalled so you can immediately use your instances' GPUs.
In this tutorial, you'll:
- Configure your Lambda Public Cloud Firewall and a Cloud API key for SkyPilot and Kubernetes.
- Install SkyPilot.
- Configure SkyPilot for Lambda Public Cloud.
- Use SkyPilot to launch 2 1x H100 on-demand instances and deploy a 2-node Kubernetes cluster using these instances.
All of the instructions in this tutorial should be followed on your computer.
This tutorial assumes you already have installed:
python3
python3-venv
python3-pip
curl
netcat
socat
You can install these packages by running:
You also need to install kubectl by running:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && \
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
Configure your Lambda Public Cloud Firewall and generate a Cloud API key#
Use the Lambda Public Cloud Firewall feature to add rules allowing incoming traffic to ports TCP/443 and TCP/6443.
Generate a Cloud API key for SkyPilot. You can also use an existing Cloud API key.
Install SkyPilot#
Create a directory for this tutorial and change into the directory by running:
Create and activate a Python virtual environment for this tutorial by running:
Then, install SkyPilot in your virtual environment by running:
Configure SkyPilot for Lambda Public Cloud#
Download the SkyPilot example cloud_k8s.yaml
and launch_k8s.sh
files by running:
curl -LO https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/k8s_cloud_deploy/cloud_k8s.yaml && \
curl -LO https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/k8s_cloud_deploy/launch_k8s.sh
Edit the cloud_k8s.yaml
file.
At the top of the file:
- Set
accelerators
toH100:1
. - For
SKY_K3S_TOKEN
, replace mytoken with a strong passphrase.
Warning
It's important that you use a strong passphrase. Otherwise, the Kubernetes cluster can be compromised, especially if your firewall rules allow incoming traffic from all sources.
You can generate a strong passphrase by running:
This command will generate a random string of characters such as
zPUlZGe4HRcy+Om04RvGmQ==
.
The top of the cloud_k8s.yaml
file should look similar to:
resources:
cloud: lambda
accelerators: H100:1
# Uncomment the following line to expose ports on a different cloud
# ports: 6443
num_nodes: 2
envs:
SKY_K3S_TOKEN: zPUlZGe4HRcy+Om04RvGmQ== # Can be any string, used to join worker nodes to the cluster
Note
You can set accelerators
to a different instance type, for example,
A100:8
for an 8x A100 instance or H100:8
for an 8x H100 instance.
Create a directory in your home directory named .lambda_cloud
and change into
that directory by running:
Create a file named lambda_keys
that contains:
Replace API-KEY with your actual Cloud API key.
Use SkyPilot to launch instances and deploy Kubernetes#
Change into the directory you created for this tutorial by running:
Then, launch 2 1x H100 instances and deploy a 2-node Kubernetes cluster using those instances by running:
You'll begin to see output similar to:
===== SkyPilot Kubernetes cluster deployment script =====
This script will deploy a Kubernetes cluster on the cloud and GPUs specified in cloud_k8s.yaml.
+ CLUSTER_NAME=k8s
+ sky launch -y -c k8s cloud_k8s.yaml
Task from YAML spec: cloud_k8s.yaml
Considered resources (2 nodes):
------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
------------------------------------------------------------------------------------------------
Lambda gpu_1x_h100_pcie 26 200 H100:1 us-east-1 4.98 โ
------------------------------------------------------------------------------------------------
Multiple Lambda instances satisfy H100:1. The cheapest Lambda(gpu_1x_h100_pcie, {'H100': 1}) is considered among:
['gpu_1x_h100_pcie', 'gpu_1x_h100_sxm5'].
To list more details, run: sky show-gpus H100
Running task on cluster k8s...
It usually takes about 15 minutes for the Kubernetes cluster to be deployed.
The Kubernetes cluster is successfully deployed once you see:
Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled
To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
๐ Enabled clouds ๐
โ Kubernetes
โ Lambda
+ set +x
===== Kubernetes cluster deployment complete =====
You can now access your k8s cluster with kubectl and skypilot.
โข View the list of available GPUs on Kubernetes: sky show-gpus --cloud kubernetes
โข To launch a SkyPilot job running nvidia-smi on this cluster: sky launch --cloud kubernetes --gpus <GPU> -- nvidia-smi
To test the Kubernetes cluster, launch a job by running:
You'll see output similar to the following and will be asked if you want to proceed:
Task from command: nvidia-smi
Managed job 'sky-cmd' will be launched on (estimated):
Considered resources (1 node):
----------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
----------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 default 0.00 โ
----------------------------------------------------------------------------------------------------
Launching a managed job 'sky-cmd'. Proceed? [Y/n]: y
Press Enter to proceed.
You should see output similar to the following, indicating the job ran successfully:
Launching managed job 'sky-cmd' from jobs controller...
Considered resources (1 node):
----------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
----------------------------------------------------------------------------------------------
Kubernetes 8CPU--24GB 8 24 - default 0.00 โ
----------------------------------------------------------------------------------------------
โ๏ธ Launching managed jobs controller on Kubernetes.
โโโ Pod is up.
โ Cluster launched: sky-jobs-controller-4b0a835a. View logs at: ~/sky_logs/sky-2024-11-07-08-48-26-378180/provision.log
โ๏ธ Mounting files.
Syncing (to 1 node): /tmp/managed-dag-sky-cmd-nza5s28e -> ~/.sky/managed_jobs/sky-cmd-4bfa.yaml
โ Files synced. View logs at: ~/sky_logs/sky-2024-11-07-08-48-26-378180/file_mounts.log
โ๏ธ Running setup on managed jobs controller.
Check & install cloud dependencies on controller: done.
โ Setup completed. View logs at: ~/sky_logs/sky-2024-11-07-08-48-26-378180/setup-*.log
Auto-stop is not supported for Kubernetes and RunPod clusters. Skipping.
โ๏ธ Job submitted, ID: 1
โโโ Waiting for task resources on 1 node.
โโโ Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=1362) Thu Nov 7 16:52:15 2024
(sky-cmd, pid=1362) +-----------------------------------------------------------------------------------------+
(sky-cmd, pid=1362) | NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
(sky-cmd, pid=1362) |-----------------------------------------+------------------------+----------------------+
(sky-cmd, pid=1362) | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
(sky-cmd, pid=1362) | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
(sky-cmd, pid=1362) | | | MIG M. |
(sky-cmd, pid=1362) |=========================================+========================+======================|
(sky-cmd, pid=1362) | 0 NVIDIA H100 PCIe On | 00000000:08:00.0 Off | 0 |
(sky-cmd, pid=1362) | N/A 32C P0 49W / 350W | 1MiB / 81559MiB | 0% Default |
(sky-cmd, pid=1362) | | | Disabled |
(sky-cmd, pid=1362) +-----------------------------------------+------------------------+----------------------+
(sky-cmd, pid=1362)
(sky-cmd, pid=1362) +-----------------------------------------------------------------------------------------+
(sky-cmd, pid=1362) | Processes: |
(sky-cmd, pid=1362) | GPU GI CI PID Type Process name GPU Memory |
(sky-cmd, pid=1362) | ID ID Usage |
(sky-cmd, pid=1362) |=========================================================================================|
(sky-cmd, pid=1362) | No running processes found |
(sky-cmd, pid=1362) +-----------------------------------------------------------------------------------------+
โ Managed job finished: 1 (status: SUCCEEDED).
๐ Useful Commands
Managed Job ID: 1
โโโ To cancel the job: sky jobs cancel 1
โโโ To stream job logs: sky jobs logs 1
โโโ To stream controller logs: sky jobs logs --controller 1
โโโ To view all managed jobs: sky jobs queue
โโโ To view managed job dashboard: sky jobs dashboard