Deploying Llama 3.2 3B in a Kubernetes (K8s) cluster
Introduction
In this tutorial, you'll:
Stand up a single-node Kubernetes cluster on an on-demand instance using K3s.
Install the NVIDIA GPU Operator so your cluster can use your instance's GPUs.
Deploy Ollama in your cluster to serve the Llama 3.2 3B model.
Install the Ollama client.
Interact with the Llama 3.2 3B model.
You don't need a Kubernetes cluster to run Ollama and serve the Llama 3.2 3B model. Part of this tutorial is to demonstrate that it's possible to stand up a Kubernetes cluster on on-demand instances.
Stand up a single-node Kubernetes cluster
Install K3s (Kubernetes) by running:
Verify that your Kubernetes cluster is ready by running:
You should see output similar to:
Install socat by running:
socat is needed to enable port forwarding in a later step.
Install the NVIDIA GPU Operator
Install the NVIDIA GPU Operator in your Kubernetes cluster by running:
In a few minutes, verify that your instance's GPUs are detected by your cluster by running:
You should see output similar to:
nvidia.com/gpu.count=8
indicates that your cluster detects 8 GPUs.nvidia.com/gpu.product=Tesla-V100-SXM2-16GB
indicates that the detected GPUs are Tesla V100 SXM2 16GB GPUs.
In this tutorial, Ollama will only use 1 GPU.
Deploy Ollama in your Kubernetes cluster
Start an Ollama server in your Kubernetes cluster by running:
In a few minutes, run the following command to verify that the Ollama server is accepting connections and is using a GPU:
You should see output similar to:
The last line in the example output above shows that Ollama is using a single
Tesla V100-SXM2-16GB
GPU.Start a tmux session by running:
Then, run the following command to make Ollama accessible from outside of your Kubernetes cluster:
You should see output similar to:
Install the Ollama client
Press Ctrl + B, then press C to open a new tmux window.
Download and install the Ollama client by running:
Serve and interact with the Llama 3.2 3B model
Serve the Llama 3.2 3B model using Ollama by running:
You can interact with the model once you see the following prompt:
Test the model by entering a prompt, for example:
You should see output similar to:
Last updated