Serving Llama 3.1 8B and 70B using vLLM on an NVIDIA GH200 instance#
This tutorial outlines how to serve Llama 3.1 8B and 70B using vLLM on an On-Demand Cloud (ODC) instance backed with the NVIDIA GH200 Grace Hopper Superchip. These two models represent two different ways you might serve a model on an NVIDIA GH200 instance:
- Llama 3.1 8B fits entirely in the GH200 GPU's VRAM. You can serve this model without additional configuration.
- Llama 3.1 70B exceeds the GH200 GPU's available memory. However, you can still serve this model on your GH200 instance by offloading the model weights to the onboard CPU's memory. This technique, known as CPU offloading, effectively expands the available GPU memory for storing model weights, but requires CPU-GPU data transfer during each forward pass. Due to its high-bandwidth chip-to-chip connection, the GH200 is uniquely suited for offloading tasks like this one.
Though the tutorial focuses on these two Llama models, you can use these steps to serve any appropriately sized model vLLM supports on your GH200 instance.
Setting up your environment#
Launch your GH200 instance#
Begin by launching a GH200 instance:
- In the Lambda Cloud console, navigate to the SSH keys page, click Add SSH Key, and then add or generate a SSH key.
- Navigate to the Instances page and click Launch Instance.
- Follow the steps in the instance launch wizard.
- Instance type: Select 1x GH200 (96 GB).
- Region: Select an available region.
- Filesystem: Don't attach a filesystem.
- SSH key: Use the key you created in step 1.
- Click Launch instance.
- Review the EULAs. If you agree to them, click I agree to the above to start launching your new instance. Instances can take up to five minutes to fully launch.
Get access to Llama 3.1 8B and 70B#
Next, obtain a Hugging Face access token and get approval to access the Llama 3.1 8B and Llama 3.1 70B repositories:
- If you don't already have a Hugging Face account, create one.
- Navigate to the Hugging Face Access Tokens page and click Create new token.
- Under Access Type, click Read to give your token read-only access.
- Name your token and then click Create token. A modal dialog opens.
- Click Copy to copy your access token, paste the token somewhere safe for future use, and then click Done to exit the dialog.
-
Navigate to the Llama-3.1-8B-Instruct page, and then review and accept the model's license agreement. After you accept the agreement, Hugging Face submits a request to access the model's repository for approval. On approval, you should gain access to all versions of the model (8B, 70B, and 405B).
The approval process tends to be fast. You can see the status of the request in your Hugging Face account settings.
Set up your Python virtual environment#
Create a new Python virtual environment and install the required libraries:
- In the Lambda Cloud console, navigate to the Instances page, find the row for your instance, and then click Launch in the Cloud IDE column. JupyterLab opens in a new window.
- In JupyterLab's Launcher tab, under Other, click Terminal to open a new terminal.
-
In your terminal, create a Python virtual environment:
-
Activate the virtual environment:
-
Make sure that
pip
,setuptools
, andwheel
are all up to date: -
Install the latest nightly build of
torch
: -
Install a version of
fsspec
known to be compatible with the version of the Hugging Facedatasets
library used by recent nightlytorch
releases:
Install and configure vLLM#
Now that you've set up your Python virtual environment, you can install and configure vLLM. Begin by installing Triton, which vLLM depends on to run on GH200 instances:
-
Clone the Triton GitHub repository:
-
Navigate to the cloned repository, and then install Triton's dependencies:
-
Install Triton:
After Triton finishes installing, you can install and configure vLLM:
-
Return to your home directory, and then clone the vLLM GitHub repository:
-
Navigate to your cloned repository, and then configure your vLLM requirements files to use the version of
torch
you installed earlier: -
Install the rest of vLLM's dependencies, and then finish installing vLLM. This step can take up to 15 minutes to complete:
Serving a model with vLLM#
After vLLM finishes installing, you can start serving models on your vLLM instance.
Serve Llama 3.1 8B using the vLLM API server#
To start a vLLM API server that serves the Llama 3.1 8B model:
-
Set the following environment variables. Replace
<HF-TOKEN>
with your Hugging Face user access token: -
Start a
tmux
session for your vLLM server. -
Start the vLLM API server. The server loads your model and then begins serving it:
When the server is ready, you should see output similar to the following:
INFO: Started server process [12821] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Note
This step might take several minutes to complete.
Test the vLLM API server#
Now that your vLLM API server is up and running, verify that it's working as expected:
- Detach the
tmux
session for your server by pressing Ctrl + B, then D. -
Send a prompt to the server:
curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d "{ \"prompt\": \"What is the name of the capital of France?\", \"model\": \"${MODEL_REPO}\", \"temperature\": 0.0, \"max_tokens\": 1 }"
You should see output similar to the following:
{"id":"cmpl-d3a33498b5d74d9ea09a7c256733b8df","object":"text_completion","created":1722545598,"model":"meta-llama/Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":"Paris","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":11,"total_tokens":12,"completion_tokens":1}}
To return to the tmux
session for your server:
- Press Ctrl + B, then D to detach your current session.
-
Reattach your vLLM server session:
Serve Llama 3.1 70B using CPU offloading#
For models that exceed the GH200's VRAM, such as Llama 3.1 70B, you can use CPU offloading to offload model weight storage to the onboard CPU. The GH200 features a high-bandwidth connection between its CPU and GPU, which can provide significant speed gains for tasks that require CPU offloading.
To serve Llama 3.1 70B using CPU offloading:
- If needed, press Ctrl + B, then D to detach your current session.
-
Set the
MODEL_REPO
environment variable to Llama 3.1 70B instead of Llama 3.1 8B: -
Reattach your vLLM server session:
-
Press Ctrl + C to stop the currently running server.
-
Start the server again, this time using the Llama 3.1 70B model and enabling CPU offloading by appending the
--cpu-offload-gb
flag.Note
This step might take up to 20 minutes to complete.
-
After the model finishes loading and the server starts serving, detach your
tmux
session again. - Send a test prompt as described in the Testing the vLLM API Server section above.
Add a firewall rule#
Optionally, you can add a firewall rule to allow external traffic to your API server. For details, see Firewalls > Creating a firewall rule. vLLM serves on port 8000 by default.
Cleaning up#
When you're done with your instances, terminate them to avoid incurring unnecessary costs:
- In the Lambda Cloud console, navigate to the Instances page.
- Select the checkboxes of the instances you want to delete.
- Click Terminate. A dialog appears.
- Follow the instructions and then click Terminate instances to terminate your instances.
Next steps#
- To learn how to benchmark your GH200 instance against other instances, see Running a PyTorch®-based benchmark on an NVIDIA GH200 instance.
- For details on using vLLM to serve models on other instance types, see Serving the Llama 3.1 8B and 70B models using Lambda Cloud on-demand instances.
- To learn how to use Hugging Face's Diffusers and Transformers libraries on a GH200 instance, see Running Hugging Face Transformers and Diffusers on an NVIDIA GH200 instance.
- For more tips and tutorials, see our Education section.