How to serve the Llama 3.1 8B and 70B models using Lambda Cloud on-demand instances

This tutorial shows you how to use a Lambda Cloud 1x or 8x NVIDIA Tensor Core A100 or H100 instance to serve the Llama 3.1 8B and 70B models. You'll serve the models using vLLM running inside of a Docker container.

Prerequisites

This tutorial assumes the following prerequisites:

  1. A Lambda Cloud on-demand instance appropriate for the Llama 3.1 model you want to run.

  2. A Hugging Face user account.

  3. An approved Hugging Face user access token that includes repository read permissions for the model repository you wish to use.

Start the vLLM API server

If you haven't already, use the dashboard or Cloud API to launch an instance. Then, SSH into your instance.

Run:

export HF_TOKEN=HF-TOKEN HF_HOME="/home/ubuntu/.cache/huggingface" MODEL_REPO=meta-llama/MODEL

Replace HF-TOKEN with your Hugging Face user access token.

Replace MODEL with:

  • If you're serving the 8B model:

Meta-Llama-3.1-8B-Instruct
  • If you're serving the 70B model:

Meta-Llama-3.1-70B-Instruct

These commands set environment variables needed for this tutorial.

Start a tmux session by running tmux.

If you're serving the 8B model, run:

sudo docker run \
  --gpus all \
  --ipc=host \
  -v "${HF_HOME}":/root/.cache/huggingface \
  -p 8000:8000 \
  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  vllm/vllm-openai --model "${MODEL_REPO}" \
    --disable-log-requests

If you're serving the 70B model, instead run:

sudo docker run \
  --gpus all \
  --ipc=host \
  -v "${HF_HOME}":/root/.cache/huggingface \
  -p 8000:8000 \
  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  vllm/vllm-openai --model "${MODEL_REPO}" \
    --disable-log-requests \
    --tensor-parallel-size 8

Both commands, above:

  • Download the model you're serving.

  • Start vLLM's API server with the chosen model.

The difference betweeen the two commands, above, is that the second command enables the tensor parallel strategy to use 8x GPUs. See vLLM's docs to learn more about the distributed inference strategies.

The vLLM API server is running once you see:

INFO 08-01 19:11:07 api_server.py:292] Available routes are:
INFO 08-01 19:11:07 api_server.py:297] Route: /openapi.json, Methods: GET, HEAD
INFO 08-01 19:11:07 api_server.py:297] Route: /docs, Methods: GET, HEAD
INFO 08-01 19:11:07 api_server.py:297] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-01 19:11:07 api_server.py:297] Route: /redoc, Methods: GET, HEAD
INFO 08-01 19:11:07 api_server.py:297] Route: /health, Methods: GET
INFO 08-01 19:11:07 api_server.py:297] Route: /tokenize, Methods: POST
INFO 08-01 19:11:07 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-01 19:11:07 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-01 19:11:07 api_server.py:297] Route: /version, Methods: GET
INFO 08-01 19:11:07 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-01 19:11:07 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-01 19:11:07 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test the vLLM API server

To test that the API server is serving the Llama 3.1 model:

Press Ctrl + B, then press C to open a new tmux window.

Then, run:

curl -X POST http://localhost:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d "{
           \"prompt\": \"What is the name of the capital of France?\",
           \"model\": \"${MODEL_REPO}\",
           \"temperature\": 0.0,
           \"max_tokens\": 1
         }"

You should see output similar to:

{"id":"cmpl-d3a33498b5d74d9ea09a7c256733b8df","object":"text_completion","created":

You can make the output more human-readable using jq. To do this, first install jq by running:

sudo apt update && sudo apt install -y jq

Then, append | jq . to the curl command, above.

The complete command should be:

curl -X POST http://localhost:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d "{
           \"prompt\": \"What is the name of the capital of France?\",
           \"model\": \"${MODEL_REPO}\",
           \"temperature\": 0.0,
           \"max_tokens\": 1
         }" | jq .

The output should now look similar to:

{
  "id": "cmpl-529d01c83069409fa5c166e1d137e21e",
  "object": "text_completion",
  "created": 1722545913,
  "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " Paris",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 11,
    "total_tokens": 12,
    "completion_tokens": 1
  }
}

Acknowledgement

We'd like to thank the vLLM team for their partnership in developing this guide and their pioneering work in streamlining LLM serving.

Last updated