Deploying a Llama 3 inference endpoint

Deploy a Llama 3 8B or 70B inference model using NVIDIA A100 or H100 Tensor Core GPUs in Lambda Cloud.

Meta's Llama 3 large language models (LLMs) feature generative text models recognized for their state-of-the-art performance in common industry benchmarks.

This guide covers the deployment of a Meta Llama 3 inference endpoint using Lambda On-Demand Cloud. This tutorial uses the Llama 3 models hosted by Hugging Face.

The model is available in 8B and 70B sizes:

Model Size

Characteristics

8B (8 billion parameters)

More efficient and accessible, suitable for tasks where resources are constrained. The 8B model requires a 1x A100 or H100 GPU node.

70B (70 billion parameters)

Superior performance and capabilities ideal for complex or high-stakes applications. The 70B model requires an 8x A100 or H100 GPU node.

Prerequisites

This tutorial assumes the following prerequisites:

  1. Lambda On-Demand Cloud instances appropriate for the Llama 3 model size you want to run.

  2. A Hugging Face user account.

  3. An approved Hugging Face user access token that includes repository read permissions for the meta-llama-3 model repository you wish to use.

JSON outputs in this tutorial are formatted using jq.

Set up the inference point

Once you have the appropriate Lambda On-Demand Cloud instances and Hugging Face permissions, begin by setting up an inference point.

  1. Add or generate an SSH key to access the instance.

  2. SSH into your instance.

  3. Create a dedicated python environment

python3 -m venv Meta-Llama-3-8B
source Meta-Llama-3-8B/bin/activate
python3 -m pip install vllm==0.4.3 huggingface-hub==0.23.2 torch==2.3.0 numpy==1.26.4
  1. Log in to Hugging Face:

huggingface-cli login
  1. Start the model server (download/cache as necessary).

python3 -m vllm.entrypoints.openai.api_server \
  --host=0.0.0.0 \
  --port=8000 \
  --model=meta-llama/Meta-Llama-3-8B &> api_server.log & 

Interact with the Model

The following request delivers language prompts to the Llama 3 model:

curl -X POST http://localhost:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "What is the name of the capital of France?",
           "model": "meta-llama/Meta-Llama-3-8B",
           "temperature": 0.0,
           "max_tokens": 1   // sets the response length
         }'

Llama 3 responds to requests in the following format:

{
  "id": "cmpl-d898e2089b7b4855b48e00684b921c95",
  "object": "text_completion",
  "created": 1718221710,
  "model": "meta-llama/Meta-Llama-3-8B",
  "choices": [
    {
      "index": 0,
      "text": " Paris",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 11,
    "total_tokens": 12,
    "completion_tokens": 1
  }
}

Last updated