Deploying a Llama 3 inference endpoint#
Meta's Llama 3 large language models (LLMs) feature generative text models recognized for their state-of-the-art performance in common industry benchmarks.
This guide covers the deployment of a Meta Llama 3 inference endpoint using Lambda On-Demand Cloud. This tutorial uses the Llama 3 models hosted by Hugging Face.
The model is available in 8B and 70B sizes:
Model Size | Characteristics |
---|---|
8B (8 billion parameters) | More efficient and accessible, suitable for tasks where resources are constrained. The 8B model requires a 1x A100 or H100 GPU node. |
70B (70 billion parameters) | Superior performance and capabilities ideal for complex or high-stakes applications. The 70B model requires an 8x A100 or H100 GPU node. |
Prerequisites#
This tutorial assumes the following prerequisites:
- Lambda On-Demand Cloud instances appropriate for the Llama 3 model size you want to run.
- Model 8B (meta-llama/Meta-Llama-3-8B) requires 1x A100 or H100 GPU node.
- Model 70B (meta-llama/Meta-Llama-3-70B) requires 8x A100 or H100 GPU nodes.
- A Hugging Face user account.
- An approved Hugging Face user access token that includes repository read permissions for the meta-llama-3 model repository you wish to use.
JSON outputs in this tutorial are formatted using jq.
Set up the inference point#
Once you have the appropriate Lambda On-Demand Cloud instances and Hugging Face permissions, begin by setting up an inference point.
- Launch your Lambda On-Demand Cloud instance.
- Add or generate an SSH key to access the instance.
- SSH into your instance.
-
Create a dedicated python environment.
-
Log in to Hugging Face:
-
Start the model server (download/cache as necessary).
Interact with the Model#
The following request delivers language prompts to the Llama 3 model:
Llama 3 responds to requests in the following format:
{
"id": "cmpl-d898e2089b7b4855b48e00684b921c95",
"object": "text_completion",
"created": 1718221710,
"model": "meta-llama/Meta-Llama-3-8B",
"choices": [
{
"index": 0,
"text": " Paris",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 11,
"total_tokens": 12,
"completion_tokens": 1
}
}
{
"id": "cmpl-d898e2089b7b4855b48e00684b921c95",
"object": "text_completion",
"created": 1718221710,
"model": "meta-llama/Meta-Llama-3-70B",
"choices": [
{
"index": 0,
"text": " Paris",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 11,
"total_tokens": 12,
"completion_tokens": 1
}
}