Skip to content

Using the Lambda Inference API#

The Lambda Inference API enables you to use the Llama 3.1 405B Instruct large language model (LLM), and fine-tuned versions such as Nous Research's Hermes 3 and Liquid AI's LFM 40.3B MoE (Mixture of Experts), without the need to set up your own vLLM API server on an on-demand instance or 1-Click Cluster (1CC).

Tip

Try Lambda Chat!

Also try Companion, powered by the Lambda Inference API.

Contact us to learn more about our:

Since the Lambda Inference API is compatible with the OpenAI API, you can use it as a drop-in replacement for applications currently using the OpenAI API. See, for example, our guide on integrating the Lambda Inference API into VS Code.

The Lambda Inference API implements endpoints for:

Currently, the following models are available:

  • hermes3-405b
  • hermes3-70b
  • hermes3-8b
  • lfm-40b
  • llama3.1-405b-instruct-fp8
  • llama3.1-70b-instruct-fp8
  • llama3.1-8b-instruct
  • llama3.1-nemotron-70b-instruct-fp8
  • llama3.2-3b-instruct
  • llama3.3-70b-instruct-fp8
  • qwen25-coder-32b-instruct

To use the Lambda Inference API, first generate a Cloud API key from the dashboard. You can also use a Cloud API key that you've already generated.

In the examples below:

  • Replace <MODEL> with one of the models listed above.
  • Replace <API-KEY> with your actual Cloud API key.

Creating chat completions#

The /chat/completions endpoint takes a list of messages that make up a conversation, then outputs a response.

Run:

curl -sS https://api.lambdalabs.com/v1/chat/completions \
  -H "Authorization: Bearer <API-KEY>" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "<MODEL>",
        "messages": [
          {
            "role": "system",
            "content": "You are a helpful assistant named Hermes, made by Nous Research."
          },
          {
            "role": "user",
            "content": "Who won the world series in 2020?"
          },
          {
            "role": "assistant",
            "content": "The Los Angeles Dodgers won the World Series in 2020."
          },
          {
            "role": "user",
            "content": "Where was it played?"
          }
        ]
      }' | jq .

You should see output similar to:

{
  "id": "chatcmpl-cbb10ffe2bf24c81a37d86204a3ec835",
  "object": "chat.completion",
  "created": 1733448149,
  "model": "hermes3-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The 2020 World Series was played at Globe Life Field in Arlington, Texas, due to the COVID-19 pandemic restrictions. All games were played at this neutral site to minimize travel and potential exposure to the virus."
      },
      "finish_reason": "stop",
      "content_filter_results": {
        "hate": {
          "filtered": false
        },
        "self_harm": {
          "filtered": false
        },
        "sexual": {
          "filtered": false
        },
        "violence": {
          "filtered": false
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "profanity": {
          "filtered": false,
          "detected": false
        }
      }
    }
  ],
  "usage": {
    "prompt_tokens": 65,
    "completion_tokens": 45,
    "total_tokens": 110,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  },
  "system_fingerprint": ""
}

First, create and activate a Python virtual environment. Then, install the OpenAI Python API library by running:

pip install openai

Run, for example:

from openai import OpenAI

openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "<MODEL>"

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant named Hermes, made by Nous Research."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":
        "assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model=model,
)

print(chat_completion)

You should see output similar to:

ChatCompletion(id='chatcmpl-54ecd2c87a114a67a6928614088a7a92', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The 2020 World Series was played at Globe Life Field in Arlington, Texas, which is the home of the Texas Rangers. However, it was not the home field of the teams participating. This was due to the COVID-19 pandemic and the restrictions on travel and gatherings. The Los Angeles Dodgers played the Tampa Bay Rays for the championship.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False}, 'self_harm': {'filtered': False}, 'sexual': {'filtered': False}, 'violence': {'filtered': False}, 'jailbreak': {'filtered': False, 'detected': False}, 'profanity': {'filtered': False, 'detected': False}})], created=1733460270, model='llama3.1-8b-instruct', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage(completion_tokens=70, prompt_tokens=86, total_tokens=156, completion_tokens_details=None, prompt_tokens_details=None))

Creating completions#

The /completions endpoint takes a single text string (a prompt) as input, then outputs a response. In comparison, the /chat/completions endpoint takes a list of messages as input.

To use the /completions endpoint:

Run:

curl -sS https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer <API-KEY>" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "<MODEL>",
        "prompt": "Computers are",
        "temperature": 0
      }' | jq .

You should see output similar to:

{
  "id": "chatcmpl-8e46443e199a446ea8a49ed124cad61b",
  "object": "text_completion",
  "created": 1733448483,
  "model": "hermes3-8b",
  "choices": [
    {
      "text": "1. Electronic devices that process data and perform a wide range of tasks\n2. Calculating machines used for complex mathematical operations\n3. Devices that can store and retrieve information\n4. Tools that enhance communication through email, instant messaging, and video conferencing\n5. Platforms for creating and sharing multimedia content, such as videos, photos, and music\n6. Essential tools for businesses and organizations in managing operations, financial transactions, and customer relations\n7. Systems used in scientific research and data analysis\n8. Devices that can be programmed to perform specific tasks and solve problems\n9. Networked tools that enable collaboration and resource sharing among users\n10. Powerful machines capable of performing complex computations, simulations, and artificial intelligence tasks.",
      "index": 0,
      "finish_reason": "stop",
      "logprobs": {
        "tokens": null,
        "token_logprobs": null,
        "top_logprobs": null,
        "text_offset": null
      }
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 149,
    "total_tokens": 172,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

First, create and activate a Python virtual environment. Then, install the OpenAI Python API library by running:

pip install openai

Run, for example:

from openai import OpenAI

openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "<MODEL>"

response = client.completions.create(
  prompt="Computers are",
  temperature=0,
  model=model,
)

print(response)

You should see output similar to:

Completion(id='chatcmpl-2b9da158a108459cb7e2e9ee61e72e49', choices=[CompletionChoice(finish_reason='stop', index=0, logprobs=Logprobs(text_offset=None, token_logprobs=None, tokens=None, top_logprobs=None), text='electronic devices that can be programmed to perform a variety of tasks, from simple calculations to complex operations. They can process and store vast amounts of data, communicate with other devices, and execute instructions at incredibly high speeds.')], created=1733460512, model='llama3.1-8b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=45, prompt_tokens=38, total_tokens=83, completion_tokens_details=None, prompt_tokens_details=None))

Listing models#

The /models endpoint lists the models available for use through the Lambda Inference API.

To use the /models endpoint:

Run:

curl https://api.lambdalabs.com/v1/models -H "Authorization: Bearer <API-KEY>" | jq .

You should see output similar to:

{
 "object": "list",
 "data": [
   {
     "id": "hermes3-405b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "hermes3-70b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "hermes3-8b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "lfm-40b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "llama3.1-405b-instruct-fp8",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },

   […]

   {
     "id": "qwen25-coder-32b-instruct",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   }
  ]
}

First, create and activate a Python virtual environment. Then, install the OpenAI Python API library by running:

pip install openai

Run:

from openai import OpenAI

openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

client.models.list()

You should see output similar to:

SyncPage[Model](data=[Model(id='hermes3-405b', created=1724347380, object='model', owned_by='lambda'), Model(id='hermes3-70b', created=1724347380, object='model', owned_by='lambda'), Model(id='hermes3-8b', created=1724347380, object='model', owned_by='lambda'), Model(id='lfm-40b', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-405b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-70b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-8b-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-nemotron-70b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.2-3b-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='qwen25-coder-32b-instruct', created=1724347380, object='model', owned_by='lambda')], object='list')