Using the Lambda Inference API#

The Lambda Inference API enables you to use large language models (LLMs) without the need to set up a server. No limits are placed on the rate of requests. The Lambda Inference API can be used as a drop-in replacement for applications currently using the OpenAI API. See, for example, our guide on integrating the Lambda Inference API into VS Code.

To use the Lambda Inference API, first generate a Cloud API key from the dashboard. You can also use a Cloud API key that you've already generated.

In the examples below, you can replace hermes3-405b with any of the available models. You can obtain a list of the available models using the /models endpoint. Replace <API-KEY> with your actual Cloud API key.

Creating chat completions#

The /chat/completions endpoint takes a list of messages that make up a conversation, then outputs a response.

PythonCurl

First, create and activate a Python virtual environment. Then, install the OpenAI Python API library by running:

pip install openai

Run, for example:

from openai import OpenAI

openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "hermes3-405b"

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant named Hermes, made by Nous Research."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":
        "assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model=model,
)

print(chat_completion)

You should see output similar to:

ChatCompletion(id='chatcmpl-54ecd2c87a114a67a6928614088a7a92', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The 2020 World Series was played at Globe Life Field in Arlington, Texas, which is the home of the Texas Rangers. However, it was not the home field of the teams participating. This was due to the COVID-19 pandemic and the restrictions on travel and gatherings. The Los Angeles Dodgers played the Tampa Bay Rays for the championship.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False}, 'self_harm': {'filtered': False}, 'sexual': {'filtered': False}, 'violence': {'filtered': False}, 'jailbreak': {'filtered': False, 'detected': False}, 'profanity': {'filtered': False, 'detected': False}})], created=1733460270, model='llama3.1-8b-instruct', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage(completion_tokens=70, prompt_tokens=86, total_tokens=156, completion_tokens_details=None, prompt_tokens_details=None))

Run:

curl -sS https://api.lambdalabs.com/v1/chat/completions \
  -H "Authorization: Bearer <API-KEY>" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "hermes3-405b",
        "messages": [
          {
            "role": "system",
            "content": "You are a helpful assistant named Hermes, made by Nous Research."
          },
          {
            "role": "user",
            "content": "Who won the world series in 2020?"
          },
          {
            "role": "assistant",
            "content": "The Los Angeles Dodgers won the World Series in 2020."
          },
          {
            "role": "user",
            "content": "Where was it played?"
          }
        ]
      }' | jq .

You should see output similar to:

{
  "id": "chatcmpl-cbb10ffe2bf24c81a37d86204a3ec835",
  "object": "chat.completion",
  "created": 1733448149,
  "model": "hermes3-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The 2020 World Series was played at Globe Life Field in Arlington, Texas, due to the COVID-19 pandemic restrictions. All games were played at this neutral site to minimize travel and potential exposure to the virus."
      },
      "finish_reason": "stop",
      "content_filter_results": {
        "hate": {
          "filtered": false
        },
        "self_harm": {
          "filtered": false
        },
        "sexual": {
          "filtered": false
        },
        "violence": {
          "filtered": false
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "profanity": {
          "filtered": false,
          "detected": false
        }
      }
    }
  ],
  "usage": {
    "prompt_tokens": 65,
    "completion_tokens": 45,
    "total_tokens": 110,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  },
  "system_fingerprint": ""
}

Creating completions#

The /completions endpoint takes a single text string (a prompt) as input, then outputs a response. In comparison, the /chat/completions endpoint takes a list of messages as input.

To use the /completions endpoint:

PythonCurl

First, create and activate a Python virtual environment. Then, install the OpenAI Python API library by running:

pip install openai

Run, for example:

from openai import OpenAI

openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "hermes3-405b"

response = client.completions.create(
  prompt="Computers are",
  temperature=0,
  model=model,
)

print(response)

You should see output similar to:

Completion(id='chatcmpl-2b9da158a108459cb7e2e9ee61e72e49', choices=[CompletionChoice(finish_reason='stop', index=0, logprobs=Logprobs(text_offset=None, token_logprobs=None, tokens=None, top_logprobs=None), text='electronic devices that can be programmed to perform a variety of tasks, from simple calculations to complex operations. They can process and store vast amounts of data, communicate with other devices, and execute instructions at incredibly high speeds.')], created=1733460512, model='llama3.1-8b-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=45, prompt_tokens=38, total_tokens=83, completion_tokens_details=None, prompt_tokens_details=None))

Run:

curl -sS https://api.lambdalabs.com/v1/completions \
  -H "Authorization: Bearer <API-KEY>" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "hermes3-405b",
        "prompt": "Computers are",
        "temperature": 0
      }' | jq .

You should see output similar to:

{
  "id": "chatcmpl-8e46443e199a446ea8a49ed124cad61b",
  "object": "text_completion",
  "created": 1733448483,
  "model": "hermes3-8b",
  "choices": [
    {
      "text": "1. Electronic devices that process data and perform a wide range of tasks\n2. Calculating machines used for complex mathematical operations\n3. Devices that can store and retrieve information\n4. Tools that enhance communication through email, instant messaging, and video conferencing\n5. Platforms for creating and sharing multimedia content, such as videos, photos, and music\n6. Essential tools for businesses and organizations in managing operations, financial transactions, and customer relations\n7. Systems used in scientific research and data analysis\n8. Devices that can be programmed to perform specific tasks and solve problems\n9. Networked tools that enable collaboration and resource sharing among users\n10. Powerful machines capable of performing complex computations, simulations, and artificial intelligence tasks.",
      "index": 0,
      "finish_reason": "stop",
      "logprobs": {
        "tokens": null,
        "token_logprobs": null,
        "top_logprobs": null,
        "text_offset": null
      }
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 149,
    "total_tokens": 172,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

Listing models#

The /models endpoint lists the models available for use through the Lambda Inference API.

To use the /models endpoint:

PythonCurl

First, create and activate a Python virtual environment. Then, install the OpenAI Python API library by running:

pip install openai

Run:

from openai import OpenAI

openai_api_key = "<API-KEY>"
openai_api_base = "https://api.lambdalabs.com/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

client.models.list()

You should see output similar to:

SyncPage[Model](data=[Model(id='hermes3-405b', created=1724347380, object='model', owned_by='lambda'), Model(id='hermes3-70b', created=1724347380, object='model', owned_by='lambda'), Model(id='hermes3-8b', created=1724347380, object='model', owned_by='lambda'), Model(id='lfm-40b', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-405b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-70b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-8b-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.1-nemotron-70b-instruct-fp8', created=1724347380, object='model', owned_by='lambda'), Model(id='llama3.2-3b-instruct', created=1724347380, object='model', owned_by='lambda'), Model(id='qwen25-coder-32b-instruct', created=1724347380, object='model', owned_by='lambda')], object='list')

Run:

curl https://api.lambdalabs.com/v1/models -H "Authorization: Bearer <API-KEY>" | jq .

You should see output similar to:

{
 "object": "list",
 "data": [
   {
     "id": "hermes3-405b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "hermes3-70b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "hermes3-8b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "lfm-40b",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },
   {
     "id": "llama3.1-405b-instruct-fp8",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   },

   […]

   {
     "id": "qwen25-coder-32b-instruct",
     "object": "model",
     "created": 1724347380,
     "owned_by": "lambda"
   }
  ]
}

Note

Currently, the following models are available:

llama3.1-405b-instruct-fp8
lfm-40b
llama3.1-8b-instruct
llama3.2-11b-vision-instruct
qwen25-coder-32b-instruct
hermes3-405b
deepseek-llama3.3-70b
llama3.1-70b-instruct-fp8
llama3.3-70b-instruct-fp8
llama3.2-3b-instruct
hermes3-8b
llama3.1-nemotron-70b-instruct-fp8
hermes3-70b