Deploying models with dstack
dstack is an alternative to Kubernetes for the orchestration of AI and ML applications. With dstack, you can use YAML configuration files to define the Lambda Public Cloud resources needed for your applications. dstack will automatically obtain those resources, that is, launch appropriate on-demand instances, and start your applications.
In this tutorial, you'll learn how to set up dstack, and use it to deploy vLLM on a Lambda Public Cloud on-demand instance. vLLM will serve the Hermes 3 fine-tuned Llama 3.1 8B large language model (LLM).
All of the instructions in this tutorial should be followed on your computer. This tutorial assumes you already have installed:
python3
python3-venv
pip
git
curl
jq
You can install these packages by running:
Setting up the dstack server
To set up the dstack server:
Create a directory for this tutorial, and change into the directory by running:
Then, create and activate a Python virtual environment by running:
Install dstack by running:
Create a directory for the dstack server and change into the directory by running:
In this directory, create a configuration file named config.yml
with the following contents:
Replace API-KEY with your actual Cloud API key.
Then, start the dstack server by running:
You should see output similar to:
Deploying vLLM and serving Hermes 3
To deploy vLLM and serve the Hermes 3 model:
Open another terminal. Then, change into the directory you created for this tutorial, and activate the Python virtual environment you created earlier, by running:
In this directory, create a new directory named
task-hermes-3-vllm
and change into the new directory by running:
In this new directory, create a filed named .dstack.yml
with the following contents:
Then, initialize and apply the configuration by running:
You'll see output similar to:
Press Y then Enter to submit the run. The run will take several minutes to complete.
dstack will automatically:
Launch an instance with between 40GB and 80GB of VRAM
Install vLLM and its dependencies using pip
Download the Hermes 3 model
Start vLLM and serve the Hermes 3 model
In the Lambda Public Cloud dashboard, you can see the instance launching.
vLLM is running and serving the Hermes 3 model once you see output similar to:
In another terminal, test that vLLM is serving the Hermes 3 model by running:
You should see output similar to:
To quit vLLM and terminate the instance that was launched:
In the previous terminal, that is, the terminal you used to run dstack apply
, press Ctrl + C. You'll be asked if you want to stop the run before detaching. Press Y then Enter. You'll see Stopped
once the run is stopped.
After 5 minutes, the instance will terminate. Alternatively, you can delete the instance immediately by running:
You'll be asked for confirmation that you want to delete the fleet, that is, the instance launched for this tutorial. Press Y then press Enter.
Using the Lambda Public Cloud dashboard, you can confirm that the instance was terminated.
To shut down the dstack server, press Ctrl + C in the terminal running the server.
Last updated