Before you can download the Llama 3.1 405B model, you need to review and accept the model's license agreement. Once you accept the agreement, a request to access the repository will be submitted for approval; approval tends to be fast. You can see the status of the request in your Hugging Face account settings.
Download the Llama 3.1 405B model and set up a head node
Then SSH into one of your 1CC GPU nodes. You can find the node names in your 1-Click Clusters dashboard. You’ll use this GPU node as a head node for cluster management.
On the head node, set environment variables needed for this tutorial by running:
Run the script to start a Ray cluster for serving the Llama 3.1 405B model using vLLM. The Ray cluster uses your 1CC's InfiniBand fabric for optimal performance.
Connect another GPU node to the head node
Next, you'll connect another of your 1CC's GPUs nodes to the head node. This other GPU node will be referred to below as the worker node.
In a new terminal, SSH into the worker node, then set environment variables needed for this tutorial by running:
The Llama 3.1 405B model is ready to be served once you see output similar to:
INFO: Started server process [24469]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Test the Llama 3.1 405B model
Still on the worker node, open a new tmux window (Ctrl + B, then press C).
Download vLLM’s example Open AI chat completion client.
Finally, to test the Llama 3.1 405B model, run:
python3 ${SHARED_DIR}/inference_test.py
This command produces output similar to:
Chat completion results:
ChatCompletion(id='chat-8eba7fa7e2f7442aafa82a1683bfc77f', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The 2020 World Series was played at Globe Life Field in Arlington, Texas. This was a neutral site due to COVID-19 restrictions and was also referred to as a "bubble" environment.', role='assistant', function_call=None, tool_calls=[]), stop_reason=None)], created=1721884178, model='/root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/e04e3022cdc89bfed0db69f5ac1d249e21ee2d30', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=41, prompt_tokens=59, total_tokens=100))
Acknowledgement
We'd like to thank the vLLM team for their partnership in developing this guide and their pioneering work in streamlining LLM serving.