SGLang
Deploy Distributed Inference Service with SGLang and LWS on GPUs
In this example, we demonstrate how to deploy a distributed inference service using LeaderWorkerSet (LWS) with SGLang on GPU clusters.
SGLang provides native support for distributed tensor-parallel inference and serving, enabling efficient deployment of large language models (LLMs) such as DeepSeek-R1 671B and Llama-3.1-405B across multiple nodes. This example uses the meta-llama/Meta-Llama-3.1-8B-Instruct model to demonstrate multi-node serving capabilities. For implementation details on distributed execution, see the SGLang docs Run Multi-Node Inference.
Since SGLang employs tensor parallelism for multi-node inference, which requires more frequent communications than pipeline parallelism, ensure high-speed bandwidth between nodes to avoid poor performance.
Deploy LeaderWorkerSet of SGLang
We use LeaderWorkerSet to deploy 2 SGLang replicas, and each replica has 2 Pods, 1 GPU per Pod. Set the --tp
to 2 to enable inference across two pods.
The leader pod runs the HTTP server, with a ClusterIP Service exposing the port.
Replace the HUGGING_FACE_HUB_TOKEN
in lws.yaml
with your own Hugging Face token. Then, deploy the lws.yaml
.
Verify the status of the SGLang pods
Should get an output similar to this
Verify that the distributed tensor-parallel inference works
Should get an output similar to this
Access ClusterIP Service
Use kubectl port-forward
to forward local port 40000 to a pod.
The output should be similar to the following
Serve the Model
Open another terminal and send a request
The output should be similar to the following