SGLang

An example of using SGLang with LWS

Deploy Distributed Inference Service with SGLang and LWS on GPUs

In this example, we demonstrate how to deploy a distributed inference service using LeaderWorkerSet (LWS) with SGLang on GPU clusters.

SGLang provides native support for distributed tensor-parallel inference and serving, enabling efficient deployment of large language models (LLMs) such as DeepSeek-R1 671B and Llama-3.1-405B across multiple nodes. This example uses the meta-llama/Meta-Llama-3.1-8B-Instruct model to demonstrate multi-node serving capabilities. For implementation details on distributed execution, see the SGLang docs Run Multi-Node Inference.

Since SGLang employs tensor parallelism for multi-node inference, which requires more frequent communications than pipeline parallelism, ensure high-speed bandwidth between nodes to avoid poor performance.

Deploy LeaderWorkerSet of SGLang

We use LeaderWorkerSet to deploy 2 SGLang replicas, and each replica has 2 Pods, 1 GPU per Pod. Set the --tp to 2 to enable inference across two pods. The leader pod runs the HTTP server, with a ClusterIP Service exposing the port. Replace the HUGGING_FACE_HUB_TOKEN in lws.yaml with your own Hugging Face token. Then, deploy the lws.yaml.

kubectl apply -f docs/examples/sglang/lws.yaml

Verify the status of the SGLang pods

kubectl get pods

Should get an output similar to this

NAME         READY   STATUS    RESTARTS   AGE
sglang-0     1/1     Running   0          3m55s
sglang-0-1   1/1     Running   0          3m55s
sglang-1     1/1     Running   0          3m55s
sglang-1-1   1/1     Running   0          3m55s

Verify that the distributed tensor-parallel inference works

kubectl logs sglang-0 |grep -C 2 -i "Application startup complete"

Should get an output similar to this

[2025-02-11 09:06:55] INFO:     Started server process [1]
[2025-02-11 09:06:55] INFO:     Waiting for application startup.
[2025-02-11 09:06:55] INFO:     Application startup complete.
[2025-02-11 09:06:55] INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
[2025-02-11 09:06:56] INFO:     127.0.0.1:40048 - "GET /get_model_info HTTP/1.1" 200 OK

Access ClusterIP Service

Use kubectl port-forward to forward local port 40000 to a pod.

# Listen on port 40000 locally, forwarding to the targetPort of the service's port 40000 in a pod selected by the service
kubectl port-forward svc/sglang-leader 40000:40000

The output should be similar to the following

Forwarding from 127.0.0.1:40000 -> 40000

Serve the Model

Open another terminal and send a request

curl http://localhost:40000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "role": "user",
    "prompt": "What is the meaning of life?"
}'

The output should be similar to the following

{
  "id": "ae241aa13d2f473ab69a9b0d84eabe8b",
  "object": "text_completion",
  "created": 1739265029,
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": " A question that has puzzled humanity for thousands of years. Philosophers, scientists,",
      "logprobs": null,
      "finish_reason": "length",
      "matched_stop": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 24,
    "completion_tokens": 16,
    "prompt_tokens_details": null
  }
}

Last modified April 1, 2025: docs: move README.md from docs/examples to LWS website (#472) (35c900a)