一、启动分析

使用vllm可以通过CLI命令行工具使用,也可以在代码中调用。

CLI启动链路分析

当使用如下命令行启动vllm

1
vllm serve Qwen/Qwen3-1.7B --dtype=half

会构造出一个服务来提供请求。具体函数入口可以在pyproject.toml中看到:

1
2
3
4
5
[project.scripts]
vllm = "vllm.entrypoints.cli:main"

[tool.pipx]
apps = ["vllm"]

1. CLI入口分析

根据toml的命令可知,命令行工具的入口文件为:vllm/entrypoints/cli/main.py,其中主要逻辑是根据请求的参数不同,调用不同模块:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 命令模块列表
CMD_MODULES = [
vllm.entrypoints.cli.openai, # chat/complete命令
vllm.entrypoints.cli.serve, # serve命令
vllm.entrypoints.cli.benchmark.main, # bench命令
vllm.entrypoints.cli.collect_env, # collect-env命令
vllm.entrypoints.cli.run_batch, # run-batch命令
]

# 为每个命令模块初始化子命令解析器
for cmd_module in CMD_MODULES:
new_cmds = cmd_module.cmd_init()
for cmd in new_cmds:
cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)

if hasattr(args, "dispatch_function"):
# 将每个子命令对象的cmd设置成dispatch_function,然后运行
args.dispatch_function(args)
else:
parser.print_help()

从中代码中可见,服务还提供如下命令

  • vllm serve [model] [options] - 启动OpenAI兼容的API服务器
  • vllm chat [options] - 与运行的API服务器进行交互式聊天
  • vllm complete [options] - 进行文本补全
  • vllm run-batch [options] - 批量处理提示词
  • vllm bench latency - 延迟基准测试
  • vllm bench throughput - 吞吐量基准测试
  • vllm bench serve - 服务性能基准测试
  • vllm bench startup - 启动时间基准测试
  • vllm bench sweep - 参数扫描基准测试
  • vllm bench mm_processor - 多模态处理器基准测试
  • vllm collect-env - 收集环境信息

接下来分析这些子命令的启动逻辑,每个子命令的包中都有cmd_init函数来提供一个子命令对象,如下为serve中的函数

1
2
def cmd_init() -> list[CLISubcommand]:
return [ServeSubcommand()]

每个子命令对象都是继承CLISubcommand

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class CLISubcommand:
"""Base class for CLI argument handlers."""

name: str
# 最终dispatch_function=cmd.cmd启动的cmd,最终的子命令的入口
@staticmethod
def cmd(args: argparse.Namespace) -> None:
raise NotImplementedError("Subclasses should implement this method")

def validate(self, args: argparse.Namespace) -> None:
# No validation by default
pass
# 参数解析器,用于解析每个子命令的参数
def subparser_init(
self, subparsers: argparse._SubParsersAction
) -> FlexibleArgumentParser:
raise NotImplementedError("Subclasses should implement this method")

最终的逻辑都是各个功能模块去实现基类CLISubcommand来完成不同的功能

2. CLI serve逻辑

serve入口实现在vllm/entrypoints/cli/serve.py中,核心代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class ServeSubcommand(CLISubcommand):
"""The `serve` subcommand for the vLLM CLI."""
# 在cli中使用vllm serve调用到这个对象
name = "serve"

@staticmethod
def cmd(args: argparse.Namespace) -> None:
# If model is specified in CLI (as positional arg), it takes precedence
if hasattr(args, "model_tag") and args.model_tag is not None:
args.model = args.model_tag

if args.headless or args.api_server_count < 1:
run_headless(args)
else:
if args.api_server_count > 1:
run_multi_api_server(args)
else:
# Single API server (this process).
uvloop.run(run_server(args))

从cmd函数可知,vllm serve支持三种模式的调用:

模式1:Headless模式(无头模式)
触发条件headless=True
功能

  • 不启动HTTP API服务器
  • 只启动推理引擎核心
  • 适用于:
    • 分布式训练中的工作节点
    • 多节点并行处理(PP/TP)
    • 仅需要推理能力,不需要HTTP接口的场景

关键代码

1
2
3
4
5
6
7
def run_headless(args: argparse.Namespace):
# 创建引擎配置,设置headless=True
vllm_config = engine_args.create_engine_config(
usage_context=usage_context, headless=True
)
# 启动数据并行引擎
engine_manager = CoreEngineProcManager(...)

模式2:单API服务器模式
触发条件headless=Falseapi_server_count = 1

功能

  • 启动单个OpenAI兼容的API服务器
  • 在当前进程中直接运行服务器
  • 适用于单机部署场景

关键代码

1
2
# Single API server (this process)
uvloop.run(run_server(args))

模式3:多API服务器模式
触发条件headless=False 且 [api_server_count > 1]
功能

  • 启动多个API服务器进程
  • 每个服务器进程独立运行
  • 支持负载均衡和高可用
  • 适用于高并发生产环境

关键代码

1
2
3
4
5
def run_multi_api_server(args: argparse.Namespace):
# 设置多进程Prometheus监控
setup_multiprocess_prometheus()
# 启动多个API服务器进程
api_server_manager = APIServerProcessManager(...)

使用场景对比

模式 适用场景 特点
Headless 分布式训练、工作节点 无HTTP接口,纯推理引擎
单API服务器 开发测试、小规模部署 简单直接,单进程运行
多API服务器 生产环境、高并发 多进程负载均衡,高可用

参数设置示例

1
2
3
4
5
6
7
8
9
10
# Headless模式(无API服务器)
vllm serve Qwen/Qwen3-4B --headless
vllm serve Qwen/Qwen3-4B --api-server-count=0

# 单API服务器模式(默认)
vllm serve Qwen/Qwen3-4B
vllm serve Qwen/Qwen3-4B --api-server-count=1

# 多API服务器模式
vllm serve Qwen/Qwen3-4B --api-server-count=4

3. api server逻辑

在cli/serve.go中,无论是启动单个还是多个API服务器进程,都会调用到vllm/entrypoints/openai/api_server.py中的setup_server函数创建socket,并最终调用到run_server_worker函数,在该函数中完成fastapi对象的构造以及服务的启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
async def run_server_worker(
listen_address, sock, args, client_config=None, **uvicorn_kwargs
) -> None:
"""Run a single API server worker."""
async with build_async_engine_client(
args,
client_config=client_config,
) as engine_client:
# 构造FastAPI对象,用于暴露http接口
app = build_app(args)

await init_app_state(engine_client, app.state, args)

logger.info(
"Starting vLLM API server %d on %s",
engine_client.vllm_config.parallel_config._api_process_rank,
listen_address,
)
shutdown_task = await serve_http(
app,
sock=sock,
enable_ssl_refresh=args.enable_ssl_refresh,
host=args.host,
port=args.port,
log_level=args.uvicorn_log_level,
# NOTE: When the 'disable_uvicorn_access_log' value is True,
# no access log will be output.
access_log=not args.disable_uvicorn_access_log,
timeout_keep_alive=envs.VLLM_HTTP_TIMEOUT_KEEP_ALIVE,
ssl_keyfile=args.ssl_keyfile,
ssl_certfile=args.ssl_certfile,
ssl_ca_certs=args.ssl_ca_certs,
ssl_cert_reqs=args.ssl_cert_reqs,
h11_max_incomplete_event_size=args.h11_max_incomplete_event_size,
h11_max_header_count=args.h11_max_header_count,
**uvicorn_kwargs,
)

# NB: Await server shutdown only after the backend context is exited
try:
await shutdown_task
finally:
sock.close()

def build_app(args: Namespace) -> FastAPI:
if args.disable_fastapi_docs:
app = FastAPI(
openapi_url=None, docs_url=None, redoc_url=None, lifespan=lifespan
)
elif args.enable_offline_docs:
app = FastAPI(docs_url=None, redoc_url=None, lifespan=lifespan)
else:
app = FastAPI(lifespan=lifespan)
# 把各个模块注册的router都加到fastapi对象中,用于绑定各模块的http路由
from vllm.entrypoints.serve import register_vllm_serve_api_routers
register_vllm_serve_api_routers(app)

from vllm.entrypoints.openai.chat_completion.api_router import (
attach_router as register_chat_api_router,
)
register_chat_api_router(app)

# ...
# 导入不同模块对应的http路由
register_responses_api_router(app)
register_translations_api_router(app)
register_completion_api_router(app)
register_anthropic_api_router(app)
register_models_api_router(app)
register_sagemaker_routes(router)

app.include_router(router)

run_server_worker会调用vllm/entrypoints/launcher.py中的serve_http接收已经绑定的socket以及已经创建的FastAPI对象,然后使用uvicorn来启动服务。

1
2
3
4
5
6
7
8
9
10
async def serve_http(
app: FastAPI,
sock: socket.socket | None, # 接收预绑定的socket
enable_ssl_refresh: bool = False,
**uvicorn_kwargs: Any,
):
# ... 配置uvicorn ...
server = uvicorn.Server(config)
server_task = loop.create_task(server.serve(sockets=[sock] if sock else None))
# 使用预绑定的socket启动服务

4. chat主对话逻辑

当调用/v1/chat/completions接口时,会进入vllm/entrypoints/openai/chat_completion/api_router.py的对应路由中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
@router.post(
"/v1/chat/completions",
dependencies=[Depends(validate_json_request)],
responses={
HTTPStatus.OK.value: {"content": {"text/event-stream": {}}},
HTTPStatus.BAD_REQUEST.value: {"model": ErrorResponse},
HTTPStatus.NOT_FOUND.value: {"model": ErrorResponse},
HTTPStatus.INTERNAL_SERVER_ERROR.value: {"model": ErrorResponse},
},
)
@with_cancellation
@load_aware_call
async def create_chat_completion(request: ChatCompletionRequest, raw_request: Request):
metrics_header_format = raw_request.headers.get(
ENDPOINT_LOAD_METRICS_FORMAT_HEADER_LABEL, ""
)
handler = chat(raw_request)
if handler is None:
base_server = raw_request.app.state.openai_serving_tokenization
return base_server.create_error_response(
message="The model does not support Chat Completions API"
)

try:
generator = await handler.create_chat_completion(request, raw_request)
except Exception as e:
return handler.create_error_response(e)

if isinstance(generator, ErrorResponse):
return JSONResponse(
content=generator.model_dump(), status_code=generator.error.code
)

elif isinstance(generator, ChatCompletionResponse):
return JSONResponse(
content=generator.model_dump(),
headers=metrics_header(metrics_header_format),
)

return StreamingResponse(content=generator, media_type="text/event-stream")

其中handler.create_chat_completion就会调用到最终的chat主逻辑,按openAI的格式返回模型推理结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# vllm/entrypoints/openai/chat_completion/serving.py
class OpenAIServingChat(OpenAIServing):
async def create_chat_completion(
self,
request: ChatCompletionRequest,
raw_request: Request | None = None,
) -> AsyncGenerator[str, None] | ChatCompletionResponse | ErrorResponse:
"""
Chat Completion API similar to OpenAI's API.

See https://platform.openai.com/docs/api-reference/chat/create
for the API specification. This API mimics the OpenAI
Chat Completion API.
"""
# ...
# 预请求处理
engine_request, tokenization_kwargs = await self._process_inputs(
sub_request_id,
engine_prompt,
sampling_params,
lora_request=lora_request,
trace_headers=trace_headers,
priority=request.priority,
data_parallel_rank=data_parallel_rank,
)
# 调用引擎EngineClient的rpc接口
generator = self.engine_client.generate(
engine_request,
sampling_params,
sub_request_id,
lora_request=lora_request,
trace_headers=trace_headers,
priority=request.priority,
prompt_text=prompt_text,
tokenization_kwargs=tokenization_kwargs,
data_parallel_rank=data_parallel_rank,
)
# ...

# Streaming response
if request.stream:
return self.chat_completion_stream_generator(
request,
result_generator,
request_id,
model_name,
conversation,
tokenizer,
request_metadata,
)

try:
return await self.chat_completion_full_generator(
request,
result_generator,
request_id,
model_name,
conversation,
tokenizer,
request_metadata,
)

在主对话逻辑中,会调用rpc接口请求到推理引擎中,链路如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
OpenAI API层 (FastAPI)

OpenAIServingChat.create_chat_completion()

AsyncLLM.generate() ← 客户端接口

add_request() ← 请求预处理

_add_request() ← 关键分叉点
├── output_processor.add_request() ← 本地处理
└── engine_core.add_request_async() ← 跨进程到核心引擎

EngineCoreClient ← IPC/RPC通信层

真正的vLLM核心引擎进程 ← 模型推理逻辑

5. CLI调用到推理引擎全链路

代码直接调用链路分析

当通过Python代码直接调用vLLM库时,走的是核心引擎链路,而不是CLI链路。从vllm/__init__.py可以看到主要入口点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
__all__ = [
"__version__",
"bc_linter_skip",
"bc_linter_include",
"__version_tuple__",
"LLM",
"ModelRegistry",
"PromptType",
"TextPrompt",
"TokensPrompt",
"SamplingParams",
"RequestOutput",
"CompletionOutput",
"PoolingOutput",
"PoolingRequestOutput",
"EmbeddingOutput",
"EmbeddingRequestOutput",
"ClassificationOutput",
"ClassificationRequestOutput",
"ScoringOutput",
"ScoringRequestOutput",
"LLMEngine",
"EngineArgs",
"AsyncLLMEngine",
"AsyncEngineArgs",
"initialize_ray_cluster",
"PoolingParams",
]

可以在代码中按如下方法调用暴露的模块

1
2
3
4
5
6
7
8
9
10
11
from vllm import LLM, LLMEngine, AsyncLLMEngine
from vllm import SamplingParams, EngineArgs

# 高级API(推荐)
llm = LLM(model="Qwen/Qwen3-1.7B", dtype="half")
outputs = llm.generate(["Hello, how are you?"])

# 低级API
engine = LLMEngine.from_args(EngineArgs(model="Qwen/Qwen3-1.7B"))
sampling_params = SamplingParams()
outputs = engine.generate("Hello", sampling_params)