CVE-2026-10300
Description
SGLang's inference HTTP endpoint crashes due to an assertion failure when handling concurrent LoRA requests exceeding the batch limit.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
SGLang's inference HTTP endpoint crashes due to an assertion failure when handling concurrent LoRA requests exceeding the batch limit.
Vulnerability
A security vulnerability exists in SGLang versions prior to the fix for issue #23141, specifically within the python/sglang/srt/lora/lora_manager.py file of the Inference HTTP Endpoint component. The issue arises when the --max-loras-per-batch limit is set, and concurrent requests for more LoRA adapters than the limit, along with base-model requests, are processed in the same scheduling round. This leads to an assertion failure in the lora_manager.fetch_new_loras() function [1].
Exploitation
An attacker can exploit this vulnerability by starting the SGLang server with a --max-loras-per-batch value lower than the number of LoRA adapters configured. Subsequently, sending concurrent requests to more than the allowed number of different LoRA adapters, alongside requests for the base model, triggers the crash. This requires remote access to the server and is considered to have high complexity and be difficult to exploit [1, 2].
Impact
Successful exploitation of this vulnerability results in a denial of service. The scheduler process raises an unhandled exception, which propagates to the server, causing it to become permanently unresponsive. This leads to a loss of availability for the inference service [1, 2].
Mitigation
A pull request has been submitted to address this issue by enforcing the max_loras_per_batch limit at the scheduler admission stage, preventing the assertion failure. The fix is detailed in reference [2]. The availability of a fixed version and its release date are not yet disclosed in the available references. The vulnerability has been publicly disclosed and may be actively exploited [1].
AI Insight generated on Jun 1, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
1- Range: =0.5.10.post1
Patches
13171b8683146Merge a42932696cabd81cdd8b98e0af34f00f0c26f859 into c06220159bca623b139018c8fe5c3822a8ded5de
2 files changed · +176 −1
python/sglang/srt/managers/scheduler.py+18 −1 modified@@ -2627,7 +2627,7 @@ def _get_new_batch_prefill_raw( self._chunked_req_scheduled_last_iter = False if self.enable_lora: - running_loras = {req.lora_id for req in self.running_batch.reqs} + running_loras = self._collect_committed_lora_ids(adder.can_run_list) if self.lora_drainer: self.lora_drainer.update_draining_state( @@ -2814,6 +2814,23 @@ def _can_schedule_lora_req( new_lora_set ) + def _collect_committed_lora_ids( + self, prefill_can_run_list: List[Req] + ) -> set[Optional[str]]: + """Collect lora_ids already committed to the next batch so admission + can correctly enforce ``max_loras_per_batch``. + + Includes (a) reqs in the current running decode batch and (b) reqs + already placed into the prefill adder's can_run_list — notably the + chunked_req, which is admitted unconditionally before the waiting + queue is processed. Without (b) a chunked LoRA prefill is invisible + to admission, allowing the scheduler to admit N+1 distinct adapters + when ``max_loras_per_batch=N``. See sgl-project/sglang#23141. + """ + committed = {req.lora_id for req in self.running_batch.reqs} + committed.update(r.lora_id for r in prefill_can_run_list) + return committed + def update_running_batch(self, batch: ScheduleBatch) -> Optional[ScheduleBatch]: """Update the current running decoding batch.""" initial_bs = batch.batch_size()
test/registered/unit/managers/test_scheduler_lora_admission.py+158 −0 added@@ -0,0 +1,158 @@ +"""Regression tests for LoRA admission when chunked_req is present (#23141).""" + +import ast +import inspect +import unittest +from types import SimpleNamespace +from unittest.mock import MagicMock + +from sglang.test.ci.ci_register import register_cpu_ci +from sglang.test.test_utils import CustomTestCase, maybe_stub_sgl_kernel + +maybe_stub_sgl_kernel() + +from sglang.srt.managers.schedule_batch import Req +from sglang.srt.managers.scheduler import Scheduler + +register_cpu_ci(est_time=2, suite="stage-a-test-cpu") + + +def _make_req(lora_id): + req = Req.__new__(Req) + req.lora_id = lora_id + req.sampling_params = SimpleNamespace(max_new_tokens=16, ignore_eos=False) + return req + + +def _scheduler_with_lora( + *, + running_batch_loras, + max_loras_per_batch=2, + enable_overlap_loading=False, +): + s = Scheduler.__new__(Scheduler) + s.enable_lora = True + s.lora_drainer = None + s.enable_lora_overlap_loading = enable_overlap_loading + + s.running_batch = MagicMock() + s.running_batch.reqs = [_make_req(l) for l in running_batch_loras] + + lora_mgr = MagicMock() + lora_mgr.max_loras_per_batch = max_loras_per_batch + lora_mgr.validate_lora_batch = MagicMock( + side_effect=lambda ids: len(ids) <= max_loras_per_batch + ) + s.tp_worker = SimpleNamespace(model_runner=SimpleNamespace(lora_manager=lora_mgr)) + return s + + +class TestCollectCommittedLoraIds(CustomTestCase): + """``_collect_committed_lora_ids`` must report every lora_id that will + appear in the next batch so admission can enforce ``max_loras_per_batch``.""" + + def test_running_batch_only(self): + s = _scheduler_with_lora(running_batch_loras=["A", "B"]) + self.assertEqual(s._collect_committed_lora_ids([]), {"A", "B"}) + + def test_includes_chunked_req_from_can_run_list(self): + s = _scheduler_with_lora(running_batch_loras=["A"]) + chunked = _make_req("X") + self.assertEqual(s._collect_committed_lora_ids([chunked]), {"A", "X"}) + + def test_base_model_request_counted_as_distinct_uid(self): + # A base-model request (lora_id=None) is a distinct entry in the + # set computed by fetch_new_loras, so admission must see it too. + s = _scheduler_with_lora(running_batch_loras=[None]) + chunked = _make_req("X") + self.assertEqual(s._collect_committed_lora_ids([chunked]), {None, "X"}) + + +class TestLoraAdmissionWithChunkedReq(CustomTestCase): + """End-to-end regression for #23141: when ``max_loras_per_batch=N`` and a + chunked LoRA prefill already sits in ``can_run_list``, admission must + reject a further distinct LoRA request that would form an (N+1)-th UID. + """ + + def test_rejects_new_adapter_when_chunked_fills_last_slot(self): + s = _scheduler_with_lora(running_batch_loras=["A"], max_loras_per_batch=2) + chunked_req = _make_req("X") + running_loras = s._collect_committed_lora_ids([chunked_req]) + + new_req = _make_req("B") + self.assertFalse(s._can_schedule_lora_req(new_req, running_loras)) + + def test_admits_request_with_same_adapter_as_chunked(self): + s = _scheduler_with_lora(running_batch_loras=["A"], max_loras_per_batch=2) + chunked_req = _make_req("X") + running_loras = s._collect_committed_lora_ids([chunked_req]) + + same_as_chunked = _make_req("X") + self.assertTrue(s._can_schedule_lora_req(same_as_chunked, running_loras)) + + def test_without_chunked_req_behavior_unchanged(self): + s = _scheduler_with_lora(running_batch_loras=["A"], max_loras_per_batch=2) + running_loras = s._collect_committed_lora_ids([]) + + new_req = _make_req("B") + self.assertTrue(s._can_schedule_lora_req(new_req, running_loras)) + + def test_base_plus_n_loras_at_cap_rejects_next(self): + # Mirrors Yunzez's N adapters + 1 base-model scenario: when the base + # request already occupies one UID slot, admission must reject the + # (N+1)-th distinct adapter regardless of whether it arrives via + # chunked_req or waiting_queue. + s = _scheduler_with_lora(running_batch_loras=[None], max_loras_per_batch=2) + chunked_req = _make_req("A") + running_loras = s._collect_committed_lora_ids([chunked_req]) + + new_req = _make_req("B") + self.assertFalse(s._can_schedule_lora_req(new_req, running_loras)) + + def test_overlap_loading_sees_chunked_lora_in_running_set(self): + # The overlap-loading branch consumes the same ``running_loras`` set + # that admission computes, so the chunked_req's adapter must be + # visible to ``try_overlap_load_lora`` as well. + s = _scheduler_with_lora( + running_batch_loras=["A"], + max_loras_per_batch=2, + enable_overlap_loading=True, + ) + s.lora_overlap_loader = MagicMock() + s.lora_overlap_loader.try_overlap_load_lora.return_value = False + + running_loras = s._collect_committed_lora_ids([_make_req("X")]) + self.assertFalse(s._can_schedule_lora_req(_make_req("B"), running_loras)) + s.lora_overlap_loader.try_overlap_load_lora.assert_called_once_with( + "B", {"A", "X"} + ) + + +class TestAdmissionCallSiteWiring(CustomTestCase): + """Source-level guard: the helper is only useful if admission actually + calls it with ``adder.can_run_list``. A silent revert of the call site + (re-inlining the old ``{req.lora_id for req in self.running_batch.reqs}``) + must trip this test even when the helper itself is left intact.""" + + def test_admission_invokes_helper_with_can_run_list(self): + src = inspect.getsource(Scheduler._get_new_batch_prefill_raw) + tree = ast.parse(src) + for node in ast.walk(tree): + if ( + isinstance(node, ast.Call) + and isinstance(node.func, ast.Attribute) + and node.func.attr == "_collect_committed_lora_ids" + and len(node.args) == 1 + and isinstance(node.args[0], ast.Attribute) + and node.args[0].attr == "can_run_list" + ): + return + self.fail( + "_get_new_batch_prefill_raw must call " + "self._collect_committed_lora_ids(adder.can_run_list); regressing " + "this call site silently re-introduces sgl-project/sglang#23141." + ) + + +if __name__ == "__main__": + unittest.main()
Vulnerability mechanics
Root cause
"The scheduler does not enforce the maximum number of LoRA adapters per batch before processing requests, leading to an assertion failure."
Attack vector
An attacker can trigger this vulnerability by sending concurrent requests to multiple LoRA adapters, exceeding the configured `--max-loras-per-batch` limit, along with at least one request for a base model without a LoRA path. This scenario causes the scheduler to build a batch with more distinct adapter UIDs than allowed. The attack can be launched remotely due to the exposed HTTP endpoint and requires high complexity to execute successfully.
Affected code
The vulnerability resides in the `lora_manager.fetch_new_loras()` function within the file `python/sglang/srt/lora/lora_manager.py`. Specifically, the assertion `assert len(cur_uids) <= self.max_loras_per_batch` is reached when the scheduler fails to enforce the LoRA adapter limit during batch construction.
What the fix does
The pull request aims to fix the vulnerability by ensuring that the scheduler enforces the `--max-loras-per-batch` limit when constructing batches. This prevents the assertion in `lora_manager.fetch_new_loras()` from being reached. The patch will modify the batch construction logic to adhere to the specified cap, thus avoiding the scheduler crash and loss of availability.
Preconditions
- configThe server must be started with `--enable-lora` and a `--max-loras-per-batch` value set to a number less than the total number of distinct LoRA adapters being requested concurrently.
- networkThe attacker must be able to send requests to the SGLang Inference HTTP Endpoint.
- inputThe attacker must send concurrent requests targeting multiple LoRA adapters and at least one base model request without a LoRA path.
Reproduction
1. Generate LoRA adapters using `gen_lora_weights.py`. 2. Start the SGLang server with `--max-loras-per-batch` set to a value lower than the number of generated adapters. 3. Send concurrent requests to more than `max_loras_per_batch` different adapters using `repro_sglang_lora_crash.py`.
Generated on Jun 1, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6News mentions
0No linked articles in our index yet.