Red-Teaming a Government AI EduBot: Semantic Guardrails Hold but Structural Attacks Succeed

SentinelOne Labs has published a detailed case study of a black-box red-teaming engagement against "EduBot," a pseudonymous government AI assistant designed to answer resident education queries. The assessment, conducted against the OWASP Top 10 for LLMs, targeted prompt injection, insecure output handling, and jailbreaking. While the system demonstrated strong defenses against semantic manipulation, it fell prey to structural attacks that bypassed content filters.

The red team began with standard "front door" attacks, including direct prompt injection and persona adoption. Direct commands to override system instructions were immediately refused, indicating a robust instruction hierarchy. Attempts to frame malicious requests as role-playing or fictional scenarios also failed, revealing that the system evaluated user intent rather than relying solely on keyword blocking. This suggested a safety-first alignment in the foundational model.

Next, the team tried cognitive hacking by exploiting the bot's domain focus. They framed requests for toxic content as educational examples for a civics class. However, the system refused, demonstrating that content safety filters were weighted heavier than helpfulness objectives. Cross-language attacks using Arabic and English inputs also resulted in standard refusals. At this stage, the system appeared highly secure against semantic manipulation.

The breakthrough came when the team pivoted to syntactic attacks. They discovered that the model treated data differently than conversation. By framing a request as a "Developer UI Test" and asking for a JSON object containing HTML code with a malicious URL, the system generated a functional phishing payload. The safety filters scanned the response text but treated the code block as syntax rather than harmful advice, and the system failed to sanitize the URL or HTML tags, enabling potential cross-site scripting (XSS) attacks.

Further testing revealed a second vulnerability: Base64 obfuscation. The team prompted the model to decode a Base64-encoded string containing forbidden instructions. The model complied, outputting the decoded text without applying safety filters. This bypass worked because the filters likely scanned the input before decoding, or the model treated the decoded output as a neutral transformation rather than a user-facing response.

The case study underscores a critical lesson in AI security: semantic guardrails often fail against structural manipulation. While EduBot's defenses against social engineering and intent-based attacks were robust, the system lacked input sanitization and output validation for structured data formats. The findings highlight the need for comprehensive security measures that include structural input validation, output encoding, and context-aware filtering to prevent such bypasses.

SentinelOne Labs recommends that organizations deploying LLMs implement strict input sanitization for structured formats like JSON and XML, enforce output encoding to prevent injection attacks, and conduct regular red-teaming exercises that include both semantic and syntactic attack vectors. The EduBot case serves as a valuable example for AI security practitioners, demonstrating that even well-defended systems can be compromised through unexpected attack surfaces.