AI Powered Code Review

At Symbiotic Security, we’re building more than just tools that automatically fix vulnerable code. We’re building systems that help developers understand why their code is vulnerable by taking an educational approach, complete with AI-driven insights and guidance.

Together with my incredible team, we’ve been working on AI-powered remediation leveraging Large Language Models (LLMs) for the official launch of version 1 of our code security solution - a topic which you can (and should) read more about here. In particular, we’ve focused on Infrastructure as Code (IaC) remediation, a topic where very little research and data exist. Here’s what we’ve learned so far:

‍

Why “Just Ask the AI to Fix It” Doesn’t Work (At Scale)

Our first attempt was simple: detect a vulnerability, hand it to an LLM, and ask for a fix. We tried this with tools like Copilot and a variety of local models. The results weren’t terrible: between 65 and 75 percent of the AI-generated fixes were accepted by developers without any changes.

But in security, “good enough” is never actually good enough. Developers need a high level of confidence that a fix is correct, safe, and won’t cause downstream issues. Anything less adds friction and erodes trust in the system.

‍

Model Choice Matters - A Lot

If you’re working with an off-the-shelf LLM, results will vary significantly depending on the model’s training data, reasoning capabilities, and fine-tuning.

To test this, we ran an experiment on 13 different LLMs, asking them to fix specific vulnerabilities. Each model got four attempts per vulnerability, and we measured:

Average success rate across all attempts.
Best case success rate (if the model succeeded at least once in four tries).

Here are our full findings:

Here are a few key takeaways:

Claude 3.7 Sonnet consistently produced strong, reliable results.
DeepSeek R1 distill Qwen 7B had high peak accuracy but lacked consistency.
Model performance varied significantly across vulnerability types.

When models failed, it often wasn’t because they didn’t try to fix the problem. Instead, they introduced syntax errors, created new vulnerabilities in the process, or suggested solutions that just didn’t hold up in the real world. That’s why we added a post-processing step to sanitize AI-generated code - without it, the failure rate would have been much higher.

‍

What’s Actually Working for Symbiotic

After a lot of trial and error, here are a few key things that have made a real difference in our results:

Bigger models don’t always mean better results.

Especially in IaC security, we’ve seen clear diminishing returns. Larger LLMs often perform no better than smaller ones, likely because they’re all drawing from the same limited pool of public IaC data.

Context is king.

Instead of relying on generic inference, our approach enhances the LLM with structured context:

• Detailed descriptions of the vulnerability

• Examples of secure code (as close as possible to the vulnerable snippet)

• Guided remediation steps

By augmenting LLMs rather than relying on their pre-trained security knowledge, we saw a major improvement in results.

The Agentic Approach Works

Rather than a single AI model, specialized AI agents drastically improve reliability:

Context Extractor Agent – Enriches AI input with relevant code context.
Code Review Agent – Detects and corrects syntax errors & small mistakes.
Cybersecurity Expert Agents – Specialized agents for XSS, SQL injection, etc.

AI + Developer = Best Results

Beyond just automated remediation, we provide an interactive AI chat so developers can:

Ask why a vulnerability is a problem.
Request alternative remediation approaches.
Add missing context AI may have overlooked.

Below is an example of AI remediation and interaction with developers together, where more context equals better AI remediation:

‍

To see it in action for yourself, check out the demos on this page here.

‍

The Real Bottleneck Isn’t Model Size—It’s Context

For IaC security remediation, larger LLMs don’t necessarily perform better. There’s a clear plateau effect, showing that classical AI models - trained mostly on publicly available data - have reached their limit in security remediation. Since these models rely on generic security knowledge from open-source repositories, documentation, and research papers, they struggle with real-world, context-specific vulnerabilities.

The next leap in AI remediation isn’t about bigger models, it’s about better context. Specific, high-quality contextual data is now the true fuel for improving AI performance. By injecting precise vulnerability descriptions, secure code patterns, and real-world remediation strategies, we can bypass the plateau and unlock significantly better results.

Cracking the Code: Insights on AI-Powered Code Security Remediation