← Back to Index

Auto-Remediation Security Agent

Python / CodeBERT / PyGithub / Gemini API

Data Ingestion & Parsing

The pipeline begins by authenticating with PyGithub to target C language repositories with high star counts (>150). To process this code at scale, we needed to isolate individual function bodies from files that often lacked headers or dependencies, rendering standard AST compilers (like Clang) ineffective.

We implemented a custom state-machine parser. Instead of full syntax tree generation, this lightweight algorithm iterates through file lines, tracking brace depth signatures. This allows us to extract syntactically complete C functions from "dirty" or partial source files with O(N) efficiency.

def parse_functions_from_file(file_path):
    inside_function = False
    brace_count = 0
    
    # Iterate line by line, tracking scope depth
    for line in lines:
        stripped = line.strip()
        
        # Detect function start
        if not inside_function and "(" in stripped and stripped.endswith("{"):
             inside_function = True
             brace_count = 1
        
        elif inside_function:
             function_body.append(line)
             # Track nested blocks to find true end of function
             brace_count += line.count("{") - line.count("}")
             
             if brace_count == 0:
                 functions.append("".join(function_body))
Fig 1. Core logic of the heuristic parser. It utilizes brace-balancing to extract function bodies from non-compilable C source code.

Vector Detection & LLM Remediation

We map extracted functions to a 768-dimensional vector space using the Microsoft CodeBERT base model. Simultaneously, we compute embeddings for the Devign vulnerability dataset. The detection engine utilizes PyTorch to calculate cosine similarity; if a GitHub function exceeds a 0.95 similarity threshold against a known exploit, it is flagged.

For remediation, we do not simply ask the LLM to fix the bug. We employ a Few-Shot prompting strategy with requested reasoning, shown to be more accurate in various benchmarks. We construct a prompt containing both the potentially vulnerable code and the verified exploit it matched against, then ask the model to think about how they are similar. This grounds the Gemini API, allowing it to compare the logic flow, then generate a specific patch for suspected issues.

"""
Analyze the following two code snippets:

GitHub Code (Potentially Vulnerable):
```c
{github_code} ```
Known Vulnerable Code (For Comparison):
```c
{vulnerable_code} ```
Identify the most probable vulnerability in the GitHub code based on similarity.
Assess severity (Low/Medium/High/Critical).
Provide an improved version of the GitHub code that mitigates the vulnerability.
"""
Fig 2. The dynamic prompt template used for the Gemini API. By injecting the 'Known Vulnerable' code retrieved via vector search, we significantly reduce LLM hallucinations, ground the context, and improve patch accuracy.