Python / CodeBERT / PyGithub / Gemini API
The pipeline begins by authenticating with PyGithub to target C language repositories with high star counts (>150). To process this code at scale, we needed to isolate individual function bodies from files that often lacked headers or dependencies, rendering standard AST compilers (like Clang) ineffective.
We implemented a custom state-machine parser. Instead of full syntax tree generation, this lightweight algorithm iterates through file lines, tracking brace depth signatures. This allows us to extract syntactically complete C functions from "dirty" or partial source files with O(N) efficiency.
def parse_functions_from_file(file_path): inside_function = False brace_count = 0 # Iterate line by line, tracking scope depth for line in lines: stripped = line.strip() # Detect function start if not inside_function and "(" in stripped and stripped.endswith("{"): inside_function = True brace_count = 1 elif inside_function: function_body.append(line) # Track nested blocks to find true end of function brace_count += line.count("{") - line.count("}") if brace_count == 0: functions.append("".join(function_body))
We map extracted functions to a 768-dimensional vector space using the Microsoft CodeBERT base model. Simultaneously, we compute embeddings for the Devign vulnerability dataset. The detection engine utilizes PyTorch to calculate cosine similarity; if a GitHub function exceeds a 0.95 similarity threshold against a known exploit, it is flagged.
For remediation, we do not simply ask the LLM to fix the bug. We employ a Few-Shot prompting strategy with requested reasoning, shown to be more accurate in various benchmarks. We construct a prompt containing both the potentially vulnerable code and the verified exploit it matched against, then ask the model to think about how they are similar. This grounds the Gemini API, allowing it to compare the logic flow, then generate a specific patch for suspected issues.
"""
Analyze the following two code snippets:
GitHub Code (Potentially Vulnerable):
```c
{github_code} ```
Known Vulnerable Code (For Comparison):
```c
{vulnerable_code} ```
Identify the most probable vulnerability in the GitHub code based on similarity.
Assess severity (Low/Medium/High/Critical).
Provide an improved version of the GitHub code that mitigates the vulnerability.
"""