Python / CodeBERT / PyGithub / Gemini API
The pipeline begins by authenticating with PyGithub to choose and download C language repositories with high star counts. We download the raw source files and pass them through a custom state-machine parser. This parser tracks brace depth and keyword signatures to isolate individual function bodies from the global scope without requiring a full abstract syntax tree compilation. We load the Microsoft CodeBERT base model and tokenizer to convert these raw function strings into 768-dimensional vector embeddings. The embeddings represent the semantic logic of the code rather than just syntactic structure.
We simultaneously compute embeddings for the Devign vulnerability dataset which contains 12,000 functions labeled with verified exploits. The detection engine utilizes PyTorch to calculate the cosine similarity between the scraped function vectors and the vulnerability vector database. If a similarity score exceeds the tuned threshold of 0.95, the system flags the code as potentially vulnerable. We then construct a structured prompt containing the flagged code and its similar vulnerable counterpart. This prompt is sent to the Gemini Pro API with instructions to identify the specific exploit type (such as buffer overflow or integer wrap-around) and generate a rewritten, secure version of the C function. The response is parsed to extract the remediation code block for automated patching.