Large Language Models, LLM, replicate bugs in code instead of fixing them | Large language models course udemy | A compact guide to large language models pdf download | Train llm on own data github | Turtles AI

Large Language Models, LLM, replicate bugs in code instead of fixing them
LLMs often replicate errors present in the training data, reducing the reliability of code completion
Isabella V19 March 2025

 

LLMs struggle to complete code containing bugs, often replicating errors in the training data. Their accuracy drops in these scenarios, highlighting the need for improvements in models, post-processing, and integration with development tools.

Key Points:

  • Error Replication: LLMs tend to repeat historical bugs in their training data.
  • Reduced Accuracy: LLMs’ ability to generate correct code drops significantly in the presence of buggy code.
  • Limitations in Complex Structures: LLMs struggle to handle constructs such as method invocations and return statements.
  • Limited Effectiveness of Post-Processing: Current techniques do not significantly reduce error rates in bug-prone scenarios.


Large language models (LLMs) have changed natural language processing by showing remarkable capabilities in code completion. However, a recent study highlighted a significant limitation: when exposed to code fragments containing bugs, LLMs tend to replicate these errors rather than correct them. This behavior is attributed to the presence of faulty code in the training data, which negatively impacts the performance of the models in real-world scenarios. In this study, seven LLMs, including GPT-4, GPT-3.5, CodeLlama-13B-hf, Gemma-7B, StarCoder2-15B, and CodeGEN-350M, were evaluated using the Defects4J dataset. The results showed that, in bug-prone tasks, the probability of LLMs generating correct code is almost equivalent to that of generating buggy code, with substantially lower accuracy than regular code completion tasks (e.g., 12.27% versus 29.85% for GPT-4). On average, each model produced approximately 151 correct completions and 149 buggy completions, highlighting the difficulty in handling bug-prone contexts. Surprisingly, 44.44% of the bugs created by LLMs were identical to historical bugs in the training data, with GPT-4o reaching 82.61%. This indicates a tendency for LLMs to memorize and reproduce known bugs, rather than innovate and generate error-free code. The most challenging code structures for LLMs include method invocations and return statements, while simpler constructs such as “if” statements and variable declarations are less problematic. Additionally, existing post-processing techniques, while they can improve the consistency of the generated code, do not significantly reduce error rates in bug-prone scenarios. This highlights the need to develop models with a better understanding of programming syntax and semantics, as well as more effective post-processing strategies and greater integration with development tools such as integrated development environments (IDEs). 

Despite the progress of LLMs in code completion, significant challenges remain in dealing with buggy code, highlighting the importance of further research and improvements in this area.