Systematic Literature Review

From Data to Code

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

A research map of how training data quality issues propagate into generated code defects, detection methods, and governance strategies across the LLM lifecycle.

From Data to Code project overview
114 Primary Studies Reviewed
9 Quality Dimensions
18 Propagation Mechanisms

From Data to Code is the project website for Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code. This systematic literature review studies how training data quality issues in large language models for code propagate into generated code quality issues, including correctness bugs, security vulnerabilities, compliance risks, robustness failures, maintainability problems, and efficiency defects.

The review connects data defects, code generation failures, detection methods, and governance strategies across the LLM lifecycle. It provides a taxonomy of quality issues in LLM-generated code, a taxonomy of training data quality issues, and a mapping from data problems to code defects.


📢 News

  • [2026-05] Our paper is now available on arXiv.
  • [2026-04] The From-Data-to-Code repository is officially launched.

📖 Abstract

This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles.

Overview of the process of paper collection and filtering

Fig. 1. Overview of the paper collection and filtering process.

Lifecycle of Detection and Governance

Fig. 2. Conceptual Framework of Quality Issues and Mitigation in the LLM Lifecycle.


🤝 Contribution

We welcome contributions from the community. If you have new research or have discovered missing classic papers, please follow these steps:

  1. Fork this repository.
  2. Add your paper to the corresponding RQ section.
  3. Submit a Pull Request.

© 2026 SYSUSELab. From Data to Code: Systematic Review of Quality Issues in LLMs for Code.

This site uses Just the Docs, a documentation theme for Jekyll.