🛠️ RQ5: Governance Strategies

We synthesize a Multi-layered Governance Framework spanning the data lifecycle and model inference stages.

💻 1. Code-Level Mitigation

Model-level: SFT, RLHF/DPO, Reward-based optimization (execution correctness + static metrics).
Generation-level:
- Pre-generation: Prompt Engineering, RAG, and Agent-based workflows.
- In-generation: Adaptive decoding constraints and Iterative Self-reflection.
- Post-generation: Automated AST-level repairs and sandbox execution filtering.

Fig. 8. Taxonomy of Code Issue Mitigation Strategies

📊 2. Data-Level Mitigation

Cleaning & Filtering: Execution-feedback elimination and LLM-driven semantic cleaning.
Data Balancing: Stratified resampling across languages and domains to mitigate bias.
Data Enhancement: Refactoring, adding docstrings, and standardizing low-quality code.
Data Augmentation: High-quality synthetic generation and integration of curated OS repos.

Taxonomy of Dataset Issue Mitigation Strategies

Fig. 9. Taxonomy of Training Data Issue Mitigation Strategies

📄 Referenced Papers

LLMs Meet Library Evolution

LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion

2024-06 View Paper ↗

Less is More

Less is More: On the Importance of Data Quality for Unit Test Generation

2025-02 View Paper ↗

Qwen

Qwen Technical Report

2023-09 View Paper ↗

Qwen2

Qwen2 Technical Report

2024-07 View Paper ↗

DataMan

DataMan: Data Manager for Pre-training Large Language Models

2025-02 View Paper ↗

Phi-4

Phi-4 Technical Report

2024-12 View Paper ↗

SStuBs

Large Language Models and Simple, Stupid Bugs

2023-03 View Paper ↗

package hallucinations

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs

2024-06 View Paper ↗

Large Language Models for Code

Large Language Models for Code: Security Hardening and Adversarial Testing

2023-02 View Paper ↗

CloudAPIBench

On Mitigating Code LLM Hallucinations with API Documentation

2024-07 View Paper ↗

AutoAPIEval

A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

2024-09 View Paper ↗

Codequal Analyzer

Improving LLM-Generated Code Quality with GRPO

2025-06 View Paper ↗

REAL

Training Language Models to Generate Quality Code with Program Analysis Feedback

2025-05 View Paper ↗

CIDRe

CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement

2025-05 View Paper ↗

Infinite-Instruct

Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

2025-05 View Paper ↗

Quality In, Quality Out

Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation

2025-03 View Paper ↗

SwallowCode

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

2025-05 View Paper ↗

Refining ChatGPT-Generated Code

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

2023-07 View Paper ↗

Qwen3

Qwen3 Technical Report

2025-05 View Paper ↗

Qwen2.5

Qwen2.5 Technical Report

2024-12 View Paper ↗

TeleChat

Technical Report of TeleChat2, TeleChat2.5 and T1

2025-07 View Paper ↗

Kimi K2

Kimi K2: Open Agentic Intelligence

2025-07 View Paper ↗

ReCode

ReCode: Updating Code API Knowledge with Reinforcement Learning

2025-06 View Paper ↗

Seed-Coder

Seed-Coder: Let the Code Model Curate Data for Itself

2025-06 View Paper ↗

Data-efficient Fine-tuning

Data-efficient LLM Fine-tuning for Code Generation

2025-04 View Paper ↗

CRPE

CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation

2025-05 View Paper ↗

DeepSeek-Coder

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

2024-01 View Paper ↗

Code Pretraining

How Does Code Pretraining Affect Language Model Task Performance?

2024-09 View Paper ↗

StarCoder 2 and The Stack v2

StarCoder 2 and The Stack v2: The Next Generation

2024-02 View Paper ↗

CodeSmellEval

How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study

2024-12 View Paper ↗

RPG

Rethinking Repetition Problems of LLMs in Code Generation

2025-05 View Paper ↗

Repetition In Repetition Out

Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

2023-10 View Paper ↗

Brevity is the soul of wit

Brevity is the soul of wit: Pruning long files for code generation

2024-07 View Paper ↗

Benchmark Builders

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

2025-04 View Paper ↗

CodeCipher

CodeCipher: Learning to Obfuscate Source Code Against LLMs

2024-10 View Paper ↗

DataComp-LM

DataComp-LM: In search of the next generation of training sets for language models

2024-06 View Paper ↗

RedStone

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

2024-12 View Paper ↗

Code Llama

Code Llama: Open Foundation Models for Code

2023-08 View Paper ↗

Codex

Evaluating Large Language Models Trained on Code

2021-07 View Paper ↗

Path Planning Evaluation

Assessing LLM code generation quality through path planning tasks

2025-04 View Paper ↗

CODEJUDGE

CODEJUDGE : Evaluating Code Generation with Large Language Models

2024-01 View Paper ↗

Synthetic Data Generation

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

2025-01 View Paper ↗

Cracks in The Stack

Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets

2025-05 View Paper ↗

MG-Verilog

MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

2024-06 View Paper ↗

Code Generation Survey

A Survey on Large Language Models for Code Generation

2024-08 View Paper ↗

DataRecipe

DataRecipe --- How to Cook the Data for CodeLLM?

2024-10 View Paper ↗

aiXcoder-7B

aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing

2025-04 View Paper ↗

Imperfect Code Generation

Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models

2024-05 View Paper ↗

Inter-Dataset Code Duplication

On Inter-Dataset Code Duplication and Data Leakage in Large Language Models

2025-01 View Paper ↗

LLM-ProS

LLM-ProS: Analyzing Large Language Models’ Performance in Competitive Problem Solving

2025-05 View Paper ↗

UCD-Training

Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs

2026-02 View Paper ↗

ShortCoder

ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code GenerationPreprint

2026-01 View Paper ↗

APIKG4SYN

Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS

2025-11 View Paper ↗

MultiCodeIF

A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback

2025-07 View Paper ↗

Beyond Functional Correctness

Beyond functional correctness: Investigating coding style inconsistencies in large language models

2024-06 View Paper ↗

Adadec

Adadec: Uncertainty-guided adaptive decoding for llm-based code generation

2025-06 View Paper ↗

Code Copycat Conundrum

Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation

2025-04 View Paper ↗

AllianceCoder

What to retrieve for effective retrieval-augmented code generation? an empirical study and beyond

2025-03 View Paper ↗

RustEvo^ 2

RustEvo^ 2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation

2025-03 View Paper ↗

RobGen

A Preliminary Study on the Robustness of Code Generation by Large Language Models

2025-03 View Paper ↗

Llm Hallucinations in Practical Code Generation

Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation

2024-09 View Paper ↗

COFFE

COFFE: A Code Efficiency Benchmark for Code Generation

2025-02 View Paper ↗