🔍 RQ4: Detection Methods

Detection techniques are evolving from rigid static analysis to dynamic, model-driven, and hybrid evaluation frameworks. They form the diagnostic foundation of LLM quality governance.

💻 1. Code-Level Detection

Identifies defects in generated code using three main paradigms:

Dynamic Analysis: Test-based execution and runtime monitoring to assess accuracy and efficiency.
Static Analysis: Rule-based detection (SonarQube, Semgrep) for syntax errors and vulnerabilities.
Model-based Detection: “LLM-as-a-judge” techniques and ML classifiers for semantic filtering.

Taxonomy of Code Issue Detection Methods

Fig. 6. Taxonomy of Code Issue Detection Techniques

📊 2. Data-Level Detection

Targets the integrity, provenance, and representativeness of training data:

Dynamic Analysis: Execution-based validation and metric drift monitoring (detecting data leakage).
Static Analysis: Rule-based detection and provenance tracing using file hashes.
Model-based Detection: Semantic screening using LLMs to evaluate readability and hazards.

Taxonomy of Dataset Issue Detection Methods

Fig. 7. Taxonomy of Training Data Issue Detection Techniques

📄 Referenced Papers

LLMs Meet Library Evolution

LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion

2024-06 View Paper ↗

Less is More

Less is More: On the Importance of Data Quality for Unit Test Generation

2025-02 View Paper ↗

Qwen

Qwen Technical Report

2023-09 View Paper ↗

Qwen2

Qwen2 Technical Report

2024-07 View Paper ↗

DataMan

DataMan: Data Manager for Pre-training Large Language Models

2025-02 View Paper ↗

Phi-4

Phi-4 Technical Report

2024-12 View Paper ↗

Copilot Security

Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?

2022-04 View Paper ↗

Copilot Evaluation

An Empirical Evaluation of GitHub Copilot’s Code Suggestions

2025-01 View Paper ↗

HalluCode

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

2024-04 View Paper ↗

CodeHalu

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

2024-05 View Paper ↗

EffiBench

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

2024-02 View Paper ↗

Mercury

Mercury: A Code Efficiency Benchmark for Code Large Language Models

2024-02 View Paper ↗

SStuBs

Large Language Models and Simple, Stupid Bugs

2023-03 View Paper ↗

package hallucinations

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs

2024-06 View Paper ↗

HallTrigger

Code Hallucination

2024-07 View Paper ↗

Large Language Models for Code

Large Language Models for Code: Security Hardening and Adversarial Testing

2023-02 View Paper ↗

Purple Llama CYBERSECEVAL

Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models

2023-12 View Paper ↗

Lost at C

Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants

2022-08 View Paper ↗

AI Assistants Security

Do Users Write More Insecure Code with AI Assistants?

2022-11 View Paper ↗

The Counterfeit Conundrum

The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?

2024-02 View Paper ↗

Bugs in LLM-generated Code

Bugs in Large Language Models Generated Code: An Empirical Stud

2024-03 View Paper ↗

GitHub Copilot, Amazon CodeWhisperer, ChatGPT

Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT

2023-04 View Paper ↗

ChatGPT Code Quality

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

2023-08 View Paper ↗

CloudAPIBench

On Mitigating Code LLM Hallucinations with API Documentation

2024-07 View Paper ↗

CodeMirage

CodeMirage: Hallucinations in Code Generated by Large Language Models

2024-08 View Paper ↗

LLM-generated Code Efficiency

On Evaluating the Efficiency of Source Code Generated by LLMs

2024-04 View Paper ↗

Syntactic Robustness

Syntactic Robustness for LLM-based Code Generation

2024-04 View Paper ↗

DeSec

Decoding Secret Memorization in Code LLMs Through Token-Level Characterization

2024-10 View Paper ↗

Bias Unveiled

Bias Unveiled: Investigating Social Bias in LLM-Generated Code

2024-11 View Paper ↗

FairCoder

FairCoder: Evaluating Social Bias of LLMs in Code Generation

2025-01 View Paper ↗

From Effectiveness to Efficiency

From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions

2024-06 View Paper ↗

ENAMEL

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

2024-06 View Paper ↗

DeVAIC

DeVAIC: A Tool for Security Assessment of AI-generated Code

2024-04 View Paper ↗

PTMs

Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written

2024-11 View Paper ↗

Codequal Analyzer

Improving LLM-Generated Code Quality with GRPO

2025-06 View Paper ↗

Artificial-Intelligence Generated Code Considered Harmful

Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation

2024-09 View Paper ↗

Unveiling Inefficiencies in LLM-Generated Code

Unveiling Inefficiencies in LLM-Generated Code: Toward a Comprehensive Taxonomy

2025-03 View Paper ↗

Python Tests Quality

Quality Assessment of Python Tests Generated by Large Language Models

2025-06 View Paper ↗

CoQuIR

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

2025-06 View Paper ↗

REAL

Training Language Models to Generate Quality Code with Program Analysis Feedback

2025-05 View Paper ↗

CIDRe

CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement

2025-05 View Paper ↗

Infinite-Instruct

Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

2025-05 View Paper ↗

Quality In, Quality Out

Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation

2025-03 View Paper ↗

Security and Quality in LLM-Generated Code

Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis

2025-02 View Paper ↗

SwallowCode

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

2025-05 View Paper ↗

ROSE

ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells

2025-07 View Paper ↗

Refining ChatGPT-Generated Code

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

2023-07 View Paper ↗

Qwen3

Qwen3 Technical Report

2025-05 View Paper ↗

Qwen2.5

Qwen2.5 Technical Report

2024-12 View Paper ↗

TeleChat

Technical Report of TeleChat2, TeleChat2.5 and T1

2025-07 View Paper ↗

Kimi K2

Kimi K2: Open Agentic Intelligence

2025-07 View Paper ↗

ReCode

ReCode: Updating Code API Knowledge with Reinforcement Learning

2025-06 View Paper ↗

Seed-Coder

Seed-Coder: Let the Code Model Curate Data for Itself

2025-06 View Paper ↗

Data-efficient Fine-tuning

Data-efficient LLM Fine-tuning for Code Generation

2025-04 View Paper ↗

CRPE

CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation

2025-05 View Paper ↗

DeepSeek-Coder

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

2024-01 View Paper ↗

StarCoder 2 and The Stack v2

StarCoder 2 and The Stack v2: The Next Generation

2024-02 View Paper ↗

CodeSmellEval

How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study

2024-12 View Paper ↗

RPG

Rethinking Repetition Problems of LLMs in Code Generation

2025-05 View Paper ↗

Repetition In Repetition Out

Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

2023-10 View Paper ↗

Every Sample Matters

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

2025-03 View Paper ↗

WaveCoder

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

2023-12 View Paper ↗

Brevity is the soul of wit

Brevity is the soul of wit: Pruning long files for code generation

2024-07 View Paper ↗

Benchmark Builders

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

2025-04 View Paper ↗

Beyond Correctness

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

2024-07 View Paper ↗

Generated Code Diversity

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

2024-08 View Paper ↗

CodeMI

Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

2024-04 View Paper ↗

DataComp-LM

DataComp-LM: In search of the next generation of training sets for language models

2024-06 View Paper ↗

Codex

Evaluating Large Language Models Trained on Code

2021-07 View Paper ↗

Path Planning Evaluation

Assessing LLM code generation quality through path planning tasks

2025-04 View Paper ↗

CODEJUDGE

CODEJUDGE : Evaluating Code Generation with Large Language Models

2024-01 View Paper ↗

Datasets for Large Language Models

Datasets for Large Language Models: A Comprehensive Survey

2024-02 View Paper ↗

Synthetic Data Generation

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

2025-01 View Paper ↗

Cracks in The Stack

Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets

2025-05 View Paper ↗

Unseen Horizons

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

2025-04 View Paper ↗

MG-Verilog

MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

2024-06 View Paper ↗

Code Generation Survey

A Survey on Large Language Models for Code Generation

2024-08 View Paper ↗

DataRecipe

DataRecipe --- How to Cook the Data for CodeLLM?

2024-10 View Paper ↗

Training Data Extraction

Understanding Privacy Risks of Large Language Models in Japanese Based on Training Data Extraction Attacks

2025-08 View Paper ↗

aiXcoder-7B

aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing

2025-04 View Paper ↗

Imperfect Code Generation

Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models

2024-05 View Paper ↗

Inter-Dataset Code Duplication

On Inter-Dataset Code Duplication and Data Leakage in Large Language Models

2025-01 View Paper ↗

LLM-ProS

LLM-ProS: Analyzing Large Language Models’ Performance in Competitive Problem Solving

2025-05 View Paper ↗

ClassEval

Evaluating Large Language Models in Class-Level Code Generation

2024-6 View Paper ↗

Uncovering Pretraining Code in LLMs

Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

2025-11 View Paper ↗

RealSec-Bench

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

2026-01 View Paper ↗

ShortCoder

ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code GenerationPreprint

2026-01 View Paper ↗

APIKG4SYN

Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS

2025-11 View Paper ↗

MultiCodeIF

A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback

2025-07 View Paper ↗

Beyond Functional Correctness

Beyond functional correctness: Investigating coding style inconsistencies in large language models

2024-06 View Paper ↗

Adadec

Adadec: Uncertainty-guided adaptive decoding for llm-based code generation

2025-06 View Paper ↗

Code Copycat Conundrum

Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation

2025-04 View Paper ↗

AllianceCoder

What to retrieve for effective retrieval-augmented code generation? an empirical study and beyond

2025-03 View Paper ↗

RustEvo^ 2

RustEvo^ 2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation

2025-03 View Paper ↗

RobGen

A Preliminary Study on the Robustness of Code Generation by Large Language Models

2025-03 View Paper ↗

Llm Hallucinations in Practical Code Generation

Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation

2024-09 View Paper ↗

COFFE

COFFE: A Code Efficiency Benchmark for Code Generation

2025-02 View Paper ↗

AATK Benchmark

Asleep at the keyboard? assessing the security of github copilot's code contributions

2021-08 View Paper ↗