Paper2Rebuttal:
A Multi-Agent Framework for Transparent Author Response Assistance

Qianli Ma*, Chang Guo*, Zhiheng Tian*, Siyu Wang, Jipeng Xiao, Yuanhao Yue, Zhipeng Zhang†

AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University

* indicates co-first authors; † indicates corresponding author.

Overview

We introduce REBUTTALAGENT, the first multi-agent framework that reframes rebuttal generation as an evidence-centric planning task. This approach addresses the limitations of current direct-to-text methods, which often lead to hallucination, overlooked critiques, and a lack of verifiable grounding. Our system decomposes complex feedback into atomic concerns, dynamically constructs hybrid contexts by synthesizing compressed summaries with high-fidelity text, and integrates an autonomous external search module to resolve concerns requiring outside literature. Crucially, REBUTTALAGENT generates an inspectable response plan before drafting, ensuring every argument is explicitly anchored in internal or external evidence. This validation process, performed on our proposed REBUTTALBENCH, demonstrates that REBUTTALAGENT outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Our work is summarized in the following figure, contrasting our approach with previous methods.

Overview of our work. Given a manuscript and reviews, (a) direct text generation (SFT on peerreview corpora) often fabricates experiment results and prone to hallucination. (b) Interactive prompting with chat-LLMs depends on manual concern feeding and many iterations. (c) RebuttalAgent reframes rebuttal writing as a decision-and-evidence organization problem, performing concern breakdown, query-conditioned internal and external evidence construction, and strategy-level plan verification with human-in-the-loop checkpoints before drafting the final response.

RebuttalAgent

REBUTTALAGENT operates as a multi-agent framework designed to transform the rebuttal process into a structured and inspectable workflow. Our system generates evidence-linked intermediate artifacts before drafting the final text to ensure grounded and controllable output. The architecture, as shown in the figure below, decomposes complex reasoning into specialized agents paired with lightweight checkers. This design exposes critical decision points, allowing authors to retain responsibility for strategic stance and final wording. The pipeline begins by distilling the manuscript into a structured summary and extracting atomic reviewer concerns for stable long-context reasoning. Guided by these concerns, the system constructs evidence bundles by retrieving high-fidelity excerpts from the manuscript and augmenting them with verifiable external literature via web search. The workflow concludes by synthesizing an explicit response plan outlining arguments and evidence links, which authors can refine through a human-in-the-loop mechanism before the system produces a formal rebuttal letter.

Overview of RebuttalAgent. Given a manuscript (PDF) and reviewer comments, the system (1) structures inputs by parsing and compressing the paper with fidelity checks and extracting atomic reviewer concerns with coverage checks; (2) builds concern-conditioned evidence by constructing a query-specific hybrid manuscript context and, when needed, retrieving and summarizing external literature into citation-ready briefs; and (3) generates an inspectable, evidence-linked response plan that is checked for consistency and commitment safety, incorporates optional author feedback, and is then realized into a formal rebuttal draft.

RebuttalBench

We introduce REBUTTALBENCH, a benchmark built from real OpenReview discussions to fix the gap between rebuttal quality and generic text metrics. It curates dense review-response dyads and evaluates what matters most for rebuttals: coverage of atomic concerns and evidence-grounded arguments, not just surface fluency.

The dataset is derived from ICLR OpenReview forums. Each item pairs the initial review with the author rebuttal and the reviewer follow-up, which provides a supervision signal for positive/negative outcomes. This yields REBUTTALBENCH-CORPUS with 9.3K review-rebuttal pairs; a harder REBUTTALBENCH-CHALLENGE subset selects the top 20 papers with over 100 reviewers to maximize concern diversity.

We evaluate responses using an LLM-as-judge rubric on a 0–5 scale across Relevance (R-Score), Argumentation Quality (A-Score), and Communication Quality (C-Score). The word cloud and top-word histogram below highlight recurring reviewer focuses like clarity, novelty, and reproducibility, which the rubric explicitly targets.

RebuttalBench statistics and rubric design. (a) Word-cloud and top-word histogram of reviews in REBUTTALBENCH-CORPUS, highlighting recurring reviewer emphases (e.g., clarity, novelty, reproducibility). (b) Motivated by these signals, REBUTTALBENCH evaluates rebuttals with a rubric that mirrors these concerns, scoring Relevance, Argumentation Quality, and Communication Quality rather than fluency alone.

Experimental Results

We assessed REBUTTALAGENT's efficacy by comparing it with strong closed-source LLM baselines and by ablating key system components. All experiments were conducted in a fully automated mode without human intervention, providing a conservative lower bound on performance. We used four SOTA LLMs: GPT-5-mini, Grok-4.1-fast, Gemini-3-Flash, and DeepSeekV3.2 as baselines, and adopted Gemini-3-Flash as a unified LLM judge for evaluation.

REBUTTALAGENT consistently outperforms strong closed-source LLMs across all evaluation dimensions on REBUTTALBENCH when sharing the same base models. The largest gains are observed in Relevance and Argumentation Quality, with improvements in coverage up to +0.78 (DeepSeekV3.2) and specificity up to +1.33 (GPT5-mini). Argumentation quality is strengthened with up to +0.63 higher rebuttal quality. This indicates that performance improvements stem from task decomposition and structured intermediate reasoning, not merely stronger generative capacity. The benefit of REBUTTALAGENT is also larger for weaker base models, suggesting that explicit concern structuring, evidence construction, and response planning can partially compensate for limited base-model capability.

Method Relevance Argumentation
Quality
Communication
Quality
Average
Coverage Semantic
Alignment
Specificity Logic
Consistency
Evidence
Support
Response
Engagement
Professional
Tone
Statement
Clarity
Suggestion
Constructiveness
DeepSeekV3.2 3.65 4.44 3.28 3.44 3.01 3.16 3.37 3.96 3.81 3.57
RebuttalAgent-
DeepSeekV3.2
4.43(+0.78) 4.82(+0.38) 4.39(+1.11) 3.86(+0.42) 3.23(+0.22) 3.79(+0.63) 3.60(+0.23) 4.18(+0.22) 4.06(+0.25) 4.08(+0.51)
Grok4.1-fast 3.98 4.58 3.72 3.73 3.32 3.60 3.48 4.05 3.92 3.82
RebuttalAgent-
Grok-4.1-fast
4.66(+0.68) 4.92(+0.34) 4.65(+0.93) 4.13(+0.40) 3.42(+0.10) 4.15(+0.55) 3.68(+0.20) 4.23(+0.18) 4.24(+0.32) 4.25(+0.43)
Gemini3-Flash 4.00 4.71 3.77 3.71 3.30 3.56 3.51 4.08 3.95 3.85
RebuttalAgent-
Gemini3-Flash
4.51(+0.51) 4.88(+0.17) 4.49(+0.72) 4.11(+0.40) 3.39(+0.09) 4.07(+0.51) 3.78(+0.27) 4.28(+0.20) 4.09(+0.14) 4.23(+0.38)
GPT5-mini 3.61 4.22 2.96 3.37 2.92 3.07 3.35 3.95 3.91 3.48
RebuttalAgent-
GPT5-mini
4.34(+0.73) 4.84(+0.62) 4.29(+1.33) 3.78(+0.41) 3.31(+0.39) 3.70(+0.63) 3.60(+0.25) 4.21(+0.26) 4.24(+0.33) 4.05(+0.57)

Our ablation studies reveal that Evidence Construction is the most critical intermediate artifact. Removing external evidence briefs leads to the largest and most consistent degradation across dimensions, with clear drops in Relevance and Communication Quality. Input Structuring and Checkers also contribute measurably, indicating that the gains of REBUTTALAGENT arise from the combination of complementary modules. Evidence-centered artifacts are the primary driver, while structuring and verification provide crucial guardrails.

Metric Rebuttal
Agent
w/o
Component
Structuring Evidence Checker
Relevance
Coverage 4.51 4.49 (-0.02) 4.26 (-0.25) 4.54 (+0.03)
Semantic Alignment 4.88 4.71 (-0.17) 4.73 (-0.15) 4.89 (+0.01)
Specificity 4.49 4.46 (-0.03) 4.19 (-0.30) 4.47 (-0.02)
Argumentation Quality
Logic Consistency 4.11 4.06 (-0.05) 4.05 (-0.06) 4.13 (+0.02)
Evidence Support 3.39 3.23 (-0.16) 3.32 (-0.07) 3.39 (+0.00)
Response Engagement 4.07 4.04 (-0.03) 3.97 (-0.10) 4.01 (-0.06)
Communication Quality
Professional Tone 3.78 3.69 (-0.09) 3.74 (-0.04) 3.73 (-0.05)
Statement Clarity 4.28 4.33 (+0.05) 4.22 (-0.06) 4.29 (+0.01)
Suggestion Constructiveness 4.09 4.06 (-0.03) 3.82 (-0.27) 4.05 (-0.04)

Case studies demonstrate that REBUTTALAGENT's approach of producing an inspectable plan, separating interpretative defense from necessary intervention, effectively reduces hallucination and over-commitment. It outputs concrete deliverables and a scoped to-do list when new experiments or analyses are needed, supporting author decision-making by making the reasoning path and required work explicit before drafting.

Conclusion

We proposed REBUTTALAGENT, a multi-agent framework for transparent rebuttal assistance. Our system constructs structured, evidence-linked intermediate artifacts before drafting text, decomposing rebuttal writing into concern structuring, query-conditioned context building, on-demand external evidence synthesis, and response planning. This decomposition significantly improves traceability and cross-point coherence, while ensuring authors remain responsible for final decisions and wording. Through an author-centric benchmark and a rubric-based evaluation that measures relevance, global coherence, and argumentation quality beyond mere text fluency, our experimental results demonstrate that REBUTTALAGENT effectively improves the key requirements of reliable rebuttal assistance. This highlights the benefits of a transparent “verify-then-write” workflow, which reduces cognitive burden while empowering authors with ultimate control over the final wording.

BibTeX

Copy the citation and drop it into your bibliography.

@misc{ma2026paper2rebuttal,
    title={Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance}, 
    author={Qianli Ma and Chang Guo and Zhiheng Tian and Siyu Wang and Jipeng Xiao and Yuanhao Yue and Zhipeng Zhang},
    year={2026},
    eprint={2601.14171},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2601.14171}, 
}