Tech Overview

Yixing Jiang and Andrew Ng

Our agentic reviewer provides rapid feedback to researchers on their work to help them to rapidly iterate and improve their research.

The inspiration for this project was a conversation that one of us had with a student (not from Stanford) that had their research paper rejected 6 times over 3 years. They got a round of feedback roughly every 6 months from the peer review process, and this commentary formed the basis for their next round of revisions. The 6 month iteration cycle was painfully slow, and the noisy reviews — which were more focused on judging a paper's worth than providing constructive feedback — gave only a weak signal for where to go next.

Leveraging a basic agentic workflow, our agentic reviewer quickly provides a paper review and also attempts to give actionable feedback. The agent grounds its review in the latest relevant prior work pulled from arXiv. This creates a much faster feedback loop: you can submit, get feedback, run more experiments/edit, and resubmit, helping you improve your work faster.

When modified to output a 1-10 score by training to mimic ICLR 2025 reviews (which are public), we found that the Spearman correlation (higher is better) between one human reviewer and another is 0.41, whereas the correlation between AI and one human reviewer is 0.42. This suggests the agentic reviewer is approaching human-level performance.

Notes:

Reviews are AI generated, and may contain errors.
Because of grounding in arXiv, we expect the results to be more accurate in fields like AI where recent research is freely published there, and less accurate in other fields.
We had built this for researchers seeking feedback on their work. If you are a reviewer for a conference, we discourage using this in any way that violates the policies of that conference.

Agentic Reviewer workflow

The agentic system takes the paper PDF and optional target venue as input. It first converts the PDF into a Markdown document using LandingAI's Agentic Document Extraction (ADE). It then extracts the paper title and verifies the document is an academic paper as a sanity check.

To ground the paper review in the latest prior work, the agent analyzes the paper to generate web search queries of different levels of specificity. The search phrases aim to cover different perspectives, such as relevant benchmarks/baselines, other papers addressing the same problem, and papers with related techniques. These queries are then executed using the Tavily search API to find relevant papers on arXiv. The agent then downloads the metadata (title, authors, abstracts) of those papers.

To balance coverage and context length, the agent evaluates the relevance of each related work using this metadata to select the most relevant papers; it then chooses an appropriate summarization method for each of the top related papers, choosing between using the existing abstract from the downloaded metadata or generating a detailed summary from the full text. If it elects to create a detailed summary, the agent also specifies the most salient focus areas for summary creation. It then downloads the paper PDFs from arXiv, converts the PDF to Markdown, and uses an LLM to generate a detailed summary given the Markdown and focus areas.

Finally, the agent uses both the original paper's Markdown and the newly synthesized related work summaries to generate a comprehensive review following a template.

Towards human-level reviews

We also modified the agent to provide an overall score for the paper. Instead of having the LLM directly generate a final score, it instead provides scores on 7 dimensions to quantify the paper quality across different dimensions. The 7 dimensions we used are: Originality, importance of research question addressed, whether the claims are well supported, soundness of experiments, clarity of writing, value to the research community, and whether contextualized appropriately relative to prior work. We then used linear regression to fit a model mapping from these 7 scores to a final score.

In detail, we randomly sampled 300 submissions from ICLR 2025 and excluded 3 withdrawn submissions with no human scores. We use 150 submissions to train the linear regression model to combine the 7 individual scores into a single score, and then test using the remaining 147 submissions.

The Spearman correlation between two human reviewers is 0.41, while the Spearman correlation between the AI score and one human score is 0.42. This indicates the agentic system agrees with another human as much as one human agrees with another in the choice of rating.

The AUC for predicting acceptance using one human score is 0.84, while the AUC using the AI score is 0.75. Note however this is not an apples-to-apples comparison: The human scores here have an advantage given the acceptance decision that was actually made was based partly on this score.

The following figure shows the proportion of papers that received a low AI score (≤ 5.5) across papers with different ranges of average human scores. The AI scores are generally well-calibrated.

Calibration Plot showing AI score distribution

On the website, this score is displayed only if the target venue selected in the paper submission form is ICLR.

Related work and future directions

Some early studies have explored the use of agents to analyze peer review dynamics [1] or to generate more specific and helpful feedback via discussion among agents [2]. An empirical study by Liang et al. [3] demonstrated that GPT4-generated feedback has a substantial overlap with human feedback. However, it also finds that LLMs are significantly less likely than humans to comment on novelty. One recent study focused on fine-grained evaluation of LLM reviewers and found that LLMs are biased towards examining technical validity while significantly overlooking novelty assessment [4].

A pilot study for ICLR 2025 showed that LLM-generated feedback on human reviews can enhance review quality by nudging reviewers to make their reviews more specific and actionable [5].

Beyond reviewing, there is an exciting and growing field of using AI assistance through multiple stages of the research process. For example, AI has shown promising results in hypothesis generation [6, 7, 8, 9, 10]. Some researchers are further carrying out end-to-end automated discovery [11, 12, 13, 14, 15], which is a line of research for which we believe agentic reviewing, by providing an automated evaluation metric, can accelerate progress.

We believe AI tools have significant potential to enhance the research process. We are just at the beginning of a long journey to build AI that helps researchers.

References

[1] Y. Jin et al., "AgentReview: Exploring Peer Review Dynamics with LLM Agents," Oct. 13, 2024, arXiv: arXiv:2406.12708. doi: 10.48550/arXiv.2406.12708.
[2] M. D'Arcy, T. Hope, L. Birnbaum, and D. Downey, "MARG: Multi-Agent Review Generation for Scientific Papers," Jan. 08, 2024, arXiv: arXiv:2401.04259. doi: 10.48550/arXiv.2401.04259.
[3] W. Liang et al., "Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis," NEJM AI, vol. 1, no. 8, p. AIoa2400196, July 2024, doi: 10.1056/AIoa2400196.
[4] H. Shin et al., "Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews," Nov. 07, 2025, arXiv: arXiv:2502.17086. doi: 10.48550/arXiv.2502.17086.
[5] N. Thakkar et al., "Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025," Apr. 13, 2025, arXiv: arXiv:2504.09737. doi: 10.48550/arXiv.2504.09737.
[6] K. Swanson, W. Wu, N. L. Bulaong, J. E. Pak, and J. Zou, "The Virtual Lab: AI Agents Design New SARS-CoV-2 Nanobodies with Experimental Validation," Nov. 12, 2024, bioRxiv. doi: 10.1101/2024.11.11.623004.
[7] C. Si, D. Yang, and T. Hashimoto, "Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers," Sept. 06, 2024, arXiv: arXiv:2409.04109. doi: 10.48550/arXiv.2409.04109.
[8] C. Si, T. Hashimoto, and D. Yang, "The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas," June 25, 2025, arXiv: arXiv:2506.20803. doi: 10.48550/arXiv.2506.20803.
[9] S. A. Rizvi et al., "Scaling Large Language Models for Next-Generation Single-Cell Analysis," Oct. 26, 2025, bioRxiv. doi: 10.1101/2025.04.14.648850.
[10] J. Gottweis et al., "Towards an AI co-scientist," Feb. 26, 2025, arXiv: arXiv:2502.18864. doi: 10.48550/arXiv.2502.18864.
[11] H. Wang et al., "Scientific discovery in the age of artificial intelligence," Nature, vol. 620, no. 7972, pp. 47–60, Aug. 2023, doi: 10.1038/s41586-023-06221-2.
[12] C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery," Sept. 01, 2024, arXiv: arXiv:2408.06292. doi: 10.48550/arXiv.2408.06292.
[13] Y. Yamada et al., "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search," Apr. 10, 2025, arXiv: arXiv:2504.08066. doi: 10.48550/arXiv.2504.08066.
[14] L. Mitchener et al., "Kosmos: An AI Scientist for Autonomous Discovery," Nov. 05, 2025, arXiv: arXiv:2511.02824. doi: 10.48550/arXiv.2511.02824.
[15] S. Schmidgall et al., "Agent Laboratory: Using LLM Agents as Research Assistants," June 17, 2025, arXiv: arXiv:2501.04227. doi: 10.48550/arXiv.2501.04227.