Equilibrium effects of LLM reviewing

LLMs have been proposed as an aid to the process of scientific peer review. I argue that (a) different ways of using LLMs in review can have very different consequences and that (b) to reason about these consequences, it crucial to think not just about the immediate accuracy or helpfulness of LLM outputs, but rather about the equilibrium that would be reached by the overall scientific ecosystem. In particular, using LLMs to provide full-fledged reviews of papers could create negative long-run incentives for authors stemming from greater centralization in the evaluation of research.

Several different roles have been explored for LLMs in peer review. First, as a form of feedback to authors, enabling them to anticipate weaknesses or common criticisms before submitting their paper. Second, as an error detection mechanism, checking papers for characteristics like bugs in proofs or statistical mistakes that would be clearly acknowledged as flaws if they were known to reviewers but which are not always caught in practice. Third, to improve the quality of human reviews by providing feedback that human reviewers may choose to implement or not. Finally, as a reviewer themselves. For example, AAAI’s pilot program envisions LLMs as contributing first-stage reviews considered alongside human reviews. While the pilot program does not have LLMs provide a numerical rating, it is easy to imagine future versions that have the LLM generate a “full” review (complete with rating) that, e.g., could allow venues to reduce the number of human reviewers assigned to each paper.

This final option, “LLM as reviewer”, can have very different consequences than the others. In the short term, it may appear easy to vet such a system, for example by checking the correspondence between LLM reviews and human reviews on a set of papers, and whether the same accept/reject decisions would have been reached if the LLM were relied on to supply one of the reviews. This kind of verification, even if it shows very positive results for the LLM, is not sufficient because it neglects how authors may change their own behavior once LLM reviews begin to impact decisions on paper acceptances. These equilibrium effects could prove much more consequential than any direct effects in reducing workload.

A very extreme form of equilibrium behavior is an explicitly adversarial attack on the LLM reviewer. The possibilities are near-endless. On the low-effort end of the spectrum, some authors will no doubt attempt to smuggle new instructions into hidden text in the paper, instructing the LLM to leave a more positive review. This may be explicitly forbidden by conferences, with automated checks implemented to detect such instructions. However, more sophisticated attacks abound. Authors could attempt steganographic attacks, using an LLM of their own to attempt millions of paraphrases of the original paper until discovering a version that is semantically similar but garners a much more positive review. Or, since sophisticated LLM reviewers will employ web search, authors could attempt to poison the search results, introducing papers into arxiv or less-selective venues which exist only to influence the LLM reviewer once retrieved into context. Are these attacks, or others amongst the undoubtedly wide range of options, effective? We will certainly find out once tens of thousands of machine learning researchers are incentivized to try. Given the effort that AI and machine learning conferences have spent combating collusion rings in reviewing, serious attempts at attacking the system seem inevitable.

However, the more worrying equilibrium effects are in how LLM reviewers may impact the conduct of ordinary researchers, who won’t set out to game the system in such extreme ways, but who will nevertheless respond to the incentives that LLM reviewers create. If LLMs are asked to write a review of a paper, spanning from the importance of the question asked to the appropriateness and rigor of the methods, they will undoubtedly have “preferences” about each of those elements. There will be some questions that they find more important, and some less. There will be some methods that they find more rigorous than others. These value judgments are inextricable from the task of reviewing a paper – the very point of being a reviewer is to have substantive opinions about what constitutes good science!

At present, it does not seem like we even know what LLMs’ preferences are. However, if LLM reviewer systems are employed in major publication venues, authors will discover and respond to those preferences. The incentives at play are potentially significant. As a simple thought experiment, imagine a borderline-quality paper in a system with three human reviewers compared to two human reviewers and one LLM. The paper is accepted if two out of three reviews are positive. In a three-human system, each reviewer accepts the paper with 50% probability, leading to a 50% overall acceptance rate. If the author is able to submit a paper that they know the LLM will like, conditioning on one of the reviews being positive, the acceptance probability climbs to 75%. Conversely, if they do work that they know the LLM will review negatively, the acceptance probability drops to 25%. While this is a simple example, not to be taken literally, it qualitatively illustrates that authors who already face strong incentives to get papers accepted might face a correspondingly strong incentive to do work that the LLM approves of. In reality, there would be further equilibrium effects: the acceptance bar of conferences would recalibrate to reflect many authors changing their behavior to ensure positive LLM reviews, further disadvantaging authors who do not.

In this sense, LLM reviewers are qualitatively different from any human reviewer because they are present in the discussion of every single paper, impossible to avoid. Introducing an LLM reviewer inevitably centralizes power in scientific review. In a narrow sense, it centralizes peer review to align much more closely with the taste of the LLM than with any single human. In a longer-run sense, it allows those who create the system (write the prompts, train the model, etc) to have a much larger role in the evaluation of science than any single person or group normally does. Even in our present reviewing system, unconventional ideas that later prove impactful sometimes struggle to pass through a consensus-oriented review process. LLMs seem likely to amplify such tendencies – right now, unusual work can still be accepted with some luck finding amenable reviewers, but there would be no lucky escape from the LLM reviewer.

The extent to which these structural effects manifest is likely to be heavily impacted by the way that LLMs are used. The potential for unanticipated equilibria seems much greater when LLMs participate more fully as reviewers, expressing substantive judgments about the overall quality of papers. It may be lower when LLMs participate in more constrained roles, for example commenting on specific aspects of soundness as a sort of pre-review check that does not substitute for human reviewers.

More broadly, to decide whether and how to use LLMs as part of peer review, I argue that the scientific community should embrace three design goals:

First, there should be transparency, backed through careful empirical study, about what substantive preferences about research objectives and methodology a particular LLM system introduces into evaluations of papers.

Second, we should avoid using LLMs in ways that incentivize research to conform to any single set of preferences, especially in ways that would not command an overwhelming consensus in the community. For example, a preference that papers with mistaken proofs should not be accepted is likely appropriate, but it is undesirable to create additional monoculture in contested judgments about the importance of different research topics.

Third, scientists should demand that any system used as part of peer review be openly accessible, with turn-key replication by outside parties. This is the foundation for any communal attempts to understand or shape the role of LLMs in peer review.

While LLMs raise genuinely new issues, the AI community has existing tools to evaluate and design systems with these goals in mind. Algorithmic monoculture, performative prediction, strategic responses to algorithms, and adversarial attacks have all been topics of significant recent study. Any strategy we take to leverage LLMs’ capabilities in scientific work should build on these foundations and combine additional theory, empirical study, and experimentation to understand how LLMs influence the immediate review process and their overall impact on scientific work. LLMs are an amazing technology that likely has many applications in improving the quality and efficiency of research. However, introducing them into decision making processes is not a step to be taken lightly.

Thank you to Andrew Perrault for a conversation that inspired this piece.