Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the use of LLMs (e.g. ChatGPT) in a JOSS submission #1297

Open
evetion opened this issue Nov 28, 2023 · 12 comments
Open

On the use of LLMs (e.g. ChatGPT) in a JOSS submission #1297

evetion opened this issue Nov 28, 2023 · 12 comments

Comments

@evetion
Copy link

evetion commented Nov 28, 2023

I recently encountered documentation that was clearly written by a Large Language Model (LLM) in a JOSS submission. I wondered what the policy of JOSS is on the use of LLMs, on authorship etc.

Personally, I'm not against the use of LLMs in writing either code or documentation. I think we can assume many of the submissions (will eventually) make use of LLMs such as GitHub Copilot. This will not always be clear or detectable. But we could request to state that a LLM has been used, and for which purpose, in the paper?

Policy on LLMs of Nature and Science

Nature

Authors. Corresponding author(s) should be identified with an asterisk. Large Language Models (LLMs), such as ChatGPT, do not currently satisfy our authorship criteria. Notably an attribution of authorship carries with it accountability for the work, which cannot be effectively applied to LLMs. Use of an LLM should be properly documented in the Methods section (and if a Methods section is not available, in a suitable alternative part) of the manuscript.

Science

Artificial intelligence (AI). AI-assisted technologies [such as large language models (LLMs), chatbots, and image creators] do not meet the Science journals’ criteria for authorship and therefore may not be listed as authors or coauthors, nor may sources cited in Science journal content be authored or coauthored by AI tools. Authors who use AI-assisted technologies as components of their research study or as aids in the writing or presentation of the manuscript should note this in the cover letter and in the acknowledgments section of the manuscript. Detailed information should be provided in the methods section: The full prompt used in the production of the work, as well as the AI tool and its version, should be disclosed. Authors are accountable for the accuracy of the work and for ensuring that there is no plagiarism. They must also ensure that all sources are appropriately cited and should carefully review the work to guard against bias that may be introduced by AI. Editors may decline to move forward with manuscripts if AI is used inappropriately. Reviewers may not use AI technology in generating or writing their reviews because this could breach the confidentiality of the manuscript.

@sneakers-the-rat
Copy link
Contributor

I think that asking reviewers to spend their time reviewing code or documentation that the authors didn't even take the time to write would be a really bad thing for JOSS. I don't want JOSS to be a source of free labor where someone throws together a package carelessly with an LLM and then reviewers are then asked to pick through and fix all the bugs created by the LLM.

We can't have a policy like "no LLMs" because it would be unenforceable, but I think reviewers should have the ability to signal to the editor that 'hey i think that most of this is just LLM junk and i don't want to review that' and for that to terminate the review. on the author side, i think requiring disclosure for LLM docs and code would be a really good idea so that reviewers can know what they are reviewing.

@danielskatz
Copy link
Collaborator

I think the problem here is that some authors use LLMs to improve text, such as translation, smoothing, joining, etc. I don't think we should disallow this. I'm less sure about code. Maybe we should just reemphasize/reinforce that authors are responsible for everything the submit, no matter what tools were used to work on it.

@sneakers-the-rat
Copy link
Contributor

Yes I think that the use of LLMs to improve text, specifically in the case of translation, is completely fine, and agree we shouldn't disallow it. I do think a disclosure statement would be good - if an author uses an LLM for translation, I would hope that our reviewers and editors would understand that in context as a legitimate need, and in that case (if the target language was english and the source language was not) I would want to spend extra time to help with the text. If the text feels odd without a disclosed reason, I would not want that to be a source of suspicion for the reviewer. so in both cases i think disclosure (without an unenforceable/undesirable blanket ban) would be a good idea.

@danielskatz
Copy link
Collaborator

A challenge is what LLMs we ask people to disclose. Google docs and phones suggesting text is an LLM, for example. I lean towards not asking what tools people used, as this seems to me to be a limitless pit I don't really want to jump in

@sneakers-the-rat
Copy link
Contributor

sneakers-the-rat commented Apr 17, 2024

I see your point, but I also think that there is some qualitative distinction to draw between predictive text generally and the kinds of tools that are being thrust into the world that generate masses of plausible seeming text, specifically in their capacity to create meaningful labor imbalances that exploit the 'good will' nature of JOSS's review system. The revival of this issue was prompted by (being vague bc i'm not sure what's public) the attempted use of a software artifact that does not exist and never did, so reviewers are being asked to spend time evaluating code that the authors didn't spend the time to write. This turns JOSS into a free labor farm for correcting LLM-generated bugs instead of a system for improving the health of FOSS. I appreciate the thought of falling back on holding authors accountable for whatever they submit, because ideally that would be enough, but I think we might be missing an adversarial case here without a specific policy that might meaningfully impact the operation of the journal

@danielskatz
Copy link
Collaborator

In my opinion:

For well-intentioned users, which tools they used shouldn't be an issue, and I don't want to potential denigrate people who use LLMs for language over those who don't, as I think this would hurt non-English speakers (writers).

For the adversarial case, I don't think that asking authors to volunteer the use of LLMs will work, as I suspect anyone adversarial will not actually disclose this.

I think, but am less sure, that we will see more code that uses LLMs along the way over time, and I don't see this as problematic, as long as its an aide and the code is both understood and tested by the author.

So, while I understand the case you raise and do agree that we want to avoid it and similar ones in the future, I don't see a way to do so that will both be successful and will not harm those who are behaving well.

@sneakers-the-rat
Copy link
Contributor

I don't think that asking authors to volunteer the use of LLMs will work, as I suspect anyone adversarial will not actually disclose this.

the expectation is that well-intentioned people will describe what they did so that the reviewers are aware of what they are being asked to evaluate. So hopefully our reviewers, when seeing a disclosure that says that LLMs were used to help with translation, see that as not a problem and also a thing they could potentially help with in their review. In the case that folks (for whatever reason) lie about their use of LLMs, the reviewers then have something to point to say "you say this code was not LLM generated, and yet this has few other explanations than being LLM-generated code. plz explain."

In any case we need a policy here, because a lack of a policy both threatens poisoning the reviewer pool and also makes us unresponsive as a journal to a pretty important development in FOSS, regardless of whether we are "for it" or "against it."

@danielskatz
Copy link
Collaborator

Let's see what other people think/say. I have given my opinion, but I'm ok with being outweighed by others

@logological
Copy link

I agree with everything @sneakers-the-rat has been saying. And I should add that the policy should not just be some disclosure requirements for authors, but rather something that enables editors and reviewers to expeditiously challenge or walk away from submissions that have been apparently submitted in bad faith without such disclosure, or perhaps even in good faith but without the level scholarly effort to understand and verify the text/code that is being submitted.

This second part is very important because I have personally spoken with the editor-in-chief of two established (30- to 50-year-old) scholarly journals that have effectively been destroyed by a flood of low-quality submissions. Coinciding with the availability of LLMs such as ChatGPT, the journals saw a tremendous increase in submissions, most of which did not stand up to close scrutiny. In some cases the problems were obvious, such as nonsensical text and fabricated references, but in other cases the deficiencies required more careful examination to expose. Regardless, the sheer number of submissions overwhelmed the editorial apparatus of the journal. The editors did not have enough time to check all the papers carefully enough to determine if they should be desk-rejected, nor to find reviewers for all the papers that passed their inadequate checks. Of the papers that did get assigned to reviewers, a higher proportion were low-quality submissions that the reviewers ended up (sometimes angrily) rejecting. The overall quality of published submissions was therefore lowered, and everyone involved in the journal, from editors to reviewers to good-faith authors to subscribers, was in some way overwhelmed or let down. Finding no support from the publisher – one of the big names in scholarly publishing – the editor-in-chief ended up resigning, leaving along with the editorial assistant and nearly all the associate editors and editorial advisory board members.

I would not want anything like this to happen to JOSS, and so would argue that we should be prepared to deal efficiently with negligent or bad-faith submissions on a large scale. Though I don't think the journals I mentioned above had any policy concerning the use of LLMs, a policy that merely required disclosure would not have helped any. JOSS's more collegial publishing model is particularly vulnerable to abuse, since we see reviewing as a process to collaboratively improve the submissions. But this process needs to begin with a submission of a certain minimum level of quality! People who submit to us LLM-generated material, and who themselves are not willing or able to check that it meets this minimum level of quality, should not expect our editors and reviewers to do so for them, let alone help bring this material up to publication quality. Since reviewers are already pretty much free to refuse or abandon reviews, our policy should probably focus on what measures editors can take to avoid assigning problematic submissions to reviewers in the first place.

@jedbrown
Copy link
Member

I wrote this earlier on slack:

I consider it fundamentally disrespectful to inflict language model output on humans, akin to trying to pass counterfeit money or forged artwork. LMs can inflate a small prompt (which may or may not be factual or insightful) into a larger text optimized to deceive the reader into thinking it is conveying understanding. The art of technical writing is to convey a lot of knowledge in a direct, precise, and concise form. The internal representation of LMs have no relation to facts or knowledge, and largely exist on a spectrum from plagiarism to bullshit. (The companies believe "AI" will be a get-out-of-court-free card for behavior that would be a clear IP law violation if a human was given the training content and produced the output.) It is true that LM bans are largely unenforceable, but I believe we should be clear that it is unacceptable for JOSS.

It's true that there is no sharp line between grammar-checking and generating paragraphs of text. The pattern of human perception and behavior is that even when people say they have proof-read/edited paragraphs of generated text, it is still likely to contain errors and false implications. Elsevier has papers published months ago with the text "I'm very sorry, [...], as I am an AI language model" that have been publicly called out, yet remain unedited and unretracted. (I'm delighted that such "papers" are unciteable in Science, and suggest they apply that standard to the entire journal.)

As a reviewer thinking I'm providing feedback to a person, I would feel violated to learn I was reviewing generated text. The same applies if I were to learn that as a reader, and it would cast doubt on the journal's integrity and practices that such a thing could pass review. The intent and value of a publication has to be more than the bean. I think grammar-checking is fine, as well as narrow use in translation, but writing a paper in one language and bulk translating it presents a range of problematic second-order effects as @oliviaguest noted on Slack (besides increased chance of factual errors).

LMs promise a short-cut to forming a coherent mental model and communicating that. That is very much the incentive for plagiarism, but there is a perception that LM-generated content is somehow victimless while plagiarism is theft from the original author. I would stipulate they are of the same cloth, and that both are also subjecting the reader without consent. If a human took verbatim text/code from one or more sources and applied token-level obfuscation (changing variable names, loop structures, synonym substitution, etc.), they are still plagiarizing and the resulting code is considered a derivative work. (It is hard to prove this without historical records, thus courts will scrutinize instances in which humans were sloppy.) LMs automate this obfuscation while shredding the records (which would be evidence of intent in a human system) and promising plausible deniability. Clean-room design is the way to ensure clean IP, but LMs do not and can not work in this way. I consider it poisoning the fabric of society when LM boosters attempt to reduce human cognition to token manipulation or claim that it doesn't matter.

TL;DR: We need to consider the broader ethical and social context as well as JOSS' reputation, not just current law and a blanket statement. Specific affirmations sort out the accidentally-sloppy from the malicious.

@kthyng
Copy link
Contributor

kthyng commented Apr 18, 2024

Just a thought or two:

This seems very important for us to get ahead of, but also different than the typical examples being considered since we are as a journal not as focused on written text, while still including and reviewing some. Do we need different rules for text and code or are can they be treated the same? That is, what might be the guideline for using ChatGPT to assist in checking one's writing vs. Github Copilot in assisting in writing one's code?

Also regarding the example given by @logological, given that we are working with codebases that have at least some history (and often a long history), wouldn't we be less likely to have random AI-generated repositories submitted to us? Or do you think that is the potential future of submissions we will see?

@oliviaguest
Copy link
Member

@kthyng:

wouldn't we be less likely to have random AI-generated repositories submitted to us? Or do you think that is the potential future of submissions we will see?

I obviously cannot know if we will get random obviously silly codebases created by AI tools (which likely are plagiarised, or even stolen, or minimally not crediting the authors) that we would likely desk-reject for reasons of quality (as well as history, like you hint at: no sensible commit history)... but you raise a very good question. In my teaching, with student work such as essays and other situations where such tools can be used to cheat (I am teaching them to write an essay; and they are not writing an essay), my experience has been that it is glaringly obvious and that they freely admit to it. I am sharing not to argue for a specific outcome or action, just for others to know what (fellow) educators have been seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants