A Linguistic Comparison between Human and ChatGPT-Generated Conversations

Authors: Morgan Sandler, Hyesun Choung, Arun Ross, Prabu David

To appear in the 4th edition of the International Conference on Pattern Recognition and Artificial Intelligence. 3-6 July 2024 in Jeju, Korea.

Paper link (ArXiv). Citations below

BibTex citation

@inproceedings{sandler2024linguistic,
  title={A Linguistic Comparison between Human and ChatGPT-Generated Conversations},
  author={Sandler, Morgan and Choung, Hyesun and Ross, Arun and David, Prabu},
  booktitle={4th International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI)},
  year={2024},
  organization={IAPR}
}

M. Sandler, H. Choung, A. Ross, and P. David, “A Linguistic Comparison between Human and ChatGPT-Generated Conversations,” in the 4th International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI), 2024.

2GPTEmpathicDialogues Dataset, Code, and Analyses

Setup environment

The python/conda environment may be set up via:

conda env create -f environment.yml

Experiment setup from the paper.

Download 2GPTEmpathicDialogues Dataset

To download the human-generated dialogues, refer to the original paper by Rashkin et al, 2019 and their corresponding code repository. The ChatGPT-generated (ChatGPT3.5) dialogues may be download via this link. Corresponding embeddings of the 2GPTEmpathicDialogues dataset can be downloaded here. These were used in the following visualization from the paper:

Example dialogue from the 2GPTEmpathicDialogues dataset (from the paper)

To generate 2GPTEmpathicDialogues from scratch:

Proofread and run 2gpt_empathy_conv_gen.py. Requires an OpenAI API key. Note: the model used was gpt-3.5-turbo. At the time, that was the best available option. GPT-4 now has API key access with more affordable options. Don't forget to update that line in the code if you are intending to use GPT-4.

Obtaining and visualizing the embeddings of the ChatGPT-generated and human-generated dialogues.

To obtain the dialogue embeddings use compute_dialogue_embeddings.py. This code can be reused for the human-generated and ChatGPT-generated dialogues. See TODOs in the file for more.
To visualize the 3-D UMAP viz of the dialogue embeddings and obtain the Dunn index, use vizualize_dialogue_embeddings.py

How to run valence classification experiments:

Run the ValenceClassification.py file. Check the TODOs for the required embeddings file input. Note: this code is currently set up for valence classification of the ChatGPT-generated embeddings, but can be re-used for the human-generated embeddings as well (TODOs explain).

Valence classification results (from the paper)

Linguistic analysis results

Note: separate statistical software was used for the linguistic analysis. Additionally, LIWC is a proprietary software and must be obtained by the appropriate means. See this website for more.

Appendix mentioned in the paper

Summary statistics and statistical significance tests for all 118 linguistic categories from LIWC-22. Accessible here.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
paperfigures		paperfigures
.gitignore		.gitignore
2gpt_empathy_conv_gen.py		2gpt_empathy_conv_gen.py
Appendix1and2.xlsx		Appendix1and2.xlsx
FINAL_gptgenerated_umap_viz_3D.pdf		FINAL_gptgenerated_umap_viz_3D.pdf
FINAL_gptgenerated_umap_viz_3D.svg		FINAL_gptgenerated_umap_viz_3D.svg
FINAL_humangenerated_umap_viz_3D.pdf		FINAL_humangenerated_umap_viz_3D.pdf
FINAL_humangenerated_umap_viz_3D.svg		FINAL_humangenerated_umap_viz_3D.svg
LICENSE		LICENSE
README.md		README.md
ValenceClassification.py		ValenceClassification.py
compute_dialogue_embeddings.py		compute_dialogue_embeddings.py
environment.yml		environment.yml
requirements.txt		requirements.txt
visualize_dialogue_embeddings.py		visualize_dialogue_embeddings.py

License

morganlee123/2GPTEmpathicDialogues

Folders and files

Latest commit

History

Repository files navigation

A Linguistic Comparison between Human and ChatGPT-Generated Conversations

Authors: Morgan Sandler, Hyesun Choung, Arun Ross, Prabu David

Paper link (ArXiv). Citations below

2GPTEmpathicDialogues Dataset, Code, and Analyses

Setup environment

Experiment setup from the paper.

Download 2GPTEmpathicDialogues Dataset

Example dialogue from the 2GPTEmpathicDialogues dataset (from the paper)

To generate 2GPTEmpathicDialogues from scratch:

Obtaining and visualizing the embeddings of the ChatGPT-generated and human-generated dialogues.

How to run valence classification experiments:

Valence classification results (from the paper)

Linguistic analysis results

Appendix mentioned in the paper

About

Topics

Resources

License

Stars

Watchers

Forks

Languages