OpenAI. 2023. Gpt-4 technical report. ArXiv,
abs/2303.08774.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural
Information Processing Systems, 35:27730–27744.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311–318.
Prasanna Parthasarathi, Joelle Pineau, and Sarath Chan-
dar. 2020. How to evaluate your dialogue system:
Probe tasks as an alternative for token-level evalua-
tion metrics. arXiv preprint arXiv:2008.10427.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, et al. 2019. Pytorch: An imperative style,
high-performance deep learning library. Advances
in neural information processing systems, 32:8026–
8037.
Baolin Peng, Michel Galley, Pengcheng He, Chris
Brockett, Lars Liden, Elnaz Nouri, Zhou Yu, Bill
Dolan, and Jianfeng Gao. 2022. Godel: Large-scale
pre-training for goal-directed dialog. arXiv preprint
arXiv:2206.11309.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. arXiv preprint arXiv:1910.10683.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and
Yuxiong He. 2020. Deepspeed: System optimiza-
tions enable training deep learning models with over
100 billion parameters. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowl-
edge Discovery & Data Mining, pages 3505–3506.
Shachar Rosenman, Alon Jacovi, and Yoav Goldberg.
2020. Exposing shallow heuristics of relation ex-
traction models with challenge data. In Proceedings
of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 3702–
3710.
Chinnadhurai Sankar, Sandeep Subramanian, Christo-
pher Pal, Sarath Chandar, and Yoshua Bengio. 2019.
Do neural dialog systems use the conversation his-
tory effectively? an empirical study. In Proceedings
of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 32–37.
Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, Daniel
Roberto Filizzola Ortiz, Enrico Santus, and Regina
Barzilay. 2019. Towards debiasing fact verification
models. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
3410–3416.
Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju,
Eric Michael Smith, Stephen Roller, Megan Ung,
Moya Chen, Kushal Arora, Joshua Lane, et al. 2022.
Blenderbot 3: a deployed conversational agent that
continually learns to responsibly engage. arXiv
preprint arXiv:2208.03188.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.
Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang
Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou.
2023. Is chatgpt a good nlg evaluator? a preliminary
study. arXiv preprint arXiv:2303.04048.
Sean Welleck, Jason Weston, Arthur Szlam, and
Kyunghyun Cho. 2019. Dialogue natural language
inference. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
pages 3731–3741.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1
(Long Papers), pages 1112–1122.
Thomas Wolf, Julien Chaumond, Lysandre Debut, Vic-
tor Sanh, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Morgan Funtowicz, Joe Davison, Sam
Shleifer, et al. 2020. Transformers: State-of-the-
art natural language processing. In Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing: System Demonstrations,
pages 38–45.
Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, and
Weiqiang Jia. 2023. Cognitive mirage: A review
of hallucinations in large language models. arXiv
preprint arXiv:2309.06794.
Yi-Ting Yeh, Maxine Eskenazi, and Shikib Mehri. 2021.
A comprehensive assessment of dialog evaluation
metrics. In The First Workshop on Evaluations and
Assessments of Neural Conversation Systems, pages
15–33.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein-
berger, and Yoav Artzi. 2019a. Bertscore: Evaluating
text generation with bert. In International Confer-
ence on Learning Representations.