Preprint

Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Generative Adversarial Network (GAN) based techniques can generate and synthesize realistic faces that cause profound social concerns and security problems. Existing methods for detecting GAN-generated faces can perform well on limited public datasets. However, images from existing datasets do not represent real-world scenarios well enough in terms of view variations and data distributions, where real faces largely outnumber synthetic ones. The state-of-the-art methods do not generalize well in real-world problems and lack the interpretability of detection results. Performance of existing GAN-face detection models degrades accordingly when facing data imbalance issues. To address these shortcomings, we propose a robust, attentive, end-to-end framework that spots GAN-generated faces by analyzing eye inconsistencies. Our model automatically learns to identify inconsistent eye components by localizing and comparing artifacts between eyes. After the iris regions are extracted by Mask-RCNN, we design a Residual Attention Network (RAN) to examine the consistency between the corneal specular highlights of the two eyes. Our method can effectively learn from imbalanced data using a joint loss function combining the traditional cross-entropy loss with a relaxation of the ROC-AUC loss via Wilcoxon-Mann-Whitney (WMW) statistics. Comprehensive evaluations on a newly created FFHQ-GAN dataset in both balanced and imbalanced scenarios demonstrate the superiority of our method.
Article
State-of-the-art (SOTA) Generative Models (GMs) can synthesize photo-realistic images that are hard for humans to distinguish from genuine photos. Identifying and understanding manipulated media are crucial to mitigate the social concerns on the potential misuse of GMs. We propose to perform reverse engineering of GMs to infer model hyperparameters from the images generated by these models. We define a novel problem, "model parsing", as estimating GM network architectures and training loss functions by examining their generated images - a task seemingly impossible for human beings. To tackle this problem, we propose a framework with two components: a Fingerprint Estimation Network (FEN), which estimates a GM fingerprint from a generated image by training with four constraints to encourage the fingerprint to have desired properties, and a Parsing Network (PN), which predicts network architecture and loss functions from the estimated fingerprints. To evaluate our approach, we collect a fake image dataset with 100K images generated by 116 different GMs. Extensive experiments show encouraging results in parsing the hyperparameters of the unseen models. Finally, our fingerprint estimation can be leveraged for deepfake detection and image attribution, as we show by reporting SOTA results on both the deepfake detection (Celeb-DF) and image attribution benchmarks.
Article
Facial forgery by DeepFake has recently attracted more public attention. Face image contains sensitive personal information, abuse of such technology will grow into a menace. Since the difference between real and fake faces is usually subtle and local, the general detection framework of applying the backbone network to capture the global features of the entire face and then feeding it into the binary classifier is not optimal. In addition, patch-based schemes are widely used in various computer vision tasks, including image classification. However, how to extract features for location-specific and arbitrary-shaped patches while preserving their original information and spoof patterns as much as possible requires further exploration. In this paper, a novel deep forgery detector called Patch-DFD is proposed, which applies a patch-based solution of Facial Patch Mapping (FPM) to obtain several part-based feature maps, preserving original details of each facial patch to the greatest extent. Besides, the BM-pooling module aims to fix the size of the feature maps while reducing quantization errors. The local voting strategy is finally used to fuse the results of parts detectors, so as to more accurately identify the fake faces generated by deep generative models. Compared to typical patch-wise framework that takes patch inputs, our scheme is more efficient due to the absence of repeated convolution operations. Moreover, extensive experiments conducted on publicly available face forensics datasets have proved that the effectiveness of our framework.
Article
In recent years, generative adversarial networks (GANs) have been widely used to generate realistic fake face images, which can easily deceive human beings. To detect these images, some methods have been proposed. However, their detection performance will be degraded greatly when the testing samples are post-processed. In this paper, some experimental studies on detecting post-processed GAN-generated face images find that (a) both the luminance component and chrominance components play an important role, and (b) the RGB and YCbCr color spaces achieve better performance than the HSV and Lab color spaces. Therefore, to enhance the robustness, both the luminance component and chrominance components of dual-color spaces (RGB and YCbCr) are considered to utilize color information effectively. In addition, the convolutional block attention module and multilayer feature aggregation module are introduced into the Xception model to enhance its feature representation power and aggregate multilayer features, respectively. Finally, a robust dual-stream network is designed by integrating dual-color spaces RGB and YCbCr and using an improved Xception model. Experimental results demonstrate that our method outperforms some existing methods, especially in its robustness against different types of post-processing operations, such as JPEG compression, Gaussian blurring, gamma correction, and median filtering.
  • Xiao Bi
  • Deli Chen
  • Guanting Chen
  • Shanhuang Chen
  • Damai Dai
  • Chengqi Deng
  • Honghui Ding
  • Kai Dong
  • Qiushi Du
  • Zhe Fu
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024).
Language models are few-shot learners
  • Tom Brown
  • Benjamin Mann
  • Nick Ryder
  • Melanie Subbiah
  • Jared D Kaplan
  • Prafulla Dhariwal
  • Arvind Neelakantan
  • Pranav Shyam
  • Girish Sastry
  • Amanda Askell
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877-1901.
What makes fake images detectable? understanding properties that generalize
  • Lucy Chai
  • David Bau
  • Ser-Nam Lim
  • Phillip Isola
Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. 2020. What makes fake images detectable? understanding properties that generalize. In Computer vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23-28, 2020, proceedings, part XXVI 16. Springer, 103-120.
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
  • Wei-Lin Chiang
  • Zhuohan Li
  • Ziqing Lin
  • Ying Sheng
  • Zhanghao Wu
  • Hao Zhang
  • Lianmin Zheng
  • Siyuan Zhuang
  • Yonghao Zhuang
  • Joseph E Gonzalez
Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6.
On the detection of synthetic images generated by diffusion models
  • Riccardo Corvi
  • Davide Cozzolino
  • Giada Zingarini
  • Giovanni Poggi
  • Koki Nagano
  • Luisa Verdoliva
Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. 2023. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1-5.
Unmasking deepfakes with simple features
  • Ricard Durall
  • Margret Keuper
  • Franz-Josef Pfreundt
  • Janis Keuper
Ricard Durall, Margret Keuper, Franz-Josef Pfreundt, and Janis Keuper. 2019. Unmasking deepfakes with simple features. arXiv preprint arXiv:1911.00686 (2019).
Caltech 10k Web Faces
  • Michael Fink
  • Pietro Perona
Michael Fink and Pietro Perona. 2022. Caltech 10k Web Faces. doi:10.22002/D1. 20132
Leveraging frequency analysis for deep fake image recognition
  • Joel Frank
  • Thorsten Eisenhofer
  • Lea Schönherr
  • Asja Fischer
  • Dorothea Kolossa
  • Thorsten Holz
Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020. Leveraging frequency analysis for deep fake image recognition. In International conference on machine learning. PMLR, 3247-3258.
Guiding instruction-based image editing via multimodal large language models
  • Tsu-Jui Fu
  • Wenze Hu
  • Xianzhi Du
  • William Yang Wang
  • Yinfei Yang
  • Zhe Gan
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. 2023. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023).
Llava-med: Training a large language-and-vision assistant for biomedicine in one day
  • Chunyuan Li
  • Cliff Wong
  • Sheng Zhang
  • Naoto Usuyama
  • Haotian Liu
  • Jianwei Yang
  • Tristan Naumann
  • Hoifung Poon
  • Jianfeng Gao
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2023), 28541-28564.
  • Fei-Fei Li
  • Marco Andreeto
  • Marc'aurelio Ranzato
  • Pietro Perona
Fei-Fei Li, Marco Andreeto, Marc'Aurelio Ranzato, and Pietro Perona. 2022. Caltech 101. doi:10.22002/D1.20086
Forgerygpt: Multimodal large language model for explainable image forgery detection and localization
  • Jiawei Li
  • Fanrui Zhang
  • Jiaying Zhu
  • Esther Sun
  • Qiang Zhang
Jiawei Li, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, and Zheng-Jun Zha. 2024. Forgerygpt: Multimodal large language model for explainable image forgery detection and localization. arXiv preprint arXiv:2410.10238 (2024).
Visual instruction tuning
  • Haotian Liu
  • Chunyuan Li
  • Qingyang Wu
  • Yong Jae Lee
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems 36 (2023), 34892-34916.
  • Tingkai Liu
  • Yunzhe Tao
  • Haogeng Liu
  • Qihang Fan
  • Ding Zhou
  • Huaibo Huang
Tingkai Liu, Yunzhe Tao, Haogeng Liu, Qihang Fan, Ding Zhou, Huaibo Huang, Ran He, and Hongxia Yang. 2023. DeVAn: Dense Video Annotation for Video-Language Models. arXiv preprint arXiv:2310.05060 (2023).
Thinking in frequency: Face forgery detection by mining frequency-aware clues
  • Yuyang Qian
  • Guojun Yin
  • Lu Sheng
  • Zixuan Chen
  • Jing Shao
Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision. Springer, 86-103.
Improving Language Understanding by Generative Pre-Training
  • Alec Radford
  • Karthik Narasimhan
Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.
Pandagpt: One model to instruction-follow them all
  • Yixuan Su
  • Tian Lan
  • Huayang Li
  • Jialu Xu
  • Yan Wang
  • Deng Cai
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. 2023. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023).
  • Ming Wang
  • Yuanzhong Liu
  • Xiaoyu Liang
  • Songlian Li
  • Yijie Huang
  • Xiaoming Zhang
  • Sijia Shen
  • Chaofeng Guan
  • Daling Wang
  • Shi Feng
Ming Wang, Yuanzhong Liu, Xiaoyu Liang, Songlian Li, Yijie Huang, Xiaoming Zhang, Sijia Shen, Chaofeng Guan, Daling Wang, Shi Feng, et al. 2024. LangGPT: Rethinking structured reusable prompt design framework for LLMs from the programming language. arXiv preprint arXiv:2402.16929 (2024).
  • Xuansheng Wu
  • Jiayi Yuan
  • Wenlin Yao
Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, and Ninghao Liu. 2025. Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders. arXiv preprint arXiv:2502.15576 (2025).
Fakeshield: Explainable image forgery detection and localization via multimodal large language models
  • Zhipei Xu
  • Xuanyu Zhang
  • Runyi Li
  • Zecheng Tang
  • Qing Huang
  • Jian Zhang
Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. 2024. Fakeshield: Explainable image forgery detection and localization via multimodal large language models. arXiv preprint arXiv:2410.02761 (2024).
Thinking in patch: Towards generalizable forgery detection with patch transformation
  • Xueqi Zhang
  • Shuo Wang
  • Chenyu Liu
  • Min Zhang
  • Xiaohan Liu
  • Haiyong Xie
Xueqi Zhang, Shuo Wang, Chenyu Liu, Min Zhang, Xiaohan Liu, and Haiyong Xie. 2021. Thinking in patch: Towards generalizable forgery detection with patch transformation. In PRICAI 2021: Trends in Artificial Intelligence: 18th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2021, Hanoi, Vietnam, November 8-12, 2021, Proceedings, Part III 18. Springer, 337-352.
Minigpt-4: Enhancing vision-language understanding with advanced large language models
  • Deyao Zhu
  • Jun Chen
  • Xiaoqian Shen
  • Xiang Li
  • Mohamed Elhoseiny
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
Progressive feedbackenhanced transformer for image forgery localization
  • Haochen Zhu
  • Gang Cao
  • Xianglin Huang
Haochen Zhu, Gang Cao, and Xianglin Huang. 2023. Progressive feedbackenhanced transformer for image forgery localization. arXiv preprint arXiv:2311.08910 (2023).