Research reveals the credibility risks of GPT models, with privacy protection and bias issues still needing to be addressed.

2025-08-09 21:00:25

Abstract generation in progress

Research on the Trustworthiness Assessment of Large Language Models Reveals Potential Vulnerabilities

A study conducted in collaboration by institutions such as the University of Illinois at Urbana-Champaign, Stanford University, and the University of California, Berkeley, has comprehensively evaluated the credibility of the generative pre-trained transformer model (GPT). The research team developed a comprehensive assessment platform and detailed their findings in the recently published paper "Decoding Trust: A Comprehensive Assessment of the Credibility of GPT Models."

The research findings reveal some previously undisclosed vulnerabilities related to reliability. For example, the GPT model is prone to generating toxic and biased outputs, and it may also leak private information from training data and conversation history. Although GPT-4 is generally more reliable than GPT-3.5 in standard tests, it is more susceptible to attacks when faced with malicious prompts designed to circumvent security measures. This may be because GPT-4 adheres more strictly to misleading instructions.

The research team conducted a comprehensive evaluation of the GPT model from eight different perspectives, including adversarial robustness, toxicity and bias, privacy leakage, and more. For example, when assessing the robustness against text adversarial attacks, the researchers designed three scenarios: standard benchmark tests, tests under different task instructions, and self-constructed more challenging adversarial text tests.

In terms of toxicity and bias, research has found that GPT models generally exhibit little bias towards most stereotype topics. However, under misleading system prompts, the model may be induced to agree with biased content. Compared to GPT-3.5, GPT-4 is more susceptible to targeted misleading prompts. The degree of bias in the model also depends on the sensitivity of the specific groups and topics mentioned by the user.

In terms of privacy protection, research has found that GPT models may leak sensitive information from training data, such as email addresses. In some cases, leveraging supplementary knowledge can significantly improve the accuracy of information extraction. Additionally, the model may also leak private information injected into the conversation history. Overall, GPT-4 performs better than GPT-3.5 in protecting personal identity information, but both encounter issues when faced with privacy leak demonstrations.

The research team stated that this assessment work aims to encourage more researchers to participate and work together to create more robust and trustworthy models. To promote collaboration, they have made the evaluation benchmark code publicly available, which is highly scalable and user-friendly. At the same time, researchers have also shared their findings with relevant companies to take timely measures to address potential vulnerabilities.

This research provides a comprehensive perspective on the credibility assessment of GPT models, revealing the strengths and weaknesses of existing models. With the widespread application of large language models in various fields, these findings are significant for enhancing the safety and reliability of AI systems.

GPT0.19%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

7 Likes