The Fragility of AI Trust

This research empirically demonstrates the inconsistency of GPT-4o mini when performing sentiment analysis, raising serious concerns about reliability in production environments.

Testing 100,000 Spanish comments on Latin American presidents revealed significant classification variations with only slight prompt modifications
Findings challenge the robustness of LLMs for consistent classification tasks
Highlights critical security implications for organizations relying on LLMs for decision-making
Underscores the need for standardized prompting protocols to ensure reliable AI outputs

For security professionals, this research serves as a crucial warning: LLMs may deliver substantially different results based on subtle prompt differences, potentially introducing dangerous inconsistencies into automated systems.

Trusting CHATGPT: how minor tweaks in the prompts lead to major differences in sentiment classification