Unveiling AI Agent Vulnerabilities Part III: Data Exfiltration

In the third part of our series we demonstrate how risk intensifies in multi-modal AI agents, where hidden instructions embedded within innocuous-looking images or documents can trigger sensitive data exfiltration without any user interaction.

Technical BriefDownload the research paper

By Sean Park (Principal Threat Researcher)

Can an AI agent serve as a gateway for cyberattacks? Could attackers hijack a Large Language Model (LLM) capable of executing code to run harmful commands? Is it possible for hidden instructions in Microsoft Office documents to trick an AI Agent into leaking sensitive data? How easily can attackers manipulate database queries to extract restricted information?

These are fundamental challenges AI agents face today. This blog series sheds light on critical vulnerabilities in AI agents, offering a comprehensive examination of the threats that lurk beneath seemingly intelligent responses.

Key Takeaways

  • Indirect prompt injection uses external sources like web pages, images, and documents to covertly manipulate AI agents.
  • AI agents capable of interpreting multiple modes of input, such as text or images, are increasingly susceptible to prompt-based attacks hidden within content. These vulnerabilities can lead to sensitive data exfiltration without user interaction.
  • The Pandora Proof-of-Concept AI Agent demonstrates how advanced AI systems can process embedded malicious payloads, such as those in MS Word documents, to execute harmful code—highlighting the critical need for service-level safeguards.
  • Organizations must enforce strict security protocols, including access controls, advanced filtering, and real-time monitoring systems, to mitigate risks of data leakage and unauthorized actions in AI-enabled systems.

As AI systems become increasingly integrated into everyday life, indirect prompt injection has emerged as a major threat. Unlike direct attacks, these injections often originate from seemingly benign external sources, such as web pages or downloaded documents, and can manipulate AI agents into executing harmful or unintended actions.

This vulnerability stems from a limitation in large language models (LLMs): their inability to distinguish between authentic user input and malicious injected prompts. As a result, LLMs and LLM-powered AI agents are particularly susceptible to indirect prompt injection attacks.

In this entry, we demonstrate how this risk intensifies in multi-modal AI agents, where hidden instructions embedded within innocuous-looking images or documents can trigger sensitive data exfiltration without any user interaction. This highlights the pressing need for agent level safeguards and proactive security strategies against covert prompt-based attacks hidden within content.

Our research paper offers a comprehensive analysis of these issues and their broader implications.

This is the third part of a series that examines real-world AI agent vulnerabilities and evaluates their potential impact. The initial two installments cover:

  • Part I: Unveiling AI Agent Vulnerabilities: Introduces key security risks in AI agents, such as prompt injection and unauthorized code execution, and outlines the structure for subsequent discussions on data exfiltration, database exploitation, and mitigation strategies.
  • Part II: Code Execution Vulnerabilities: Explores how adversaries can exploit weaknesses in LLM-driven services to execute unauthorized code, bypass sandbox restrictions, and exploit flaws in error-handling mechanisms, which lead to data breaches, unauthorized data transfers, and persistent access within execution environments.

Using webpages, images, and documents for indirect prompt injection

Indirect prompt injection is a stealth attack method where malicious instructions are hidden in external data — such as web content, images, or documents — to manipulate AI behavior without the user’s knowledge. These attacks typically fall into three categories: web-based, image-based, and document-based.

Web-based attacks

AI agents parsing websites can interpret embedded malicious prompts (e.g., exfiltrating memory-stored data to an attacker email). If the service permits outbound traffic and email functionality, sensitive information like API keys or contact details can be leaked.

Image-based attacks (Zero-click exploits)

Multi-modal models like GPT-4o can extract hidden text from images — even those appearing blank. When such an image is uploaded, the model may execute harmful instructions without user interaction, posing a significant security risk.

Figure 1. The injected prompt within a seemingly blank image (top), but contains hidden text (bottom)

Figure 1. The injected prompt within a seemingly blank image (top), but contains hidden text (bottom)

Figure 1. The injected prompt within a seemingly blank image (top), but contains hidden text (bottom)

Document-based attacks

ChatGPT supports document uploads and can extract hidden text from applications like Microsoft Word. The model can even read and act on text marked as “hidden” using formatting (such as CTRL+SHIFT+H), enabling covert prompt injection.

These scenarios emphasize the importance of implementing strict access controls, monitoring, and prompt validation to prevent data exfiltration via indirect prompt injection.

What’s at risk?

Sensitive information stored within LLM services presents a valuable target for adversaries. This includes content from user conversations, uploaded documents, and persistent memory that the system was asked to retain. Common data types targeted in these attacks include:

  • Personal data: Names, email addresses, phone numbers, Social Security Numbers (SSNs)
  • Financial information: Bank details, credit card numbers
  • Health records: Protected health information (PHI)
  • Business secrets: Trade secrets, strategic plans, financial reports
  • Authentication credentials: API keys, access tokens, passwords
  • Uploaded files: Confidential business documents, government records, proprietary research

ChatGPT Data Analyst

Crafted documents containing hidden instructions present a serious threat to AI Agents and tools that support code execution, such as ChatGPT’s Data Analyst feature. When such a file is uploaded, the service may interpret embedded payloads as legitimate prompts and unknowingly execute them. This can lead to unintended script execution, granting attackers unauthorized access to user data that may then be encoded and transmitted to external destinations.

For example, a Microsoft Word document embedded with a malicious payload can cause ChatGPT to run code and extract the contents of the document. This execution occurred because the feature for code execution was manually enabled in the system settings. Under normal circumstances, this functionality is disabled by default to limit exposure.

Although AI Agents usually implement safeguards such as blocking access to dynamically generated URLs through abuse detection systems and rule-based filters, these measures can be circumvented. Attackers may exploit well-known or compromised domains that evade standard blocklists, allowing payloads to connect externally and leak data.

The possibility of adversaries exploiting compromised, high-prevalence domains to bypass existing restrictions raises significant concerns about data security. In the absence of effective service-level guardrails, attackers could successfully leverage indirect prompt injection to exfiltrate sensitive data. This would allow them to covertly access and extract confidential information without detection.

Figure 2. An example of a data exfiltration prompt

Figure 2. An example of a data exfiltration prompt

Figure 3. ChatGPT executing instructions within the payload of a crafted MS Word document

Figure 3. ChatGPT executing instructions within the payload of a crafted MS Word document

Pandora

Pandora is a proof-of-concept (PoC) AI Agent created by Trend Micro’s Forward-looking Threat Research (FTR) team to explore emerging security risks associated with advanced AI systems. It includes capabilities like unrestricted code execution and internet access, all within a Docker-based sandbox. These features allow it to dynamically run code and interact with external URLs.

In one test scenario we conducted, Pandora triggered a MULTIMODAL_CONTEXT event upon the upload of a malicious Word document titled CV – Actor.docx. This event simulated how AI systems like ChatGPT process uploaded files by extracting their contents and generating a context-aware prompt.

The document was embedded with a Python payload, which was subsequently extracted and shown, indicating that the system processed and understood the contents as part of its operational logic.

Figure 4. A MULTIMODAL_CONTEXT event (the text within the red box is the content retrieved from the uploaded file)

Figure 4. A MULTIMODAL_CONTEXT event (the text within the red box is the content retrieved from the uploaded file)

Once the malicious payload was activated by a crafted user query, the embedded Python code executed within Pandora’s environment. This resulted in the exfiltration of sensitive data to a command-and-control (C&C) server.

Figure 5. Triggering the Python tool

Figure 5. Triggering the Python tool

The interaction concluded with Pandora responding with a “Task complete” message — an output also dictated by the injected code. The incident highlights a key insight: even when using a robust foundational model like GPT-4o, vulnerabilities can still arise if service-level protections are not in place. Pandora serves as a stark reminder of the need for comprehensive guardrails beyond what the base model provides.

Conclusion

While AI Agents and LLMs offer immense potential, they can also be abused for malicious attacks if not properly secured. One of the most pressing concerns involves indirect prompt injection, particularly when combined with multi-modal capabilities that allow malicious content to bypass traditional defenses.

To address these challenges, organizations must take a proactive approach by enforcing robust access controls, applying advanced content filtering, and deploying real-time monitoring systems. These measures are essential to reducing the risk of data leakage, unauthorized actions, and other forms of AI-enabled exploitation.

As AI systems grow more sophisticated, so too must the security measures surrounding them. Ensuring safe and responsible deployment depends not just on the capabilities of the models themselves, but on the strength of the service-level safeguards that control their use.

Addressing the threat of indirect prompt injection requires a comprehensive, layered approach.

Organizations should consider the following countermeasures:

  • Implement network-level restrictions to prevent connections to unverified or potentially harmful URLs.
  • Deploy sophisticated filters to analyze uploaded content for concealed instructions.
  • Use optical character recognition (OCR) and image enhancement techniques to detect hidden text within visual content.
  • Employ moderation systems and threat detection models to identify and neutralize embedded command attempts.
  • Clean and pre-process user input to strip out or isolate potentially dangerous prompt content.
  • Record all interactions and actively monitor for irregular or suspicious language model behaviors that may indicate exploitation.
HIDE

Like it? Add this infographic to your site:
1. Click on the box below.   2. Press Ctrl+A to select all.   3. Press Ctrl+C to copy.   4. Paste the code into your page (Ctrl+V).

Image will appear the same size as you see above.

OSZAR »