Background
On December 4, 2025, the Apache Software Foundation disclosed CVE-2025-66516, a critical XML External Entity (XXE) injection vulnerability in Apache Tika rated CVSS 10.0. The vulnerability affects multiple core components of Apache Tika, including tika-core (versions 1.13-3.2.1), tika-pdf-module (versions 2.0.0-3.2.1), and tika-parsers (versions 1.13-1.28.5).
This advisory expands upon the previously disclosed CVE-2025-54988 (CVSS 8.4) from August 2025, clarifying the full scope of affected artifacts. While the original report identified the PDF parser module as the entry point, the underlying vulnerability and its fix reside in tika-core, meaning organizations that only patched the PDF module remain vulnerable.
The attack vector is unauthenticated and remote, requiring only the processing of a specially crafted PDF file to achieve exploitation. Given Apache Tika's widespread use in document processing pipelines, search indexing systems, content analysis platforms, and security tools, the potential impact spans numerous industries including finance, legal, government, and media sectors.
Technical Details
The vulnerability stems from improper XML entity processing in Apache Tika's handling of XFA (XML Forms Architecture) content embedded within PDF documents. XFA is an XML-based specification used to define form elements and data within PDFs, and Tika parses this content during document analysis and metadata extraction.
An attacker can craft a malicious PDF containing XFA data with external XML entity references. When Tika processes this document, it resolves these external entities without proper validation, enabling several attack vectors. The attacker can read arbitrary files from the server filesystem, potentially accessing sensitive configuration files, credentials, or application data. Additionally, the vulnerability enables Server-Side Request Forgery (SSRF) attacks, allowing the attacker to probe internal networks, access cloud metadata services, or interact with internal APIs not exposed to the internet. In resource-constrained environments, the exploitation can cause Denial of Service through entity expansion attacks that consume excessive memory or CPU.
The critical nature of this vulnerability is amplified by Apache Tika's typical deployment pattern. Many applications automatically process uploaded documents for indexing, preview generation, or content extraction. In such environments, simply uploading a malicious PDF is sufficient to trigger the vulnerability, no user interaction or authentication beyond upload access is required.
Affected Products
Vulnerable Versions:
- Apache Tika core (org.apache.tika:tika-core): 1.13 through 3.2.1
- Apache Tika PDF parser module (org.apache.tika:tika-parser-pdf-module): 2.0.0 through 3.2.1
- Apache Tika parsers (org.apache.tika:tika-parsers): 1.13 before 2.0.0
Patched Versions:
- Apache Tika core: 3.2.2 or later
- Apache Tika parsers: 2.0.0 or later (for 1.x users)
Critical Clarification:
The scope expansion in CVE-2025-66516 addresses two key oversights from the original CVE-2025-54988 disclosure. First, while the PDF parser module was identified as the entry point, the actual vulnerability exists in tika-core. Organizations that upgraded tika-parser-pdf-module but not tika-core to version 3.2.2 or later remain vulnerable. Second, in Tika's 1.x release series, the PDF parser was bundled within the tika-parsers module rather than as a separate artifact. These legacy deployments were not explicitly called out in the initial advisory.
Indicators of Compromise
Organizations should immediately review logs for exploitation attempts targeting document processing endpoints. The vulnerability is exploited through malicious PDF uploads containing crafted XFA content.
Attack Signatures:
- PDF uploads with XFA content containing external entity declarations (DOCTYPE with SYSTEM or PUBLIC identifiers)
- Unusual file access patterns from Tika processes, particularly reads to sensitive files like /etc/passwd, configuration files, or credential stores
- Outbound network connections from Tika processes to unexpected destinations, especially cloud metadata endpoints or internal network ranges
- Resource exhaustion patterns indicating entity expansion attacks, such as memory spikes or CPU saturation during PDF processing
Log Search Queries:
- Application logs: Errors mentioning "EntityExpansionException", "DOCTYPE", or "ENTITY" during PDF processing
- Network logs: Outbound HTTP requests from application servers to internal IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) or cloud metadata services (169.254.169.254)
- File system audit logs: Unexpected file access by the Tika process or application server user, particularly to system files or configuration directories
Post-Exploitation Indicators:
- Data exfiltration through external entity references to attacker-controlled servers
- Lateral movement attempts from compromised document processing infrastructure
- Modified or newly created files in web directories without corresponding deployments
- Unauthorized access to internal services discovered through SSRF exploitation
Organizations identifying these indicators should immediately initiate incident response procedures and conduct comprehensive forensic analysis.
Recommendations
Organizations using Apache Tika must take immediate action to address this critical vulnerability.
Immediate Actions:
- Upgrade tika-core to version 3.2.2 or later. This is mandatory regardless of your tika-parser-pdf-module version.
- For organizations running Tika 1.x, upgrade tika-parsers to version 2.0.0 or later.
- Verify all three artifacts (tika-core, tika-parser-pdf-module, tika-parsers) are at safe versions in a coordinated manner.
- If using Apache Tika as a transitive dependency through other libraries, audit your dependency tree and ensure all Tika components are updated.
Risk Assessment:
- Identify all applications processing untrusted PDF documents, particularly public-facing upload endpoints, email attachment processors, and document management systems.
- Map Tika deployments to understand potential blast radius, including systems with access to sensitive data or internal networks.
- Review network segmentation to determine if exploited Tika instances could reach critical internal resources.
Immediate Workarounds (if patching is not immediately feasible):
- Disable PDF parsing capability entirely by removing or excluding the PDF parser from your Tika configuration.
- Implement strict input validation and sanitization for uploaded files, though note this is not a complete mitigation.
- Deploy network-level controls to prevent Tika processes from making outbound connections or accessing sensitive internal resources.
- Consider processing untrusted documents in isolated, sandboxed environments with minimal privileges and network access.
Detection and Monitoring:
- Implement real-time monitoring for PDF processing operations, with alerts for unusual patterns such as external entity references or unexpected network activity.
- Deploy file system monitoring to detect unauthorized access from Tika processes.
- Enable comprehensive logging for all document processing workflows, including full request payloads where feasible.
- Establish baseline behavior for legitimate document processing to identify anomalous activity.
Response Planning:
- Prepare incident response procedures for potential exploitation, including system isolation steps and forensic log collection requirements.
- Review privilege levels for Tika processes and implement least-privilege principles to limit potential impact.
- Consider implementing defense-in-depth measures including Web Application Firewalls (WAF) with XXE detection rules, though these should not replace patching.
Conclusion
CVE-2025-66516 represents a maximum-severity vulnerability in one of the most widely deployed document processing frameworks. With a CVSS score of 10.0, unauthenticated remote exploitation, and broad impact across multiple Apache Tika artifacts, this vulnerability demands immediate attention from every organization using Tika for document processing.
The expansion of scope beyond the original CVE-2025-54988 highlights the complexity of modern dependency chains and the critical importance of comprehensive patching strategies. Organizations cannot assume that addressing a single component in a modular framework provides complete protection, the underlying shared libraries must also be secured.
Security teams should recognize that document processing represents a critical attack surface in modern applications. Systems that automatically parse, analyze, or extract content from user-supplied files are inherently exposed to content-based attacks like XXE. This vulnerability should serve as a catalyst for broader security improvements in document processing pipelines, including proper input validation, sandboxing, network segmentation, and comprehensive monitoring.
