When your XML parser is a little too helpful.
XML is a data format that supports defining custom "Entities" (basically variables) inside the document header (`DOCTYPE`). The problem is that many standard XML parsing libraries allow defining External Entities by default. This means the XML file can instruct the server's parser to go fetch the contents of a local file on the server's hard drive, or make a network request to an internal system.
To prevent XXE, you must configure your XML parser to strictly disable Document Type Definitions (DTDs) or at least disable the resolution of external entities.
<!-- VULNERABLE XML PAYLOAD -->
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE foo [
<!-- This tells the parser: assign the contents of /etc/passwd to &xxe; -->
<!ELEMENT foo ANY >
<!ENTITY xxe SYSTEM "file:///etc/passwd" >
]>
<foo>Hello &xxe;</foo>
# SECURE: Python lxml example
from lxml import etree
# DANGER: Default parser might allow DTDs
# parser = etree.XMLParser()
# SECURE: Explicitly disable resolving entities
parser = etree.XMLParser(resolve_entities=False)
tree = etree.fromstring(xml_data, parser)
A B2B application allows customers to upload XML invoices for automated processing. The parsing script doesn't disable DTDs. A malicious customer uploads an invoice with an external entity pointing to `http://169.254.169.254/latest/meta-data/iam/security-credentials/` (the AWS Metadata endpoint). When the server parses the XML, it fetches its own AWS root credentials and outputs them into the generated PDF invoice, which the attacker then downloads.
How does an attacker actually see the stolen data (like `/etc/passwd`) in an XXE attack?