XML Format Reference
A technical reference for the XML data format: structure rules, elements versus attributes, namespaces, encoding, and strategies for converting XML to and from JSON.
What Is XML?
XML (Extensible Markup Language) is a W3C Recommendation, first published in 1998 as a simplified subset of SGML. Unlike JSON or CSV — which are pure data formats — XML is a markup language: it was designed to annotate text with structural tags, making it suitable for both documents (XHTML, DocBook) and data interchange (SOAP, RSS, sitemaps).
XML's core design principle is that the format must be both human-readable and machine-parsable. Every XML document is a tree of elements, each of which can carry attributes and contain text, child elements, or both (mixed content).
XML Structure
A well-formed XML document follows these rules:
- Prolog (optional). The
<?xml version="1.0" encoding="UTF-8"?>declaration appears first, specifying the XML version and character encoding. - Single root element. Exactly one element encloses all others. No sibling roots.
- Closing tags mandatory. Every opening
<tag>must have a matching</tag>. Self-closing tags like<br/>are allowed. - Proper nesting. Elements must close in reverse order of opening:
<a><b></b></a>is valid;<a><b></a></b>is not. - Case-sensitive.
<Title>and<title>are different elements. - Attribute values quoted. All attribute values must be in single or double quotes:
type="book".
Here is a minimal well-formed XML document:
<?xml version="1.0" encoding="UTF-8"?>
<library>
<book category="fiction">
<title lang="en">The Hobbit</title>
<author>J.R.R. Tolkien</author>
<year>1937</year>
<price currency="USD">14.99</price>
</book>
<book category="non-fiction">
<title lang="en">Sapiens</title>
<author>Yuval Noah Harari</author>
<year>2011</year>
<price currency="USD">18.99</price>
</book>
</library>
Elements vs. Attributes
XML offers two ways to attach data to a node: as child elements or as attributes. This is the single most important design decision in any XML schema, and it has direct consequences for XML→JSON conversion.
| Property | Element | Attribute |
|---|---|---|
| Can contain child nodes | Yes | No (text only) |
| Can repeat on same parent | Yes | No (must be unique per element) |
| Order-preserving | Yes | No (unordered by spec) |
| Good for | Structured data, long text, repeating values | Metadata, identifiers, flags, units |
| JSON equivalent | Object property or array | No direct equivalent |
When to use attributes: For short, atomic metadata that qualifies the element — id, type, lang, currency. Attributes are unordered by the XML spec, so don't rely on attribute order.
When to use child elements: For data that can be multi-valued, has internal structure, or contains text that may need to be extended later. Elements can nest; attributes cannot.
Namespaces
XML namespaces prevent name collisions when combining XML from different vocabularies. A namespace is declared with xmlns and identified by a URI:
<root xmlns:bk="http://example.com/books"
xmlns:auth="http://example.com/authors">
<bk:title>The Hobbit</bk:title>
<auth:name>J.R.R. Tolkien</auth:name>
</root>
Namespaces matter for XML→JSON conversion because the prefix (bk:, auth:) is part of the element's qualified name. Most converters preserve the prefix in the JSON key, producing output like "bk:title": "The Hobbit". Some converters strip prefixes and use just the local name; others expand to the full namespace URI. There is no single standard — check your converter's behavior before relying on it.
Encoding
XML documents declare their encoding in the prolog: <?xml version="1.0" encoding="UTF-8"?>. UTF-8 is the default and most common encoding. UTF-16 is also supported, but must be declared explicitly. A mismatch between the declared encoding and the actual byte stream is a common source of parse failures — the XML parser will reject the document if it encounters bytes that are invalid in the declared encoding.
XML 1.0 also supports named character entities: < for <, > for >, & for &, " for ", and ' for '. Numeric character references (€ for €) are also valid.
XML vs. JSON: Structural Differences
| XML Concept | JSON Equivalent | Notes |
|---|---|---|
Element <name>Alice</name> | "name": "Alice" | Direct mapping |
Attribute <book type="fiction"> | "@type": "fiction" | Convention-based, not native |
| Text content of element with attributes | Object with "#text" or "$" key | Converter-specific |
| Repeating elements | JSON array | Detected by finding siblings with same name |
| Mixed content (text + child elements) | No clean equivalent | Typically lost or flattened |
Comments <!-- ... --> | Dropped | JSON has no comment syntax |
Processing instructions <? ... ?> | Dropped | Not data |
| CDATA sections | String value | Content preserved, wrapper discarded |
XML→JSON Conversion Strategies
Because XML has concepts JSON lacks (attributes, mixed content, namespaces), every XML→JSON converter makes opinionated choices. The three most common strategies:
- @-attribute convention. Attributes become JSON properties prefixed with
@:<book type="fiction">...</book>→{"@type": "fiction", ...}. This is the approach used by our XML to JSON converter. - Attributes as nested object. All attributes go under a dedicated key:
{"_attributes": {"type": "fiction"}, ...}. Cleaner separation but adds nesting depth. - Attributes as elements. Attributes are converted to child elements. Loses the element/attribute distinction entirely but produces flatter JSON.
For JSON→XML conversion, the reverse decisions apply: the converter must decide whether a JSON property becomes an XML element or attribute. Our JSON to XML converter uses the @ prefix convention for round-trip compatibility.
Common XML Pitfalls
- Unescaped ampersands. An
&in text content must be written as&. An unescaped&makes the document not well-formed and no XML parser will accept it. This is the single most frequent XML error. - Missing closing tags. Unlike HTML, XML requires every tag to close. A missing
</tag>produces a fatal parse error, not a best-effort render. - Namespace prefix without declaration. Using
<ns:tag>without a correspondingxmlns:ns="..."declaration makes the document not namespace-well-formed. - Encoding mismatch. If the prolog says
encoding="UTF-8"but the file is actually Latin-1, characters above U+007F will cause parse failures. Always verify the actual byte encoding matches the declaration. - Duplicate attributes. The same attribute name cannot appear twice on the same element:
<book type="fiction" type="non-fiction">is a fatal error.
Frequently Asked Questions
Does XML-to-JSON conversion preserve attributes?
It depends on the converter. Most converters preserve attribute values by using an @-prefix convention: the attribute category="book" on an element becomes a JSON property "@category": "book". Some converters place attributes under a dedicated key like "@attributes" or "_attributes". In both cases the values survive the conversion — but the distinction between an element and an attribute is lost, since JSON has no native equivalent to XML attributes. If you need to round-trip XML→JSON→XML, pick a converter that documents its attribute strategy and stick with it.
What happens to XML comments during conversion?
Most XML-to-JSON converters silently drop XML comments (<!-- ... -->). Comments are metadata, not data, and JSON has no comment syntax — the JSON specification (RFC 8259) does not include comments. If your XML comments contain information you need to preserve (version numbers, author notes, suppression markers), extract them before conversion. Some specialized XML tooling can serialize comments as JSON properties (e.g., "__comment": "..."), but this is not standard behavior.
Why does my XML have namespace prefixes in JSON output?
Namespace prefixes (e.g., <ns:title>) appear in JSON output when the converter preserves the qualified name as-is, producing keys like "ns:title". This happens because XML namespaces don't map neatly to JSON — there's no JSON equivalent to xmlns declarations. Some converters strip prefixes entirely and use the local name; others include the full namespace URI in a separate key. If you need clean JSON output, consider pre-processing your XML to remove namespace prefixes or use a converter that can resolve qualified names to their local-name equivalents.