XML Format Reference

A technical reference for the XML data format: structure rules, elements versus attributes, namespaces, encoding, and strategies for converting XML to and from JSON.

What Is XML?

XML (Extensible Markup Language) is a W3C Recommendation, first published in 1998 as a simplified subset of SGML. Unlike JSON or CSV — which are pure data formats — XML is a markup language: it was designed to annotate text with structural tags, making it suitable for both documents (XHTML, DocBook) and data interchange (SOAP, RSS, sitemaps).

XML's core design principle is that the format must be both human-readable and machine-parsable. Every XML document is a tree of elements, each of which can carry attributes and contain text, child elements, or both (mixed content).

XML Structure

A well-formed XML document follows these rules:

Prolog (optional). The <?xml version="1.0" encoding="UTF-8"?> declaration appears first, specifying the XML version and character encoding.
Single root element. Exactly one element encloses all others. No sibling roots.
Closing tags mandatory. Every opening <tag> must have a matching </tag>. Self-closing tags like   are allowed.
Proper nesting. Elements must close in reverse order of opening: <a></a> is valid; <a></a> is not.
Case-sensitive. <Title> and <title> are different elements.
Attribute values quoted. All attribute values must be in single or double quotes: type="book".

Here is a minimal well-formed XML document:

<?xml version="1.0" encoding="UTF-8"?>
<library>
  <book category="fiction">
    <title lang="en">The Hobbit</title>
    <author>J.R.R. Tolkien</author>
    <year>1937</year>
    <price currency="USD">14.99</price>
  </book>
  <book category="non-fiction">
    <title lang="en">Sapiens</title>
    <author>Yuval Noah Harari</author>
    <year>2011</year>
    <price currency="USD">18.99</price>
  </book>
</library>

Elements vs. Attributes

XML offers two ways to attach data to a node: as child elements or as attributes. This is the single most important design decision in any XML schema, and it has direct consequences for XML→JSON conversion.

Property	Element	Attribute
Can contain child nodes	Yes	No (text only)
Can repeat on same parent	Yes	No (must be unique per element)
Order-preserving	Yes	No (unordered by spec)
Good for	Structured data, long text, repeating values	Metadata, identifiers, flags, units
JSON equivalent	Object property or array	No direct equivalent

When to use attributes: For short, atomic metadata that qualifies the element — id, type, lang, currency. Attributes are unordered by the XML spec, so don't rely on attribute order.

When to use child elements: For data that can be multi-valued, has internal structure, or contains text that may need to be extended later. Elements can nest; attributes cannot.

Namespaces

XML namespaces prevent name collisions when combining XML from different vocabularies. A namespace is declared with xmlns and identified by a URI:

<root xmlns:bk="http://example.com/books"
      xmlns:auth="http://example.com/authors">
  <bk:title>The Hobbit</bk:title>
  <auth:name>J.R.R. Tolkien</auth:name>
</root>

Namespaces matter for XML→JSON conversion because the prefix (bk:, auth:) is part of the element's qualified name. Most converters preserve the prefix in the JSON key, producing output like "bk:title": "The Hobbit". Some converters strip prefixes and use just the local name; others expand to the full namespace URI. There is no single standard — check your converter's behavior before relying on it.

Encoding

XML documents declare their encoding in the prolog: <?xml version="1.0" encoding="UTF-8"?>. UTF-8 is the default and most common encoding. UTF-16 is also supported, but must be declared explicitly. A mismatch between the declared encoding and the actual byte stream is a common source of parse failures — the XML parser will reject the document if it encounters bytes that are invalid in the declared encoding.

XML 1.0 also supports named character entities: < for <, > for >, & for &, " for ", and ' for '. Numeric character references (€ for €) are also valid.

XML vs. JSON: Structural Differences

XML Concept	JSON Equivalent	Notes
Element `<name>Alice</name>`	`"name": "Alice"`	Direct mapping
Attribute `<book type="fiction">`	`"@type": "fiction"`	Convention-based, not native
Text content of element with attributes	Object with `"#text"` or `"$"` key	Converter-specific
Repeating elements	JSON array	Detected by finding siblings with same name
Mixed content (text + child elements)	No clean equivalent	Typically lost or flattened
Comments `<!-- ... -->`	Dropped	JSON has no comment syntax
Processing instructions `<? ... ?>`	Dropped	Not data
CDATA sections	String value	Content preserved, wrapper discarded

XML→JSON Conversion Strategies

Because XML has concepts JSON lacks (attributes, mixed content, namespaces), every XML→JSON converter makes opinionated choices. The three most common strategies:

@-attribute convention. Attributes become JSON properties prefixed with @: <book type="fiction">...</book> → {"@type": "fiction", ...}. This is the approach used by our XML to JSON converter.
Attributes as nested object. All attributes go under a dedicated key: {"_attributes": {"type": "fiction"}, ...}. Cleaner separation but adds nesting depth.
Attributes as elements. Attributes are converted to child elements. Loses the element/attribute distinction entirely but produces flatter JSON.

For JSON→XML conversion, the reverse decisions apply: the converter must decide whether a JSON property becomes an XML element or attribute. Our JSON to XML converter uses the @ prefix convention for round-trip compatibility.

Common XML Pitfalls

Unescaped ampersands. An & in text content must be written as &. An unescaped & makes the document not well-formed and no XML parser will accept it. This is the single most frequent XML error.
Missing closing tags. Unlike HTML, XML requires every tag to close. A missing </tag> produces a fatal parse error, not a best-effort render.
Namespace prefix without declaration. Using <ns:tag> without a corresponding xmlns:ns="..." declaration makes the document not namespace-well-formed.
Encoding mismatch. If the prolog says encoding="UTF-8" but the file is actually Latin-1, characters above U+007F will cause parse failures. Always verify the actual byte encoding matches the declaration.
Duplicate attributes. The same attribute name cannot appear twice on the same element: <book type="fiction" type="non-fiction"> is a fatal error.

Frequently Asked Questions

Does XML-to-JSON conversion preserve attributes?

It depends on the converter. Most converters preserve attribute values by using an @-prefix convention: the attribute category="book" on an element becomes a JSON property "@category": "book". Some converters place attributes under a dedicated key like "@attributes" or "_attributes". In both cases the values survive the conversion — but the distinction between an element and an attribute is lost, since JSON has no native equivalent to XML attributes. If you need to round-trip XML→JSON→XML, pick a converter that documents its attribute strategy and stick with it.

What happens to XML comments during conversion?

Most XML-to-JSON converters silently drop XML comments (). Comments are metadata, not data, and JSON has no comment syntax — the JSON specification (RFC 8259) does not include comments. If your XML comments contain information you need to preserve (version numbers, author notes, suppression markers), extract them before conversion. Some specialized XML tooling can serialize comments as JSON properties (e.g., "__comment": "..."), but this is not standard behavior.

Why does my XML have namespace prefixes in JSON output?

Namespace prefixes (e.g., <ns:title>) appear in JSON output when the converter preserves the qualified name as-is, producing keys like "ns:title". This happens because XML namespaces don't map neatly to JSON — there's no JSON equivalent to xmlns declarations. Some converters strip prefixes entirely and use the local name; others include the full namespace URI in a separate key. If you need clean JSON output, consider pre-processing your XML to remove namespace prefixes or use a converter that can resolve qualified names to their local-name equivalents.