In this post I explain the rules for entity definition and references in natural XML documents. These rules are defined in the XML specs (XML 1.0 and XML 1.1). You can read the specification yourself, but reading standards documents are rarely fun. This post will illustrate the rules by example.
In section 4.4 XML Processor Treatment of Entities and References of the spec, we are presented with …
| Entity Type |
Character | ||||
| Parameter | Internal General | External Parsed General |
Unparsed | ||
| Reference in Content |
Not recognized | Included |
Included if validating |
Forbidden | Included |
| Reference in Attribute Value | Not recognized |
Included in literal |
Forbidden | Forbidden | Included |
| Occurs as Attribute Value |
Not recognized | Forbidden | Forbidden | Notify | Not recognized |
| Reference in EntityValue | Included in literal | Bypassed | Bypassed | Error | Included |
| Reference in DTD | Included as PE | Forbidden | Forbidden | Forbidden | Forbidden |
The approach
The rule of each cell in the above matrix will be illustrated with a case study (labelled “test case”). The input of test cases will be grouped together in a natural XML document, and passed through an XSLT identity transform. A test case is then, the combination of the input document and the result of the transform (either an error result or a resultant document). The transformation engine chosen was Saxon-HE 9.5.1.1N from Saxonica. The central idea of this approach, is that the identity transform is a transparent window into how Saxon’s input XML processor processes and views the test case input documents. Listing 1 shows our identity transform.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" encoding="utf-8" omit-xml-declaration="yes" />
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Informational resources
The following entities (informational resources) will be available to all test case via the indicated system identifiers.
<?xml version="1.0" encoding="UTF-8"?>banana
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY % where-is-it "It is at ">
<!-- %colour; is defined by whatever is using this external doc-type.
Even though this external doc-type is passed BEFORE the internal one.
At this point of parsing, it is not necessary for %colour; to be defined,
as the reference in the attri definition is BYPASSED, not substituted. -->
<!ENTITY % some-place "%colour;">
<!-- The parameter entity is included in the definition of &test-case-14; -->
<!ENTITY test-case-14 "%where-is-it; %some-place;">
<!-- In the processor's symbol map: test-case-14 ==> 'x'. -->
Entity references in content
Let’s look at some entity references in content. This means references anywhere after the start-tag and before the end-tag of an element, and corresponds to the nonterminal content.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE a-doc-type [
<!ENTITY % internal-pe "internal-pe-value">
<!ENTITY % external-pe SYSTEM "source-text.txt">
<!ENTITY internal-ge "internal-ge-value">
<!ENTITY internal-ge-with-mark-up "<solo-element recursive1='&internal-ge;'/>">
<!ENTITY external-parsed-ge SYSTEM "source-text.txt">
]>
<container>
<!-- Test case 1: Parameter entities are not recognised in content. -->
<test-case1>%internal-pe; %external-pe;</test-case1>
<!-- Test case 2: Internal general entities are included in content. -->
<test-case2 attri="&internal-ge;">&internal-ge; &internal-ge-with-mark-up;</test-case2>
<!-- Test case 3: Character references are included. -->
<test-case3>© Sean B. Durkin, 2014</test-case3>
<!-- Test case 4: An external parsed general entity in content.
An non-validating processor MIGHT include it.
A validating one WILL. -->
<test-case4>&external-parsed-ge;</test-case4>
</container>
… transforms to …
<container>
<!-- Test case 1: Parameter entities are not recognised in content. -->
<test-case1>%internal-pe; %external-pe;</test-case1>
<!-- Test case 2: Internal general entities are included in content. -->
<test-case2 attri="internal-ge-value">internal-ge-value <solo-element recursive1="internal-ge-value"/>
</test-case2>
<!-- Test case 3: Character references are included. -->
<test-case3>آ© Sean B. Durkin, 2014</test-case3>
<!-- Test case 4: An external parsed general entity in content.
An non-validating processor MIGHT include it.
A validating one WILL. -->
<test-case4>banana</test-case4>
</container>
Test cases 1 and 2 show us that parameter entity references are not recognised in content. That should be expected by reader. They are a thing just for the purposes of building DTD’s.
Test case 3 shows that character references are included in content. They are replaced as soon as the XML processor sees the entity reference in content. I am not sure why we are getting the extra آ character in listing 5. If you can explain this, please leave a comment.
Test case 4 is an interesting one. The treatment of reference to an external parsed general entity in content depends on whether or not the processor is validating. If validating, the replacement text is included, as shown in listing 5, test case 4. Apparently the outcome of listing 5, does not depend on whether the input document of listing 4 is marked as stand-alone or not. You would think that marking standalone=”yes” would cause the processor NOT to attempt to load external entities. The spec says that for a non-validating processor, the processor gets to choose whether or not to include replacement text. It is a rather inconvenient rule, because it makes the behavior of non-validating processors unpredictable in the general sense. Will &external-parsed-ge; be replaced by ‘banana’? If the vendor does not publish the rule, only the vendor knows.
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE a-doc-type [ <!NOTATION my-notation SYSTEM "www.dat-files.org" > <!ENTITY unparsed-entity SYSTEM "source-file.dat" NDATA my-notation> ]> <container> <!-- Test case 5: An unparsed entity reference in content is a fatal error. --> <test-case5>&unparsed-entity;</test-case5> </container>
The document is listing 6 raises a fatal error when we attempt to ident transform it. Test case 5 (listing 6), illustrates that we cannot refer to an unparsed entity within content. More-over they are pretty much forbidden every where. The only place we can have them is as entity values, as we will see following.
References within attribute values
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE a-doc-type [ <!ENTITY % internal-pe "internal-pe-value"> <!ENTITY internal-ge "internal-ge-value"> <!ELEMENT test-case6 EMPTY> <!ATTLIST test-case6 attri2 CDATA "%internal-pe;"><!-- 6.A: Parameter entities are not recognised in attribute values. --> <!ELEMENT test-case7 EMPTY> <!ATTLIST test-case7 attri2 CDATA "&internal-ge;"><!-- 7.A: Internal general entities are are included in attribute values.--> <!ELEMENT test-case8 EMPTY> <!ATTLIST test-case8 attri2 CDATA "©"><!-- 8.A: Character references are included in attribute values. --> ]> <container> <!-- Test case 6.B: Parameter entities are not recognised in attribute values. --> <test-case6 attri1="%internal-pe;" /> <!-- Test case 7.B: Internal general entities are included in attribute values. --> <test-case7 attri1="&internal-ge;" /> <!-- Test case 8.B: Character references are included in attribute values. --> <test-case8 attri1="© Sean B. Durkin, 2014" /> </container>
… transforms into …
<container> <!-- Test case 6.B: Parameter entities are not recognised in attribute values. --> <test-case6 attri1="%internal-pe;" attri2="%internal-pe;"/> <!-- Test case 7.B: Internal general entities are included in attribute values. --> <test-case7 attri1="internal-ge-value" attri2="internal-ge-value"/> <!-- Test case 8.B: Character references are included in attribute values. --> <test-case8 attri1="آ© Sean B. Durkin, 2014" attri2="آ©"/> </container>
As we can see from listings 7 and 8, internal parameter entities are not recognised in attribute values (just as they are not so in non-attribute content). But parsed general entities and character references are ok. Our parsed general reference (test case 7) is an internal general entity, but equally well it could have been an external parsed general entity.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE a-doc-type [
<!ENTITY external-parsed-ge SYSTEM "source-text.txt">
<!ELEMENT test-case9 EMPTY>
<!ATTLIST test-case9 attri2 CDATA "&external-parsed-ge;"><!-- 9.A: External entity reference forbidden in attribute value. -->
]>
<container>
<!-- Test case 9.B: An external entity, (be it either an external parsed general entity or an unparsed general entity)
is forbidden as a reference in an attribute value. This is a fatal error. -->
<test-case9 attri1="&external-parsed-ge;" />
</container>
Oh oh! The document of listing 9 blows up when we try to ident transform it. It has two fatal errors. Both illustrate that it is forbidden to refer to an unparsed entity in an attribute value, be it content or attribute defaults.
Entities AS attribute values
When an attribute value is declared to be of type ENTITY, we can list the entity directly as the value, as opposed to making an entity reference.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE a-doc-type [
<!ENTITY internal "stuff and nonsense">
<!NOTATION jpg SYSTEM "image/jpeg">
<!ENTITY file_pic SYSTEM "file.jpg" NDATA jpg>
<!ENTITY source-text SYSTEM "source-text.txt">
<!ELEMENT test-case EMPTY>
<!ATTLIST test-case source-entity ENTITY #REQUIRED>
]>
<container>
<!-- Test case 10. An internal general entity as an entity value is forbidden. -->
<test-case case-number="10" source-entity="internal"/>
<!-- Test case 11. An external parsed general entity as an entity value is forbidden. -->
<test-case case-number="11" source-entity="source-text"/>
<!-- Test case 12. When an unparsed entity as an entity value, and the processor is validating,
the processor must inform the application of the system and public (if any) identifiers for
both the entity and its associated notation. (In this case "file.jpg" and "image/jpeg"). -->
<test-case case-number="12" source-entity="file_pic"/>
<!-- Test case 13. Character references are not recognised in attribute values of type ENTITY. -->
<test-case case-number="13" source-entity="©"/>
</container>
… transforms into …
<container>
<!-- Test case 10. An internal general entity as an entity value is forbidden. -->
<test-case case-number="10" source-entity="internal"/>
<!-- Test case 11. An external parsed general entity as an entity value is forbidden. -->
<test-case case-number="11" source-entity="source-text"/>
<!-- Test case 12. When an unparsed entity as an entity value, and the processor is validating,
the processor must inform the application of the system and public (if any) identifiers for
both the entity and its associated notation. (In this case "file.jpg" and "image/jpeg"). -->
<test-case case-number="12" source-entity="file_pic"/>
<!-- Test case 13. Character references are not recognised in attribute values of type ENTITY. -->
<test-case case-number="13" source-entity="آ©"/>
</container>
What the heck is happening in test cases 10, 11 and 13 (listing 11)? According to the “Processor Treatment of Entities and References” matrix in the spec, it is forbidden to specify a parsed general entity as a value of a attribute typed as ENTITY. Yet the XML processor used by Saxon, allows these to pass without comment. It should be fatal error. I posted a question about this on StackOverflow. If you can explain this odd behaviour, feel free to leave a comment on this post.
An unparsed entity is fine (as you can see from test case 12).
Really, what is the point of this rule? What is wrong with an ENTITY attribute taking the value of a parsed entity? It feels like this is a typographical error in the spec.
Reference in entity value
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE a-doc-type SYSTEM "extra-doc-type.dtp" [
<!ENTITY % colour "green">
<!ENTITY external-parsed-ge SYSTEM "source-text.txt">
<!ENTITY unparsed-entity SYSTEM "source-file.dat" NDATA my-notation>
<!-- When a general entity reference appears in the EntityValue in an entity declaration, it must be bypassed and left as is. -->
<!ENTITY test-case-15 "&test-case-14;">
<!-- In the processor's symbol map: test-case-15 ==> '&test-case-14;'. -->
<!ENTITY test-case-16 "&external-parsed-ge;">
<!-- In the processor's symbol map: test-case-16 ==> '&external-parsed-ge;'. -->
<!ENTITY test-case-17 "©">
<!-- In the processor's symbol map: test-case-17 ==> '©'. -->
<!ENTITY test-case-18 "&unparsed-entity;">
<!-- Test case 18 is an error because references to unparsed entities are forbidden in entity values.
However, the processor may choose to recover from this error. -->
]>
<container>
<!-- Test case 15: Mapping of @attri applies recursively until we reach a value of 'x'. -->
<test-case-15 attri="&test-case-15;" />
<!-- Test case 16: An external parsed general entity reference (&external-parsed-ge;)
in the entity value definition above, is bypassed, and stored as-is.
When we come to normalisation of content (below), replacement values of entities
are recursively applied. -->
<test-case-16>&test-case-16;</test-case-16>
<test-case-17 attri="&test-case-17;" />
</container>
… transforms into …
<container>
<!-- Test case 15: Mapping of @attri applies recursively until we reach a value of 'x'. -->
<test-case-15 attri="It is at green"/>
<!-- Test case 16: An external parsed general entity reference (&external-parsed-ge;)
in the entity value definition above, is bypassed, and stored as-is.
When we come to normalisation of content (below), replacement values of entities
are recursively applied. -->
<test-case-16>banana</test-case-16>
<test-case-17 attri="آ©"/>
</container>
From listings 12 and 13, we can see that references to parsed entities within an entity value declaration are bypassed. Take a look at the entity declaration for test-case-15. When the declaration is parsed, the processor stores the value ‘&test-case-14;’ literally. It does not store the translation of &test-case-14; as the value of &test-case-15, but rather ‘&test-case-14;’ literally. When it comes to parsing content, yes, then at that point, ‘&test-case-15’ will be recursively replaced by a computed replacement value. By examining both listings 13 and 4, you can see that this comes out to be ‘It is at green’.
References to unparsed entities within an entity value definition is an error, but not not a fatal one. Test case 18, listing 11 is an example. Our processor chooses to swallow and recover from this error, but another processor might baulk at it.
Entity references within the DTD
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE a-doc-type [
<!ENTITY % internal-pe "<!ELEMENT test-case-19 EMPTY>
<!ATTLIST test-case-19 magic CDATA 'waffle'>">
<!ENTITY % external-pe SYSTEM "source-text.txt">
<!-- %internal-pe; below is included. That is to say, it is parsed as its replacement text -->
%internal-pe;
]>
<container>
<test-case-19/>
</container>
… transforms into …
<container> <test-case-19 magic="waffle"/> </container>
In listings 14 and 15, we can see that a reference to a parameter entity is included in the DTD. There rules for where parameter entities can occur depend on whether the data set is internal or external. For internal, the rules are restrictive and fairly simple. You can only have parameter entity references at the outer-most declarative level, and it may only contain a whole number of markup-decls. But when the references are in an external dataset, the rules are much more liberal. We can have them within mark-up.
In constrast, listing 16 explodes when we try to ident transform it. It contains 4 fatal errors.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE a-doc-type [
<!ENTITY internal-ge "<!ELEMENT test-case-20 EMPTY>
<!ATTLIST test-case-20 magic CDATA 'waffle'>">
<!ENTITY external-parsed-ge SYSTEM "source-text.txt">
<!ENTITY unparsed-entity SYSTEM "source-file.dat" NDATA my-notation>
<!-- Test case 20: Internal general entity references are forbidden in the DTD,
(outside of an entity value definition, an attribute value, a processing instruction,
a comment, a system literal or a public id literal).
&internal-ge; below is forbidden in the DTD. This is a fatal error. -->
&internal-ge;
<!-- Test case 21: External parsed general entity references are forbidden in the DTD,
(outside of an entity value definition, an attribute value, a processing instruction,
a comment, a system literal or a public id literal).
&external-parsed-ge; below is forbidden in the DTD. This is a fatal error. -->
&external-parsed-ge;
<!-- Test case 22: Unparsed entity references are forbidden in the DTD,
(outside of an entity value definition, an attribute value, a processing instruction,
a comment, a system literal or a public id literal).
&unparsed-entity; below is forbidden in the DTD. This is a fatal error. -->
&unparsed-entity;
<!-- Test case 23: Character references are forbidden in the DTD,
(outside of an entity value definition, an attribute value).
m, which is the letter 'm', is forbidden in the context shown below.
This is a fatal error. -->
<!ELEMENT test-case-23 EMPTY>
<!ATTLIST test-case-23 magic CDATA 'waffle'>"
]>
<container>
<test-case-20/>
<test-case-23/>
</container>
We are not permitted to put general or character entity references within DTD’s, except in attribute and entity values, as covered by previous rules.