A XML Document Type Declaration section defines the structure of the XML data contained within the XML document. Its a set of rules that determine what can and can't go into the document. Rules that define the structure of data are referred to as a Schema, and XML Schemas are the point when XML starts to get complicated....
The Document Type Declaration node must appear at the top of the XML document (after the XML Declaration), but before the XML Document Element. Broadly speaking the Document Type Declaration node can take 2 forms, a reference to an external file which contains the DTD Schema, or an inline DTD Schema description.
The XML Document Type Declaration can reference an external file which contains the actual DTD schema.
The following example demonstrates this.
ExternalDtdSample.xml |
Copy Code
|
---|---|
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE people_list SYSTEM "example.dtd"> <people_list> <person> <name>Fred Bloggs</name> <birthdate>27/11/2008</birthdate> <gender>Male</gender> </person> </people_list> |
Lets look at the DOCTYPE tag in more detail.
The first argument is the "name", in our example "people_list", this tells the parser that these rules should be applied to elements with the name "people_list".
The next argument only has 2 possible values "SYSTEM" or "PUBLIC".
If the value is "SYSTEM", then the parser expects the next argument to be a SystemIdentifier, which it treats as a uri.
<!DOCTYPE Name SYSTEM SystemIdentifier
The SystemIdentifier uri is used to retrieve the DTD schema, which is used to validate the named element.
example.dtd |
Copy Code
|
---|---|
<!ELEMENT people_list (person*)> <!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)> <!ELEMENT name (#PCDATA)> <!ELEMENT birthdate (#PCDATA)> <!ELEMENT gender (#PCDATA)> <!ELEMENT socialsecuritynumber (#PCDATA)> |
The contents of the DTD for this example are shown above, but for a full description of how DTD's work, see DTD Schemas
If the second argument in the DOCTYPE is "PUBLIC" then the next 2 arguments are the PublicIdentifier and a SystemIdentifier
<!DOCTYPE Name PUBLIC PublicIdentifier SystemIdentifier
The PublicIdentifier is just an identifier. It has no specific meaning its just an ID.
The SystemIdentifier is treated as a uri, just as in the previous example.
The parser can use any combination of data taken from the PublicIdentifier & SystemIdentifier in order to resolve the DTD Schema.
XHTML DOCTYPE |
Copy Code
|
---|---|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
A parser reading this DOCTYPE node may look at the PublicIdentifier "-//W3C//DTD HTML 4.01 Transitional//EN" and decide to use the XHTML 4.01 DTD Schema loaded from its local cache, rather than going across the internet to read http://www.w3.org/TR/html4/loose.dtd.
Equally another parser may look at the PublicIdentifier "-//W3C//DTD HTML 4.01 Transitional//EN" and have no idea what it is, so would be forced to read the DTD schema from the uri held in the SystemIdentifer (http://www.w3.org/TR/html4/loose.dtd).
The XML Standard does not provide any rules for how to resolve an external DTD schema given a public and system identifier.
From the W3C spec : [Definition: In addition to a system identifier, an external identifier may include a public identifier.] An XML processor attempting to retrieve the entity's content may use any combination of the public and system identifiers as well as additional information outside the scope of this specification to try to generate an alternative URI reference. If the processor is unable to do so, it MUST use the URI reference specified in the system literal.
In a production system files should never be loaded directly from the W3C web site.
In the past so many applications were reading schema data directly from the http://www.w3c.org domain that it was accounting for the majority of their traffic. In order to solve this problem they started to introduce a delay, now if you download a schema file directly from this domain, it will be served, but it could take up to 2 minutes to get it (that's 120 seconds to read a file that is 2-3 KBytes), when you consider that some of these schemas include others, reading one file may cause many more to be read, each one taking 2 minutes. The result is your application is unusable slow...
As a result you have to cache the w3c schemas you use in order to avoid reading them from the W3C web site.
Using the DOCTYPE node it is possible to specify a set of DTD Schema rules that are defined inline within the XML Document
The square brackets [...] contain the DTD Schema rules.
InlineSchema.xml |
Copy Code
|
---|---|
<?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend</body> </note> |
The syntax for a Document Type Declaration section described by the W3C using EBNF as follows.
[28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>' [28a] DeclSep ::= PEReference | S [28b] intSubset ::= (markupdecl | DeclSep)* [29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment [45] elementdecl ::= '<!ELEMENT' S Name S contentspec S? '>' [52] AttlistDecl ::= '<!ATTLIST' S Name AttDef* S? '>' [70] EntityDecl ::= GEDecl | PEDecl [71] GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>' [72] PEDecl ::= '<!ENTITY' S '%' S Name S PEDef S? '>' [75] ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral S SystemLiteral [82] NotationDecl ::= '<!NOTATION' S Name S (ExternalID | PublicID) S? '>'