xmlCIF

A Proposal for Faithful Representation
of
Extensible Markup Language (XML) Documents
within
Crystallographic Information File (CIF) Data Sets

Draft of 2 March 2000
Revised 19 July 2000

Herbert J. Bernstein yaya@bernstein-plus-sons.com

Introduction

This is a work in progress. It is incomplete, but it seems to be far enough along to provide a framework for further discussion.

The Extensible Markup Language (XML) [Bray, Paoli, Sperberg-McQueen 98] is "a subset of SGML ... Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML." The Crystallographic Information File (CIF) [Hall, Allen, Brown 91] is "the standard crystallographic data exchange prescribed by the International Union of Crystallography." Scientific papers, as well as data are written in CIF format, and it is common practice to include text formatted according to various standards within CIFs. The purpose of this document is to propose extensions to the definition of CIF to allow for the inclusion of XML documents within CIF data sets without the loss of any of the information contained within the XML document.

Use of CIF in crystallography is well-established. Use of XML is being widely discussed. Translation between the two representations is an important issue. As we shall see, translation from CIF to XML is relatively simple. Translation from XML to CIF is not. A preliminary version of cif2xml is ready for testing. Design of xml2cif is in progress.

Both XML and CIF have a similar "flavor", providing information associated with tags, but differ in significant details. XML allows a highly nested, order-dependent presentation of information. XML also allows various attributes of tags to be assigned values. There are many alternate ways in which documents with such features could be embedded as data within a CIF document. However, there is much to gained if the embedding is such that the values of tags within embedded documents can be unambiguously parsed and associated with the values of tags elsewhere in the CIF document.

In order to achieve maximum functionality with a clean, uniform syntax we propose to define a extended CIF (xCIF) format to be used in parsing certain data values within a CIF which, while similar in style to CIF, also permits recursive nesting, optional order-dependence, optional use of associated parameters, optional preservation of white space, and other extensions, which facilitate a faithful representation of XML documents within CIF documents.

xCIF is the syntactic base upon which we propose to build xmlCIF, a dictionary and software API specification for a rigorous mapping between xml and CIF.

The Basics of XML

The full definition of XML is given in [Bray, Paoli, Sperberg-McQueen 98]. We present a simplified description of some of the features of XML to assist in understanding the mappings to and from xmlCIF.

An XML document consists of character data intermingled with "markup". The ampersand ("&"), percent ("%"), and angle brackets ("<" and ">") are highly significant in XML and are used to help distinguish the character data of the document from its markup.

XML Markup consists of:

start-tags	<name> <name attribute=value attribute=value ...>	marks the beginning of an XML element. The attribute-value pairs are optional and no attribute may appear more than once
end-tags	</name>	marks the end of the XML element begun by the start-tag with a matching name
empty-element tags	<name/> <name attribute=value attribute=value ... />	this is a special form equivalent to <name></name> or <name attribute=value attribute=value ...></name> which is used when a tag has no content
entity references	&name; %name;	Entity references refer to objects by name. The symbols "&", "<", ">", "'", and the double quote are represented by "&", "<", ">", "'", """ respectively.
character references	&#nnn;	-- specifies a character with decimal unicode value nnn
	&#xhhh;	-- specifies a character with hexadecimal unicode value hhh
comments	<!-- comment -->	This special markup is used to include comment text
CDATA sections	<![CDATA[ character_data ]]>	This special markup is used to embed text which might otherwise be interpreted as markup.
document type declarations	<?xml version="1.0"?>	This optional special markup unambiguously identifies an XML document.
	<!DOCTYPE name ... >	This optional special markup provides information on the markup declarations that define the grammar of the document.
element type declarations	<!ELEMENT name contents>
attribute list declarations	<!ATTLIST name elementname type default ... >
entity declarations	<!ENTITY name entity_definition > <!ENTITY % name parsed_entity_definition >
notation declarations	<!NOTATION name id >
processing instructions	<?program_name parameters ?>

The term "whitespace" in XML (as well as in CIF) refers to any non-empty sequence of spaces, tabs or line-terminators.

An XML name is a string beginning with a letter, underscore ("_") or colon (":") and consisting of letters, digits, hyphens, underscores, colons or periods ("."). Names beginning with "xml" (case-insensitive) and names containing the colon are reserved. They should be accepted by parsers, but authors of documents should not generate such names except for the reserved purposes.

An XML "system" literal string is quoted either with single quote ("'") or a double quote, and may not contain the character chosen as the quote mark. There are other literals which have additional restrictions on the characters that may be included in those literal strings. In order to allow quote marks within strings, the special escape sequences "'" and """ may be used to represent the single and double quotes within character data.

XML has been used as a framework for definition of a Chemical Markup Language (CML) [Murray-Rust, Rzepa 99]. The program Jmol [Gezelter 99] is able to display CML datasets. A typical fragment of a CML dataset presents atomic coordinates by columns, as seen in this example of methanol distributed as an example in the Jmol release:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE molecule SYSTEM "cml.dtd" []>
<molecule id="METHANOL">
  <atomArray>
     <stringArray builtin="id">a1 a2 a3 a4 a5 a6&</stringArray>
     <stringArray builtin="elementType">C O H H H H&</stringArray>
     <floatArray builtin="x3" units="pm">-0.748 0.558 -1.293 -1.263 -0.699 0.716
</floatArray>
     <floatArray builtin="y3" units="pm">-0.015 0.420 0.202 0.754 -0.934 1.404&</floatArray>
     <floatArray builtin="z3" units="pm">0.024 -0.278 -0.901 0.600 0.609  0.137&</floatArray>
  </atomArray>
</molecule>

The Basics of CIF

The full definition of CIF is given in [Hall, Allen, Brown 91]. In simplified terms, a CIF is a collection of data blocks. Each data contains data names (tags) and their values. Whitespace (in the same sense as with XML) is used to delimit the tokens of the language. Tags are marked with a leading underscore ("_") to distinguish them from values. Values which might be confused with data names or which contain whitespace are quoted in one of three ways: with single or double quotes or with semicolon as the first character of a line. An unusual aspect of CIF is that a terminal quote mark is not meaningful unless followed by whitespace. The single and double quote may only be used to quote strings that are confined to a single line. In addition to the underscore, and the three quote marks, three other characters have special meaning: the period ("."), the question mark ("?") and the hash mark ("#"). The period is used when no value is specified. The question mark is used when a value is desired but not available. The hash mark indicates that the remaining characters on a line are part of a comment.

There are a small number of reserved words: "global_", "data_", "loop_", "stop_", and "save_". The last two reserved words are not used by CIF but are reserved to prevent conflict with the language from which CIF is derived (STAR). "global_" and "data_" mark the start of a data block. "data_" should be followed immediately with the name of the block, without intervening whitespace. If "loop_" appears, it is followed by a sequence of tags without intervening data values. Those tags are considered as the column headings of a table. These are followed by rows of data values corresponding to those column headings. Outside of a table, tags and data values appear in simple alternation.

Within a data block a given tag may appear only once. The meaning of a CIF document is not altered by changing the order of presentation of data blocks nor is it altered by changing the order of presentation of tags within a block.

Converting between CIF and XML

The conversion from CIF to XML is relatively simple. Translation from XML to CIF is not. The major differences are:

CIF	XML
Table-oriented "naturally" row-based	Tree-oriented "naturally" column-based
Case-insensitive tags	Case-sensitive entity names
Two levels of nesting	Unlimited nesting
Order independent	Order dependent
Dictionary-based tag parametrization	DTD and dynamic entity parameters

In addition, the rules for writing tag names in XML are slightly more restrictive than they are in CIF. Quoted strings have slightly different syntax.

cif2xml

cif2xml is a program which converts from CIF to XML using the CIF toolbox, CIFtbx [Hall, Bernstein 96] . The basic approach is to map categories into an outer level of XML tags and individual tags into the next level down the tree. Three new dictionary tags are defined to allow for mapping of CIF categories and tags to XML entity names:

_xml_mapping.token	gives the CIF token to be mapped
_xml_mapping.token_type	gives the type of CIF token
_xml_mapping.target	gives the string to be used in xml

The mapping is optionally by rows or by columns. Mapping by columns is the default because it allows a much high density of data versus tags.

Here is the beginning of the cell information from 1crn as mapped by cif2xml:

<crystal>
<cell.entry_id>                 1CRN
</cell.entry_id>
<float builtin="gacell">         40.96
</float>
<float builtin="gbcell">         18.65
</float>
<float builtin="gccell">         22.52
</float>
<float builtin="galpha">         90.
</float>
<float builtin="gbeta">          90.77
</float>
<float builtin="ggamma">         90.
</float>
...

Note the non-CML tag cell.entry_id included. cif2xml allows for request lists so that such tags may be excluded, but, for use with Jmol, there is no need to exclude them.

The output of cif2xml when used to produce data by columns agrees with the output of the BioDOM program pdb2xml [Moore 99] for such non-looped data. For coordinate lists the higher information density of the cif2xml output results in faster dataset reading and display when used with Jmol.

The Proposed xCIF Format

We propose to extend the CIF syntax to create an extended CIF format to be used in parsing certain data values within a CIF. Each parsed data value is treated as an "xCIF" document.

An xCIF document is a valid CIF document, within which certain tags are used which have values intended to be parsed according to xCIF syntax. We say that such values are either of "xCIF type" or are of other types and have the "xCIF attribute". We define two initial tags of type xCIF, _xCIF.doc and _xCIF.doc_params, which are used to bring xCIF documents into a CIF as values. If these tags are used within a loop, multiple xCIF documents may be included. In order to allow an ordering of these top-level included xCIF documents, and to allow for multiple instances of the same xCIF document, we define an additional tag, _xCIF.doc_ordinal to optionally specify an document ordinal. Each document is given as a value of _xCIF.doc. The corresponding value of _xCIF.doc_params specifies the top level parameters applicable to that document.

Within an xCIF document, the following tags are defined:

_params	The value is a xCIF document assigning values to tags. In the context of xmlCIF, these are equivalent to assigning values to attributes for this block of character data, and, if order is being preserved, should appear first in the block.
_cmnt	The value is a comment. This is an alternative to "#" delimited comments.
_text	The value is a block of text exempt from nested parsing.
_prog _prog_params	The value of _prog is the name of a program. The value of _prog_params is a string representing parameters for the "associated" _prog tag. If order is being preserved, the "associated" _prog tag must be the immediately preceding tag. If order is not preserved, a loop_ must be used to create the association. This pair of tags is used to represent the XML "<?" constructs.
_doctype _element _attlist _entity _notation	The tags are used to carry information from the equivalent XML "<!" constructs

The Top Level Parameters

The parameters controlling the syntax and semantics of an xCIF document are specified in a character string (or text field) containing values for the following tags:

_xCIF_doc.parse_contents	"yes" (the default) if the xCIF document is to be parsed, "no" if not.
_xCIF_doc.preserve_spacing	"yes" if white space within the xCIF document is significant, "no" (the default) if not.
_xCIF_doc.preserve_order	"yes" if the ordering of tags within the xCIF document is significant, "no" (the default) if not.
_xCIF_doc.repeat_tags	"yes" if tags may be repeated, "no" (the default) if not.
_xCIF_doc.recursion	"yes" (the default) if the values of parsed tags are themselves to be parsed, "no" if not. The combination of "_xCIF_doc.parse_contents no" and "_xCIF_doc.recursion yes" is not meaningful.
_xCIF_doc.extensions	"yes" (the default) if the xCIF extensions to CIF parsing rules are to be enabled in parsing the xCIf document, "no" if not

For example, a simple framework for specifying a xCIF document which is to be parsed for information to be used in creating an XML document might begin


     data_xmlDATA  
     _xCIF.doc_params  "_xCIF_doc.preserve_order no _xCIF_doc.repeat_tags yes"
     _xCIF.doc     
; _prog xml 
  _prog_params "version=&quot;1.0&quot;"       
  _doctype "html ..."       ...     
;

The Extended CIF Syntax

The parsing rules for xCIF are similar to those for CIF, with some added flexibility. An xCIF document consists of lines of text representing a continuous string of characters, from which the parser extracts substrings as tokens if parsing has been enabled.

The syntax is very similar to CIF. The body of the document consists of tags and values, either in directly associated pairs or in loops. Two constructs which are illegal in CIF outside of loops are permitted in xCIF: multiple tags in sequence and multiple values in sequence. Constructs of the form

_tag₁ _tag₂ … _tag_n value₁ value₂ … value_m

are equivalent to

_tag₁

; _tag₂

\; …

\…\;_tag_n " value₁ value₂ … value_m"

\…\;

;

nesting the uses of the tags and concatenating the values. This convention does not change the ordinary CIF handling of loop headers and bodies.

The ability to imply nesting by concatenating tags is supplemented with two additional special contructs. The tag "_" consisting of just an underscore may be used to return one or more levels of nesting or to function as a multiple level bracket. If the value associated with the "_" tag is numeric and a non-negative whole number, the parse returns that many levels of nesting. If the value associated with the "_" tag issymbolic and begins "}" and the parse returns to the level at which it most recently encountered the "_" tag with a value beginning with "{" and which matches for the remaining characters (if any).

The combinations "_ 0" and "_ ." are no-ops for the parse.

To understand the impact of these extensions, consider the following HTML fragment:

<CENTER>
<TABLE BORDER="2" WIDTH="380">
<TR><TD>A</TD>       <TD>26.4</TD></TR>
<TR><TD>B</TD>       <TD>38.9</TD></TR>
<TR><TD>C</TD>       <TD>34.7</TD></TR>
<TR><TD>&alpha;</TD> <TD>88.0</TD></TR>
<TR><TD>&beta;</TD>  <TD>108.0</TD></TR>
<TR><TD>&gamma;</TD> <TD>111.0</TD></TR>
</TABLE>
</CENTER>

This fragment might be translated as


_center
_table
_params "_border 2 _width 380"
_tr _td A         _td 24.4     _ 1
_tr _td B         _td 38.9     _ 1
_tr _td C         _td 34.7     _ 1
_tr _td &alpha;   _td 88.0     _ 1
_tr _td &beta;    _td 108.0    _ 1
_tr _td &gamma;   _td 111.0    _ 1

or as


_center
_table
_params "_border 2 _width 380"
_ { _tr _td A       _td 24.4      _ }
_ { _tr _td B       _td 38.9      _ }
_ { _tr _td C       _td 34.7      _ }
_ { _tr _td &alpha; _td 88.0      _ }
_ { _tr _td &beta;  _td 108.0     _ }
_ { _tr _td &gamma; _td 111.0     _ }

both of which are equivalent to

 
_center
; _table
\;
  _params '_border 2 _width 380'
  _tr '_td A       _td 24.4'
  _tr '_td B       _td 38.9'
  _tr '_td C       _td 34.7'
  _tr '_td &alpha; _td 88.0'
  _tr '_td &beta;  _td 108.0'
  _tr '_td &gamma; _td 111.0'
\;
;

The Lexical Scan

Tokens and whitespace are identified in a preliminary lexical scan according to the following rules:

0. The scan identifies tokens as reserved words tags, values, whitespace or comments. The reserved words are: "data_", "global_", "loop_", "stop_" and "save_". Tags are flagged with "_" followed by a non-empty string. Values are any other string. Whitespace is any combination of line terminators, blanks or tabs. Comments are flagged by a leading "#" and continue to the end of the line
1. Lines of text may come from the original, outer-most document, or may themselves be extracted from tokens identified by the parser at some higher level as being available for parsing.
2. Each scan begins with the left-most character of the first line of text to be parsed, with an empty string for the tentative token.
3. If the parse has been instructed to preserve whitespace, whitespace is accumulated both as a token and as the verbatim string which was parsed as whitespace.
4. An unquoted "#" causes the rest of the line to be treated as a comment. If the parse has been instructed to preserve whitespace, comments are preserved as well.
5. An unquoted "\" causes special processing of the following characters or characters, either by quoting or generating escape sequences
6. An unquoted "&" causes special processing of the following characters as an escape sequence
7. Backslash quoting: A backslash "\" before a non-alphabetic, non-numeric, non-white-space character disables any special meaning the following character may otherwise have for the parser. Thus "\\" may be used to include a backslash, or a backslash may be used before an underscore, semicolon, ampersand, single quote or double quote to disable their special meanings.
8. Ampersand and backslash escapes: The HTML-style ampersand escape sequences and the C-style backslash style escapes are both acceptable. For example, "\n" represents an embedded line terminator, "©" represents the copyright symbol, and either """ or "\q" may be used to indicate a double quote as alternative to backslash quoting of the double quote.
9. Backslash concatenation: A line which is to be concatenated with the next line is indicated by an unquoted "\" as the last non-whitespace character on the line. The backslash and the trailing whitespace are removed in the concatenation. Only backslash quoting is significant in suppressing this scan. This allows strings quoted by single and double quotes to be continued across lines.
10. Quoted strings. In any context where a string is required, the recognition of leading and trailing underscores and whitespace may be suppressed by enclosing the string in matching single quote, double quote or "\n;" pairs, provided the second member of the pair is followed by whitespace. In the xCIF parse, this parse rule applies to a string which names a tag, allowing constructions such as

_"name with blanks"

The quote marks are removed in defining the relevant token, but information about which quote mark was used is preserved if the parse has been instructed to preserve whitespace.

Specifying Parameters in General

Within an xCIF document, parameters for any given tag may be specified either by defining a specific associated tag the value of which will carry the parameters for the original tag, as we do with _xCIF.doc and xCIF.doc_params, or by use of the _params tags within the xCIF document.

References

[Bernstein et al. 98] Bernstein, H.J.,Bernstein, F.C., Bourne, P.E. "pdb2cif: Translating PDB Entries into mmCIF Format," , J. Appl. Cryst., 31, pp. 282-295, 1998, software available from http://www.iucr.org/iucr-top/CIF and http://ndbserver.rutgers.edu..

[Bray, Paoli, Sperberg-McQueen 98] T. Bray, J. Paoli, C.M. Sperberg, eds, "Extensible Markup Language (XML)", W3C Recommendation 10-Feb-98, REC-xml-19980210, http://www.w3.org/TR/1998/REC-xml-19980210

[Hall, Allen, Brown 91] S. R. Hall, F. H. Allen, I. D. Brown, "The Crystallographic Information File (CIF): A New Standard Archive File for Crystallography", Acta cryst. A47, 655-685 (1991), http://www.us.iucr.org/iucr-top/cif/standard/cifstd1.html

Murray-Rust, Rzepa 99] Murray-Rust, P., Rzepa, H., "Chemical markup, XML and the WWW, Part I: Basic principles," J. Chem. Inf . Comp. Sci, 39 No. 6, 928-942,(1999). See http://www.xml-cml.org.

Document Updated 19 July 2000

Herbert J. Bernstein
yaya@bernstein-plus-sons.com

A Proposal for Faithful Representation of Extensible Markup Language (XML) Documents within Crystallographic Information File (CIF) Data Sets