xmlCIF


A Proposal for Faithful Representation
of
Extensible Markup Language (XML) Documents
within
Crystallographic Information File (CIF) Data Sets

Draft of 2 March 2000
Revised 19 July 2000

by

Herbert J. Bernstein yaya@bernstein-plus-sons.com

Introduction

This is a work in progress. It is incomplete, but it seems to be far enough along to provide a framework for further discussion.

The Extensible Markup Language (XML) [Bray, Paoli, Sperberg-McQueen 98] is "a subset of SGML ... Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML." The Crystallographic Information File (CIF) [Hall, Allen, Brown 91] is "the standard crystallographic data exchange prescribed by the International Union of Crystallography." Scientific papers, as well as data are written in CIF format, and it is common practice to include text formatted according to various standards within CIFs. The purpose of this document is to propose extensions to the definition of CIF to allow for the inclusion of XML documents within CIF data sets without the loss of any of the information contained within the XML document.

Use of CIF in crystallography is well-established. Use of XML is being widely discussed. Translation between the two representations is an important issue. As we shall see, translation from CIF to XML is relatively simple. Translation from XML to CIF is not. A preliminary version of cif2xml is ready for testing. Design of xml2cif is in progress.

Both XML and CIF have a similar "flavor", providing information associated with tags, but differ in significant details. XML allows a highly nested, order-dependent presentation of information. XML also allows various attributes of tags to be assigned values. There are many alternate ways in which documents with such features could be embedded as data within a CIF document. However, there is much to gained if the embedding is such that the values of tags within embedded documents can be unambiguously parsed and associated with the values of tags elsewhere in the CIF document.

In order to achieve maximum functionality with a clean, uniform syntax we propose to define a extended CIF (xCIF) format to be used in parsing certain data values within a CIF which, while similar in style to CIF, also permits recursive nesting, optional order-dependence, optional use of associated parameters, optional preservation of white space, and other extensions, which facilitate a faithful representation of XML documents within CIF documents.

xCIF is the syntactic base upon which we propose to build xmlCIF, a dictionary and software API specification for a rigorous mapping between xml and CIF.

The Basics of XML

The full definition of XML is given in [Bray, Paoli, Sperberg-McQueen 98]. We present a simplified description of some of the features of XML to assist in understanding the mappings to and from xmlCIF.

An XML document consists of character data intermingled with "markup". The ampersand ("&"), percent ("%"), and angle brackets ("<" and ">") are highly significant in XML and are used to help distinguish the character data of the document from its markup.

XML Markup consists of:

start-tags

<name>
<name attribute=value attribute=value ...>

marks the beginning of an XML element. The attribute-value pairs are optional and no attribute may appear more than once

end-tags

</name>

marks the end of the XML element begun by the start-tag with a matching name

empty-element tags

<name/>
<name attribute=value attribute=value ... />

this is a special form equivalent to <name></name> or <name attribute=value attribute=value ...></name> which is used when a tag has no content

entity references

&name;
%name;

Entity references refer to objects by name. The symbols "&", "<", ">", "'", and the double quote are represented by "&amp;", "&lt;", "&gt;", "&apos;", "&quot;" respectively.

character references

&#nnn;

-- specifies a character with decimal unicode value nnn

 

&#xhhh;

-- specifies a character with hexadecimal unicode value hhh

comments

<!-- comment -->

This special markup is used to include comment text

CDATA sections

<![CDATA[ character_data ]]>

This special markup is used to embed text which might otherwise be interpreted as markup.

document type declarations

<?xml version="1.0"?>

This optional special markup unambiguously identifies an XML document.

 

<!DOCTYPE name ... >

This optional special markup provides information on the markup declarations that define the grammar of the document.

element type declarations

<!ELEMENT name contents>

 

attribute list declarations

<!ATTLIST name elementname type default ... >

 

entity declarations

<!ENTITY name entity_definition >
<!ENTITY % name parsed_entity_definition >

 

notation declarations

<!NOTATION name id >

 

processing instructions

<?program_name parameters ?>

 

The term "whitespace" in XML (as well as in CIF) refers to any non-empty sequence of spaces, tabs or line-terminators.

An XML name is a string beginning with a letter, underscore ("_") or colon (":") and consisting of letters, digits, hyphens, underscores, colons or periods ("."). Names beginning with "xml" (case-insensitive) and names containing the colon are reserved. They should be accepted by parsers, but authors of documents should not generate such names except for the reserved purposes.

An XML "system" literal string is quoted either with single quote ("'") or a double quote, and may not contain the character chosen as the quote mark. There are other literals which have additional restrictions on the characters that may be included in those literal strings. In order to allow quote marks within strings, the special escape sequences "&apos;" and "&quot;" may be used to represent the single and double quotes within character data.

XML has been used as a framework for definition of a Chemical Markup Language (CML) [Murray-Rust, Rzepa 99]. The program Jmol [Gezelter 99] is able to display CML datasets. A typical fragment of a CML dataset presents atomic coordinates by columns, as seen in this example of methanol distributed as an example in the Jmol release:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE molecule SYSTEM "cml.dtd" []>
<molecule id="METHANOL">
  <atomArray>
     <stringArray builtin="id">a1 a2 a3 a4 a5 a6&</stringArray>
     <stringArray builtin="elementType">C O H H H H&</stringArray>
     <floatArray builtin="x3" units="pm">-0.748 0.558 -1.293 -1.263 -0.699 0.716
</floatArray>
     <floatArray builtin="y3" units="pm">-0.015 0.420 0.202 0.754 -0.934 1.404&</floatArray>
     <floatArray builtin="z3" units="pm">0.024 -0.278 -0.901 0.600 0.609  0.137&</floatArray>
  </atomArray>
</molecule>


The Basics of CIF

The full definition of CIF is given in [Hall, Allen, Brown 91]. In simplified terms, a CIF is a collection of data blocks. Each data contains data names (tags) and their values. Whitespace (in the same sense as with XML) is used to delimit the tokens of the language. Tags are marked with a leading underscore ("_") to distinguish them from values. Values which might be confused with data names or which contain whitespace are quoted in one of three ways: with single or double quotes or with semicolon as the first character of a line. An unusual aspect of CIF is that a terminal quote mark is not meaningful unless followed by whitespace. The single and double quote may only be used to quote strings that are confined to a single line. In addition to the underscore, and the three quote marks, three other characters have special meaning: the period ("."), the question mark ("?") and the hash mark ("#"). The period is used when no value is specified. The question mark is used when a value is desired but not available. The hash mark indicates that the remaining characters on a line are part of a comment.

There are a small number of reserved words: "global_", "data_", "loop_", "stop_", and "save_". The last two reserved words are not used by CIF but are reserved to prevent conflict with the language from which CIF is derived (STAR). "global_" and "data_" mark the start of a data block. "data_" should be followed immediately with the name of the block, without intervening whitespace. If "loop_" appears, it is followed by a sequence of tags without intervening data values. Those tags are considered as the column headings of a table. These are followed by rows of data values corresponding to those column headings. Outside of a table, tags and data values appear in simple alternation.

Within a data block a given tag may appear only once. The meaning of a CIF document is not altered by changing the order of presentation of data blocks nor is it altered by changing the order of presentation of tags within a block.

Converting between CIF and XML

The conversion from CIF to XML is relatively simple. Translation from XML to CIF is not. The major differences are:

CIFXML
Table-oriented
"naturally" row-based
Tree-oriented
"naturally" column-based
Case-insensitive tagsCase-sensitive entity names
Two levels of nestingUnlimited nesting
Order independentOrder dependent
Dictionary-based tag parametrizationDTD and dynamic entity parameters

In addition, the rules for writing tag names in XML are slightly more restrictive than they are in CIF. Quoted strings have slightly different syntax.

cif2xml

cif2xml is a program which converts from CIF to XML using the CIF toolbox, CIFtbx [Hall, Bernstein 96] . The basic approach is to map categories into an outer level of XML tags and individual tags into the next level down the tree. Three new dictionary tags are defined to allow for mapping of CIF categories and tags to XML entity names:

_xml_mapping.token gives the CIF token to be mapped
_xml_mapping.token_type gives the type of CIF token
_xml_mapping.target gives the string to be used in xml

The mapping is optionally by rows or by columns. Mapping by columns is the default because it allows a much high density of data versus tags.

Here is the beginning of the cell information from 1crn as mapped by cif2xml:

<crystal>
<cell.entry_id>                 1CRN
</cell.entry_id>
<float builtin="gacell">         40.96
</float>
<float builtin="gbcell">         18.65
</float>
<float builtin="gccell">         22.52
</float>
<float builtin="galpha">         90.
</float>
<float builtin="gbeta">          90.77
</float>
<float builtin="ggamma">         90.
</float>
...

Note the non-CML tag cell.entry_id included. cif2xml allows for request lists so that such tags may be excluded, but, for use with Jmol, there is no need to exclude them.

The output of cif2xml when used to produce data by columns agrees with the output of the BioDOM program pdb2xml [Moore 99] for such non-looped data. For coordinate lists the higher information density of the cif2xml output results in faster dataset reading and display when used with Jmol.

The Proposed xCIF Format

We propose to extend the CIF syntax to create an extended CIF format to be used in parsing certain data values within a CIF. Each parsed data value is treated as an "xCIF" document.

An xCIF document is a valid CIF document, within which certain tags are used which have values intended to be parsed according to xCIF syntax. We say that such values are either of "xCIF type" or are of other types and have the "xCIF attribute". We define two initial tags of type xCIF, _xCIF.doc and _xCIF.doc_params, which are used to bring xCIF documents into a CIF as values. If these tags are used within a loop, multiple xCIF documents may be included. In order to allow an ordering of these top-level included xCIF documents, and to allow for multiple instances of the same xCIF document, we define an additional tag, _xCIF.doc_ordinal to optionally specify an document ordinal. Each document is given as a value of _xCIF.doc. The corresponding value of _xCIF.doc_params specifies the top level parameters applicable to that document.

Within an xCIF document, the following tags are defined:

_params

The value is a xCIF document assigning values to tags. In the context of xmlCIF, these are equivalent to assigning values to attributes for this block of character data, and, if order is being preserved, should appear first in the block.

_cmnt

The value is a comment. This is an alternative to "#" delimited comments.

_text

The value is a block of text exempt from nested parsing.

_prog
_prog_params

The value of _prog is the name of a program. The value of _prog_params is a string representing parameters for the "associated" _prog tag. If order is being preserved, the "associated" _prog tag must be the immediately preceding tag. If order is not preserved, a loop_ must be used to create the association. This pair of tags is used to represent the XML "<?" constructs.

_doctype
_element
_attlist
_entity
_notation

The tags are used to carry information from the equivalent XML "<!" constructs

 

The Top Level Parameters

The parameters controlling the syntax and semantics of an xCIF document are specified in a character string (or text field) containing values for the following tags:

_xCIF_doc.parse_contents

"yes" (the default) if the xCIF document is to be parsed, "no" if not.

_xCIF_doc.preserve_spacing

"yes" if white space within the xCIF document is significant, "no" (the default) if not.

_xCIF_doc.preserve_order

"yes" if the ordering of tags within the xCIF document is significant, "no" (the default) if not.

_xCIF_doc.repeat_tags

"yes" if tags may be repeated, "no" (the default) if not.

_xCIF_doc.recursion

"yes" (the default) if the values of parsed tags are themselves to be parsed, "no" if not. The combination of "_xCIF_doc.parse_contents no" and "_xCIF_doc.recursion yes" is not meaningful.

_xCIF_doc.extensions

"yes" (the default) if the xCIF extensions to CIF parsing rules are to be enabled in parsing the xCIf document, "no" if not

For example, a simple framework for specifying a xCIF document which is to be parsed for information to be used in creating an XML document might begin


     data_xmlDATA  
     _xCIF.doc_params  "_xCIF_doc.preserve_order no _xCIF_doc.repeat_tags yes"
     _xCIF.doc     
; _prog xml 
  _prog_params "version=&quot;1.0&quot;"       
  _doctype "html ..."       ...     
; 

 

The Extended CIF Syntax

The parsing rules for xCIF are similar to those for CIF, with some added flexibility. An xCIF document consists of lines of text representing a continuous string of characters, from which the parser extracts substrings as tokens if parsing has been enabled.

The syntax is very similar to CIF. The body of the document consists of tags and values, either in directly associated pairs or in loops. Two constructs which are illegal in CIF outside of loops are permitted in xCIF: multiple tags in sequence and multiple values in sequence. Constructs of the form

 

_tag1 _tag2 … _tagn value1 value2 … valuem

are equivalent to

_tag1

; _tag2

\; …

\…\;_tagn " value1 value2 … valuem"

\…\;

\;

;

 

nesting the uses of the tags and concatenating the values. This convention does not change the ordinary CIF handling of loop headers and bodies.

The ability to imply nesting by concatenating tags is supplemented with two additional special contructs. The tag "_" consisting of just an underscore may be used to return one or more levels of nesting or to function as a multiple level bracket. If the value associated with the "_" tag is numeric and a non-negative whole number, the parse returns that many levels of nesting. If the value associated with the "_" tag is symbolic and begins "}" and the parse returns to the level at which it most recently encountered the "_" tag with a value beginning with "{" and which matches for the remaining characters (if any).

The combinations "_ 0" and "_ ." are no-ops for the parse.

To understand the impact of these extensions, consider the following HTML fragment:

<CENTER>
<TABLE BORDER="2" WIDTH="380">
<TR><TD>A</TD>       <TD>26.4</TD></TR>
<TR><TD>B</TD>       <TD>38.9</TD></TR>
<TR><TD>C</TD>       <TD>34.7</TD></TR>
<TR><TD>&alpha;</TD> <TD>88.0</TD></TR>
<TR><TD>&beta;</TD>  <TD>108.0</TD></TR>
<TR><TD>&gamma;</TD> <TD>111.0</TD></TR>
</TABLE>
</CENTER>

This fragment might be translated as


_center
_table

_params "_border 2 _width 380" _tr _td A _td 24.4 _ 1 _tr _td B _td 38.9 _ 1 _tr _td C _td 34.7 _ 1 _tr _td &alpha; _td 88.0 _ 1 _tr _td &beta; _td 108.0 _ 1 _tr _td &gamma; _td 111.0 _ 1

or as


_center
_table
_params "_border 2 _width 380"
_ { _tr _td A       _td 24.4      _ }
_ { _tr _td B       _td 38.9      _ }
_ { _tr _td C       _td 34.7      _ }
_ { _tr _td &alpha; _td 88.0      _ }
_ { _tr _td &beta;  _td 108.0     _ }
_ { _tr _td &gamma; _td 111.0     _ }

both of which are equivalent to

 
_center
; _table
\;
  _params '_border 2 _width 380'
  _tr '_td A       _td 24.4'
  _tr '_td B       _td 38.9'
  _tr '_td C       _td 34.7'
  _tr '_td &alpha; _td 88.0'
  _tr '_td &beta;  _td 108.0'
  _tr '_td &gamma; _td 111.0'
\;
;

The Lexical Scan

Tokens and whitespace are identified in a preliminary lexical scan according to the following rules:

_"name with blanks"

The quote marks are removed in defining the relevant token, but information about which quote mark was used is preserved if the parse has been instructed to preserve whitespace.

Specifying Parameters in General

Within an xCIF document, parameters for any given tag may be specified either by defining a specific associated tag the value of which will carry the parameters for the original tag, as we do with _xCIF.doc and xCIF.doc_params, or by use of the _params tags within the xCIF document.

References


Document Updated 19 July 2000

Herbert J. Bernstein
yaya@bernstein-plus-sons.com