A Generalised and Comprehensive Solution to CSV to XML and XML to CSV Transformations

Originally posted FRIDAY, FEBRUARY 17. 2012
This is a repost of an entry from my previous, but now defunct blog. The original URL was http://pascaliburnus.seanbdurkin.id.au/index.php?/archives/17-A-Generalised-and-Comprehensive-Solution-to-CSV-to-XML-and-XML-to-CSV-Transformations.html, and the entry can still be seen in its original rendering via the Way Back Machine


Abstract:This entry provides a generalised and comprehensive solution to the conversion of Comma Separated Values (CSV) file to XML. CSV is a very common file format for tabular data. It is clear from the frequency of related questions on StackOverflow, that there is a strong public demand for a generalised solution to CSV to XML and XML to CSV conversion. I believe that this entry marks the first published solution to the said transformations, which defines the model of CSV that it applies to, is generalised to work with all conformant CSV, in the XML domain defines the schema of the XML representation of the CSV file, and is error free. A feature of this solution is that the XML representation is constrained by an XML Schema. This is as opposed to other possible transformations which might derive the XML node names from the CSV header values, which in a generalised sense are unknown.

A quick example

csv data

Colour," Make", model
Lime ,"GMH","23 "
Black,Ford,14

xml equivalent

<?xml version="1.0" encoding="UTF-8"?>
<xcsv:comma-separated-single-line-values
  xmlns:xcsv="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
  xcsv-version="1.0">
   <xcsv:notice xml:lang="en">The xcsv format was developed by Sean B. Durkin&#133;www.seanbdurkin.id.au</xcsv:notice>
   <xcsv:row>
      <xcsv:value>Colour</xcsv:value>
      <xcsv:value> Make</xcsv:value>
      <xcsv:value> model</xcsv:value>
   </xcsv:row>
   <xcsv:row>
      <xcsv:value>Lime </xcsv:value>
      <xcsv:value>GMH</xcsv:value>
      <xcsv:value>23 </xcsv:value>
   </xcsv:row>
   <xcsv:row>
      <xcsv:value>Black</xcsv:value>
      <xcsv:value>Ford</xcsv:value>
      <xcsv:value>14</xcsv:value>
   </xcsv:row>
</xcsv:comma-separated-single-line-values>

Statement of copying permission

Excluding any code written by Andrew Welch, readers are permitted to copy the style-sheets and schema listed in this post, without let or hinderance.

What is CSV?

For the purposes of defining this pair of transformations, I will define the CSV format. My definition is based on RFC 4180 October 2005 “Common Format and MIME Type for CSV Files” but with some minor changes to suite the needs of XSLT based transformations.

  • The CSV format is a text stream encoded in either UTF-8 or UTF-16. If encoded in UTF-16, it must be prefixed by a BOM. A BOM is optional for UTF-8. No other encoding is valid. In any case, the csv file must be correctly readable by the XSLT 2.0 function unparsed-text()
  • A CSV stream is a contiguous sequence of text lines, also known as rows.
  • Lines are separated by line terminators. The last line can be terminated by either a line terminator or the end of the stream. If the last sequence of characters is a line terminator, then this line terminator does not signify the separtion of the previous line and a following empty line, but rather it follows and terminates the last line. Refer to sections 2.1 and 2.2 of RFC 4180 for examples.
  • Valid line terminators are:
  • The sequence U+00000D (CR) U+00000A (LF) or
  • the character U+00000A (LF).

Other line terminators which might exist in some file-systems or recognised as line terminators by XML 1.1, are not recognised. Counter-examples include:

  • The sequence U+00000A (LF) U+00000D (CR),
  • the character U+00000D (CR),
  • U+000085 (NEL) or
  • U+002028 (LS)
  • The first line is the header row. Subsequent rows are data rows. The values of the header row are respectively positioned headings for the values of the data rows. In contrast to RFC 4180, there is always exactly one header row.
  • Headings are not required to be unique, but they cannot be empty or consist of entirely of white-space. For this purpose, white-space is defined as per XML 1.1 .
  • The heading row has at least one heading (cell in the header row).
  • Each row is a list of cells. Cells are encoded string values, and are separated by the comma (,) character.
  • There is at least one cell per row. If the line is empty of characters, it signifies a row with one cell, whose value is empty.
  • Two output options are available for transforming to xml format: One encapsulated with node >xcsv:comma-separated-single-line-values/< and the other >xcsv:comma-separated-multi-line-values/<
  • In the case of >xcsv:comma-separated-single-line-values/< output, cells may not contain a line terminator (LF or the CR LF pattern), but may contain a CR not followed by an LF, or may contain NEL or LS. If content developers require multi-line cells, it is recommended that the LS character be used as a content-line separator within cells.
  • Cells are either encoded as quoted values or unquoted values.
  • An unquoted value represents the same value as its encoding string. An unquoted value may not contain either the comma (,) character or the quote (“) character.
  • A quoted value represents the same value as its encoding string, except that it is delimited on both sides by quote (“) marks (which are part of the encoding, not the value), and any value-bearing quote marks are escaped in the representation as a double occurance of the quote character (“).
  • White space, not as a line separator, is always significant and a part of the cell value.
  • There is no requirement for the count of cells in any data row to match the count of cells in the header. In other words, ragged csv is permissible.

Specification of the schema of the XML representation of CSV

The XML schema for the XML representation of CSV is as follows. It can be referenced in binary form or downloaded as indicated in the xs:schema targetNamespace.

(Click the word Schema below to expand, if you need to read.)

Schema
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:xcsv="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
    elementFormDefault="qualified"
    targetNamespace="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
    version="1.0">

  <xs:import
      namespace="http://www.w3.org/XML/1998/namespace" 
      schemaLocation="xml.xsd"/>

  <xs:element name="comma-separated-single-line-values">
   <xs:annotation><xs:documentation xml:lang="en">
    This schema describes an XML representation of a subset of csv content.
    The format described by this schema, here-after referred to as "xcsv"
    is part of a generalised solution to the problem of converting
    general csv files into suitable XML, and the reverse transform.

    The restrictions on the csv content (in relation to
	xcsv:comma-separated-single-line-values) are:
      <strong> The csv file is encoded either in UTF-8 or UTF16. If UTF-16, a BOM
        is required.
      </strong> The cell values of the csv may not contain the CR or LF characters.
        Essentially, we are restricted to single-line values.

	A multi-line version follows.
	
    The xcsv format was developed by Sean B. Durkin&#x85;www.seanbdurkin.id.au
   </xs:documentation></xs:annotation>

   <xs:complexType>
    <xs:sequence>

     <xs:element ref="xcsv:notice" minOccurs="0" maxOccurs="1"/>
     <xs:element name="row" minOccurs="0" maxOccurs="unbounded">
      <xs:annotation><xs:documentation xml:lang="en">
       A row element represents a "row" or "line" in the csv file. Rows contain values.
       </xs:documentation>
       <xs:appinfo>
        <example>
         <csv-line>apple,"banana","red, white and blue","quote this("")"</csv-line>
         <row>
          <value>apple</value>
          <value>banana</value>
          <value>red, white and blue</value>
          <value>quote this(")</value>
         </row>
        </example>
       </xs:appinfo>
      </xs:annotation> 
        <xs:choice minOccurs="1" maxOccurs="unbounded">
       <xs:annotation><xs:documentation xml:lang="en">
         Empty rows are not possible in csv. We must have at least one value or one error.
       </xs:documentation></xs:annotation>
       <xs:element name="value">
        <xs:annotation><xs:documentation xml:lang="en">
         A value element represents a decoded (model) csv "value" or "cell".
         If the encoded value in the lexical csv was of a quoted form, then
         the element content here is the decoded or model form. In other words,
         the delimiting double-quote marks are striped out and the internal
         escaped double-quotes are de-escaped.
        </xs:documentation></xs:annotation>
        <xs:simpleType>
         <xs:restriction base="xs:string">
          <xs:pattern value="[^\n]*"/>
          <xs:whiteSpace value="preserve"/>
          <xs:annotation><xs:documentation xml:lang="en">
           Cell values must fit this pattern because of the single-line restriction
           that we placed on the csv values.
          </xs:documentation></xs:annotation>
         </xs:restriction>
        </xs:simpleType>
       </xs:element>
       <xs:group ref="xcsv:errorGroup">
        <xs:annotation><xs:documentation xml:lang="en">
          An error can be recorded here as a child of row, if there was an encoding
          error in the csv for that row.
        </xs:documentation></xs:annotation>
       </xs:group>
        </xs:choice>
     </xs:element>

     <xs:group ref="xcsv:errorGroup">
      <xs:annotation><xs:documentation xml:lang="en">
       An error can be recorded here as a child of the comma-separated-values element,
       if there was an i/o error in the transformational process. For example:
        CSV file not found.
      </xs:documentation></xs:annotation>
     </xs:group>

    </xs:sequence>

    <xs:attribute name="xcsv-version" type="xs:decimal"
        fixed="1.0" use="required"/>
   </xs:complexType>
  </xs:element>

  <xs:element name="comma-separated-multiline-values">
   <xs:annotation><xs:documentation xml:lang="en">
    Similar to xcsv:comma-separated-multi-line-values but allows multi-line values.
   </xs:documentation></xs:annotation>

   <xs:complexType>
    <xs:sequence>

     <xs:element ref="xcsv:notice" minOccurs="0" maxOccurs="1"/>
     <xs:element name="row" minOccurs="0" maxOccurs="unbounded">
       <xs:choice minOccurs="1" maxOccurs="unbounded">
       <xs:element name="value">
        <xs:simpleType>
         <xs:restriction base="xs:string">
          <xs:whiteSpace value="preserve"/>
         </xs:restriction>
        </xs:simpleType>
       </xs:element>
       <xs:group ref="xcsv:errorGroup">
       </xs:group>
        </xs:choice>
     </xs:element>

     <xs:group ref="xcsv:errorGroup">
     </xs:group>

    </xs:sequence>

    <xs:attribute name="xcsv-version" type="xs:decimal"
        fixed="1.0" use="required"/>
   </xs:complexType>
  </xs:element>

 <xs:element name="notice" type="xcsv:notice-en" />
      <xs:annotation><xs:documentation xml:lang="en">
       This is an optional element below comma-separated-single-line-values or
        comma-separated-multiline-values that looks like the example.
       </xs:documentation>
      <xs:appinfo>
       <example>
        <notice xml:lang="en">The xcsv format was developed by Sean B. Durkin&#x85;www.seanbdurkin.id.au</notice>
       </example>
      </xs:appinfo></xs:annotation>
      <xs:complexType name="notice-en">
        <xs:simpleContent>
          <xs:extension base="xcsv:notice-content-en">
           <xs:attribute ref="xml:lang" use="required" fixed="en" />
          </xs:extension>
        </xs:simpleContent>
      </xs:complexType>
      <xs:simpleType name="notice-content-en">
       <xs:restriction base="xs:string">
         <xs:enumeration value="The xcsv format was developed by Sean B. Durkin&#x85;www.seanbdurkin.id.au"/>
       </xs:restriction>
      </xs:simpleType>
 <xs:element />

   <xs:group name="errorGroup">
      <xs:annotation><xs:documentation xml:lang="en">
       This is an error node/message in one or more languages.
      </xs:documentation>
      <xs:appinfo>
       <example>
        <error error-code="2">
         <message xml:lang="en">Quoted value not terminated.</message>
         <message xml:lang="ru">процитированная строка не закрыта.</message>
         <error-data>"</error-data>
        </error>
       </example> 
       <example>
        <error error-code="3">
         <message xml:lang="en">Quoted value incorrectly terminated.</message>
         <message xml:lang="ru">процитированная строка закрыта неправильно.</message>
        </error>
       </example>
      </xs:appinfo> 
      </xs:annotation>
   <xs:element name="error">
    <xs:element name="message" minOccurs="1" maxOccurs="unbounded" type="xcsv:string-with-lang" />
      <xs:annotation><xs:documentation xml:lang="en">
       Although there can be multiple messages, there should only be at most one per language.
      </xs:documentation></xs:annotation>
    <xs:element name="error-data" minOccurs="0" maxOccurs="1" >
     <xs:simpleContent>
      <xs:restriction base="xs:string">
       <xs:whiteSpace value="preserve"/>
      </xs:restriction>
     </xs:simpleContent>
    </xs:element>
    <xs:attribute name="error-code" type="xs:positiveInteger" default="1" />
      <xs:annotation><xs:documentation xml:lang="en">
       Each different kind of error should be associated with a unique error code.
       A map for the error codes is outside the scope of this schema, except to say the following:
         <strong> one (1) means a general or uncategorised error. (Try to avoid this!)
      </xs:documentation></xs:annotation>
   </xs:element>
  </xs:group>

  <xs:complexType name="string-with-lang">
      <xs:annotation><xs:documentation xml:lang="en">
       This is an element with text content in some language as indicated
       by the xml:lang attribute.
      </xs:documentation></xs:annotation>
   <xs:simpleContent>
    <xs:extension base="xs:string">
     <xs:attribute ref="xml:lang" use="required" default="en" />
    </xs:extension>
   </xs:simpleContent>
  </xs:complexType>


</xs:schema> 

Rules for Transformation

When transforming from CSV to XML:

  • The transformation should only output UTF-8 encoded XML.
  • The transformation should check for validity of the CSV. If invalid, an element should be written directly underneath the element. Refer to the schema.
  • The schema does not require that the xcsv:comma-separated-single-line-values or xcsv:comma-separated-multiline-values elements be the document element, but if transforming one CSV file to one XML file, this should be made the case.

XSLT solutions

Here is a style-sheet to convert from csv (single line values case only) to xcsv format (output type is root node). A significant part of this style-sheet is derived from or copied from Andrew Welch’s solution. A binary copy is located at ‘http://www.seanbdurkin.id.au/xslt/csv-to-xml.xslt’. The style-sheet takes two parameters: url-of-csv and lang. url-of-csv is the filename or reference to the csv file to be converted. ‘lang’, a parameter defaulting to ‘en’ specifies the human language of error messages. If non-english language is chosen, the transform needs to be supported by a language file (param ‘url-of-messages’) with the localised error messages.

I plan to make a style sheet for the multi-line csv case in a future post, as well as the inverse transforms (xcsv to csv). If you are interested, either watch this blog, or take it as an exercise for the reader to design these style-sheets. If you design such style-sheets, let me know in the comments for this post.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE constants[
  <!ENTITY notice "The xcsv format was developed by Sean B. Durkin&#x85;www.seanbdurkin.id.au">
  ]>  
<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:fn="http://www.w3.org/2005/xpath-functions" 
  xmlns:local="http://www.seanbdurkin.id.au/xslt/csv-to-xml.xslt" 
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
  xmlns:xcsv="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
  version="2.0"
  exclude-result-prefixes="xsl xs fn local">
<xsl:output indent="yes" encoding="UTF-8" method="xml"/>
<xsl:import-schema schema-location="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
                   use-when="system-property('xsl:is-schema-aware')='yes'" />
				   
<!-- Read Me ! -->
<!-- ************************************************************************</strong> -->
<!-- A significant part of this style-sheet is derived from or copied from     --> 
<!-- Andrew Welch's solution. Refer: http://andrewjwelch.com/code/xslt/csv/csv-to-xml_v2.html -->
<!-- Modifications have been made by me (Sean B. Durkin) in order to meet the design         -->
<!-- goals as described here:                                                                -->
<!-- http://pascaliburnus.seanbdurkin.id.au/index.php?/archives/17-A-Generalised-and-Compreh -->
<!--ensive-Solution-to-CSV-to-XML-and-XML-to-CSV-Transformations.html				         -->

<!-- Stylesheet parameters -->
<!-- ************************************************************************<strong> -->
<xsl:param name="url-of-csv" as="xs:string" />
<xsl:param name="lang" as="xs:string" select="'en'" />
<xsl:param name="url-of-messages" as="xs:string" />

<!-- Configurable constants -->
<!-- ************************************************************************</strong> -->
  <xsl:variable name="quote" as="xs:string">"</xsl:variable>
  <xsl:variable name="error-messages-i18n">
   <xcsv:error error-code="1">
    <xcsv:message xml:lang="en">Uncategorised error.</xcsv:message>
   </xcsv:error>
   <xcsv:error error-code="2">
    <xcsv:message xml:lang="en">Quoted value not terminated.</xcsv:message>
   </xcsv:error>
   <xcsv:error error-code="3">
    <xcsv:message xml:lang="en">Quoted value incorrectly terminated.</xcsv:message>
   </xcsv:error>
   <xcsv:error error-code="5">
    <xcsv:message xml:lang="en">Could not open CSV resource.</xcsv:message>
   </xcsv:error>
  </xsl:variable> 
  
<!-- Non-configurable constants -->
<!-- ************************************************************************<strong> -->
 <xsl:variable name="error-messages">
   <xsl:apply-templates select="$error-messages-i18n" mode="messages" />
  </xsl:variable>
    
<xsl:template match="@*|node()" mode="messages" >
 <xsl:copy>
  <xsl:apply-templates select="@*|node()" mode="messages" />
 </xsl:copy>
</xsl:template>

<xsl:template match="xcsv:message[
      not(@xml:lang=$lang) and
     (not(@xml:lang='en') or ../xcsv:message[@xml:lang=$lang])]" mode="messages" />
	 


<xsl:function name="local:unparsed-text-lines" as="xs:string+">
 <xsl:param name="href" as="xs:string" />
 <xsl:sequence use-when="function-available('unparsed-text-lines')" select="fn:unparsed-text-lines($href)" />
 <xsl:sequence use-when="not(function-available('unparsed-text-lines'))" select="tokenize(unparsed-text($href), '\r\n|\r|\n')[not(position()=last() and .='')]" />
</xsl:function>

<xsl:function name="local:error-node" as="node()">
 <xsl:param name="error-code" as="xs:integer" />
 <xsl:param name="data" as="xs:string" />
 <xcsv:error error-code="{$error-code}">
  <xcsv:message
    xml:lang="{$error-messages/xcsv:error[@error-code=$error-code]/xcsv:message/@xml:lang}">
    <xsl:value-of select="$error-messages/xcsv:error[@error-code=$error-code]/xcsv:message"/>
  </xcsv:message>
  <xcsv:error-data><xsl:value-of select="$data"/></xcsv:error-data>
 </xcsv:error>
</xsl:function>

<xsl:function name="local:csv-to-xml" as="node()+">
 <xsl:param name="href" as="xs:string" />
 <xcsv:comma-separated-single-line-values xcsv-version="1.0">
  <xcsv:notice xml:lang="en">&notice;</xcsv:notice>
  <xsl:choose>
   <xsl:when test="fn:unparsed-text-available($href)">
    <xsl:for-each select="local:unparsed-text-lines($href)">
	 <xcsv:row>
     <xsl:analyze-string select="fn:concat(., ',')" regex='(("[^"]*")+|[^,"]*),'>
      <xsl:matching-substring>
	   <xcsv:value>
	    <xsl:choose>
         <xsl:when test="fn:starts-with( fn:regex-group(1), $quote)">
          <xsl:value-of select='fn:replace(fn:regex-group(1), "^""|""$|("")""", "$1" )' />
	     </xsl:when>
	     <xsl:otherwise>
          <xsl:value-of select='regex-group(1)' />
	     </xsl:otherwise>
	    </xsl:choose>
	   </xcsv:value>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
       <xsl:copy-of select="local:error-node(3,.)"/>
       <!-- Quoted value incorrectly terminated. -->
      </xsl:non-matching-substring>
     </xsl:analyze-string>
	 </xcsv:row>
    </xsl:for-each>
   </xsl:when>
   <xsl:otherwise>
    <xsl:copy-of select="local:error-node(5,$href)"/>
    <!-- Could not open CSV resource. -->
   </xsl:otherwise>
  </xsl:choose>
 </xcsv:comma-separated-single-line-values>
</xsl:function>

<xsl:template match="/" name="local:main">
 <xsl:copy-of select="local:csv-to-xml($url-of-csv)" />
</xsl:template>


</xsl:stylesheet>

Unit testing

Please refer to http://codereview.stackexchange.com/questions/10180/please-review-my-xml-schema-for-an-xml-representation-of-csv. The Use cases can be used for unit testing.

Case Study

Let’s say we have a csv file like this:

1st Name,2nd Name
Sean B.,Durkin
"Chris ""Nemora""",Durkin

… and we want to convert it to an xml format like this …

<?xml version="1.0" encoding="UTF-8"?>
<people>
   <person>
      <first-name>Sean B.</first-name>
      <last-name>Durkin</last-name>
   </person>
   <person>
      <first-name>Chris "Nemora"</first-name>
      <last-name>Durkin</last-name>
   </person>
</people>

Assumptions

We will assume:

  • The csv file is encoded in utf-8 without BOM.
  • The csv file uses CR.LF for its end of line marker.
  • The csv file may or may not have an CR.LF at the end of the last line (right after “Durkin”).
  • The aforementioned xcsv style-sheet has been set-up as a standard library, whose location is known by your XSLT engine (probably via your configured OASIS catalog for the XSLT engine).
  • The location of the csv file has been passed in to parameter ‘url-of-csv’

So the solution is …

<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:fn="http://www.w3.org/2005/xpath-functions" 
  xmlns:xcsvt="http://www.seanbdurkin.id.au/xslt/csv-to-xml.xslt" 
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
  xmlns:xcsv="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
  version="2.0"
  exclude-result-prefixes="xsl xs fn xcsvt xcsv">
<xsl:import href="csv-to-xml.xslt" />
<xsl:output indent="yes" encoding="UTF-8" method="xml"/>
<xsl:import-schema schema-location="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
                   use-when="system-property('xsl:is-schema-aware')='yes'" />
<xsl:param name="url-of-csv" as="xs:string" />

<xsl:import-schema schema-location="http://www.seanbdurkin.id.au/xslt/xcsv.xsd"
                   use-when="system-property('xsl:is-schema-aware')='yes'" />


<xsl:template match="/">
  <xsl:variable name="phase-1-result">
    <xsl:call-template name="xcsvt:main" /><!-- phase 1 -->
  </xsl:variable>
  <xsl:apply-templates select="$phase-1-result" mode="phase-2"/>
 <people>
   <xsl:apply-templates select="
     $phase-1-result/
	 xcsv:comma-separated-single-line-values/
	 (xcsv:row[position() ge 2])" mode="phase-2"/>
 </people>
</xsl:template>

<xsl:template match="@*|</strong>" mode="phase-2" />

<xsl:template match="xcsv:row" mode="phase-2">
 <person>
  <first-name><xsl:value-of select="xcsv:value[1]" /></first-name>
  <last-name> <xsl:value-of select="xcsv:value[2]" /></last-name>
 </person>
</xsl:template>

</xsl:stylesheet>				   

This entry was posted in XSLT 2.0. Bookmark the permalink.