www.openlinksw.com
docs.openlinksw.com

Book Home

Contents
Preface

XML Support

Rendering SQL Queries as XML (FOR XML Clause)
XML Composing Functions in SQL Statements (SQLX)
Virtuoso XML Services
Querying Stored XML Data
XPATH_CONTAINS SQL Predicate Using xpath_eval() External Entity References in Stored XML XML Schema & DTD Functions Using XML and Free Text XCONTAINS predicate text-contains XPath Predicate XML Free Text Indexing Rules XML Processing & Free Text Encoding Issues
Using UpdateGrams to Modify Data
XML Templates
XML DTD and XML Schemas
XQuery 1.0 Support
XSLT Transformation
XMLType
Changing XML entities in DOM style

14.4. Querying Stored XML Data

14.4.1. XPATH_CONTAINS SQL Predicate

XPath expressions can be used in SQL statements to decompose and match XML data stored in columns. The xpath_contains SQL predicate can be used either to test for an XML value matching a path expression or to extract one or more entities from the XML value. These values can then be used later in the query as contexts for other XPath expressions.

xpath_contains (xml_column, xp_expression[, query_variable]);

The first argument, xml_column is the name of the column on which to perform the XPath search. The second argument, xp_expression, takes an XPath expression.

The third argument is an optional query variable that gets bound to each result entity value of the xpath expression. If this variable is omitted the xpath_contains predicate will qualify the query by returning true for matches. In this case the result will only return one row per match. If the variable is present, the result set could contain multiple rows per result set row of the base table, one row for each match.

Consider the example:

select xt_file, t from xml_text
  where xpath_contains (xt_text, '//chapter/title[position () = 1]', t);

This SQL statement will select the first title child of any chapter entities in the XML documents in the xt_text column of the table xml_text. There can be several matching entities per row of xml_text. The result set will contain a row for each matching entity.

In XPath terms the path expression of xpath_contains is evaluated with the context node set to the root node of the XML tree represented by the value of the column that is the first argument of xpath_contains. This node is the only element of the context node set.

Note:

The 't' variable in the above example gets bound to XML entities, not to their string values or other representations. One can thus use these values as context nodes for other expressions.

The XPATH expression can have a list of options in the beginning. The list of options is surrounded by square brackets. Options in the list are delimited by spaces. The most popular option is __quiet that allows to process a set of rows if not all stored documents are valid XMLs; if an error is signalled by the XML parser when it prepares a content document for the XPATH in question and the XPATH contains __quiet then the error is suppressed and the row is silently ignored as if XPATH found nothing. One can configure the DTD validator of the parser by placing its configuration parameters in the list of XPATH options.

The following example is almost identical to the previous one but it works even if not all values of xt_text are valid XMLs, and the resulting values of the 't' variable are standalone entities even if source documents in xt_text contain external generic entities.

select xt_file, t from xml_text
  where xpath_contains (xt_text, '[__quiet BuildStandalone=ENABLE]//chapter/title[position () = 1]', t);

14.4.2. Using xpath_eval()

The xpath_eval() function is used to filter out parts of an XML fragment that match a given XPATH expression. It can be used to retrieve multiple-node answers to queries, as it is often the case that more than one node-set matches. Consider the following statements that create a table with XML stored inside.

CREATE TABLE t_articles (
	article_id int NOT NULL,
	article_title varchar(255) NOT NULL,
	article_xml long varchar
	);

insert into t_articles (article_id, article_title) values (1, 'a');
insert into t_articles (article_id, article_title) values (2, 'b');

UPDATE t_articles SET article_xml = '
<beatles id = "b1">
<beatle instrument = "guitar" alive = "no">john lennon</beatle>
<beatle instrument = "guitar" alive = "no">george harrison</beatle>
</beatles>'
WHERE article_id = 1;

UPDATE t_articles SET article_xml = '
<beatles id = "b2">
<beatle instrument = "bass" alive = "yes">paul mccartney</beatle>
<beatle instrument = "drums" alive = "yes">ringo starr</beatle>
</beatles>'
WHERE article_id = 2;

Now we make a query that will return a vector of results, each vector element corresponding to a node-set of the result.

SELECT xpath_eval('//beatle/@instrument', xml_tree_doc (article_xml), 0)
	AS beatle_instrument FROM t_articles WHERE article_id = 2;

The repeating nodes are returned as part of a vector, the third argument to xpath_eval() is set to 0, which means that it is to return all nodes.

Otherwise, we can select the first node-set by supplying 1 as the third parameter to xpath_eval():

SELECT xpath_eval('//beatle/@instrument', xml_tree_doc (article_xml), 1)
	AS beatle_instrument FROM t_articles WHERE article_id = 2;
See Also:

xpath_eval()

xquery_eval()

xmlupdate()


14.4.3. External Entity References in Stored XML

When an XML document is stored as either text or in persistent XML format it can contain references to external parsed entities with the <!entity ...> declaration and the &xx; syntax. These are stored as references and not expanded at storage time if the entity is external. Such references are transparently followed by XPath and XSLT. A run-time error occurs if the referenced resource cannot be accessed when needed. The reference is only followed if the actual subtree is selected by XPath or XSLT. The resource is retrieved at most once for each XPath or XSLT operation referencing it, regardless of the number of times the link is traversed. This is transparent, so that the document node of the referenced entity appears as if it were in the place of the reference.

External entity references have an associated URI, which is either absolute, with protocol identifier and full path, or relative. Virtuoso resolves relative references with respect to the base URI of the referencing document. If the document is stored as a column value in a table it does not have a natural base URI; therefore, the application must supply one if relative references are to be supported. This is done by specifying an extra column of the same table to contain a path, in the form of collections delimited by slashes, like the path of a DAV resource or a Unix file system path. This base URI is associated with an XML column with the IDENTIFIED BY declaration:

create table XML_TEXT (
	XT_ID integer,
	XT_FILE varchar,
	XT_TEXT long varchar identified by xt_file,
		primary key (XT_ID)
	);

create index XT_FILE on XML_TEXT (XT_FILE);

Thus, each time the value of xt_text is retrieved for XML processing by xpath_contains() or xcontains() the base URI is taken from xt_file. The complete URI for the xt_text of a column of the sample table would be:

virt://<qualified table name>.<uri column>.<text column>:<uri column value>

An example would be:

"virt://DB.DBA.XML_TEXT.XT_FILE.XT_TEXT:sqlreference.xml"

The '..' and '.' in relative paths are treated like file names when combining relative references to base URIs. A relative reference without a path just replaces the last part of the path in the base URI.

See Also:

xml_uri_get() and xml_uri_merge() for more details.


14.4.4. XML Schema & DTD Functions

The following functions can be used to generate XML Schema or DTD information about a given SQL query:

Generating XML Schema and DTD Data

This example shows trivial use of the two functions xml_auto_schema() and xml_auto_dtd().

SQL> select xml_auto_schema('select u_name from sys_users', 'root');
callret
VARCHAR
_______________________________________________________________________________

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

 <xsd:annotation>
  <xsd:documentation>
   Schema for output of the following SQL statement:

   <![CDATA[select u_name from sys_users]]>

  </xsd:documentation>
 </xsd:annotation>

 <xsd:element name="root" type="root__Type"/>

 <xsd:complexType name="root__Type">
  <xsd:sequence>
   <xsd:element name="SYS_USERS" type="SYS_USERS_Type" minOccurs="0" maxOccurs="unbounded"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:complexType name="SYS_USERS_Type">
  <xsd:attribute name="U_NAME" type="xsd:string"/>
 </xsd:complexType>

</xsd:schema>

1 Rows. -- 1843 msec.
SQL> select xml_auto_dtd('select u_name from sys_users', 'root');
callret
VARCHAR
_______________________________________________________________________________

<!-- dtd for output of the following SQL statement:
select u_name from sys_users
-->
<!ELEMENT root (#PCDATA | SYS_USERS)* >
<!ELEMENT SYS_USERS (#PCDATA)* >
<!ATTLIST SYS_USERS
        U_NAME  CDATA   #IMPLIED        >

1 Rows. -- 411 msec.

14.4.5. Using XML and Free Text

Virtuoso integrates classic free text retrieval and XML semi-structured query features to offer a smart, scalable XML repository. When a column is declared as indexed XML with the CREATE TEXT XML INDEX statement the text is checked for well-formedness at time of storage. The specific XML structure of the text is also considered when making the free text index entries. This XML-aware free text index is used for processing XPath queries in the xcontains SQL predicate. This predicate is only applicable to columns for which there is an XML free text index.

Arbitrary free text criteria can appear inside the XPath expression of xcontains. These are introduced by the XPath extension function text-contains(), which may only be used within xcontains as it relies on the underlying free text index.

Note

xpath_contains() does not require the existence of a free text index and can thus apply to any well-formed XML content.


14.4.6. XCONTAINS predicate

This predicate is used in a SQL statement, it returns "true" if a free text indexed column with XML content matches an XPATH expression. Optionally produces the matching node set as a result set.

Syntax

xcontains_pred:
	xcontains (column, expr [, result_var [, opt_or_value ...]])

opt_or_value:
	  DESCENDING
	  | START_ID ',' scalar_exp
	  | END_ID ',' scalar_exp
	  | SCORE_LIMIT ',' scalar_exp
	  | OFFBAND column

result_var:
	  IDENTIFIER
	  | NULL

The column must refer to a column for which there exists a free text index.

The expr must be a narrow or wide string expression whose syntax matches the rules in 'XPATH Query Syntax'.

The result_var variable is a query variable which, if present, will be successively bound to each element of the node set selected by the XPATH expression. if the value is not a node set and is true, the variable will be once bound to this value. The scope of the variable is the containing select and its value is a scalar or an XML entity. The result_var can be not an identifier but a NULL keyword to explicitly indicate that no query variable is required.

The START_ID is the first allowed document ID to be selected by the expression in its traversal order, e.g. least or equal for ascending and greatest or equal for descending.

END_ID is the last allowed id in the traversal order. For descending order the START_ID must be >= END_ID for hits to be able to exist. For ascending order the START_ID must be <= END_ID for hits to be able to exist.

DESCENDING specifies that the search will produce the hit with the greatest ID first, as defined by integer or composite collation. This has nothing to do with a possible ORDER BY of the enclosing statement. Even if there is an ORDER BY in the enclosing statement the DESCENDING keyword of xcontains has an effect in the interpretation of the STRT_ID and END_ID xcontains options.

RANGES specifies that the query variable following the RANGES keyword will be bound to the word position ranges of the hits of the expression inside the document. The variable is in scope inside the enclosing SELECT statement.

SCORE_LIMIT specifies a minimum score that hits must have or exceed to be considered matches of the predicate.

OFFBAND specifies that the following column will be retrieved from the free text index instead of the actual table. For this to be possible the column must have been declared as offband with the CLUSTERED WITH option of the CREATE TEXT INDEX statement.

If the select statement containing the xcontains predicate does not specify an exact match of the primary key of the table having the xcontains predicate, then the contains predicate will be the 'driving' condition, meaning that rows come in ascending or descending order of the free text document ID. If there is a full equality match of the primary key of the table, this will be the driving predicate and xcontains will only be used to check if the text expression matches the single row identified by the full match of the primary key.

The xcontains predicate may not appear outside of a select statement and may only reference a column for which a free text index has been declared. The first argument must be a column for which there is such an index. The text expression may be variable and computed, although it must be constant during the evaluation of the select statement containing it.

The xcontains predicate must be a part of the top level AND of the WHERE clause of the containing select. It may not for example be a term of an OR predicate in the select but can be AND'ed with an OR expression.

Selecting Title Elements Called 'Key'
select xt_file from xml_text2 where
 xcontains (xt_text, '//title = "Key"');

The query retrieves the xt_file for rows whose xt_text is an XML document containing 'Key' as the text value of a title element.

If not all values in xt_text are valid XMLs then '__quiet' option can be useful to disable error signalling. It is unusual to get an incorrect XML stored in a column that has free text XML index because both on insert and on update the text is parsed by an free text indexing routine, but the error is possible if e.g. a non-standalone document is stored and an important external entity was available at indexing time but disappeared later. Thus a modified example might be better for a column with non-standalone documents; select xt_file from xml_text2 where xcontains (xt_text, '[__quiet] //title = "Key"');

Selecting Title Element that Contains a Specified Text
select n from xml_text2 where
 xcontains(xt_text,
 '//title[. = "AS Declaration - Column Aliasing"]',0,n);

The query retrieves each title element from each row of xml_text2 where the xt_text contains title elements with the text value "AS Declaration - Column Aliasing."

Note

The equality test is case- and whitespace-sensitive, as normal in XPath. The free text index is used for the search but the final test is done according to XPath rules.

See Also:

The CONTAINS Predicate.


14.4.7. text-contains XPath Predicate

text-contains (node-set, text-expression)

This XPath predicate is true if any of the nodes in node-set have text values matching the text-expression. The text-expression should be a constant string whose syntax corresponds to the top production of the free text syntax for patterns in contains(). The string also may not consist exclusively of spaces or noise words.

See Also:

"Noise Words" in the Free Text Search chapter.

Selecting All Titles About Aliasing
select n from xml_text2 where
  xcontains (xt_text,
  '//title[text-contains (., "Aliasing")]', 0, n);

This selects all title elements that contain the word "Aliasing" using free text match rules: case insensitive and whole word.

Select All Trees with Elements Containing "sql reference"
select n from xml_text2 where
  xcontains (xt_text,
  '//*[text-contains (., ''"sql reference"'')]',
  0, n);

This selects all elements whose text value contains the phrase "sql reference". Free text matching rules apply. This produces all nodes in document order for all documents which contains the phrase, starting with the document node and following downward including all paths to the innermost element(s) whose text contains the phrase.


14.4.8. XML Free Text Indexing Rules

XML documents are inserted into the free text index as follows:

From these rules follows that:

<html>
  <body>
   <title>Title of Document</title>
   <p>Some <b>bold</b> text </p>
  </body>
</html>

will be indexed as follows:

<html>		0
<body>		0
<title>		0
Title		1
of		- no number, noise word
Document		2
</title>		3
   <p>		3
Some		4
 <b>		4
bold		5
</b>		6
 text		6
</p>		6
  </body>		6
</html>		6

As a result, the phrase "some bold text" is the string value of the <p> tag and will match the free text expression "some bold text" even though there is mark-up in it. Conversely, the phrase "Document some bold" does not match. Words will not considered adjacent if there is a mix of opening and closing tags. They will only be considered adjacent if there are solely one or more either opening or closing tags between them. This can be circumvented by using the NEAR connective instead of the phrase construct.

A free text condition will only be true of an element if all the words needed to satisfy the condition are part of the element's string value. This string value includes text children of descendants.


14.4.9. XML Processing & Free Text Encoding Issues

XML document may be written in a variety of encodings, and it may cause errors if an incorrect encoding is used for reading a document. Most common errors can easily be eliminated by writing proper XML prologs in documents, but this is not always possible, e.g. if documents are composed by third-party applications. Virtuoso provides various tools to support different types of encodings and to specify encodings to use if a given document has no XML prolog.

14.4.9.1. Encodings: The Difference Between Encodings & Character Sets

Not all documents may be converted to Unicode by using simple character sets. Some of them are stored in so-called "multibyte" encodings. It means that every letter (or ideograph) is represented as a sequence of one or more bytes, not by exactly one byte. The conversion from such representation to Unicode and back is usually significantly slower than simple transformation via character sets, so these representations are supported by data import operations only, but not by internal RDBMS routines.

The Virtuoso Server "knows" some number of built-in encodings, such as UTF-8, UTF-16BE and UTF-16LE. It can load additional encoding descriptions from a "UCM" file, and can automatically create a new encoding from a known charset with the same name. See the UCM Encodings section for more details.

An encoding may be used in the following places:

You can only use character sets, not encodings as an ODBC connection character set, as a character set attribute of a column of a database table, as an output encoding of the built-in XSLT processor (it is for future versions). UTF-8 is an exception, it is supported in many places where other encodings are not.

Security Note:

Two strings converted to Unicode may be identical, but this does not guarantee that their source strings were equal byte-by-byte due to the nature of some encodings. For this reason you should avoid processing authorization data that are neither in Unicode nor in one of the standard character sets (single-byte encodings). Multibyte encodings and user-defined character sets may be unsafe for such purposes.

14.4.9.1.1. UCM Encodings

The description of a multibyte encodings is much longer than the description of a character set. It is inconvenient to keep such amounts of data inside the executable. Virtuoso can load descriptions of required encodings from external files in UCM format. Every UCM file describes one encoding.

Virtuoso loads UCM files at system initialization. The list of UCM files is kept in the Virtuoso INI file under a section called [Ucms]. This section should contain a UcmPath parameter and one or more parameters with names Ucm1, Ucm2, Ucm3 and so on (up to Ucm99).

The UcmPath parameter specifies the directory where UCM files are located, and every UcmNN parameter specifies the name of a UCM file to load and a list of names that the encoding can be identified by the <?xml ... encoding="..." ?> XML preamble. A vertical bar character is used to delimit names in the list.

Sample [Ucms] Section
[Ucms]
UcmPath = /usr/local/javalib/ucm
Ucm1 = java-Cp933-1.3-P.ucm,Cp933
Ucm2 = java-Cp949-1.3-P.ucm,Cp949|Korean

This section describes two UCM files located in /usr/local/javalib/ucm directory: data from java-Cp933-1.3-P.ucm will be used for documents in the 'Cp933' encoding; data from java-Cp949-1.3-P.ucm will be used for documents in the 'Cp949' encoding and for documents in the 'Korean' encoding (because these two names refers to the same encoding).

Note:

The encoding name specified inside the UCM file itself is not used.

The Virtuoso server will log the results of processing each UCM file specified in the Virtuoso INI file. If a UCM file specified is not found or contains syntax errors, the error is logged, otherwise only the type and name(s) of the encoding are logged.

Note:

If the virtuoso.ini contains a misspelled name of a parameter or section, the parameter (or a whole section) is ignored without being reported as an error. It is always wise to verify that the log contains a record about the encoding(s) you load.

See Also:

UCM files can be found freely from various sites concerning the "International Components for Unicode" project, such as: IBM ICU Homepage or the IBM UCM files directory.

The C Interface chapter contains further information regarding user customizable support for new encodings and languages. For almost all tasks, it is enough to define a new charset or to load an additional UCM file, but some special tasks may require writing additional C code.



14.4.9.2. The Encoding Attribute

If an XML document contains the encoding parameter in its

<?xml ... ?>

prolog declaration, it will be properly decoded and converted into UTF-8, so the application code is free from encoding problems. If the value of this attribute is the name of a pre-set or user-defined character set, that character set will be used. Virtuoso will recognize names such as UTF-8 and UTF8 as multi-character or special encodings. Virtuoso recognizes both official names and aliases.

If an encoding is not specified in an XML prolog, or if the document contains no prolog, the default encoding will be used to read the document. If a built-in SQL function invokes the XML parser, it will have an optional argument parser_mode to specify whether source text should be parsed as strict XML or as HTML. If the source text is 8-bit, then UTF-8 will be used as the default encoding for "XML mode", and ISO-8859-1 (Latin-1) will be the default for "HTML mode". If the source text is of some wide-character type, Unicode is the default. To make another encoding the default, you may specify its official or alias name as the content_encoding argument of a built-in function you call.


14.4.9.3. Encoding in XPath Expressions

Sometimes applications should perform XPath queries using the encoding specified by a client. For example, a search engine may ask a user to specify a pattern to search and use the browser's current encoding as a hint to parse the pattern properly. In such cases you may wish to use the __enc XPath option to specify the encoding used for the rest of XPath string:

Specifying Search Encodings in XPath

Create a sample table and store an XML with non-Latin-1 characters

create table ENC_XML_SAMPLE (
  ID integer,
  XPER long varchar,
  primary key (ID)
);

insert into ENC_XML_SAMPLE (ID, XPER)
values (
  1,
  xml_persistent ('<?xml version="1.0" encoding="WINDOWS-1251" ?>
    <book><cit>Îí äîáàâèë
      êàðòîøêè,
    ïîñîëèë è
    ïîñòàâèë
    àêâàðèóì íà
    îãîíü
    (Ì.Æâàíåöêèé
    )</cit></book>'
  )
);
...
  

Find the IDs of all XML documents whose texts contain a specified phrase. Note that there are pairs of single quotes (not double quotes) around KOI8-R. The encoding name should be in single quotes, but because it is inside a string constant the quotes must be duplicated.

select ID from ENC_XML_SAMPLE where
  xcontains (XPER, '[__enc ''KOI8-R''] //cit[text-contains(.,
  "''ÐÏÓÔÁ×ÉÌ
    ÁË×ÁÒÉÕÍ
    ÎÁ ÏÇÏÎØ''")]');


14.4.9.4. Encoding in Free Text Search Indexes & Patterns

Like XML applications, free text searching may have encoding problems, and Virtuoso offers a similar solution for them.

Both the CREATE TEXT INDEX statement and vt_create_text_index() Virtuoso/PL procedure have an optional argument to specify the encoding of the indexed data. The specified encoding will be applied to all source text documents (if the TEXT INDEX was created), or to all XML documents that have no encoding attribute of the sort <?xml ... encoding="..." ?> (if the TEXT XML INDEX was created).

The option __enc may be specified at the beginning of free text search pattern, even if the pattern is inside an XPath statement:

Specifying an Encoding for Free Text Searching

Create a sample table and store a sample of text with non-Latin-1 characters (assuming that client encoding is Windows-1251)

create table ENC_TEXT_SAMPLE (
  ID integer,
  TEXT long nvarchar,
  primary key (ID)
);

insert into ENC_TEXT_SAMPLE (ID, XPER)
values (
  1,
  '<?xml version="1.0" encoding="WINDOWS-1251" ?>
Îí äîáàâèë
    êàðòîøêè,
    ïîñîëèë è
    ïîñòàâèë
    àêâàðèóì
    íà îãîíü
    (Ì.Æâàíåöêèé')
);
...

Find the IDs of all text documents whose texts contain a specified phrase.

select ID from ENC_SAMPLE where
  contains (TEXT, '[__enc ''KOI8-R'']
    "ÐÏÓÔÁ×ÉÌ
    ÁË×ÁÒÉÕÍ
    ÎÁ ÏÇÏÎØ"'
  );

Encoding may be applied locally to an argument of the text-search predicate. It may be used if the document contains citations in different encodings or if the XML document contains non-ASCII characters in names of tags or attributes, or if the encoding affects character codes of ASCII symbols such as '/' or '['.

select ID from ENC_XML_SAMPLE where
  xcontains (XPER, '//cit[text-contains(., "[__enc ''KOI8-R'']
    ''ÐÏÓÔÁ×ÉÌ
    ÁË×ÁÒÉÕÍ ÎÁ
    ÏÇÏÎØ''")]'
  );
Note:

You may have free-text a expression written as a literal constant: e.g. if the argument of text-contains XPath function is a literal constant. Be careful to not declare the __enc twice, once in the beginning of the whole XPath expression and then again in the beginning of the free-text expression constant, because words of the text expression will thus be converted twice.