Data flow transformation for extraction of information from input column containing XML documents, using XPath expressions. More than one expression can be provided, and the number of output column names provided should match.
If the transformation runs in merging mode, the output is synchronous to the input, and all the matches from each of the XPath expressions, are joined with the given match separator, and put in the corresponding column.
If the MergeResults parameter is false, the output is asynchronous and each match, from each of the XPath expressions, occupies a separate row in the corresponding column. The results from each expression, are stacked together and send to the output, which means for every XML document from the input (i.e. for each input row), the number of rows in the output equals the maximal number of matches from all of the XPath expressions.
Here is an exemplary XML document, which corresponds to one row from the input:
If we've setup two XPath expressions, corresponding to two columns:
Titleis filled from this expressions
Authoris filled from this expression:
The resulting output rows, send for this input row , will be these:
|Everyday Italian||Giada De Laurentiis|
|Harry Potter||J K. Rowling|
|XQuery Kick Start||James McGovern|
If we run with the same setup, but MergeResults set to true, and ResultSeparator set to
, the output would be:
|Everyday Italian,Harry Potter,XQuery Kick Start||Giada De Laurentiis,J K. Rowling,James McGovern,Per Bothner,Kurt Cagle,James Linn,Vaidyanathan Nagarajan|
The script has the following parameters:
- DocumentColumn - the input column containing XML documents to process.
- XPathExpressions - the list of XPath expressions for extraction, specified one per line.
- XPathNamespaces - the namespaces of elements, which are referred in the XPath expression. The format is
[namespace]. Multiple namespaces are separated with newline.
- ErrorRowDisposition - what happens, when there is an error in processing - usually in document parsing. The possible values are
RedirectRow. If the latter is selected, an error output is added, which is synchronous with the input, and has the
- ResultColumns - the list of output column names - one for each XPath expression provided. Comma separated.
- MergeResults - whether to merge multiple results for a XPath expression, or run in an asynchronous mode.
- ResultSeparator - if the merging mode is on, this one specified what separator to be used when joining the multiple matches.