Data flow transformation to extract columns error information.
Most of the SSIS data flow components have an error output that allows you to redirect rows that caused errors to a error logging destination. The quality of the information provided by the error output is rather limited. You get: the id of the column that's causing the error, the id of the error and all the input columns of the component. So what’s wrong with that?
- ErrorCode: this looks something like “-1073442796”. You then have to look up the error message that corresponds to this errorcode. It would be better if you simply got the error message.
- ErrorColumn: this is the id of the column causing the error. To find out which column this actually is you have to open the advanced editor of the component and run through all the columns to find the name of the offending column. Obviously we would prefer to get the name of the column that is causing the error.
- All the input columns. At first glance these contain useful information. They allow you to identify the row that is causing the error (ie by a identifier column) and will give you the value of the ErrorColumn (that is once you've found out what column that actually is). There is however a problem with this setting. Because most components will have different inputs every error output will contain different columns and at design time you don't know which one you are going to need to identify the error value.
What we would actually want is to have a column ErrorValue that contains the value of the column causing the error and to have to columns IdColumn and IdColumnValue that uniquely identify the row causing the error.
This last point is the more fundamental one because it makes it very difficult to set up an efficient centralized error-logging. Lets dwell on this a bit.
Good development practices for SSIS would be to redirect all error rows of all data flow components of all packages to one error-log destination (ie. an SQL server table). This would allow you to set up a centralized logging/reporting system so that you can check the errors of all the packages that have run ie the previous night.
It would also allow you to create a standard way of handling the final "all or nothing copy from staging to the data warehouse" package step:
- Best practice data warehouse building technique is to do all the transformation activities in a staging area (ie a separate database) and when that is finished to copy/replace the result to the actual data warehouse. In SSIS this would mean that the final step of a SSIS package would be a stored procedure that does this within one transaction.
- Of course, this stored procedure should check the error-log to verify that all rows have been processed correctly.
- To allow to set up this kind of centralized error-logging we not only need the ErrorMessage, ErrorColumn (name), ErrorValue, IdColumn and IdColumnValue, but also a means to identify the package, the execution instance of the package, the dataflow and the dataflow component that cause the error.
This script provides all this by transforming the error output of all the components of one data flow to a set of generic input columns for the error-log.
Transformation script output
The basic idea is that you connect all the error outputs of a data flow to the transformation script and then add the data destination for your error-log-destination. The output of the script component is:
- PackageId: identifies the package. This could be the name of the package or the System::PackageID variable that uniquely identifies the package
- ExecutionId: identifies the execution instance of the package. You can use the System::ExecutionInstanceGUID variable for this purpose or, alternatively your own id.
- StepName: Identifies the data flow and data flow component causing the error: "data flow name" – "data flow component name"
- RowIdentifierName: name of the column that uniquely identifies a row in the data flow. i.e. an identity column from the data source.
- RowIdentifierIs: the value of the RowIdentifierName
- ErrorColumnName: the name of the column that is causing the error
- ErrorColumnValue: the value of ErrorColumnName
- ErrorMessage: the description of the error
All these column are NVARCHAR column, the length of which you can set yourself.
Component properties for the transformation component
Before setting the component properties connect the error outputs of the data flow component. The following properties can be set:
- ExecutionId and PackageId are, by default, set to their SSIS counterparts. You can change them to the value of another package variable by choosing one from the drop down list.
- StandardDestinationColumnLength: this is the length of the NVARCHAR output column, by default its 250 but you can change it to another value if you like.
- Finally you can set the RowIdentifier. You do this by choosing an input (from the script component thus one of the error outputs of the other data flow components) and then picking the column name in the RowIdentifierName - drop down box.
The script component only has information about the input columns of the data flow components that you connect. This means that ie connection the error output of the "Calculate Q" component doesn’t give us any information about the column Q itself (because it doesn’t exists yet in the input). The information regarding Q is only provided when the error output of the next component is connected. If we don’t connect the next error output (or there isn’t any) the script component will not be able to translate the Id of the ErrorColumn to a name. In that case the Id will be reported.