Parquet

COZYROC Parquet components, part of COZYROC SSIS+ suite, are third-party plug-ins for Microsoft SSIS that make it easy to parse and generate Apache Parquet files. The toolkit is easy to use and follows the same guidelines and principles used by the standard out-of-the-box SSIS components.

The Apache Parquet integration package consists of a Parquet Source and Parquet Destination components that enable reading/generating Parquet files.

Parquet file schema

The description below is only relevant to 2.1 and 2.2 versions of the component.

In 2.3 Parquet components work with a more universal schema format. For more information, visit Components Metadata Schema.

Both the Source and the Destination components can deduce the data schema from the provided sample file. However, if for some reason a sample file is not available at the time the package is designed, the schema can be entered into the Destination editor in JSON format.

Each element of the schema is represented by two mandatory attributes: name, type and one optional fields. The type property can represent a primitive data type (possible types and their corresponding SSIS types are given in the table below) or complex types such as: struct meaning nested object and list meaning collection (which can also consist of structure type objects). Only elements of a complex type such as a struct or listhave a fieldsattribute in which the description of nested objects is stored.

For example, а file containing a list of objects composed of id properties and a list of objects of type struct containing properties: nameand country as pictured below

can be presented with a schema string like this:

[{

"Name" : "cities.list",

"Type" : "list",

"Fields" : [{

"Name" : "cities.list.element",

"Type" : "struct",

"Fields" : [{

"Name" : "cities.list.element.country",

"Type" : "string"

}, {

"Name" : "cities.list.element.name",

"Type" : "string"

}]

}, {

"Name" : "id",

"Type" : "int64"

}]

Parquet Data Types and SSIS Data Types

Parquet Data Type	SSIS Data Type
unspecified	DT_WSTR
boolean	DT_BOOL
byte	DT_UI1
signedbyte	DT_I1
short	DT_I2
int16	DT_I2
unsignedint16	DT_UI2
int32	DT_I8
int64	DT_I8
int96	DT_I8
bytearray	DT_BYTES
string	DT_WSTR
float	DT_R4
double	DT_R8
decimal	DT_DECIMAL
timestamp	DT_DBTIMESTAMPOFFSET
time	DT_DBTIME2
date	DT_DATE
interval	DT_BYTES

Parquet Source

Consume data from Parquet file

In this section we will show you how to set up a Parquet Source component.

In the SSIS Toolbox, locate the COZYROC's Parquet Source component and drag it onto the Data Flow canvas.

Double-click on the component on the canvas.
Once the component editor opens, select the fIle connection from File Connection menu. If the file is in the correct format, its data schema will be visualized in the Schema text editor. Note that the schema can be entered manually in the editor and the corresponding metadata will be generated even without a file currently available.

When clicking on Columns tab the component would prepare the outputs and external columns by analyzing the existing data in the Schema text editor. Please note that the Parquet Source can have multiple outputs (see the article about composite records), which columns you can see. The data in these outputs can be processed by downstream transformation and destination components(e.g. multiple OLE DB Destinations can store the data in SQL Server database).
Click OK to close the component editor.

Congratulations! You have successfully configured the Parquet Source component.

Parquet Destination

Generate Parquet file

In this section we will show you how to set up a Parquet Destination component.

In the SSIS Toolbox, locate the COZYROC's Parquet Destination component and drag it onto the Data Flow canvas.

Double-click on the component on the canvas.
Once the component editor opens, select the destination where generated JSON data will be stored: then provide Parquet sample file or directly write the schema into Schema text editor. You can also change the size of the groups of rows into which the parquet file will be divided internally.
.

When clicking on Mapping tab the component would prepare the inputs and external columns by analyzing the scehma in the Schema text editor. Please note that the Parquet Destination can have multiple inputs (see the article about composite records), which columns you can see. The data in these inputs can be processed by upstream transformation and source components (e.g. a Query Transformation can be used to retrieve the necessary data from SQL Server database).
Click OK to close the component editor.

Congratulations! You have successfully configured the Parquet Destination component.

Overview

Parquet Source Component is SSIS Data Flow Component for retrieving data from Apache Parquet file that supports multiple outputs via the composite records pattern.

Supports reading the Apache Parquet files.
Component metadata is automatically retrieved from the provided Parquet file.
Supports the following Parquet sources: File and Variable.
Supports composite outputs. Besides the root Parquet Source Output that contains the top-level fields, for any nested arrays, corresponding composite outputs get populated.
Supports an error output for redirecting problematic records (in case of a failure processing the field values).

Quick Start

Consume data from Parquet file

In this section we will show you how to set up a Parquet Source component.

In the SSIS Toolbox, locate the COZYROC's Parquet Source component and drag it onto the Data Flow canvas.

Double-click on the component on the canvas.
Once the component editor opens, select the fIle connection from File Connection menu. If the file is in the correct format, its data schema will be visualized in the Schema text editor. Note that the schema can be entered manually in the editor and the corresponding metadata will be generated even without a file currently available.

When clicking on Columns tab the component would prepare the outputs and external columns by analyzing the existing data in the Schema text editor. Please note that the Parquet Source can have multiple outputs (see the article about composite records), which columns you can see. The data in these outputs can be processed by downstream transformation and destination components(e.g. multiple OLE DB Destinations can store the data in SQL Server database).
Click OK to close the component editor.

Congratulations! You have successfully configured the Parquet Source component.

Contribute

Parameters

Configuration

Use the parameters below to configure the component.

Source
2.2 SR-2

Indicates the source of Parquet data. The following options are available:

Value Description

File Select an existing File Connection Manager or create a new one.

Variable The Parquet data is available in a variable. Select a variable or create a new one.
Variable
2.2 SR-2

A variable that contains Parquet data.
Schema

JSON string representing the schema of the Parquet file.
In 2.3 release new schema format is introduced. For more information visit Components Metadata Schema.

Value	Description
File	Select an existing File Connection Manager or create a new one.
Variable	The Parquet data is available in a variable. Select a variable or create a new one.

Knowledge Base

Where can I find the documentation for the Parquet Source?

What's New

2.3

New: Improved flat data processing speed.

2.2 SR-2

New: Support for stream input.

2.1 SR-1

Fixed: Incorrect lower-case headers (Thank you, Romain).
Fixed: Failed with error "System.IO.EndOfStreamException: Unable to read beyond the end of the stream." (Thank you, Naveen).
New: Considerable performance improvements.
Fixed: Data mismatch when reading from certain files (Thank you, Jessica).

2.1

New: Introduced component.

Overview

Parquet Destination Component is SSIS Data Flow Component for generating Apache Parquet files.

The component metadata is either automatically retrieved from a sample Parquet file or can be manually specified in JSON format.
The generated Parquet file can contain nested arrays of objects following the composite records pattern), where the fields for the arrays are fed via separate inputs.
The generated Parquet content can be written to a file or stored in a variable.

Demonstration

Quick Start

Generate Parquet file

In this section we will show you how to set up a Parquet Destination component.

In the SSIS Toolbox, locate the COZYROC's Parquet Destination component and drag it onto the Data Flow canvas.

Double-click on the component on the canvas.
Once the component editor opens, select the destination where generated JSON data will be stored: then provide Parquet sample file or directly write the schema into Schema text editor. You can also change the size of the groups of rows into which the parquet file will be divided internally.
.

When clicking on Mapping tab the component would prepare the inputs and external columns by analyzing the scehma in the Schema text editor. Please note that the Parquet Destination can have multiple inputs (see the article about composite records), which columns you can see. The data in these inputs can be processed by upstream transformation and source components (e.g. a Query Transformation can be used to retrieve the necessary data from SQL Server database).
Click OK to close the component editor.

Congratulations! You have successfully configured the Parquet Destination component.

Contribute

Parameters

Configuration

Use the parameters below to configure the component.

Destination

2.2 SR-2

Indicates the destination of Parquet data. The following options are available:

Value	Description
File	The Parquet data will be stored in a file. Select an existing File Connection Manager or create a new one.
Variable	The Parquet data will be stored in a variable. Select a variable or create a new one.

Variable
2.2 SR-2

A variable to which the Parquet data will be written.

Compression

2.3

Specify the compression method applied to columns in the destination file. The following options are available:

Value	Description
None	None compression means no compression is applied to the data.
Snappy	Snappy compression is a fast algorithm developed by Google, optimized for speed over compression ratio. Used in real-time processing systems where low latency is critical.
Gzip	Gzip compression uses the DEFLATE algorithm and is one of the most widely supported formats. General-purpose compression, especially for files and web traffic.
LZ4	LZ4 compression is designed for extremely fast compression and decompression with minimal CPU usage. Used in high-throughput systems such as streaming or databases.
Zstd	Zstandard compression is a modern algorithm developed by Facebook, offering a strong balance between speed and compression ratio. Used in systems needing both efficient storage and good performance.
Lz4Raw	LZ4 Raw compression is the raw/block version of LZ4 compression without headers or framing metadata.

Level

2.3

Specify the compression level applied to columns in the destination file. The following options are available:

Value	Description
Optimal	Optimal compression prioritizes reducing file size as much as possible, even if the process takes longer.
Fastest	Fastest compression prioritizes speed over file size, producing compressed data quickly but with less efficiency.
NoCompression	No compression means the data is stored as-is, without any size reduction.

Encoding
2.3

Specifies the encoding that should be used in the destination file.

Dictionary Encoding stores repeated values as references to a dictionary of unique values. Very effective when a column has many repeated or low-cardinality values (e.g., country codes, boolean flags).

Delta Binary Packed Encoding stores differences between consecutive values rather than the full values. Works best on numeric sequences or sorted data where values change gradually (e.g., timestamps, sequential IDs).

When both are enabled, Parquet can first apply dictionary encoding to reduce repeated values, followed by delta-binary encoding to compress numeric sequences within the dictionary. This can maximize compression for the appropriate data. Either dictionary or delta-binary encoding can be chosen individually depending on the data characteristics. When no special encoding is applied, data is stored in raw form, which may be faster to write but results in larger file sizes.

OutputSplitMethod

2.3

Specify whether a new file or a new row group will be created when the split condition is met.

2.3

Value	Description
Row Groups	A new row group is created. A row group is a logical horizontal partition of the data into rows. It contains serialized (and compressed) arrays of column entries.
Files	A new file is created.

OutputSplitTriggerMode

2.3

Specify the trigger for splitting the data. The following options are available:

Value	Description
Fixed Row Count	Splitting is triggered by the specified number of rows from the main input.
Input Data Size	Splitting is triggered when the total size of the input data reaches the specified number of megabytes.
Delimiter Row	Splitting is triggered by a delimiter row in the main input, where all columns contain null values.

OutputSplitThreshold
2.3

Specify the splitting threshold. In Fixed Row Count mode, this is the number of rows. In Input Data Size mode, this is the number of megabytes. This property is not used in Delimiter Row mode.
Schema

JSON string representing the schema of the Parquet file.
In 2.3 release new schema format is introduced. For more information visit Components Metadata Schema.

Knowledge Base

Where can I find the documentation for the Parquet Destination?

What's New

2.3

New: Improved flat data processing speed.
New: A new Compression parameter for specifying the compression algorithm.
New: A new Level parameter for specifying the compression level.
New: A new Encoding parameter for configuring the value sequence handling algorithm.
New: A new OutputSplitMethod parameter for specifying whether the data is split into row groups or files.
New: A new OutputSplitTriggerMode parameter for specifying what trigger to use for data splitting.
New: A new OutputSplitThreshold parameter for specifying additional information related to the split trigger mode.

2.2 SR-2

New: Support for stream output.

2.1 SR-3

Fixed: Errors when creating Parquet file in a dynamic data flow (Thank you, Sam!).

2.1 SR-2

New: Automatic schema generation when attaching to upstream component.
New: Support for dynamic data flow.

2.1 SR-1

Fixed: Incorrect lower-case headers (Thank you, Romain).
Fixed: Missing record in each batch of records (Thank you, Romain).

2.1

New Introduced component.

Parquet Destination

2.3

New: Improved flat data processing speed.
New: A new Compression parameter for specifying the compression algorithm.
New: A new Level parameter for specifying the compression level.
New: A new Encoding parameter for configuring the value sequence handling algorithm.
New: A new OutputSplitMethod parameter for specifying whether the data is split into row groups or files.
New: A new OutputSplitTriggerMode parameter for specifying what trigger to use for data splitting.
New: A new OutputSplitThreshold parameter for specifying additional information related to the split trigger mode.

2.2 SR-2

New: Support for stream output.

2.1 SR-3

Fixed: Errors when creating Parquet file in a dynamic data flow (Thank you, Sam!).

2.1 SR-2

New: Automatic schema generation when attaching to upstream component.
New: Support for dynamic data flow.

2.1 SR-1

Fixed: Incorrect lower-case headers (Thank you, Romain).
Fixed: Missing record in each batch of records (Thank you, Romain).

2.1

New Introduced component.

Parquet Source

2.3

New: Improved flat data processing speed.

2.2 SR-2

New: Support for stream input.

2.1 SR-1

Fixed: Incorrect lower-case headers (Thank you, Romain).
Fixed: Failed with error "System.IO.EndOfStreamException: Unable to read beyond the end of the stream." (Thank you, Naveen).
New: Considerable performance improvements.
Fixed: Data mismatch when reading from certain files (Thank you, Jessica).

2.1

New: Introduced component.

Knowledge Base

SSIS+ Components Suite

SSIS NoW

Excel Add-in for SAS

Parquet file schema

Parquet Source

Parquet Destination

Overview

Quick Start

Parameters

Configuration

Knowledge Base

What's New

Overview

Demonstration

Quick Start

Parameters

Configuration

Knowledge Base

What's New

Parquet Destination

Parquet Source

Knowledge Base

Newsletter

Contact Us

Follow Us

Support

SSIS+ Components Suite

SSIS NoW

Excel Add-in for SAS

Parquet file schema

Parquet Source

Parquet Destination

Overview

Quick Start

Parameters

Configuration

Knowledge Base

What's New

Related documentation

Overview

Demonstration

Quick Start

Parameters

Configuration

Knowledge Base

What's New

Related documentation

Parquet Destination

Parquet Source

Knowledge Base