Overview
The PDF Source Component is an SSIS Data Flow Component for consuming tabular data from PDF files. The component detects tables in the PDF file and allows processing a single table or multiple consecutive tables (across several pages), assuming they have the same structure. All output columns are of data type DT_WSTR
.
Quick Start
In this section we will show you how to set up a PDF Source component.
- In the SSIS Toolbox, locate the COZYROC's PDF Source component and drag it onto the Data Flow canvas.
- Double click it to open it's editor.
- Choose the location of the PDF file and specify the following parameters to describe its tables structure and which table to process (as there are multiple tables).
Parameters
General
Use the General page of the PDF Source dialog to specify the source PDF file and settings which table to process and how to do it.
Select a file via a standard FILE connection manager.
Specify PDF file password if necessary
Specify whether the PDF table to be processed has a header row with column names
Specify whether consecutive tables need to be treated as one. That's useful for table spanning across several pages. Only if the number of columns are the same, the table will be "merged", i.e. processed like a single table.
Select how to locate a table in the PDF document This property has the options listed in the following table.
TableFindType Description Index Locate a table by its zero-based index (default strategy). RowRegex Locate a table by a regular expression on a row representation, where the row values are comma-separated IndexAndRowRegex Locate a table by index and then locate its first row by a regular expression. Specify the PDF table location strategy criteria, according to TableFindType
TableFindType TableFind Index A zero-based index (e.g. "0"). RowRegex A regular expression to match the first row across all tables in the PDF document (e.g. `^#` would match the first row that starts with `#`) IndexAndRowRegex A regular expression to match the first row with a specified table in the PDF document (e.g. `1|^#` would match the first row in the second table that starts with `#`). Specifies how many rows to skip at the end of the table. Useful, mainly when there is a summary row(s) at the end.
Specifies whether to skip rows in a table that have less values than the columns of the table. Sometimes that's an indication that the rows don't really belong to the table (in case the parsing of the PDF has not been very precise about part of the content):
Value Description None Don't skip incomplete rows (pad with NULL values, instead). Bottom Rows Skips incomplete rows only at the bottom of the table (default) All Skips all incomplete rows.
What's New
- New: A new parameter 'Skip incomplete data rows'.
- New: Find table by index, regex or both. (replace
TableIndex
withTableFindBy
andTableFind
)
- New: Introduced component.
Related documentation
COZYROC SSIS+ Components Suite is free for testing in your development environment.
A licensed version can be deployed on-premises, on Azure-SSIS IR and on COZYROC Cloud.