ConvertPdfToImage
Description
Converts a PDF file into a series of images, one for each page.
Tags
image, jpeg, jpg, ocr, pdf, png, tesseract
Properties
In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
Display Name | API Name | Default Value | Allowable Values | Description |
---|---|---|---|---|
Output Format * | Output Format | PNG |
| The format to use when writing the image. |
Output Image Type * | Output Image Type | Color |
| The type of image to use when writing the image. |
Max File Size * | Max File Size | 10 MB | Because the entire contents of the PDF file must be loaded into memory in order to parse it, this property is used to limit the size of the PDF file that can be processed. If a PDF file is larger than this value, it will be routed to failure. | |
Dots Per Inch * | Dots Per Inch | 144 | The Dots Per Inch (DPI) to use when rendering the image. Larger values can result in higher quality images, but also larger file sizes and slower processing. Supports Expression Language, using FlowFile attributes and Environment variables. | |
Max Page Size in Inches | Max Page Size in Inches | 11.0 | The maximum width/height of the image in inches. If the image is larger than this value, it will be scaled down to fit within this width. Using this property in conjunction with the Dots Per Inch property can help to control the size of the resulting image, ensuring that memory usage and processing time are kept in check. The default value of 11.0 allows for a standard page size of 8.5 x 11. The size is applied to the larger of the page's width or height. |
Dynamic Properties
This component does not support dynamic properties.
Relationships
Name | Description |
---|---|
failure | If a FlowFile cannot be converted into an image for any reason, it will be routed to this relationship. |
images | The resulting images are routed to the success relationship. |
original | The original PDF file is routed to this relationship when processing is successful. |
Reads Attributes
This processor does not read attributes.
Writes Attributes
Name | Description |
---|---|
image.height.original.inches | The height of the image in inches before scaling to the configured maximum size. |
image.height.pixels | The height of the image in pixels. |
image.height.scaled.inches | The height of the image in inches after scaling to the configured maximum size. |
image.width.original.inches | The width of the image in inches before scaling to the configured maximum size. |
image.width.pixels | The width of the image in pixels. |
image.width.scaled.inches | The width of the image in inches after scaling to the configured maximum size. |
pageNumber | The page number in the PDF Document that the image represents, with the first page having a value of 1. |
State Management
This component does not store state.
Restricted
This component is not restricted.
Input Requirement
This component requires an incoming relationship.
System Resource Considerations
Scope | Description |
---|---|
MEMORY | Parsing a PDF requires random access to the data. As such, the entire PDF must be read into Java's heap. |