ParsePdfDocument

Description

Parses a PDF file, extracting the text and additional information into a structured JSON document. Additionally, any images or tables found in the document are extracted and routed to separate relationships. Their contents are optionally stored in the Document JSON itself, either as a Base64 encoded image or as text extracted from the image (or both). This Processor extracts the information in such a manner as to preserve the original document's layout, including the hierarchy of the sections within the document.

Properties

In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display Name	API Name	Default Value	Allowable Values	Description
Service Location Strategy *	Service Location Strategy	Default	Default Custom	Determines how Service Locations are configured within this processor for the Datavolo Document Element Detection Service.
Custom Element Detection Service URL *	Custom Element Detection Service URL			The Custom URL to the Datavolo Document Element Detection Service. This property is only considered if: the property Service Location Strategy has a value of Custom
OCR Service *	OCR Service		Controller Service: OCRService Implementations: StandardOCRService	An OCR Service for reading files to output text.
Table Embedding Strategy *	Table Embedding Strategy	Extract as Text	Extract as Image Extract as Text Extract as Image and Text Skip	When a table is found in the document, this property specifies how the table should be embedded into the document.
Image Embedding Strategy *	Image Embedding Strategy	Extract as Text	Extract as Image Extract as Text Extract as Image and Text Skip	When an image is found in the document, this property specifies how the image should be embedded into the document.
Communication Timeout *	Communication Timeout	60 sec		The amount of time to wait for a response from the microservices before timing out.

Dynamic Properties

This component does not support dynamic properties.

Relationships

Name	Description
comms.failure	If the processor is unable to communicate with one of the necessary services, the input FlowFile will be routed to this relationship.
failure	If the text of a FlowFile cannot be extracted for any reason, the input FlowFile will be routed to this relationship.
images	If an image is found in the document, the image will be routed to this relationship.
success	The text of the PDF is routed to the success relationship.
tables	If a table is found in the document, the image of the table will be routed to this relationship.

Reads Attributes

This processor does not read attributes.

Writes Attributes

Name	Description
container.scope	The scope of the container is set to DOCUMENT for the JSON Document, TABLE for tables, and FIGURE for any figures/images identified
document.id	A unique UUID for the document
fragment.count	The total number of fragments
fragment.index	The index of the fragment
mime.type	The MIME type is set to 'application/json' for the JSON document, 'image/png' for any extracted images.
page.count	The number of pages in the PDF file is added to the JSON document.

State Management

This component does not store state.

Restricted

This component is not restricted.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

This component does not specify system resource considerations.

ParsePdfDocument

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also

Description​

Tags​

Properties​

Dynamic Properties​

Relationships​

Reads Attributes​

Writes Attributes​

State Management​

Restricted​

Input Requirement​

System Resource Considerations​

See Also​

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also