ChunkText

Description

Chunks text with options for recursively splitting by delimiters and max character length. Each chunk is given the following attributes: fragment.identifier, fragment.index, fragment.count, segment.original.filename; these attributes can then be used by the MergeContent processor in order to reconstitute the original FlowFile

Properties

In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display Name	API Name	Default Value	Allowable Values	Description
Chunking Strategy *	Chunking Strategy	Recursive Delimiters	Max Chunk Length Recursive Delimiters Sentence Semantic	Strategy to chunk text. 'Recursive Delimiters' will chunk text according to the recursive split by character algorithm. In this algorithm input text is split by the first delimiter and merged back into chunks that do not exceed the 'Max Chunk Length'. Any splits that exceed 'Max Chunk Length' are then recursively split using the next delimiter. 'Max Chunk Length' will chunk text by creating chunks that are 'Max Chunk Length' in size.
Language *	Language	English	English Dutch French Italian German	Language to use for parsing sentences. This property is only considered if: the property Chunking Strategy has a value of SENTENCE or SEMANTIC
Sentence Similarity Threshold *	Sentence Similarity Threshold	0.6		Threshold for determining if two sentences are similar enough to occupy the same chunk. A value of 1.0 indicates the sentences are identical. A value of 0.0 indicates the sentences are completely dissimilar. Supports Expression Language, using FlowFile attributes and Environment variables. This property is only considered if: the property Chunking Strategy has a value of SEMANTIC
Max Chunk Length *	Max Chunk Length	4000		Maximum number of characters to include in output chunk. Setting this number too high can result in an out of memory error. Supports Expression Language, using FlowFile attributes and Environment variables.
Trim Whitespace *	Trim Whitespace	true	true false	Trim whitespace surrounding the output text chunk. This property is only considered if: the property Chunking Strategy has a value of RECURSIVE_DELIMITERS or MAX_CHUNK_LENGTH
Chunk Delimiters *	Chunk Delimiters	\n\n,\n,., ,		Specifies a comma-separated list of character sequences. Meta-characters \n, \r and \t are automatically un-escaped. Delimiters are recursively applied in order to chunk the text. This property is only considered if: the property Chunking Strategy has a value of RECURSIVE_DELIMITERS

Dynamic Properties

This component does not support dynamic properties.

Relationships

Name	Description
original	The input Flow File is routed to the original relationship.
success	Text chunks are routed to the success relationship.

Reads Attributes

This processor does not read attributes.

Writes Attributes

Name	Description
chunk.delimiters	Comma-separated list of delimiters used to chunk text. This attribute is added only when the 'Recursive Delimiters' chunking strategy is used.
chunk.end.offsets	The chunk.end.offsets attribute is added only to the original incoming FlowFile. It is a comma-separated list of end offsets for each chunk that gets generated. For example, if the FlowFile is chunked into 3 child FlowFiles, it might have a value of `183,365,548` indicating that the first chunk ends at offset 183, the second chunk ends at offset 365, and the third chunk ends at offset 548. Offsets are based on the number of characters.
chunk.language	Language used for parsing sentences. This attribute is added only when the 'Sentence' or 'Semantic' chunking strategy is used.
chunk.max.chars	Maximum number of characters to include in each chunk.
chunk.semantic.threshold	Threshold for determining if two sentences are similar enough to occupy the same chunk. This attribute is added only when the 'Semantic' chunking strategy is used.
chunk.start.offsets	The chunk.start.offsets attribute is added only to the original incoming FlowFile. It is a comma-separated list of start offsets for each chunk that gets generated. For example, if the FlowFile is chunked into 3 child FlowFiles, it might have a value of `0,183,365` indicating that the first chunk starts at offset 0, the second chunk starts at offset 183, and the third chunk starts at offset 365. Offsets are based on the number of characters.
chunk.strategy	Strategy used to chunk text. One of 'Max Chunk Length', 'Recursive Delimiters', 'Sentence', 'Semantic'.
fragment.count	The total count of Flow File chunks produced.
fragment.identifier	ID of the parent Flow File used to generate each chunk.
fragment.index	Index of the current Flow File chunk, starting at 0.
segment.original.filename	Original filename of the input Flow File.

State Management

This component does not store state.

Restricted

This component is not restricted.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

This component does not specify system resource considerations.

ChunkText

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also

Description​

Tags​

Properties​

Dynamic Properties​

Relationships​

Reads Attributes​

Writes Attributes​

State Management​

Restricted​

Input Requirement​

System Resource Considerations​

See Also​

Description

Tags

Properties

Dynamic Properties

Relationships

Reads Attributes

Writes Attributes

State Management

Restricted

Input Requirement

System Resource Considerations

See Also