Parse
type: "io.kestra.plugin.tika.Parse"
Parse a document and extract content and metadata
Examples
Extract a text & embedded image from a file
id: "parse"
type: "io.kestra.plugin.tika.Parse"
from: '{{ inputs.file }}'
extractEmbedded: true
store: false
Extract a text using ocr from an image
id: "parse"
type: "io.kestra.plugin.tika.Parse"
from: '{{ inputs.file }}'
ocrOptions:
strategy: OCR_AND_TEXT_EXTRACTION
store: true
Properties
contentType
- Type: string
- Dynamic: ❌
- Required: ❌
- Default:
XHTML
- Possible Values:
TEXT
XHTML
XHTML_NO_HEADER
The content type of extracted text
extractEmbedded
- Type: boolean
- Dynamic: ❌
- Required: ❌
- Default:
false
The file to parse
Must be a kestra internal storage
from
- Type: string
- Dynamic: ✔️
- Required: ❌
The file to parse
Must be a kestra internal storage
ocrOptions
- Type: OcrOptions
- Dynamic: ❌
- Required: ❌
- Default:
{strategy=NO_OCR}
Enable or Disable OCR capture
You need to install Tesseract to enable OCR processing
store
- Type: boolean
- Dynamic: ❌
- Required: ❌
- Default:
true
Whether to store the data from the query result into an ion serialized data file
Outputs
result
- Type: Parsed
uri
- Type: string
Definitions
OcrOptions
enableImagePreprocessing
- Type: boolean
- Dynamic: ❌
- Required: ❌
Enable image preprocessing
Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.
language
- Type: string
- Dynamic: ✔️
- Required: ❌
Language used for OCR
strategy
- Type: string
- Dynamic: ❌
- Required: ❌
- Default:
NO_OCR
- Possible Values:
AUTO
NO_OCR
OCR_ONLY
OCR_AND_TEXT_EXTRACTION
Enable or Disable OCR capture
You need to install Tesseract to enable OCR processing, plus Tesseract language pack
Parsed
content
- Type: string
- Dynamic: ❓
- Required: ❌
embedded
- Type: object
- SubType: string
- Dynamic: ❓
- Required: ❌
metadata
- Type: object
- Dynamic: ❓
- Required: ❌