Parse
Parse a document and extract its content and metadata.
yaml
type: "io.kestra.plugin.tika.Parse"
Extract text from a file.
yaml
id: tika_parse
namespace: company.team
inputs:
- id: file
type: FILE
tasks:
- id: parse
type: io.kestra.plugin.tika.Parse
from: '{{ inputs.file }}'
extractEmbedded: true
store: false
Extract text from an image using OCR.
yaml
id: tika_parse
namespace: company.team
inputs:
- id: file
type: FILE
tasks:
- id: parse
type: io.kestra.plugin.tika.Parse
from: '{{ inputs.file }}'
ocrOptions:
strategy: OCR_AND_TEXT_EXTRACTION
store: true
Dynamic
YES
Default
XHTML
Possible Values
TEXT
XHTML
XHTML_NO_HEADER
The content type of the extracted text.
Dynamic
YES
Default
false
Dynamic
YES
The file to parse.
Must be an internal storage URI.
Dynamic
NO
Default
{
"strategy": "NO_OCR"
}
Custom options for OCR processing.
You need to install Tesseract to enable OCR processing.
Dynamic
YES
Default
true
Format
uri
Dynamic
YES
Dynamic
YES
Language used for OCR.
Dynamic
YES
Default
NO_OCR
Possible Values
AUTO
NO_OCR
OCR_ONLY
OCR_AND_TEXT_EXTRACTION
OCR strategy to use for OCR processing.
You need to install Tesseract to enable OCR processing, along with Tesseract language pack.
Dynamic
NO
SubType string
Dynamic
NO
Dynamic
NO