Parse
Parse a document and extract its content and metadata.
yaml
type: "io.kestra.plugin.tika.Parse"
Extract text from a file.
yaml
id: tika_parse
namespace: company.team
inputs:
- id: file
type: FILE
tasks:
- id: parse
type: io.kestra.plugin.tika.Parse
from: '{{ inputs.file }}'
extractEmbedded: true
store: false
Extract text from an image using OCR.
yaml
id: tika_parse
namespace: company.team
inputs:
- id: file
type: FILE
tasks:
- id: parse
type: io.kestra.plugin.tika.Parse
from: '{{ inputs.file }}'
ocrOptions:
strategy: OCR_AND_TEXT_EXTRACTION
store: true
Dynamic YES
Default XHTML
Possible Values
TEXTXHTMLXHTML_NO_HEADER
The content type of the extracted text.
Dynamic YES
Default false
Dynamic YES
The file to parse.
Must be an internal storage URI.
Dynamic NO
Default {
"strategy": "NO_OCR"
}
Custom options for OCR processing.
You need to install Tesseract to enable OCR processing.
Dynamic YES
Default true
Format uri
Dynamic YES
Dynamic YES
Language used for OCR.
Dynamic YES
Default NO_OCR
Possible Values
AUTONO_OCROCR_ONLYOCR_AND_TEXT_EXTRACTION
OCR strategy to use for OCR processing.
You need to install Tesseract to enable OCR processing, along with Tesseract language pack.
Dynamic NO
SubType string
Dynamic NO
Dynamic NO