Parse Parse

yaml
type: "io.kestra.plugin.tika.Parse"

Parse a document and extract content and metadata

Examples

Extract a text & embedded image from a file

yaml
id: "parse"
type: "io.kestra.plugin.tika.Parse"
from: '{{ inputs.file }}'
extractEmbedded: true
store: false

Extract a text using ocr from an image

yaml
id: "parse"
type: "io.kestra.plugin.tika.Parse"
from: '{{ inputs.file }}'
ocrOptions:
  strategy: OCR_AND_TEXT_EXTRACTION
store: true

Properties

contentType

  • Type: string
  • Dynamic:
  • Required:
  • Default: XHTML
  • Possible Values:
    • TEXT
    • XHTML
    • XHTML_NO_HEADER

The content type of extracted text

extractEmbedded

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: false

The file to parse

Must be a kestra internal storage

from

  • Type: string
  • Dynamic: ✔️
  • Required:

The file to parse

Must be a kestra internal storage

ocrOptions

  • Type: OcrOptions
  • Dynamic:
  • Required:
  • Default: {strategy=NO_OCR}

Enable or Disable OCR capture

You need to install Tesseract to enable OCR processing

store

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: true

Whether to store the data from the query result into an ion serialized data file

Outputs

result

uri

  • Type: string

Definitions

OcrOptions

enableImagePreprocessing

  • Type: boolean
  • Dynamic:
  • Required:

Enable image preprocessing

Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

language

  • Type: string
  • Dynamic: ✔️
  • Required:

Language used for OCR

strategy

  • Type: string
  • Dynamic:
  • Required:
  • Default: NO_OCR
  • Possible Values:
    • AUTO
    • NO_OCR
    • OCR_ONLY
    • OCR_AND_TEXT_EXTRACTION

Enable or Disable OCR capture

You need to install Tesseract to enable OCR processing, plus Tesseract language pack

Parsed

content

  • Type: string
  • Dynamic:
  • Required:

embedded

  • Type: object
  • SubType: string
  • Dynamic:
  • Required:

metadata

  • Type: object
  • Dynamic:
  • Required: