Parse Parse

type: "io.kestra.plugin.tika.Parse"

Parse a document and extract content and metadata

# Examples

Extract a text & embedded image from a file

id: "parse"
type: "io.kestra.plugin.tika.Parse"
from: '{{ inputs.file }}'
extractEmbedded: true
store: false

Extract a text using ocr from an image

id: "parse"
type: "io.kestra.plugin.tika.Parse"
from: '{{ inputs.file }}'
ocrOptions:
  strategy: OCR_AND_TEXT_EXTRACTION
store: true

# Properties

# contentType

  • Type: ContentType
  • Dynamic:
  • Required:
  • Default: XHTML

The content type of extracted text

# extractEmbedded

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: false

The file to parse

Must be a kestra internal storage

# from

  • Type: string
  • Dynamic: ✔️
  • Required:

The file to parse

Must be a kestra internal storage

# ocrOptions

  • Type: OcrOptions
  • Dynamic:
  • Required:
  • Default: {strategy=NO_OCR}

Enable or Disable OCR capture

You need to install Tesseract (opens new window) to enable OCR processing

# store

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: true

Whether to store the data from the query result into an ion serialized data file

# Outputs

# result

# uri

  • Type: string

# Definitions

# OcrOptions

# enableImagePreprocessing

  • Type: boolean
  • Dynamic:
  • Required:

Enable image preprocessing

Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

# language

  • Type: string
  • Dynamic: ✔️
  • Required:

Language used for OCR

# strategy

  • Type: OCR_STRATEGY
  • Dynamic:
  • Required:
  • Default: NO_OCR

Enable or Disable OCR capture

You need to install Tesseract (opens new window) to enable OCR processing, plus Tesseract language pack

# Parsed

# content

  • Type: string
  • Dynamic:
  • Required:

# embedded

  • Type: object
  • Dynamic:
  • Required:

# metadata

  • Type: object
  • Dynamic:
  • Required: