Parse

yaml

type: "io.kestra.plugin.tika.Parse"

Parse a document and extract its content and metadata.

Examples

Extract text from a file.

yaml

id: tika_parse
namespace: company.team

inputs:
  - id: file
    type: FILE

tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    extractEmbedded: true
    store: false

Extract text from an image using OCR.

yaml

id: tika_parse
namespace: company.team

inputs:
  - id: file
    type: FILE

tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    ocrOptions:
      strategy: OCR_AND_TEXT_EXTRACTION
    store: true

Properties

`contentType`

Type: string
Dynamic: ❌
Required: ❌
Default: XHTML
Possible Values:
- TEXT
- XHTML
- XHTML_NO_HEADER

The content type of the extracted text.

`extractEmbedded`

Type: boolean
Dynamic: ❌
Required: ❌
Default: false

Whether to extract the embedded document.

`from`

Type: string
Dynamic: ✔️
Required: ❌

The file to parse.

Must be an internal storage URI.

`ocrOptions`

Type: Parse-OcrOptions
Dynamic: ❌
Required: ❌
Default: {strategy=NO_OCR}

Custom options for OCR processing.

You need to install Tesseract to enable OCR processing.

`store`

Type: boolean
Dynamic: ❌
Required: ❌
Default: true

Whether to store the data from the query result into an ion serialized data file in Kestra internal storage.

Outputs

`result`

Type: Parse-Parsed
Required: ❌

`uri`

Type: string
Required: ❌
Format: uri

Definitions

`io.kestra.plugin.tika.Parse-OcrOptions`

Properties

`enableImagePreprocessing`

Type: boolean
Dynamic: ❌
Required: ❌

Whether to enable image preprocessing.

Apache Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to Tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

`language`

Type: string
Dynamic: ✔️
Required: ❌

Language used for OCR.

`strategy`

Type: string
Dynamic: ❌
Required: ❌
Default: NO_OCR
Possible Values:
- AUTO
- NO_OCR
- OCR_ONLY
- OCR_AND_TEXT_EXTRACTION

OCR strategy to use for OCR processing.

You need to install Tesseract to enable OCR processing, along with Tesseract language pack.

`io.kestra.plugin.tika.Parse-Parsed`

Properties

`content`

Type: string
Dynamic: ❓
Required: ❓

`embedded`

Type: object
SubType: string
Dynamic: ❓
Required: ❓

`metadata`

Type: object
Dynamic: ❓
Required: ❓

Was this page helpful?

​Parse

Parse

enableImagePreprocessing

language

strategy

content

embedded

metadata

Parse

`enableImagePreprocessing`

`language`

`strategy`

`content`

`embedded`

`metadata`