Download a PDF file and extract text from it using Apache Tika

Source

yaml

id: parse-pdf
namespace: company.team

tasks:
  - id: download_pdf
    type: io.kestra.plugin.core.http.Download
    uri: https://huggingface.co/datasets/kestra/datasets/resolve/main/pdf/app_store.pdf

  - id: parse_text
    type: io.kestra.plugin.tika.Parse
    from: "{{ outputs.download_pdf.uri }}"
    contentType: TEXT
    store: false

  - id: log_extracted_text
    type: io.kestra.plugin.core.log.Log
    message: "{{ outputs.parse_text.result.content }}"

About this blueprint

Kestra

This flow downloads a PDF file using the HTTP Download task. Then, it extracts text from the PDF file using Apache Tika. Finally, it logs the extracted text using the Log task.

Download

Parse

Log

More Related Blueprints

PythonKestra

Run specific tasks only on business days for a specific country

NotificationsKestra

Using errorLogs function to send error message to Slack

NotificationsKestraSystem

Set up alerts for failed workflow executions using Discord

New to Kestra?

Use blueprints to kickstart your first workflows.

Get started with Kestra