DeduplicateItems DeduplicateItems

yaml
type: "io.kestra.core.tasks.storages.DeduplicateItems"

Deduplicate a file by retaining only the latest item for each extracted key.

The Deduplicate task involves reading the input file twice, rather than loading the entire file into memory. The first iteration is used to build a deduplication map in memory containing the last lines observed for each key. The second iteration is used to rewrite the file without the duplicates. The task must be used with this in mind.

Examples

yaml
id: "deduplicate_items"
type: "io.kestra.core.tasks.storages.DeduplicateItems"
tasks:
   - id: deduplicate
     type: io.kestra.core.tasks.storages.DeduplicateItems
     from: "{{ inputs.uri }}"
     expr: "{{ key }}"

Properties

expr

  • Type: string
  • Dynamic:
  • Required: ✔️

The 'pebble' expression to be used for extracting the deduplication key from each item.

The 'pebble' expression can be used for constructing a composite key.

from

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The file to be deduplicated.

Must be a kestra:// internal storage URI.

Outputs

droppedItemsTotal

  • Type: integer
  • Dynamic:
  • Required:

The total number of items that was dropped by the task.

numKeys

  • Type: integer
  • Dynamic:
  • Required:

The number of distinct keys observed by the task.

processedItemsTotal

  • Type: integer
  • Dynamic:
  • Required:

The total number of items that was processed by the task.

uri

  • Type: string
  • Dynamic:
  • Required:
  • Format: uri

The deduplicated file URI.