IonToParquet IonToParquet

yaml
type: "io.kestra.plugin.serdes.parquet.IonToParquet"

Read a provided file containing ion serialized data and convert it to parquet.

Examples

Read a CSV file, transform it and store the transformed data as a parquet file.

yaml
id: ion_to_parquet
namespace: company.team

tasks:
  - id: download_csv
    type: io.kestra.plugin.core.http.Download
    description: salaries of data professionals from 2020 to 2023 (source ai-jobs.net)
    uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/salaries.csv

  - id: avg_salary_by_job_title
    type: io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      data.csv: "{{ outputs.download_csv.uri }}"
    sql: |
      SELECT
        job_title,
        ROUND(AVG(salary),2) AS avg_salary
      FROM read_csv_auto('{{ workingDir }}/data.csv', header=True)
      GROUP BY job_title
      HAVING COUNT(job_title) > 10
      ORDER BY avg_salary DESC;
    store: true

  - id: result
    type: io.kestra.plugin.serdes.parquet.IonToParquet
    from: "{{ outputs.avg_salary_by_job_title.uri }}"
    schema: |
      {
        "type": "record",
        "name": "Salary",
        "namespace": "com.example.salary",
        "fields": [
          {"name": "job_title", "type": "string"},
          {"name": "avg_salary", "type": "double"}
        ]
      }

Properties

from

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

Source file URI

schema

  • Type: string
  • Dynamic: ✔️
  • Required: ✔️

The avro schema associated to the data

compressionCodec

  • Type: string
  • Dynamic:
  • Required:
  • Default: GZIP
  • Possible Values:
    • UNCOMPRESSED
    • SNAPPY
    • GZIP
    • ZSTD

The compression to used

dateFormat

  • Type: string
  • Dynamic: ✔️
  • Required:
  • Default: yyyy-MM-dd[XXX]

Format to use when parsing date

datetimeFormat

  • Type: string
  • Dynamic: ✔️
  • Required:
  • Default: yyyy-MM-dd'T'HH:mm[:ss][.SSSSSS][XXX]

Format to use when parsing datetime

Default value is yyyy-MM-dd'T'HH:mm[][.SSSSSS]XXX

decimalSeparator

  • Type: string
  • Dynamic: ✔️
  • Required:
  • Default: .

Character to recognize as decimal point (e.g. use ‘,’ for European data).

Default value is '.'

dictionaryPageSize

  • Type: integer
  • Dynamic:
  • Required:
  • Default: 1048576

Max dictionary page size

falseValues

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:
  • Default: [f, false, disabled, 0, off, no, ]

Values to consider as False

inferAllFields

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: false

Try to infer all fields

If true, we try to infer all fields with trueValues, trueValues & nullValues.If false, we will infer bool & null only on field declared on schema as null and bool.

nullValues

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:
  • Default: [, #N/A, #N/A N/A, #NA, -1.#IND, -1.#QNAN, -NaN, 1.#IND, 1.#QNAN, NA, n/a, nan, null]

Values to consider as null

pageSize

  • Type: integer
  • Dynamic:
  • Required:
  • Default: 1048576

Target page size

rowGroupSize

  • Type: integer
  • Dynamic:
  • Required:
  • Default: 134217728

Target row group size

strictSchema

  • Type: boolean
  • Dynamic:
  • Required:
  • Default: false

Whether to consider a field present in the data but not declared in the schema as an error

Default value is false

timeFormat

  • Type: string
  • Dynamic: ✔️
  • Required:
  • Default: HH:mm[:ss][.SSSSSS][XXX]

Format to use when parsing time

timeZoneId

  • Type: string
  • Dynamic:
  • Required:
  • Default: Etc/UTC

Timezone to use when no timezone can be parsed on the source.

If null, the timezone will be UTC Default value is system timezone

trueValues

  • Type: array
  • SubType: string
  • Dynamic: ✔️
  • Required:
  • Default: [t, true, enabled, 1, on, yes]

Values to consider as True

version

  • Type: string
  • Dynamic:
  • Required:
  • Default: V2
  • Possible Values:
    • V1
    • V2

Target row group size

Outputs

uri

  • Type: string
  • Required:
  • Format: uri

URI of a temporary result file

Was this page helpful?