IonToParquet

Convert an ION file into Parquet.

yaml
type: "io.kestra.plugin.serdes.parquet.IonToParquet"

Examples

Read a CSV file, transform it and store the transformed data as a parquet file.

yaml
id: ion_to_parquet
namespace: company.team

tasks:
  - id: download_csv
    type: io.kestra.plugin.core.http.Download
    description: salaries of data professionals from 2020 to 2023 (source ai-jobs.net)
    uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/salaries.csv

  - id: avg_salary_by_job_title
    type: io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      data.csv: "{{ outputs.download_csv.uri }}"
    sql: |
      SELECT
        job_title,
        ROUND(AVG(salary),2) AS avg_salary
      FROM read_csv_auto('{{ workingDir }}/data.csv', header=True)
      GROUP BY job_title
      HAVING COUNT(job_title) > 10
      ORDER BY avg_salary DESC;
    store: true

  - id: result
    type: io.kestra.plugin.serdes.parquet.IonToParquet
    from: "{{ outputs.avg_salary_by_job_title.uri }}"
    schema: |
      {
        "type": "record",
        "name": "Salary",
        "namespace": "com.example.salary",
        "fields": [
          {"name": "job_title", "type": "string"},
          {"name": "avg_salary", "type": "double"}
        ]
      }

Properties

from *string

Source file URI

Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.

compressionCodec string

Default GZIP

Possible Values

UNCOMPRESSEDSNAPPYGZIPZSTD

The compression to used

dateFormat string

Default yyyy-MM-dd[XXX]

Format to use when parsing date

datetimeFormat string

Default yyyy-MM-dd'T'HH:mm[:ss][.SSSSSS][XXX]

Format to use when parsing datetime

Default value is yyyy-MM-dd'T'HH: mm[: ss][.SSSSSS]XXX

decimalSeparator string

Default .

Character to recognize as decimal point (e.g. use ‘,’ for European data).

Default value is '.'

dictionaryPageSize integerstring

Default 1048576

Max dictionary page size

falseValues array

SubType string

Default ["f","false","disabled","0","off","no",""]

Values to consider as False

inferAllFields booleanstring

Default false

Try to infer all fields

If true, we try to infer all fields with trueValues, trueValues & nullValues.If false, we will infer bool & null only on field declared on schema as null and bool.

nullValues array

SubType string

Default ["","#N/A","#N/A N/A","#NA","-1.#IND","-1.#QNAN","-NaN","1.#IND","1.#QNAN","NA","n/a","nan","null"]

Values to consider as null

numberOfRowsToScan integerstring

Default 100

Number of row that will be scanned while inferring. The bigger it is, the more precise the output schema will be.

Only use when 'schema' property is empty

pageSize integerstring

Default 1048576

Target page size

parquetVersion string

Default V2

Possible Values

V1V2

Target row group size

rowGroupSize integerstring

Default 134217728

Target row group size

schema string

The avro schema associated to the data

If empty the task will try to infer the schema from current data, you can use 'numberOfRowsToScan' property if needed

strictSchema booleanstring

Default false

Whether to consider a field present in the data but not declared in the schema as an error

Default value is false

timeFormat string

Default HH:mm[:ss][.SSSSSS][XXX]

Format to use when parsing time

timeZoneId string

Default Etc/UTC

Timezone to use when no timezone can be parsed on the source.

If null, the timezone will be UTC Default value is system timezone

trueValues array

SubType string

Default ["t","true","enabled","1","on","yes"]

Values to consider as True

Outputs

uri string

Format uri

URI of a temporary result file

​Ion​To​Parquet

IonToParquet