IonToParquet
Read a provided file containing ion serialized data and convert it to parquet.
type: "io.kestra.plugin.serdes.parquet.IonToParquet"
Read a CSV file, transform it and store the transformed data as a parquet file.
id: ion_to_parquet
namespace: company.team
tasks:
- id: download_csv
type: io.kestra.plugin.core.http.Download
description: salaries of data professionals from 2020 to 2023 (source ai-jobs.net)
uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/salaries.csv
- id: avg_salary_by_job_title
type: io.kestra.plugin.jdbc.duckdb.Query
inputFiles:
data.csv: "{{ outputs.download_csv.uri }}"
sql: |
SELECT
job_title,
ROUND(AVG(salary),2) AS avg_salary
FROM read_csv_auto('{{ workingDir }}/data.csv', header=True)
GROUP BY job_title
HAVING COUNT(job_title) > 10
ORDER BY avg_salary DESC;
store: true
- id: result
type: io.kestra.plugin.serdes.parquet.IonToParquet
from: "{{ outputs.avg_salary_by_job_title.uri }}"
schema: |
{
"type": "record",
"name": "Salary",
"namespace": "com.example.salary",
"fields": [
{"name": "job_title", "type": "string"},
{"name": "avg_salary", "type": "double"}
]
}
YES
Source file URI
YES
The avro schema associated to the data
YES
GZIP
UNCOMPRESSED
SNAPPY
GZIP
ZSTD
The compression to used
YES
yyyy-MM-dd[XXX]
Format to use when parsing date
YES
yyyy-MM-dd'T'HH:mm[:ss][.SSSSSS][XXX]
Format to use when parsing datetime
Default value is yyyy-MM-dd'T'HH: mm[: ss][.SSSSSS]XXX
YES
.
YES
1048576
YES
["f","false","disabled","0","off","no",""]
Values to consider as False
YES
false
YES
["","#N/A","#N/A N/A","#NA","-1.#IND","-1.#QNAN","-NaN","1.#IND","1.#QNAN","NA","n/a","nan","null"]
Values to consider as null
YES
1048576
YES
134217728
YES
false
YES
HH:mm[:ss][.SSSSSS][XXX]
Format to use when parsing time
YES
Etc/UTC
Timezone to use when no timezone can be parsed on the source.
If null, the timezone will be UTC
Default value is system timezone
YES
["t","true","enabled","1","on","yes"]
Values to consider as True
YES
V2
V1
V2
Target row group size
uri
URI of a temporary result file