IonToParquet
Read a provided file containing ion serialized data and convert it to parquet.
type: "io.kestra.plugin.serdes.parquet.IonToParquet"
Read a CSV file, transform it and store the transformed data as a parquet file.
id: ion_to_parquet
namespace: company.team
tasks:
- id: download_csv
type: io.kestra.plugin.core.http.Download
description: salaries of data professionals from 2020 to 2023 (source ai-jobs.net)
uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/salaries.csv
- id: avg_salary_by_job_title
type: io.kestra.plugin.jdbc.duckdb.Query
inputFiles:
data.csv: "{{ outputs.download_csv.uri }}"
sql: |
SELECT
job_title,
ROUND(AVG(salary),2) AS avg_salary
FROM read_csv_auto('{{ workingDir }}/data.csv', header=True)
GROUP BY job_title
HAVING COUNT(job_title) > 10
ORDER BY avg_salary DESC;
store: true
- id: result
type: io.kestra.plugin.serdes.parquet.IonToParquet
from: "{{ outputs.avg_salary_by_job_title.uri }}"
schema: |
{
"type": "record",
"name": "Salary",
"namespace": "com.example.salary",
"fields": [
{"name": "job_title", "type": "string"},
{"name": "avg_salary", "type": "double"}
]
}
Source file URI
The avro schema associated to the data
The compression to used
Format to use when parsing date
Format to use when parsing datetime
Default value is yyyy-MM-dd'T'HH: mm[: ss][.SSSSSS]XXX
Values to consider as False
Values to consider as null
Format to use when parsing time
Timezone to use when no timezone can be parsed on the source.
If null, the timezone will be UTC
Default value is system timezone
Values to consider as True
Target row group size
URI of a temporary result file