IonToParquet
Convert an ION file into Parquet.
type: "io.kestra.plugin.serdes.parquet.IonToParquet"
Examples
Read a CSV file, transform it and store the transformed data as a parquet file.
id: ion_to_parquet
namespace: company.team
tasks:
- id: download_csv
type: io.kestra.plugin.core.http.Download
description: salaries of data professionals from 2020 to 2023 (source ai-jobs.net)
uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/salaries.csv
- id: avg_salary_by_job_title
type: io.kestra.plugin.jdbc.duckdb.Query
inputFiles:
data.csv: "{{ outputs.download_csv.uri }}"
sql: |
SELECT
job_title,
ROUND(AVG(salary),2) AS avg_salary
FROM read_csv_auto('{{ workingDir }}/data.csv', header=True)
GROUP BY job_title
HAVING COUNT(job_title) > 10
ORDER BY avg_salary DESC;
store: true
- id: result
type: io.kestra.plugin.serdes.parquet.IonToParquet
from: "{{ outputs.avg_salary_by_job_title.uri }}"
schema: |
{
"type": "record",
"name": "Salary",
"namespace": "com.example.salary",
"fields": [
{"name": "job_title", "type": "string"},
{"name": "avg_salary", "type": "double"}
]
}
Properties
from *Requiredstring
Source file URI
Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}
.
compressionCodec string
GZIP
UNCOMPRESSED
SNAPPY
GZIP
ZSTD
The compression to used
dateFormat string
yyyy-MM-dd[XXX]
Format to use when parsing date
datetimeFormat string
yyyy-MM-dd'T'HH:mm[:ss][.SSSSSS][XXX]
Format to use when parsing datetime
Default value is yyyy-MM-dd'T'HH: mm[: ss][.SSSSSS]XXX
decimalSeparator string
.
Character to recognize as decimal point (e.g. use ‘,’ for European data).
Default value is '.'
dictionaryPageSize integerstring
1048576
Max dictionary page size
falseValues array
["f","false","disabled","0","off","no",""]
Values to consider as False
inferAllFields booleanstring
false
Try to infer all fields
If true, we try to infer all fields with trueValues
, trueValues
& nullValues
.If false, we will infer bool & null only on field declared on schema as null
and bool
.
nullValues array
["","#N/A","#N/A N/A","#NA","-1.#IND","-1.#QNAN","-NaN","1.#IND","1.#QNAN","NA","n/a","nan","null"]
Values to consider as null
numberOfRowsToScan integerstring
100
Number of row that will be scanned while inferring. The bigger it is, the more precise the output schema will be.
Only use when 'schema' property is empty
pageSize integerstring
1048576
Target page size
parquetVersion string
V2
V1
V2
Target row group size
rowGroupSize integerstring
134217728
Target row group size
schema string
The avro schema associated to the data
If empty the task will try to infer the schema from current data, you can use 'numberOfRowsToScan' property if needed
strictSchema booleanstring
false
Whether to consider a field present in the data but not declared in the schema as an error
Default value is false
timeFormat string
HH:mm[:ss][.SSSSSS][XXX]
Format to use when parsing time
timeZoneId string
Etc/UTC
Timezone to use when no timezone can be parsed on the source.
If null, the timezone will be UTC
Default value is system timezone
trueValues array
["t","true","enabled","1","on","yes"]
Values to consider as True
Outputs
uri string
uri
URI of a temporary result file