
IonToParquet
Convert an ION file into Parquet.
type: "io.kestra.plugin.serdes.parquet.IonToParquet"Examples
Read a CSV file, transform it and store the transformed data as a parquet file.
id: ion_to_parquet
namespace: company.team
tasks:
- id: download_csv
type: io.kestra.plugin.core.http.Download
description: salaries of data professionals from 2020 to 2023 (source ai-jobs.net)
uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/salaries.csv
- id: avg_salary_by_job_title
type: io.kestra.plugin.jdbc.duckdb.Query
inputFiles:
data.csv: "{{ outputs.download_csv.uri }}"
sql: |
SELECT
job_title,
ROUND(AVG(salary),2) AS avg_salary
FROM read_csv_auto('{{ workingDir }}/data.csv', header=True)
GROUP BY job_title
HAVING COUNT(job_title) > 10
ORDER BY avg_salary DESC;
store: true
- id: result
type: io.kestra.plugin.serdes.parquet.IonToParquet
from: "{{ outputs.avg_salary_by_job_title.uri }}"
schema: |
{
"type": "record",
"name": "Salary",
"namespace": "com.example.salary",
"fields": [
{"name": "job_title", "type": "string"},
{"name": "avg_salary", "type": "double"}
]
}
Properties
from*Requiredstring
Source file URI
Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.
compressionCodecstring
GZIPUNCOMPRESSEDSNAPPYGZIPZSTDThe compression to used
dateFormatstring
yyyy-MM-dd[XXX]Format to use when parsing date
datetimeFormatstring
yyyy-MM-dd'T'HH:mm[:ss][.SSSSSS][XXX]Format to use when parsing datetime
Default value is yyyy-MM-dd'T'HH: mm[: ss][.SSSSSS]XXX
decimalSeparatorstring
.Character to recognize as decimal point (e.g. use ‘,’ for European data).
Default value is '.'
dictionaryPageSizeintegerstring
1048576Max dictionary page size
falseValuesarray
["f","false","disabled","0","off","no",""]Values to consider as False
inferAllFieldsbooleanstring
falseTry to infer all fields
If true, we try to infer all fields using trueValues, falseValues, and nullValues.If false, we infer booleans and nulls only on fields declared in the schema as null or bool.
nullValuesarray
["","#N/A","#N/A N/A","#NA","-1.#IND","-1.#QNAN","-NaN","1.#IND","1.#QNAN","NA","n/a","nan","null"]Values to consider as null
numberOfRowsToScanintegerstring
100Number of rows that will be scanned while inferring. The more rows scanned, the more precise the output schema will be.
Only use when the 'schema' property is empty
onBadLinesstring
ERRORERRORWARNSKIPHow to handle bad records (e.g., null values in non-nullable fields or type mismatches).
Can be one of: FAIL, WARN or SKIP.
pageSizeintegerstring
1048576Target page size
parquetVersionstring
V2V1V2rowGroupSizeintegerstring
134217728Target row group size
schemastring
The avro schema associated with the data
If empty, the task will try to infer the schema from the current data; use the 'numberOfRowsToScan' property if needed
strictSchemabooleanstring
falseWhether to consider a field present in the data but not declared in the schema as an error
Default value is false
timeFormatstring
HH:mm[:ss][.SSSSSS][XXX]Format to use when parsing time
timeZoneIdstring
Etc/UTCTimezone to use when no timezone can be parsed on the source.
If null, the timezone defaults to UTC. Default value is the system timezone
trueValuesarray
["t","true","enabled","1","on","yes"]Values to consider as True
Outputs
uristring
uriURI of a temporary result file
Metrics
recordscounter
Number of records converted