LoadFromGcs
type: "io.kestra.plugin.gcp.bigquery.LoadFromGcs"
Load data from GCS (Google Cloud Storage) to BigQuery
Examples
Load an avro file from a gcs bucket
id: "load_from_gcs"
type: "io.kestra.plugin.gcp.bigquery.LoadFromGcs"
from:
- "{{ outputs['avro-to-gcs'] }}"
destinationTable: "my_project.my_dataset.my_table"
format: AVRO
avroOptions:
useAvroLogicalTypes: true
Load a csv file with a defined schema
- id: load_files_test
type: io.kestra.plugin.gcp.bigquery.LoadFromGcs
destinationTable: "myDataset.myTable"
ignoreUnknownValues: true
schema:
fields:
- name: colA
type: STRING
- name: colB
type: NUMERIC
- name: colC
type: STRING
format: CSV
csvOptions:
allowJaggedRows: true
encoding: UTF-8
fieldDelimiter: ","
from:
- gs://myBucket/myFile.csv
Properties
autodetect
- Type: boolean
- Dynamic: ❌
- Required: ❌
Experimental Automatic inference of the options and schema for CSV and JSON sources
avroOptions
- Type: AvroOptions
- Dynamic: ❓
- Required: ❌
Avro parsing options
clusteringFields
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
The clustering specification for the destination table
createDisposition
- Type: string
- Dynamic: ❌
- Required: ❌
- Possible Values:
CREATE_IF_NEEDED
CREATE_NEVER
Whether the job is allowed to create tables
csvOptions
- Type: CsvOptions
- Dynamic: ❓
- Required: ❌
Csv parsing options
destinationTable
- Type: string
- Dynamic: ✔️
- Required: ❌
The table where to put query results
If not provided a new table is created.
format
- Type: string
- Dynamic: ❌
- Required: ❌
- Possible Values:
CSV
JSON
AVRO
PARQUET
ORC
The source format, and possibly some parsing options, of the external data
from
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
Google Cloud Storage source data
The fully-qualified URIs that point to source data in Google Cloud Storage (e.g. gs://bucket/path). Each URI can contain one '*' wildcard character and it must come after the 'bucket' name.
ignoreUnknownValues
- Type: boolean
- Dynamic: ❌
- Required: ❌
Whether BigQuery should allow extra values that are not represented in the table schema
If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. By default unknown values are not allowed.
location
- Type: string
- Dynamic: ✔️
- Required: ❌
The geographic location where the dataset should reside
This property is experimental and might be subject to change or removed.
See Dataset Location
maxBadRecords
- Type: integer
- Dynamic: ❌
- Required: ❌
The maximum number of bad records that BigQuery can ignore when running the job
If the number of bad records exceeds this value, an invalid error is returned in the job result. By default no bad record is ignored.
projectId
- Type: string
- Dynamic: ✔️
- Required: ❌
The GCP project id
retryAuto
- Type:ConstantExponentialRandom
- Dynamic: ❓
- Required: ❌
retryMessages
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
- Default:
[due to concurrent update, Retrying the job may solve the problem]
The message that are valid for a automatic retry.
Message is tested as a substring of the full message and case insensitive
retryReasons
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
- Default:
[rateLimitExceeded, jobBackendError, internalError, jobInternalError]
The reason that are valid for a automatic retry.
schema
- Type: object
- Dynamic: ❌
- Required: ❌
The schema for the destination table
The schema can be omitted if the destination table already exists, or if you're loading data from a Google Cloud Datastore backup (i.e. DATASTORE_BACKUP format option).
schema:
fields:
- name: colA
type: STRING
- name: colB
type: NUMERIC
See type from StandardSQLTypeName
schemaUpdateOptions
- Type: array
- SubType: string
- Dynamic: ❌
- Required: ❌
Experimental Options allowing the schema of the destination table to be updated as a side effect of the query job
Schema update options are supported in two cases: when writeDisposition is WRITE_APPEND; when writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators. For normal tables, WRITE_TRUNCATE will always overwrite the schema.
scopes
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
- Default:
[https://www.googleapis.com/auth/cloud-platform]
The GCP scopes to used
serviceAccount
- Type: string
- Dynamic: ✔️
- Required: ❌
The GCP service account key
timePartitioningField
- Type: string
- Dynamic: ✔️
- Required: ❌
The time partitioning field for the destination table
timePartitioningType
- Type: string
- Dynamic: ✔️
- Required: ❌
- Default:
DAY
- Possible Values:
DAY
HOUR
MONTH
YEAR
The time partitioning type specification for the destination table
writeDisposition
- Type: string
- Dynamic: ❌
- Required: ❌
- Possible Values:
WRITE_TRUNCATE
WRITE_APPEND
WRITE_EMPTY
The action that should occur if the destination table already exists
Outputs
destinationTable
- Type: string
Destination table
jobId
- Type: string
The job id
rows
- Type: integer
Output rows count
Definitions
Constant
interval
- Type: string
- Dynamic: ❓
- Required: ✔️
- Format:
duration
maxAttempt
- Type: integer
- Dynamic: ❓
- Required: ❌
- Minimum:
>= 1
maxDuration
- Type: string
- Dynamic: ❓
- Required: ❌
- Format:
duration
warningOnRetry
- Type: boolean
- Dynamic: ❓
- Required: ❌
- Default:
false
Random
maxInterval
- Type: string
- Dynamic: ❓
- Required: ✔️
- Format:
duration
minInterval
- Type: string
- Dynamic: ❓
- Required: ✔️
- Format:
duration
maxAttempt
- Type: integer
- Dynamic: ❓
- Required: ❌
- Minimum:
>= 1
maxDuration
- Type: string
- Dynamic: ❓
- Required: ❌
- Format:
duration
warningOnRetry
- Type: boolean
- Dynamic: ❓
- Required: ❌
- Default:
false
CsvOptions
allowJaggedRows
- Type: boolean
- Dynamic: ❌
- Required: ❌
Whether BigQuery should accept rows that are missing trailing optional columns
If true, BigQuery treats missing trailing columns as null values. If {@code false}, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. By default, rows with missing trailing columns are considered bad records.
allowQuotedNewLines
- Type: boolean
- Dynamic: ✔️
- Required: ❌
Whether BigQuery should allow quoted data sections that contain newline characters in a CSV file
By default quoted newline are not allowed.
encoding
- Type: string
- Dynamic: ✔️
- Required: ❌
The character encoding of the data
The supported values are UTF-8 or ISO-8859-1. The default value is UTF-8. BigQuery decodes the data after the raw, binary data has been split using the values set in {@link #setQuote(String)} and {@link #setFieldDelimiter(String)}.
fieldDelimiter
- Type: string
- Dynamic: ✔️
- Required: ❌
The separator for fields in a CSV file
BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (',').
quote
- Type: string
- Dynamic: ✔️
- Required: ❌
The value that is used to quote data sections in a CSV file
BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state. The default value is a double-quote ('"'). If your data does not contain quoted sections, set the property value to an empty string. If your data contains quoted newline characters, you must also set {@link #setAllowQuotedNewLines(boolean)} property to {@code true}.
skipLeadingRows
- Type: integer
- Dynamic: ❌
- Required: ❌
The number of rows at the top of a CSV file that BigQuery will skip when reading the data
The default value is 0. This property is useful if you have header rows in the file that should be skipped.
Exponential
interval
- Type: string
- Dynamic: ❓
- Required: ✔️
- Format:
duration
maxInterval
- Type: string
- Dynamic: ❓
- Required: ✔️
- Format:
duration
delayFactor
- Type: number
- Dynamic: ❓
- Required: ❌
maxAttempt
- Type: integer
- Dynamic: ❓
- Required: ❌
- Minimum:
>= 1
maxDuration
- Type: string
- Dynamic: ❓
- Required: ❌
- Format:
duration
warningOnRetry
- Type: boolean
- Dynamic: ❓
- Required: ❌
- Default:
false
AvroOptions
useAvroLogicalTypes
- Type: boolean
- Dynamic: ❌
- Required: ❌
If Format option is set to AVRO, you can interpret logical types into their corresponding types (such as TIMESTAMP) instead of only using their raw types (such as INTEGER)
The value may be null.