loadfromgcs
Load data from GCS to BigQuery.

Load data from GCS to BigQuery.

yaml
type: "io.kestra.plugin.gcp.bigquery.loadfromgcs"

Examples

Load an avro file from a gcs bucket

yaml
id: gcp_bq_load_from_gcs
namespace: company.team

tasks:
  - id: http_download
    type: io.kestra.plugin.core.http.Download
    uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv

  - id: csv_to_ion
    type: io.kestra.plugin.serdes.csv.CsvToIon
    from: "{{ outputs.http_download.uri }}"
    header: true

  - id: ion_to_avro
    type: io.kestra.plugin.serdes.avro.IonToAvro
    from: "{{ outputs.csv_to_ion.uri }}"
    schema: |
      {
        "type": "record",
        "name": "Order",
        "namespace": "com.example.order",
        "fields": [
          {"name": "order_id", "type": "int"},
          {"name": "customer_name", "type": "string"},
          {"name": "customer_email", "type": "string"},
          {"name": "product_id", "type": "int"},
          {"name": "price", "type": "double"},
          {"name": "quantity", "type": "int"},
          {"name": "total", "type": "double"}
        ]
      }

  - id: load_from_gcs
    type: io.kestra.plugin.gcp.bigquery.LoadFromGcs
    from:
      - "{{ outputs.ion_to_avro.uri }}"
    destinationTable: "my_project.my_dataset.my_table"
    format: AVRO
    avroOptions:
      useAvroLogicalTypes: true

Load a csv file with a defined schema

yaml
id: gcp_bq_load_files_test
namespace: company.team

tasks:
  - id: load_files_test
    type: io.kestra.plugin.gcp.bigquery.LoadFromGcs
    destinationTable: "myDataset.myTable"
    ignoreUnknownValues: true
    schema:
      fields:
        - name: colA
          type: STRING
        - name: colB
          type: NUMERIC
        - name: colC
          type: STRING
    format: CSV
    csvOptions:
      allowJaggedRows: true
      encoding: UTF-8
      fieldDelimiter: ","
    from:
      - gs://myBucket/myFile.csv

Properties

autodetectbooleanstring

Experimental Automatic inference of the options and schema for CSV and JSON sources.

avroOptions

Avro parsing options.

Definitions

io.kestra.plugin.gcp.bigquery.AbstractLoad-AvroOptions

useAvroLogicalTypesbooleanstring

If format is set to AVRO, you can interpret logical types into their corresponding types (such as TIMESTAMP) instead of only using their raw types (such as INTEGER)

The value may be null.

clusteringFieldsarray

SubTypestring

The clustering specification for the destination table.

createDispositionstring

Possible Values

CREATE_IF_NEEDEDCREATE_NEVER

Whether the job is allowed to create tables.

csvOptions

Csv parsing options.

Definitions

io.kestra.plugin.gcp.bigquery.AbstractLoad-CsvOptions

allowJaggedRowsbooleanstring

Whether BigQuery should accept rows that are missing trailing optional columns.

If true, BigQuery treats missing trailing columns as null values. If {@code false}, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. By default, rows with missing trailing columns are considered bad records.

allowQuotedNewLinesbooleanstring

Whether BigQuery should allow quoted data sections that contain newline characters in a CSV file.

By default quoted newline are not allowed.

encodingstring

The character encoding of the data.

The supported values are UTF-8 or ISO-8859-1. The default value is UTF-8. BigQuery decodes the data after the raw, binary data has been split using the values set in {@link #setQuote(String)} and {@link #setFieldDelimiter(String)}.

fieldDelimiterstring

The separator for fields in a CSV file.

BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (',').

quotestring

The value that is used to quote data sections in a CSV file.

BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state. The default value is a double-quote ('"'). If your data does not contain quoted sections, set the property value to an empty string. If your data contains quoted newline characters, you must also set {@link #setAllowQuotedNewLines(boolean)} property to {@code true}.

skipLeadingRowsintegerstring

The number of rows at the top of a CSV file that BigQuery will skip when reading the data

The default value is 0. This property is useful if you have header rows in the file that should be skipped.

destinationTablestring

The table where to put query results.

If not provided, a new table is created.

formatstring

Possible Values

CSVJSONAVROPARQUETORC

The source format, and possibly some parsing options, of the external data.

fromarray

SubTypestring

Google Cloud Storage source data

The fully-qualified URIs that point to source data in Google Cloud Storage (e.g. gs://bucket/path). Each URI can contain one '*' wildcard character and it must come after the 'bucket' name.

ignoreUnknownValuesbooleanstring

Whether BigQuery should allow extra values that are not represented in the table schema.

If true, the extra values are ignored. If false, records with extra columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. By default unknown values are not allowed.

impersonatedServiceAccountstring

The GCP service account to impersonate.

locationstring

The geographic location where the dataset should reside.

This property is experimental and might be subject to change or removed.

See Dataset Location

maxBadRecordsintegerstring

The maximum number of bad records that BigQuery can ignore when running the job.

If the number of bad records exceeds this value, an invalid error is returned in the job result. By default, no bad record is ignored.

projectIdstring

The GCP project ID.

retryAuto

Automatic retry for retryable BigQuery exceptions.

Some exceptions (especially rate limit) are not retried by default by BigQuery client, we use by default a transparent retry (not the kestra one) to handle this case. The default values are exponential of 5 seconds for a maximum of 15 minutes and ten attempts

Definitions

io.kestra.core.models.tasks.retrys.Constant

interval*string

Formatduration

type*object

behaviorstring

DefaultRETRY_FAILED_TASK

Possible Values

RETRY_FAILED_TASKCREATE_NEW_EXECUTION

maxAttemptsinteger

Minimum>= 1

maxDurationstring

Formatduration

warningOnRetryboolean

Defaultfalse

io.kestra.core.models.tasks.retrys.Exponential

interval*string

Formatduration

maxInterval*string

Formatduration

type*object

behaviorstring

DefaultRETRY_FAILED_TASK

Possible Values

RETRY_FAILED_TASKCREATE_NEW_EXECUTION

delayFactornumber

maxAttemptsinteger

Minimum>= 1

maxDurationstring

Formatduration

warningOnRetryboolean

Defaultfalse

io.kestra.core.models.tasks.retrys.Random

maxInterval*string

Formatduration

minInterval*string

Formatduration

type*object

behaviorstring

DefaultRETRY_FAILED_TASK

Possible Values

RETRY_FAILED_TASKCREATE_NEW_EXECUTION

maxAttemptsinteger

Minimum>= 1

maxDurationstring

Formatduration

warningOnRetryboolean

Defaultfalse

retryMessagesarray

SubTypestring

Default["due to concurrent update","Retrying the job may solve the problem","Retrying may solve the problem"]

The messages which would trigger an automatic retry.

Message is tested as a substring of the full message, and is case insensitive.

retryReasonsarray

SubTypestring

Default["rateLimitExceeded","jobBackendError","backendError","internalError","jobInternalError"]

The reasons which would trigger an automatic retry.

schemaobject

The schema for the destination table.

The schema can be omitted if the destination table already exists, or if you're loading data from a Google Cloud Datastore backup (i.e. DATASTORE_BACKUP format option).

text

schema: 
  fields: 
    - name: colA
      type: STRING
    - name: colB
      type: NUMERIC

See type from StandardSQLTypeName

schemaUpdateOptionsarray

SubTypestring

Possible Values

ALLOW_FIELD_ADDITIONALLOW_FIELD_RELAXATION

Experimental Options allowing the schema of the destination table to be updated as a side effect of the query job.

Schema update options are supported in two cases: when writeDisposition is WRITE_APPEND; when writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators. For normal tables, WRITE_TRUNCATE will always overwrite the schema.

scopesarray

SubTypestring

Default["https://www.googleapis.com/auth/cloud-platform"]

The GCP scopes to be used.

serviceAccountstring

The GCP service account.

timePartitioningFieldstring

The time partitioning field for the destination table.

timePartitioningTypestring

DefaultDAY

Possible Values

DAYHOURMONTHYEAR

The time partitioning type specification for the destination table.

writeDispositionstring

Possible Values

WRITE_TRUNCATEWRITE_TRUNCATE_DATAWRITE_APPENDWRITE_EMPTY

The action that should occur if the destination table already exists.

Outputs

destinationTablestring

Destination table

jobIdstring

The job id

rowsinteger

Output rows count

Metrics

bad.recordscounter

Unitrecords

the number of bad records reported in a job.

durationtimer

The time it took for the task to run.

input.bytescounter

Unitbytes

The number of bytes of source data in a load job.

input.filescounter

Unitfiles

The number of source files in a load job.

output.bytescounter

Unitbytes

The size of the data loaded by a load job so far, in bytes.

output.rowscounter

Unitrecords

The number of rows loaded by a load job so far.

Core Plugins and tasks

Tasks that provide Kestra's built-in orchestration, I/O, and observability capabilities.

CoreStorageScript

Resend

Plugin Resend for Kestra

Messaging

MQTT

This sub-group of plugins contains tasks for using a MQTT message broker.

Messaging

loadfromgcs
Load data from GCS to BigQuery.

More Plugins in this Category

Core Plugins and tasks

Resend

MQTT

1.2.0

loadfromgcsLoad data from GCS to BigQuery.

More Plugins in this Category

Core Plugins and tasks

Resend

MQTT

1.2.0

loadfromgcs
Load data from GCS to BigQuery.