Load data into a Snowflake database with a Singer target.

Full documentation can be found on the GitHub Repo.

yaml
type: "io.kestra.plugin.singer.targets.pipelinewisesnowflake"

Min length 1

Snowflake account name.

(i.e. rtXXXXX.eu-central-1)

Min length 1

The database name.

The raw data from a tap.

Min length 1

The database user.

Min length 1

Snowflake virtual warehouse name.

Default false

Metadata columns add extra row level information about data ingestions, (i.e. when was the row read in source, when was inserted or deleted in snowflake etc.) Metadata columns are creating automatically by adding extra columns to the tables with a column prefix _SDC_. The column names are following the stitch naming conventions documented at https://www.stitchdata.com/docs/data-structure/integration-schemas#sdc-columns. Enabling metadata columns will flag the deleted rows by setting the _SDC_DELETED_AT metadata column. Without the add_metadata_columns option the deleted rows from singer taps will not be recongisable in Snowflake.

Default false

When enabled, the files loaded to Snowflake will also be stored in archive_load_files_s3_bucket under the key /{archive_load_files_s3_prefix}/{schema_name}/{table_name}/. All archived files will have tap, schema, table and archived-by as S3 metadata keys. When incremental replication is used, the archived files will also have the following S3 metadata keys: incremental-key, incremental-key-min and incremental-key-max

(Default: Value of s3_bucket) When archive_load_files is enabled, the archived files will be placed in this bucket.

(Default: archive) When archive_load_files is enabled, the archived files will be placed in the archive S3 bucket under this prefix.

S3 Access Key ID. If not provided, AWS_ACCESS_KEY_ID environment variable or IAM role will be used.

AWS profile name for profile based authentication. If not provided, AWS_PROFILE environment variable will be used.

S3 Secret Access Key. If not provided, AWS_SECRET_ACCESS_KEY environment variable or IAM role will be used.

AWS Session token. If not provided, AWS_SESSION_TOKEN environment variable will be used.

Default 100000

Maximum number of rows in each batch. At the end of each batch, the rows in the batch are loaded into Snowflake.

Format duration

Maximum time to wait for batch to reach batch_size_rows.

When this is defined, Client-Side Encryption is enabled. The data in S3 will be encrypted. No third parties, including Amazon AWS and any ISPs, can see data in the clear. Snowflake COPY command will decrypt the data once it is in Snowflake. The master key must be 256-bit length and must be encoded as base64 string.

Override default singer command.

Default python:3.10.12

The task runner container image, only used if the task runner is container-based.

Default 0

(Default: 0) Object type RECORD items from taps can be loaded into VARIANT columns as JSON (default) or we can flatten the schema by creating columns automatically.

When value is 0 (default) then flattening functionality is turned off.

Name of the schema where the tables will be created, without database prefix. If schema_mapping is not defined then every stream sent by the tap is loaded into this schema.

Grant USAGE privilege on newly created schemas and grant SELECT privilege on newly created tables to a specific role or a list of roles. If schema_mapping is not defined then every stream sent by the tap is granted accordingly.

Default false

By default the connector caches the available table structures in Snowflake at startup. In this way it doesn't need to run additional queries when ingesting data to check if altering the target tables is required. With disable_table_cache option you can turn off this caching. You will always see the most recent table structures but will cause an extra query runtime.

Deprecated, use 'taskRunner' instead

Named file format name created at pre-requirements section. Has to be a fully qualified name including the schema name.

Default false

Flush and load every stream into Snowflake when one batch is full. Warning: This may trigger the COPY command to use files with low number of records, and may cause performance problems.

Default false

When hardDelete option is true, then DELETE SQL commands will be performed in Snowflake to delete rows in tables. It is achieved by continuously checking the _SDC_DELETED_AT metadata column sent by the singer tap. Due to deleting rows requiring metadata columns, hard_delete option automatically enables the add_metadata_columns option as well.

Default false

Generate uncompressed files when loading to Snowflake. Normally, by default GZIP compressed files are generated.

Default 0

The number of threads used to flush tables. 0 will create a thread for each stream, up to parallelism_max. -1 will create a thread for each CPU core. Any other positive number will create that number of threads, up to parallelism_max.

Default 16

Max number of parallel threads to use when flushing tables.

The database user's password.

SubType string

Override default pip packages to use a specific version.

Default true

Log based and Incremental replications on tables with no Primary Key cause duplicates when merging UPDATE events. When set to true, stop loading data if no Primary Key is defined.

Optional string to tag executed queries in Snowflake. Replaces tokens {{database}}, {{schema}} and {{table}} with the appropriate values. The tags are displayed in the output of the Snowflake QUERY_HISTORY, QUERY_HISTORY_BY_* functions.

Snowflake role to use. If not defined then the user's default role will be used.

S3 ACL name to set on the uploaded files.

S3 Bucket name. Required if to use S3 External stage. When this is defined then stage has to be defined as well.

The complete URL to use for the constructed client. This is allowing to use non-native s3 account.

A static prefix before the generated S3 key names. Using prefixes you can upload files into specific directories in the S3 bucket.

Default region when creating new connections.

Useful if you want to load multiple streams from one tap to multiple Snowflake schemas.

If the tap sends the stream_id in <schema_name>-<table_name> format then this option overwrites the default_target_schema value. Note, that using schema_mapping you can overwrite the default_target_schema_select_permission value to grant SELECT permissions to different groups per schemas or optionally you can create indices automatically for the replicated tables.

Note: This is an experimental feature and recommended to use via PipelineWise YAML files that will generate the object mapping in the right JSON format. For further info check a PipelineWise YAML Example

Named external stage name created at pre-requirements section. Has to be a fully qualified name including the schema name. If not specified, table internal stage are used. When this is defined then s3_bucket has to be defined as well.

Default singer-state

The name of Singer state file stored in KV Store.

The task runner to use.

Task runners are provided by plugins, each have their own properties.

Default false

Validate every single record message to the corresponding JSON schema. This option is disabled by default and invalid RECORD messages will fail only at load time by Snowflake. Enabling this option will detect invalid records earlier but could cause performance degradation.

Key of the state in KV Store

Default busybox

The image used for the file sidecar container.

The maximum amount of CPU resources a container can use.

Make sure to set that to a numeric value e.g. cpus: "1.5" or cpus: "4" or For instance, if the host machine has two CPUs and you set cpus: "1.5", the container is guaranteed at most one and a half of the CPUs.

The registry authentication.

The auth field is a base64-encoded authentication string of username: password or a token.

The identity token.

The registry password.

The registry URL.

If not defined, the registry will be extracted from the image name.

The registry token.

The registry username.

The ARM resource ID of the user assigned identity.

Extra boot disk size for each task.

The milliCPU count.

Defines the amount of CPU resources per task in milliCPU units. For example, 1000 corresponds to 1 vCPU per task. If undefined, the default value is 2000. If you also define the VM's machine type using the machineType property in InstancePolicy field or inside the instanceTemplate in the InstancePolicyOrTemplate field, make sure the CPU resources for both fields are compatible with each other and with how many tasks you want to allow to run on the same VM at the same time.

For example, if you specify the n2-standard-2 machine type, which has 2 vCPUs, you can set the cpu to no more than 2000. Alternatively, you can run two tasks on the same VM if you set the cpu to 1000 or less.

Memory in MiB.

Defines the amount of memory per task in MiB units. If undefined, the default value is 2048. If you also define the VM's machine type using the machineType in InstancePolicy field or inside the instanceTemplate in the InstancePolicyOrTemplate field, make sure the memory resources for both fields are compatible with each other and with how many tasks you want to allow to run on the same VM at the same time.

For example, if you specify the n2-standard-2 machine type, which has 8 GiB of memory, you can set the memory to no more than 8192.

Default default

The namespace where the pod will be created.

Default true

Whether to reconnect to the current pod if it already exists.

Default PT5S

Format duration

The additional duration to wait for logs to arrive after pod completion.

As logs are not retrieved in real time, we cannot guarantee that we have fetched all logs when the pod complete, therefore we wait for a fixed amount of time to fetch late logs.

Default PT10M

Format duration

The maximum duration to wait until the pod is created.

This timeout is the maximum time that Kubernetes scheduler can take to

schedule the pod
pull the pod image
and start the pod.

The configuration of the target Kubernetes cluster.

Additional YAML spec for the container.

Default true

Whether the pod should be deleted upon completion.

Additional YAML spec for the sidecar container.

Default

{
  "image": "busybox"
}

The configuration of the file sidecar container that handle download and upload of files.

The pod custom labels

Kestra will add default labels to the pod with execution and flow identifiers.

Node selector for pod scheduling

Kestra will assign the pod to the nodes you want (see Assign Pod Nodes)

Additional YAML spec for the pod.

Default ALWAYS

Possible Values

IF_NOT_PRESENTALWAYSNEVER

The image pull policy for a container image and the tag of the image, which affect when Docker attempts to pull (download) the specified image.

The pod custom resources

The name of the service account.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT1H

Format duration

The maximum duration to wait for the pod completion unless the task timeout property is set which will take precedence over this property.

The Batch access key.

The Batch account name.

The blob service endpoint.

Id of the pool on which to run the job.

Default true

Whether to reconnect to the current job if it already exists.

Default PT5S

Format duration

Determines how often Kestra should poll the container for completion. By default, the task runner checks every 5 seconds whether the job is completed. You can set this to a lower value (e.g. PT0.1S = every 100 milliseconds) for quick jobs and to a lower threshold (e.g. PT1M = every minute) for long-running jobs. Setting this property to a lower value will reduce the number of API calls Kestra makes to the remote service — keep that in mind in case you see API rate limit errors.

Default true

Whether the job should be deleted upon completion.

Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.

The private registry which contains the container image.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

Azure Batch will automatically timeout the job upon reaching such duration and the task will be failed.

SubType integer

Exit codes of a task execution.

If there are more than 1 exit codes, when task executes with any of the exit code in the list, the condition is met and the action will be executed.

The GCP region.

Default true

Whether to reconnect to the current job if it already exists.

Google Cloud Storage Bucket to use to upload (inputFiles and namespaceFiles) and download (outputFiles) files.

It's mandatory to provide a bucket if you want to use such properties.

Default PT5S

Format duration

Default true

Whether the job should be deleted upon completion.

The GCP project ID.

SubType string

Default ["https://www.googleapis.com/auth/cloud-platform"]

The GCP scopes to be used.

The GCP service account key.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT5S

Format duration

Additional time after the job ends to wait for late logs.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

Google Cloud Run will automatically timeout the Job upon reaching such duration and the task will be failed.

Possible Values

ACTION_UNSPECIFIEDRETRY_TASKFAIL_TASKUNRECOGNIZED

Action on task failures based on different conditions.

Conditions for actions to deal with task failures.

Network identifier with the format projects/HOST_PROJECT_ID/global/networks/NETWORK.

Subnetwork identifier in the format projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNET

Default v1

The API version

CA certificate as data

CA certificate as file path

Client certificate as data

Client certificate as a file path

Default RSA

Client key encryption algorithm

default is RSA

Client key as data

Client key as a file path

Client key passphrase

Disable hostname verification

Key store file

Key store passphrase

Default https://kubernetes.default.svc

The url to the Kubernetes API

The namespace used

Oauth token

Oauth token provider

Password

Trust all certificates

Truststore file

Truststore passphrase

Username

The URL of the blob container the compute node should use.

Mandatory if you want to use namespaceFiles, inputFiles or outputFiles properties.

Connection string of the Storage Account.

The blob service endpoint.

Shared Key access key for authenticating requests.

Shared Key account name for authenticating requests.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default e2-medium

The GCP machine type.

See https://cloud.google.com/compute/docs/machine-types

Default true

Whether to reconnect to the current job if it already exists.

Google Cloud Storage Bucket to use to upload (inputFiles and namespaceFiles) and download (outputFiles) files.

It's mandatory to provide a bucket if you want to use such properties.

Default PT5S

Format duration

Compute resource requirements.

ComputeResource defines the amount of resources required for each task. Make sure your tasks have enough compute resources to successfully run. If you also define the types of resources for a job to use with the InstancePolicyOrTemplate field, make sure both fields are compatible with each other.

Default true

Whether the job should be deleted upon completion.

Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.

SubType string

Container entrypoint to use.

SubType

Lifecycle management schema when any task in a task group is failed.

Currently we only support one lifecycle policy. When the lifecycle policy condition is met, the action in the policy will execute. If task execution result does not meet with the defined lifecycle policy, we consider it as the default policy. Default policy means if the exit code is 0, exit task. If task ends with non-zero exit code, retry the task with max_retry_count.

Default 2

Minimum >= 0

Maximum <= 10

Maximum number of retries on failures.

The default, 0, which means never retry.

SubType

Network interfaces.

The GCP project ID.

The GCP region.

Compute reservation.

SubType string

Default ["https://www.googleapis.com/auth/cloud-platform"]

The GCP scopes to be used.

The GCP service account key.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT5S

Format duration

Additional time after the job ends to wait for late logs.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

Google Cloud Batch will automatically timeout the job upon reaching such duration and the task will be failed.

The maximum amount of kernel memory the container can use.

The minimum allowed value is 4MB. Because kernel memory cannot be swapped out, a container which is starved of kernel memory may block host machine resources, which can have side effects on the host machine and on other containers. See the kernel-memory docs for more details.

The maximum amount of memory resources the container can use.

Make sure to use the format number + unit (regardless of the case) without any spaces. The unit can be KB (kilobytes), MB (megabytes), GB (gigabytes), etc.

Given that it's case-insensitive, the following values are equivalent:

"512MB"
"512Mb"
"512mb"
"512000KB"
"0.5GB"

It is recommended that you allocate at least 6MB.

Allows you to specify a soft limit smaller than memory which is activated when Docker detects contention or low memory on the host machine.

If you use memoryReservation, it must be set lower than memory for it to take precedence. Because it is a soft limit, it does not guarantee that the container doesn’t exceed the limit.

The total amount of memory and swap that can be used by a container.

If memory and memorySwap are set to the same value, this prevents containers from using any swap. This is because memorySwap includes both the physical memory and swap space, while memory is only the amount of physical memory that can be used.

A setting which controls the likelihood of the kernel to swap memory pages.

By default, the host kernel can swap out a percentage of anonymous pages used by a container. You can set memorySwappiness to a value between 0 and 100 to tune this percentage.

By default, if an out-of-memory (OOM) error occurs, the kernel kills processes in a container.

To change this behavior, use the oomKillDisable option. Only disable the OOM killer on containers where you have also set the memory option. If the memory flag is not set, the host can run out of memory, and the kernel may need to kill the host system’s processes to free the memory.

The reference to the user assigned identity to use to access the Azure Container Registry instead of username and password.

The password to log into the registry server.

The registry server URL.

If omitted, the default is "docker.io".

The user name to log into the registry server.

Min length 1

Docker image to use.

Docker configuration file.

Docker configuration file that can set access credentials to private container registries. Usually located in ~/.docker/config.json.

Limits the CPU usage to a given maximum threshold value.

By default, each container’s access to the host machine’s CPU cycles is unlimited. You can set various constraints to limit a given container’s access to the host machine’s CPU cycles.

SubType

A list of device requests to be sent to device drivers.

SubType string

Docker entrypoint to use.

SubType string

Extra hostname mappings to the container network interface configuration.

Docker API URI.

Limits memory usage to a given maximum threshold value.

Docker can enforce hard memory limits, which allow the container to use no more than a given amount of user or system memory, or soft limits, which allow the container to use as much memory as it needs unless certain conditions are met, such as when the kernel detects low memory or contention on the host machine. Some of these options have different effects when used alone or when more than one option is set.

Docker network mode to use e.g. host, none, etc.

Give extended privileges to this container.

Default IF_NOT_PRESENT

Possible Values

IF_NOT_PRESENTALWAYSNEVER

The image pull policy for a container image and the tag of the image, which affect when Docker attempts to pull (download) the specified image.

Size of /dev/shm in bytes.

The size must be greater than 0. If omitted, the system uses 64MB.

User in the Docker container.

SubType string

List of volumes to mount.

Must be a valid mount expression as string, example : /home/user:/app.

Volumes mount are disabled by default for security reasons; you must enable them on server configuration by setting kestra.tasks.scripts.docker.volume-enabled to true.

Default VOLUME

Possible Values

MOUNTVOLUME

File handling strategy.

How to handle local files (input files, output files, namespace files, ...). By default, we create a volume and copy the file into the volume bind path. Configuring it to MOUNT will mount the working directory instead.

Docker configuration file.

Docker configuration file that can set access credentials to private container registries. Usually located in ~/.docker/config.json.

Limits the CPU usage to a given maximum threshold value.

By default, each container’s access to the host machine’s CPU cycles is unlimited. You can set various constraints to limit a given container’s access to the host machine’s CPU cycles.

Default true

Whether the container should be deleted upon completion.

SubType

A list of device requests to be sent to device drivers.

SubType string

Default

[
  ""
]

Docker entrypoint to use.

SubType string

Extra hostname mappings to the container network interface configuration.

Docker API URI.

Default PT0S

Format duration

When a task is killed, this property sets the grace period before killing the container.

By default, we kill the container immediately when a task is killed. Optionally, you can configure a grace period so the container is stopped with a grace period instead.

Limits memory usage to a given maximum threshold value.

Docker network mode to use e.g. host, none, etc.

SubType string

List of port bindings.

Corresponds to the --publish (-p) option of the docker run CLI command using the format ip: dockerHostPort: containerPort/protocol. Possible example :

8080: 80/udp- 127.0.0.1: 8080: 80- 127.0.0.1: 8080: 80/udp

Give extended privileges to this container.

Default IF_NOT_PRESENT

Possible Values

IF_NOT_PRESENTALWAYSNEVER

The pull policy for a container image.

Use the IF_NOT_PRESENT pull policy to avoid pulling already existing images. Use the ALWAYS pull policy to pull the latest version of an image even if an image with the same tag already exists.

Size of /dev/shm in bytes.

The size must be greater than 0. If omitted, the system uses 64MB.

User in the Docker container.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

SubType string

List of volumes to mount.

Make sure to provide a map of a local path to a container path in the format: /home/local/path:/app/container/path. Volume mounts are disabled by default for security reasons — if you are sure you want to use them, enable that feature in the plugin configuration by setting volume-enabled to true.

Here is how you can add that setting to your kestra configuration:

text

kestra: 
  plugins: 
    configurations: 
      - type: io.kestra.plugin.scripts.runner.docker.Docker
        values: 
          volume-enabled: true

Default true

Whether to wait for the container to exit.

SubType array

A list of capabilities; an OR list of AND lists of capabilities.

SubType string

Driver-specific options, specified as key/value pairs.

These options are passed directly to the driver.

Compute environment in which to run the job.

AWS region with which the SDK should communicate.

Default

{
  "request": {
    "memory": "2048",
    "cpu": "1"
  }
}

Custom resources for the ECS Fargate container.

See the AWS documentation for more details.

Access Key Id in order to connect to AWS.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

S3 Bucket to upload (inputFiles and namespaceFiles) and download (outputFiles) files.

It's mandatory to provide a bucket if you want to use such properties.

Default PT5S

Format duration

Default true

Whether the job should be deleted upon completion.

Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.

The endpoint with which the SDK should communicate.

This property allows you to use a different S3 compatible storage backend.

Execution role for the AWS Batch job.

Mandatory if the compute environment is ECS Fargate. See the AWS documentation for more details.

Job queue to use to submit jobs (ARN). If not specified, the task runner will create a job queue — keep in mind that this can lead to a longer execution.

Default true

Whether to reconnect to the current job if it already exists.

Secret Key Id in order to connect to AWS.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

AWS session token, retrieved from an AWS token service, used for authenticating that this user has received temporary permissions to access a given resource.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

The AWS STS endpoint with which the SDKClient should communicate.

AWS STS Role.

The Amazon Resource Name (ARN) of the role to assume. If set the task will use the StsAssumeRoleCredentialsProvider. If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

AWS STS External Id.

A unique identifier that might be required when you assume a role in another account. This property is only used when an stsRoleArn is defined.

Default PT15M

Format duration

AWS STS Session duration.

The duration of the role session (default: 15 minutes, i.e., PT15M). This property is only used when an stsRoleArn is defined.

AWS STS Session name.

This property is only used when an stsRoleArn is defined.

Task role to use within the container.

Needed if you want to authenticate with AWS CLI within your container.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

AWS Batch will automatically timeout the job upon reaching that duration and the task will be marked as failed.