Load data into a Redshift database with a Singer target.

Full documentation can be found on the GitHub Repo.

yaml
type: "io.kestra.plugin.singer.targets.pipelinewiseredshift"

Min length 1

Name of the schema where the tables will be created.

If schema_mapping is not defined then every stream sent by the tap is loaded into this schema.

The raw data from a tap.

Min length 1

The database hostname.

The database port.

Min length 1

The S3 bucket name.

Min length 1

The database user.

S3 Access Key ID.

Used for S3 and Redshift copy operations.

Default false

Add metadata columns.

Metadata columns add extra row level information about data ingestions, (i.e. when was the row read in source, when was inserted or deleted in redshift etc.) Metadata columns are creating automatically by adding extra columns to the tables with a column prefix SDC. The metadata columns are documented at here. Enabling metadata columns will flag the deleted rows by setting the _SDC_DELETED_AT metadata column. Without the addMetadataColumns option the deleted rows from singer taps will not be recongisable in Redshift.

Default 100000

Maximum number of rows in each batch.

At the end of each batch, the rows in the batch are loaded into Redshift.

Override default singer command.

Default bzip2

Possible Values

gzipbzip2

The compression method to use when writing files to S3 and running Redshift COPY.

Default python:3.10.12

The task runner container image, only used if the task runner is container-based.

COPY options.

Parameters to use in the COPY command when loading data to Redshift. Some basic file formatting parameters are fixed values and not recommended overriding them by custom values. They are like: CSV GZIP DELIMITER ',' REMOVEQUOTES ESCAPE.

Default 0

Object type RECORD items from taps can be transformed to flattened columns by creating columns automatically.

When hardDelete option is true then DELETE SQL commands will be performed in Redshift to delete rows in tables. It's achieved by continuously checking the _SDC_DELETED_AT metadata column sent by the singer tap. Due to deleting rows requires metadata columns, hardDelete option automatically enables the addMetadataColumns option as well..

The database name.

Grant USAGE privilege on newly created schemas and grant SELECT privilege on newly created tables to a specific list of users or groups.

If schemaMapping is not defined then every stream sent by the tap is granted accordingly.

Default false

Disable table cache.

By default the connector caches the available table structures in Redshift at startup. In this way it doesn't need to run additional queries when ingesting data to check if altering the target tables is required. With disable_table_cache option you can turn off this caching. You will always see the most recent table structures but will cause an extra query runtime.

Deprecated, use 'taskRunner' instead

Default false

Flush and load every stream into Redshift when one batch is full.

Warning: This may trigger the COPY command to use files with low number of records..

Default false

Delete rows on Redshift.

Default 16

Max number of parallel threads to use when flushing tables.

Default 0

The number of threads used to flush tables.

0 will create a thread for each stream, up to parallelism_max. -1 will create a thread for each CPU core. Any other positive number will create that number of threads, up to parallelism_max.

The database user's password.

SubType string

Override default pip packages to use a specific version.

Default true

Log based and Incremental replications on tables with no Primary Key cause duplicates when merging UPDATE events.

When set to true, stop loading data if no Primary Key is defined..

AWS Redshift COPY role ARN.

AWS Role ARN to be used for the Redshift COPY operation. Used instead of the given AWS keys for the COPY operation if provided - the keys are still used for other S3 operations.

AWS S3 ACL.

S3 Object ACL.

S3 Key Prefix.

A static prefix before the generated S3 key names. Using prefixes you can upload files into specific directories in the S3 bucket. Default(None).

Schema mapping.

Useful if you want to load multiple streams from one tap to multiple Redshift schemas. If the tap sends the stream_id in <schema_name>-<table_name> format then this option overwrites the default_target_schema value. Note, that using schema_mapping you can overwrite the default_target_schema_select_permissions value to grant SELECT permissions to different groups per schemas or optionally you can create indices automatically for the replicated tables.

S3 Secret Access Key.

Used for S3 and Redshift copy operations.

AWS S3 Session Token.

S3 AWS STS token for temporary credentials.

Default false

Do not update existing records when Primary Key is defined.

Useful to improve performance when records are immutable, e.g. events.

Default 1

number of slices to split files into prior to running COPY on Redshift.

This should be set to the number of Redshift slices. The number of slices per node depends on the node size of the cluster - run SELECT COUNT(DISTINCT slice) slices FROM stv_slices to calculate this.

Default singer-state

The name of Singer state file stored in KV Store.

The task runner to use.

Task runners are provided by plugins, each have their own properties.

Default false

Validate every single record message to the corresponding JSON schema.

This option is disabled by default and invalid RECORD messages will fail only at load time by Redshift. Enabling this option will detect invalid records earlier but could cause performance degradation..

Key of the state in KV Store

Default busybox

The image used for the file sidecar container.

The maximum amount of CPU resources a container can use.

Make sure to set that to a numeric value e.g. cpus: "1.5" or cpus: "4" or For instance, if the host machine has two CPUs and you set cpus: "1.5", the container is guaranteed at most one and a half of the CPUs.

The registry authentication.

The auth field is a base64-encoded authentication string of username: password or a token.

The identity token.

The registry password.

The registry URL.

If not defined, the registry will be extracted from the image name.

The registry token.

The registry username.

The ARM resource ID of the user assigned identity.

Extra boot disk size for each task.

The milliCPU count.

Defines the amount of CPU resources per task in milliCPU units. For example, 1000 corresponds to 1 vCPU per task. If undefined, the default value is 2000. If you also define the VM's machine type using the machineType property in InstancePolicy field or inside the instanceTemplate in the InstancePolicyOrTemplate field, make sure the CPU resources for both fields are compatible with each other and with how many tasks you want to allow to run on the same VM at the same time.

For example, if you specify the n2-standard-2 machine type, which has 2 vCPUs, you can set the cpu to no more than 2000. Alternatively, you can run two tasks on the same VM if you set the cpu to 1000 or less.

Memory in MiB.

Defines the amount of memory per task in MiB units. If undefined, the default value is 2048. If you also define the VM's machine type using the machineType in InstancePolicy field or inside the instanceTemplate in the InstancePolicyOrTemplate field, make sure the memory resources for both fields are compatible with each other and with how many tasks you want to allow to run on the same VM at the same time.

For example, if you specify the n2-standard-2 machine type, which has 8 GiB of memory, you can set the memory to no more than 8192.

Default default

The namespace where the pod will be created.

Default true

Whether to reconnect to the current pod if it already exists.

Default PT5S

Format duration

The additional duration to wait for logs to arrive after pod completion.

As logs are not retrieved in real time, we cannot guarantee that we have fetched all logs when the pod complete, therefore we wait for a fixed amount of time to fetch late logs.

Default PT10M

Format duration

The maximum duration to wait until the pod is created.

This timeout is the maximum time that Kubernetes scheduler can take to

schedule the pod
pull the pod image
and start the pod.

The configuration of the target Kubernetes cluster.

Additional YAML spec for the container.

Default true

Whether the pod should be deleted upon completion.

Additional YAML spec for the sidecar container.

Default

{
  "image": "busybox"
}

The configuration of the file sidecar container that handle download and upload of files.

The pod custom labels

Kestra will add default labels to the pod with execution and flow identifiers.

Node selector for pod scheduling

Kestra will assign the pod to the nodes you want (see Assign Pod Nodes)

Additional YAML spec for the pod.

Default ALWAYS

Possible Values

IF_NOT_PRESENTALWAYSNEVER

The image pull policy for a container image and the tag of the image, which affect when Docker attempts to pull (download) the specified image.

The pod custom resources

The name of the service account.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT1H

Format duration

The maximum duration to wait for the pod completion unless the task timeout property is set which will take precedence over this property.

The Batch access key.

The Batch account name.

The blob service endpoint.

Id of the pool on which to run the job.

Default true

Whether to reconnect to the current job if it already exists.

Default PT5S

Format duration

Determines how often Kestra should poll the container for completion. By default, the task runner checks every 5 seconds whether the job is completed. You can set this to a lower value (e.g. PT0.1S = every 100 milliseconds) for quick jobs and to a lower threshold (e.g. PT1M = every minute) for long-running jobs. Setting this property to a lower value will reduce the number of API calls Kestra makes to the remote service — keep that in mind in case you see API rate limit errors.

Default true

Whether the job should be deleted upon completion.

Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.

The private registry which contains the container image.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

Azure Batch will automatically timeout the job upon reaching such duration and the task will be failed.

SubType integer

Exit codes of a task execution.

If there are more than 1 exit codes, when task executes with any of the exit code in the list, the condition is met and the action will be executed.

The GCP region.

Default true

Whether to reconnect to the current job if it already exists.

Google Cloud Storage Bucket to use to upload (inputFiles and namespaceFiles) and download (outputFiles) files.

It's mandatory to provide a bucket if you want to use such properties.

Default PT5S

Format duration

Default true

Whether the job should be deleted upon completion.

The GCP project ID.

SubType string

Default ["https://www.googleapis.com/auth/cloud-platform"]

The GCP scopes to be used.

The GCP service account key.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT5S

Format duration

Additional time after the job ends to wait for late logs.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

Google Cloud Run will automatically timeout the Job upon reaching such duration and the task will be failed.

Possible Values

ACTION_UNSPECIFIEDRETRY_TASKFAIL_TASKUNRECOGNIZED

Action on task failures based on different conditions.

Conditions for actions to deal with task failures.

Network identifier with the format projects/HOST_PROJECT_ID/global/networks/NETWORK.

Subnetwork identifier in the format projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNET

Default v1

The API version

CA certificate as data

CA certificate as file path

Client certificate as data

Client certificate as a file path

Default RSA

Client key encryption algorithm

default is RSA

Client key as data

Client key as a file path

Client key passphrase

Disable hostname verification

Key store file

Key store passphrase

Default https://kubernetes.default.svc

The url to the Kubernetes API

The namespace used

Oauth token

Oauth token provider

Password

Trust all certificates

Truststore file

Truststore passphrase

Username

The URL of the blob container the compute node should use.

Mandatory if you want to use namespaceFiles, inputFiles or outputFiles properties.

Connection string of the Storage Account.

The blob service endpoint.

Shared Key access key for authenticating requests.

Shared Key account name for authenticating requests.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default e2-medium

The GCP machine type.

See https://cloud.google.com/compute/docs/machine-types

Default true

Whether to reconnect to the current job if it already exists.

Google Cloud Storage Bucket to use to upload (inputFiles and namespaceFiles) and download (outputFiles) files.

It's mandatory to provide a bucket if you want to use such properties.

Default PT5S

Format duration

Compute resource requirements.

ComputeResource defines the amount of resources required for each task. Make sure your tasks have enough compute resources to successfully run. If you also define the types of resources for a job to use with the InstancePolicyOrTemplate field, make sure both fields are compatible with each other.

Default true

Whether the job should be deleted upon completion.

Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.

SubType string

Container entrypoint to use.

SubType

Lifecycle management schema when any task in a task group is failed.

Currently we only support one lifecycle policy. When the lifecycle policy condition is met, the action in the policy will execute. If task execution result does not meet with the defined lifecycle policy, we consider it as the default policy. Default policy means if the exit code is 0, exit task. If task ends with non-zero exit code, retry the task with max_retry_count.

Default 2

Minimum >= 0

Maximum <= 10

Maximum number of retries on failures.

The default, 0, which means never retry.

SubType

Network interfaces.

The GCP project ID.

The GCP region.

Compute reservation.

SubType string

Default ["https://www.googleapis.com/auth/cloud-platform"]

The GCP scopes to be used.

The GCP service account key.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT5S

Format duration

Additional time after the job ends to wait for late logs.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

Google Cloud Batch will automatically timeout the job upon reaching such duration and the task will be failed.

The maximum amount of kernel memory the container can use.

The minimum allowed value is 4MB. Because kernel memory cannot be swapped out, a container which is starved of kernel memory may block host machine resources, which can have side effects on the host machine and on other containers. See the kernel-memory docs for more details.

The maximum amount of memory resources the container can use.

Make sure to use the format number + unit (regardless of the case) without any spaces. The unit can be KB (kilobytes), MB (megabytes), GB (gigabytes), etc.

Given that it's case-insensitive, the following values are equivalent:

"512MB"
"512Mb"
"512mb"
"512000KB"
"0.5GB"

It is recommended that you allocate at least 6MB.

Allows you to specify a soft limit smaller than memory which is activated when Docker detects contention or low memory on the host machine.

If you use memoryReservation, it must be set lower than memory for it to take precedence. Because it is a soft limit, it does not guarantee that the container doesn’t exceed the limit.

The total amount of memory and swap that can be used by a container.

If memory and memorySwap are set to the same value, this prevents containers from using any swap. This is because memorySwap includes both the physical memory and swap space, while memory is only the amount of physical memory that can be used.

A setting which controls the likelihood of the kernel to swap memory pages.

By default, the host kernel can swap out a percentage of anonymous pages used by a container. You can set memorySwappiness to a value between 0 and 100 to tune this percentage.

By default, if an out-of-memory (OOM) error occurs, the kernel kills processes in a container.

To change this behavior, use the oomKillDisable option. Only disable the OOM killer on containers where you have also set the memory option. If the memory flag is not set, the host can run out of memory, and the kernel may need to kill the host system’s processes to free the memory.

The reference to the user assigned identity to use to access the Azure Container Registry instead of username and password.

The password to log into the registry server.

The registry server URL.

If omitted, the default is "docker.io".

The user name to log into the registry server.

Min length 1

Docker image to use.

Docker configuration file.

Docker configuration file that can set access credentials to private container registries. Usually located in ~/.docker/config.json.

Limits the CPU usage to a given maximum threshold value.

By default, each container’s access to the host machine’s CPU cycles is unlimited. You can set various constraints to limit a given container’s access to the host machine’s CPU cycles.

SubType

A list of device requests to be sent to device drivers.

SubType string

Docker entrypoint to use.

SubType string

Extra hostname mappings to the container network interface configuration.

Docker API URI.

Limits memory usage to a given maximum threshold value.

Docker can enforce hard memory limits, which allow the container to use no more than a given amount of user or system memory, or soft limits, which allow the container to use as much memory as it needs unless certain conditions are met, such as when the kernel detects low memory or contention on the host machine. Some of these options have different effects when used alone or when more than one option is set.

Docker network mode to use e.g. host, none, etc.

Give extended privileges to this container.

Default IF_NOT_PRESENT

Possible Values

IF_NOT_PRESENTALWAYSNEVER

The image pull policy for a container image and the tag of the image, which affect when Docker attempts to pull (download) the specified image.

Size of /dev/shm in bytes.

The size must be greater than 0. If omitted, the system uses 64MB.

User in the Docker container.

SubType string

List of volumes to mount.

Must be a valid mount expression as string, example : /home/user:/app.

Volumes mount are disabled by default for security reasons; you must enable them on server configuration by setting kestra.tasks.scripts.docker.volume-enabled to true.

Default VOLUME

Possible Values

MOUNTVOLUME

File handling strategy.

How to handle local files (input files, output files, namespace files, ...). By default, we create a volume and copy the file into the volume bind path. Configuring it to MOUNT will mount the working directory instead.

Docker configuration file.

Docker configuration file that can set access credentials to private container registries. Usually located in ~/.docker/config.json.

Limits the CPU usage to a given maximum threshold value.

By default, each container’s access to the host machine’s CPU cycles is unlimited. You can set various constraints to limit a given container’s access to the host machine’s CPU cycles.

Default true

Whether the container should be deleted upon completion.

SubType

A list of device requests to be sent to device drivers.

SubType string

Default

[
  ""
]

Docker entrypoint to use.

SubType string

Extra hostname mappings to the container network interface configuration.

Docker API URI.

Default PT0S

Format duration

When a task is killed, this property sets the grace period before killing the container.

By default, we kill the container immediately when a task is killed. Optionally, you can configure a grace period so the container is stopped with a grace period instead.

Limits memory usage to a given maximum threshold value.

Docker network mode to use e.g. host, none, etc.

SubType string

List of port bindings.

Corresponds to the --publish (-p) option of the docker run CLI command using the format ip: dockerHostPort: containerPort/protocol. Possible example :

8080: 80/udp- 127.0.0.1: 8080: 80- 127.0.0.1: 8080: 80/udp

Give extended privileges to this container.

Default IF_NOT_PRESENT

Possible Values

IF_NOT_PRESENTALWAYSNEVER

The pull policy for a container image.

Use the IF_NOT_PRESENT pull policy to avoid pulling already existing images. Use the ALWAYS pull policy to pull the latest version of an image even if an image with the same tag already exists.

Size of /dev/shm in bytes.

The size must be greater than 0. If omitted, the system uses 64MB.

User in the Docker container.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

SubType string

List of volumes to mount.

Make sure to provide a map of a local path to a container path in the format: /home/local/path:/app/container/path. Volume mounts are disabled by default for security reasons — if you are sure you want to use them, enable that feature in the plugin configuration by setting volume-enabled to true.

Here is how you can add that setting to your kestra configuration:

text

kestra: 
  plugins: 
    configurations: 
      - type: io.kestra.plugin.scripts.runner.docker.Docker
        values: 
          volume-enabled: true

Default true

Whether to wait for the container to exit.

SubType array

A list of capabilities; an OR list of AND lists of capabilities.

SubType string

Driver-specific options, specified as key/value pairs.

These options are passed directly to the driver.

Compute environment in which to run the job.

AWS region with which the SDK should communicate.

Default

{
  "request": {
    "memory": "2048",
    "cpu": "1"
  }
}

Custom resources for the ECS Fargate container.

See the AWS documentation for more details.

Access Key Id in order to connect to AWS.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

S3 Bucket to upload (inputFiles and namespaceFiles) and download (outputFiles) files.

It's mandatory to provide a bucket if you want to use such properties.

Default PT5S

Format duration

Default true

Whether the job should be deleted upon completion.

Warning, if the job is not deleted, a retry of the task could resume an old failed attempt of the job.

The endpoint with which the SDK should communicate.

This property allows you to use a different S3 compatible storage backend.

Execution role for the AWS Batch job.

Mandatory if the compute environment is ECS Fargate. See the AWS documentation for more details.

Job queue to use to submit jobs (ARN). If not specified, the task runner will create a job queue — keep in mind that this can lead to a longer execution.

Default true

Whether to reconnect to the current job if it already exists.

Secret Key Id in order to connect to AWS.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

AWS session token, retrieved from an AWS token service, used for authenticating that this user has received temporary permissions to access a given resource.

If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

The AWS STS endpoint with which the SDKClient should communicate.

AWS STS Role.

The Amazon Resource Name (ARN) of the role to assume. If set the task will use the StsAssumeRoleCredentialsProvider. If no credentials are defined, we will use the default credentials provider chain to fetch credentials.

AWS STS External Id.

A unique identifier that might be required when you assume a role in another account. This property is only used when an stsRoleArn is defined.

Default PT15M

Format duration

AWS STS Session duration.

The duration of the role session (default: 15 minutes, i.e., PT15M). This property is only used when an stsRoleArn is defined.

AWS STS Session name.

This property is only used when an stsRoleArn is defined.

Task role to use within the container.

Needed if you want to authenticate with AWS CLI within your container.

Validation RegExp \d+\.\d+\.\d+(-[a-zA-Z0-9-]+)?|([a-zA-Z0-9]+)

The version of the plugin to use.

Default PT1H

Format duration

The maximum duration to wait for the job completion unless the task timeout property is set which will take precedence over this property.

AWS Batch will automatically timeout the job upon reaching that duration and the task will be marked as failed.