IngestDocumentIngestDocument
​Ingest​DocumentCertified

Currently supports text documents (TXT, HTML, Markdown).

Ingest documents into an embedding store

Currently supports text documents (TXT, HTML, Markdown).

yaml
type: "io.kestra.plugin.ai.rag.IngestDocument"

Ingest documents into a KV embedding store. WARNING: the KV embedding store is for quick prototyping only; it stores embedding vectors in a KV store and loads them all into memory.

yaml
id: document_ingestion
namespace: company.ai

tasks:
  - id: ingest
    type: io.kestra.plugin.ai.rag.IngestDocument
    provider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-embedding-exp-03-07
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.ai.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-24.md
Properties

Embedding store provider

Definitions
baseUrl*Requiredstring

The database base URL

collectionName*Requiredstring
type*Requiredobject
connection*Required
hosts*Requiredarray
SubTypestring
Min items1

List of HTTP Elasticsearch servers

Must be a URI like https://example.com: 9200 with scheme and port

basicAuth

Basic authorization configuration

passwordstring

Basic authorization password

usernamestring

Basic authorization username

headersarray
SubTypestring

List of HTTP headers to be sent with every request

Each item is a key: value string, e.g., Authorization: Token XYZ

pathPrefixstring

Path prefix for all HTTP requests

If set to /my/path, each client request becomes /my/path/ + endpoint. Useful when Elasticsearch is behind a proxy providing a base path; do not use otherwise.

strictDeprecationModebooleanstring

Treat responses with deprecation warnings as failures

trustAllSslbooleanstring

Trust all SSL CA certificates

Use this if the server uses a self-signed SSL certificate

indexName*Requiredstring

The name of the index to store embeddings

type*Requiredobject
type*Requiredobject
kvNamestring
Default{{flow.id}}-embedding-store

The name of the KV pair to use

createTable*Requiredbooleanstring

Whether to create the table if it doesn't exist

databaseUrl*Requiredstring

Database URL of the MariaDB database (e.g., jdbc: mariadb://host: port/dbname)

fieldName*Requiredstring

Name of the column used as the unique ID in the database

password*Requiredstring
tableName*Requiredstring

Name of the table where embeddings will be stored

type*Requiredobject
username*Requiredstring
columnDefinitionsarray
SubTypestring

Metadata Column Definitions

List of SQL column definitions for metadata fields (e.g., 'text TEXT', 'source TEXT'). Required only when using COLUMN_PER_KEY storage mode.

indexesarray
SubTypestring

Metadata Index Definitions

List of SQL index definitions for metadata columns (e.g., 'INDEX idx_text (text)'). Used only with COLUMN_PER_KEY storage mode.

metadataStorageModestring

Metadata Storage Mode

Determines how metadata is stored: - COLUMN_PER_KEY: Use individual columns for each metadata field (requires columnDefinitions and indexes). - COMBINED_JSON (default): Store metadata as a JSON object in a single column. If columnDefinitions and indexes are provided, COLUMN_PER_KEY must be used.

token*Requiredstring

Token

Milvus auth token. Required if authentication is enabled; omit for local deployments without auth.

type*Requiredobject
autoFlushOnDeletebooleanstring

Auto flush on delete

If true, flush after delete operations.

autoFlushOnInsertbooleanstring

Auto flush on insert

If true, flush after insert operations. Setting it to false can improve throughput.

collectionNamestring

Collection name

Target collection. Created automatically if it does not exist. Default: "default".

consistencyLevelstring

Read/write consistency level. Common values include STRONG, BOUNDED, or EVENTUALLY (depends on client/version).

databaseNamestring

Database name

Logical database to use. If not provided, the default database is used.

hoststring

Milvus host name (used when uri is not set). Default: "localhost".

idFieldNamestring

ID field name

Field name for document IDs. Default depends on collection schema.

indexTypestring

Index type

Vector index type (e.g., IVF_FLAT, IVF_SQ8, HNSW). Depends on Milvus deployment and dataset.

metadataFieldNamestring

Field name for metadata. Default depends on collection schema.

metricTypestring

Metric type

Similarity metric (e.g., L2, IP, COSINE). Should match the embedding provider’s expected metric.

passwordstring

Password

portintegerstring

Milvus port (used when uri is not set). Typical: 19530 (gRPC) or 9091 (HTTP). Default: 19530.

retrieveEmbeddingsOnSearchbooleanstring

Retrieve embeddings on search

If true, return stored embeddings along with matches. Default: false.

textFieldNamestring

Text field name

Field name for original text. Default depends on collection schema.

uristring

URI

Connection URI. Use either uri OR host/port (not both). Examples:

  • gRPC (typical): "milvus://host: 19530"
  • HTTP: "http://host: 9091"
usernamestring

Username

Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md

vectorFieldNamestring

Vector field name

Field name for the embedding vector. Must match the index definition and embedding dimensionality.

collectionName*Requiredstring
host*Requiredstring

The host

indexName*Requiredstring
scheme*Requiredstring

The scheme (e.g., mongodb+srv)

type*Requiredobject
createIndexbooleanstring

Create the index

databasestring

The database

metadataFieldNamesarray
SubTypestring

The metadata field names

optionsobject

The connection string options

passwordstring

The password

usernamestring

The username

database*Requiredstring

The database name

host*Requiredstring
password*Requiredstring

The database password

port*Requiredintegerstring
table*Requiredstring

The table to store embeddings in

type*Requiredobject
user*Requiredstring

The database user

useIndexbooleanstring
Defaultfalse

Whether to use use an IVFFlat index

An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).

apiKey*Requiredstring
cloud*Requiredstring

The cloud provider

index*Requiredstring

The index

region*Requiredstring

The cloud provider region

type*Requiredobject
namespacestring

The namespace (default will be used if not provided)

apiKey*Requiredstring

The API key

collectionName*Requiredstring

The collection name

host*Requiredstring
port*Requiredintegerstring
type*Requiredobject
host*Requiredstring

The database server host

port*Requiredintegerstring

The database server port

type*Requiredobject
indexNamestring
Defaultembedding-index

The index name

accessKeyId*Requiredstring

Access Key ID

The access key ID used for authentication with the database.

accessKeySecret*Requiredstring

Access Key Secret

The access key secret used for authentication with the database.

endpoint*Requiredstring

The base URL for the Tablestore database endpoint.

instanceName*Requiredstring

Instance Name

The name of the Tablestore database instance.

type*Requiredobject
metadataSchemaListarray

Metadata Schema List

Optional list of metadata field schemas for the collection.

analyzerstring
Possible Values
SingleWordMaxWordMinWordSplitFuzzy
analyzerParameter
dateFormatsarray
SubTypestring
enableHighlightingboolean
enableSortAndAggboolean
fieldNamestring
fieldTypestring
Possible Values
LONGDOUBLEBOOLEANKEYWORDTEXTNESTEDGEO_POINTDATEVECTORFUZZY_KEYWORDIPJSONUNKNOWN
indexboolean
indexOptionsstring
Possible Values
DOCSFREQSPOSITIONSOFFSETS
isArrayboolean
jsonTypestring
Possible Values
FLATTENNESTED
sourceFieldNamesarray
SubTypestring
storeboolean
subFieldSchemasarray
analyzerstring
Possible Values
SingleWordMaxWordMinWordSplitFuzzy
analyzerParameter
dateFormatsarray
SubTypestring
enableHighlightingboolean
enableSortAndAggboolean
fieldNamestring
fieldTypestring
Possible Values
LONGDOUBLEBOOLEANKEYWORDTEXTNESTEDGEO_POINTDATEVECTORFUZZY_KEYWORDIPJSONUNKNOWN
indexboolean
indexOptionsstring
Possible Values
DOCSFREQSPOSITIONSOFFSETS
isArrayboolean
jsonTypestring
Possible Values
FLATTENNESTED
sourceFieldNamesarray
SubTypestring
storeboolean
subFieldSchemasarray
vectorOptions
vectorOptions
dataTypestring
dimensioninteger
metricTypestring
Possible Values
EUCLIDEANCOSINEDOT_PRODUCT
apiKey*Requiredstring

API key

Weaviate API key. Omit for local deployments without auth.

host*Requiredstring

Host

Cluster host name without protocol, e.g., "abc123.weaviate.network".

type*Requiredobject
avoidDupsbooleanstring

Avoid duplicates

If true (default), a hash-based ID is derived from each text segment to prevent duplicates. If false, a random ID is used.

consistencyLevelstring
Possible Values
ONEQUORUMALL

Consistency level

Write consistency: ONE, QUORUM (default), or ALL.

grpcPortintegerstring

gRPC port

Port for gRPC if enabled (e.g., 50051).

metadataFieldNamestring

Metadata field name

Field used to store metadata. Defaults to "_metadata" if not set.

metadataKeysarray
SubTypestring

Metadata keys

The list of metadata keys to store - if not provided, it will default to an empty list.

objectClassstring

Object class

Weaviate class to store objects in (must start with an uppercase letter). Defaults to "Default" if not set.

portintegerstring

Port

Optional port (e.g., 443 for https, 80 for http). Leave unset to use provider defaults.

schemestring

Scheme

Cluster scheme: "https" (recommended) or "http".

securedGrpcbooleanstring

Secure gRPC

Whether the gRPC connection is secured (TLS).

useGrpcForInsertsbooleanstring

Use gRPC for batch inserts

If true, use gRPC for batch inserts. HTTP remains required for search operations.

Language model provider

Must be configured with an embedding model.

Definitions
accessKeyId*Requiredstring

AWS Access Key ID

modelName*Requiredstring
secretAccessKey*Requiredstring

AWS Secret Access Key

type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
modelTypestring
DefaultCOHERE
Possible Values
COHERETITAN

Amazon Bedrock Embedding Model Type

apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
maxTokensintegerstring

Maximum Tokens

Specifies the maximum number of tokens that the model is allowed to generate in its response.

endpoint*Requiredstring

API endpoint

The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/

modelName*Requiredstring
type*Requiredobject
apiKeystring
baseUrlstring
caPemstring
clientIdstring

Client ID

clientPemstring
clientSecretstring

Client secret

serviceVersionstring

API version

tenantIdstring

Tenant ID

apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
Defaulthttps://dashscope-intl.aliyuncs.com/api/v1
text
If you use a model in the China (Beijing) region, you need to replace the URL with: https://dashscope.aliyuncs.com/api/v1,
otherwise use the Singapore region of: "https://dashscope-intl.aliyuncs.com/api/v1.
The default value is computed based on the system timezone.
caPemstring
clientPemstring
enableSearchbooleanstring

Whether the model uses Internet search results for reference when generating text or not

maxTokensintegerstring
repetitionPenaltynumberstring

Repetition in a continuous sequence during model generation

text
Increasing repetition_penalty reduces the repetition in model generation,
1.0 means no penalty. Value range: (0, +inf)
apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
Defaulthttps://api.deepseek.com/v1
caPemstring
clientPemstring
gitHubToken*Requiredstring

GitHub Token

Personal Access Token (PAT) used to access GitHub Models.

modelName*Requiredstring
type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
endpoint*Requiredstring

Endpoint URL

location*Requiredstring

Project location

modelName*Requiredstring
project*Requiredstring

Project ID

type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
Defaulthttps://router.huggingface.co/v1
caPemstring
clientPemstring
baseUrl*Requiredstring
modelName*Requiredstring
type*Requiredobject
caPemstring
clientPemstring
apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
compartmentId*Requiredstring

OCID of OCI Compartment with the model

modelName*Requiredstring
region*Requiredstring

OCI Region to connect the client to

type*Requiredobject
authProviderstring

OCI SDK Authentication provider

baseUrlstring
caPemstring
clientPemstring
endpoint*Requiredstring

Model endpoint

modelName*Requiredstring
type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
Defaulthttps://api.openai.com/v1
caPemstring
clientPemstring
apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring
caPemstring
clientPemstring
accountId*Requiredstring

Account Identifier

Unique identifier assigned to an account

apiKey*Requiredstring
modelName*Requiredstring
type*Requiredobject
baseUrlstring

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPemstring
clientPemstring
apiKey*Requiredstring

API Key

modelName*Requiredstring

Model name

type*Requiredobject
baseUrlstring
Defaulthttps://open.bigmodel.cn/

API base URL

The base URL for ZhiPu API (defaults to https://open.bigmodel.cn/)

caPemstring

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPemstring

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

maxRetriesintegerstring

The maximum retry times to request

maxTokenintegerstring

The maximum number of tokens returned by this request

stopsarray
SubTypestring

With the stop parameter, the model will automatically stop generating text when it is about to contain the specified string or token_id

Document splitter

Definitions
maxOverlapSizeInChars*Requiredinteger

Maximum overlap size (characters). Only full sentences are considered for overlap.

maxSegmentSizeInChars*Requiredinteger

Maximum segment size (characters)

splitterstring
DefaultRECURSIVE
Possible Values
RECURSIVEPARAGRAPHLINESENTENCEWORD

DocumentSplitter type

Recommended: RECURSIVE for generic text. It splits into paragraphs first and fits as many as possible into a single TextSegment. If paragraphs are too long, they are recursively split into lines, then sentences, then words, then characters until they fit into a segment.

Defaultfalse

Drop the store before ingestion (useful for testing)

List of inline documents

Definitions
content*Requiredstring

Document content

metadataobject

Document metadata

SubTypestring

List of document URLs from external sources

SubTypestring

List of internal storage URIs for documents

Path in the task working directory containing documents to ingest

Each document in the directory will be ingested into the embedding store. Ingestion is recursive and protected against path traversal (CWE-22).

SubTypestring

Additional metadata to add to all ingested documents

Additional outputs from the embedding store

Number of ingested documents

Input token count

Output token count

Total token count

Unitrecords

Number of indexed documents

Unittoken

Large Language Model (LLM) input token count

Unittoken

Large Language Model (LLM) output token count

Unittoken

Large Language Model (LLM) total token count