IngestDocument
Ingest documents into an embedding store

Ingest documents into an embedding store

Currently supports text documents (TXT, HTML, Markdown).

yaml
type: "io.kestra.plugin.ai.rag.IngestDocument"

Examples

Ingest documents into a KV embedding store. WARNING: the KV embedding store is for quick prototyping only; it stores embedding vectors in a KV store and loads them all into memory.

yaml
id: document_ingestion
namespace: company.ai

tasks:
  - id: ingest
    type: io.kestra.plugin.ai.rag.IngestDocument
    provider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-embedding-exp-03-07
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.ai.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-24.md

Properties

embeddings*

Embedding store provider

Definitions

Chroma Embedding Store

baseUrl*string

The database base URL

collectionName*string

type*object

Elasticsearch Embedding Store

connection*

io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection

hosts*array

SubTypestring

Min items1

List of HTTP Elasticsearch servers

Must be a URI like https://example.com: 9200 with scheme and port

basicAuth

Basic authorization configuration

io.kestra.plugin.ai.embeddings.Elasticsearch-ElasticsearchConnection-BasicAuth

passwordstring

Basic authorization password

usernamestring

Basic authorization username

headersarray

SubTypestring

List of HTTP headers to be sent with every request

Each item is a key: value string, e.g., Authorization: Token XYZ

pathPrefixstring

Path prefix for all HTTP requests

If set to /my/path, each client request becomes /my/path/ + endpoint. Useful when Elasticsearch is behind a proxy providing a base path; do not use otherwise.

strictDeprecationModebooleanstring

Treat responses with deprecation warnings as failures

trustAllSslbooleanstring

Trust all SSL CA certificates

Use this if the server uses a self-signed SSL certificate

indexName*string

The name of the index to store embeddings

type*object

In-memory embedding store that stores data as Kestra KV pairs

type*object

kvNamestring

Default{{flow.id}}-embedding-store

The name of the KV pair to use

MariaDB Embedding Store

createTable*booleanstring

Whether to create the table if it doesn't exist

databaseUrl*string

Database URL of the MariaDB database (e.g., jdbc: mariadb://host: port/dbname)

fieldName*string

Name of the column used as the unique ID in the database

password*string

tableName*string

Name of the table where embeddings will be stored

type*object

username*string

columnDefinitionsarray

SubTypestring

Metadata Column Definitions

List of SQL column definitions for metadata fields (e.g., 'text TEXT', 'source TEXT'). Required only when using COLUMN_PER_KEY storage mode.

indexesarray

SubTypestring

Metadata Index Definitions

List of SQL index definitions for metadata columns (e.g., 'INDEX idx_text (text)'). Used only with COLUMN_PER_KEY storage mode.

metadataStorageModestring

Metadata Storage Mode

Determines how metadata is stored: - COLUMN_PER_KEY: Use individual columns for each metadata field (requires columnDefinitions and indexes). - COMBINED_JSON (default): Store metadata as a JSON object in a single column. If columnDefinitions and indexes are provided, COLUMN_PER_KEY must be used.

Milvus Embedding Store

token*string

Token

Milvus auth token. Required if authentication is enabled; omit for local deployments without auth.

type*object

autoFlushOnDeletebooleanstring

Auto flush on delete

If true, flush after delete operations.

autoFlushOnInsertbooleanstring

Auto flush on insert

If true, flush after insert operations. Setting it to false can improve throughput.

collectionNamestring

Collection name

Target collection. Created automatically if it does not exist. Default: "default".

consistencyLevelstring

Read/write consistency level. Common values include STRONG, BOUNDED, or EVENTUALLY (depends on client/version).

databaseNamestring

Database name

Logical database to use. If not provided, the default database is used.

hoststring

Milvus host name (used when uri is not set). Default: "localhost".

idFieldNamestring

ID field name

Field name for document IDs. Default depends on collection schema.

indexTypestring

Index type

Vector index type (e.g., IVF_FLAT, IVF_SQ8, HNSW). Depends on Milvus deployment and dataset.

metadataFieldNamestring

Field name for metadata. Default depends on collection schema.

metricTypestring

Metric type

Similarity metric (e.g., L2, IP, COSINE). Should match the embedding provider’s expected metric.

passwordstring

Password

portintegerstring

Milvus port (used when uri is not set). Typical: 19530 (gRPC) or 9091 (HTTP). Default: 19530.

retrieveEmbeddingsOnSearchbooleanstring

Retrieve embeddings on search

If true, return stored embeddings along with matches. Default: false.

textFieldNamestring

Text field name

Field name for original text. Default depends on collection schema.

uristring

URI

Connection URI. Use either uri OR host/port (not both). Examples:

gRPC (typical): "milvus://host: 19530"
HTTP: "http://host: 9091"

usernamestring

Username

Required when authentication/TLS is enabled. See https://milvus.io/docs/authenticate.md

vectorFieldNamestring

Vector field name

Field name for the embedding vector. Must match the index definition and embedding dimensionality.

MongoDB Atlas Embedding Store

collectionName*string

host*string

The host

indexName*string

scheme*string

The scheme (e.g., mongodb+srv)

type*object

createIndexbooleanstring

Create the index

databasestring

The database

metadataFieldNamesarray

SubTypestring

The metadata field names

optionsobject

The connection string options

passwordstring

The password

usernamestring

The username

PGVector Embedding Store

database*string

The database name

host*string

password*string

The database password

port*integerstring

table*string

The table to store embeddings in

type*object

user*string

The database user

useIndexbooleanstring

Defaultfalse

Whether to use use an IVFFlat index

An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).

Pinecone Embedding Store

apiKey*string

cloud*string

The cloud provider

index*string

The index

region*string

The cloud provider region

type*object

namespacestring

The namespace (default will be used if not provided)

Qdrant Embedding Store

apiKey*string

The API key

collectionName*string

The collection name

host*string

port*integerstring

type*object

Redis Embedding Store

host*string

The database server host

port*integerstring

The database server port

type*object

indexNamestring

Defaultembedding-index

The index name

Tablestore Embedding Store

accessKeyId*string

Access Key ID

The access key ID used for authentication with the database.

accessKeySecret*string

Access Key Secret

The access key secret used for authentication with the database.

endpoint*string

The base URL for the Tablestore database endpoint.

instanceName*string

Instance Name

The name of the Tablestore database instance.

type*object

metadataSchemaListarray

Metadata Schema List

Optional list of metadata field schemas for the collection.

com.alicloud.openservices.tablestore.model.search.FieldSchema

analyzerstring

Possible Values

SingleWordMaxWordMinWordSplitFuzzy

analyzerParameter

com.alicloud.openservices.tablestore.model.search.analysis.AnalyzerParameter

dateFormatsarray

SubTypestring

enableHighlightingboolean

enableSortAndAggboolean

fieldNamestring

fieldTypestring

Possible Values

LONGDOUBLEBOOLEANKEYWORDTEXTNESTEDGEO_POINTDATEVECTORFUZZY_KEYWORDIPJSONUNKNOWN

indexboolean

indexOptionsstring

Possible Values

DOCSFREQSPOSITIONSOFFSETS

isArrayboolean

jsonTypestring

Possible Values

FLATTENNESTED

sourceFieldNamesarray

SubTypestring

storeboolean

subFieldSchemasarray

com.alicloud.openservices.tablestore.model.search.FieldSchema

analyzerstring

Possible Values

SingleWordMaxWordMinWordSplitFuzzy

analyzerParameter

dateFormatsarray

SubTypestring

enableHighlightingboolean

enableSortAndAggboolean

fieldNamestring

fieldTypestring

Possible Values

LONGDOUBLEBOOLEANKEYWORDTEXTNESTEDGEO_POINTDATEVECTORFUZZY_KEYWORDIPJSONUNKNOWN

indexboolean

indexOptionsstring

Possible Values

DOCSFREQSPOSITIONSOFFSETS

isArrayboolean

jsonTypestring

Possible Values

FLATTENNESTED

sourceFieldNamesarray

SubTypestring

storeboolean

subFieldSchemasarray

vectorOptions

com.alicloud.openservices.tablestore.model.search.vector.VectorOptions

dataTypestring

dimensioninteger

metricTypestring

Possible Values

EUCLIDEANCOSINEDOT_PRODUCT

Weaviate Embedding Store

apiKey*string

API key

Weaviate API key. Omit for local deployments without auth.

host*string

Host

Cluster host name without protocol, e.g., "abc123.weaviate.network".

type*object

avoidDupsbooleanstring

Avoid duplicates

If true (default), a hash-based ID is derived from each text segment to prevent duplicates. If false, a random ID is used.

consistencyLevelstring

Possible Values

ONEQUORUMALL

Consistency level

Write consistency: ONE, QUORUM (default), or ALL.

grpcPortintegerstring

gRPC port

Port for gRPC if enabled (e.g., 50051).

metadataFieldNamestring

Metadata field name

Field used to store metadata. Defaults to "_metadata" if not set.

metadataKeysarray

SubTypestring

Metadata keys

The list of metadata keys to store - if not provided, it will default to an empty list.

objectClassstring

Object class

Weaviate class to store objects in (must start with an uppercase letter). Defaults to "Default" if not set.

portintegerstring

Port

Optional port (e.g., 443 for https, 80 for http). Leave unset to use provider defaults.

schemestring

Scheme

Cluster scheme: "https" (recommended) or "http".

securedGrpcbooleanstring

Secure gRPC

Whether the gRPC connection is secured (TLS).

useGrpcForInsertsbooleanstring

Use gRPC for batch inserts

If true, use gRPC for batch inserts. HTTP remains required for search operations.

provider*

Language model provider

Must be configured with an embedding model.

Definitions

Amazon Bedrock Model Provider

accessKeyId*string

AWS Access Key ID

modelName*string

secretAccessKey*string

AWS Secret Access Key

type*object

baseUrlstring

caPemstring

clientPemstring

modelTypestring

DefaultCOHERE

Possible Values

COHERETITAN

Amazon Bedrock Embedding Model Type

Anthropic AI Model Provider

apiKey*string

modelName*string

type*object

baseUrlstring

caPemstring

clientPemstring

maxTokensintegerstring

Maximum Tokens

Specifies the maximum number of tokens that the model is allowed to generate in its response.

Azure OpenAI Model Provider

endpoint*string

API endpoint

The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/

modelName*string

type*object

apiKeystring

baseUrlstring

caPemstring

clientIdstring

Client ID

clientPemstring

clientSecretstring

Client secret

serviceVersionstring

API version

tenantIdstring

Tenant ID

DashScope (Qwen) Model Provider from Alibaba Cloud

apiKey*string

modelName*string

type*object

baseUrlstring

Defaulthttps://dashscope-intl.aliyuncs.com/api/v1

text

If you use a model in the China (Beijing) region, you need to replace the URL with: https://dashscope.aliyuncs.com/api/v1,
otherwise use the Singapore region of: "https://dashscope-intl.aliyuncs.com/api/v1.
The default value is computed based on the system timezone.

caPemstring

clientPemstring

enableSearchbooleanstring

Whether the model uses Internet search results for reference when generating text or not

maxTokensintegerstring

repetitionPenaltynumberstring

Repetition in a continuous sequence during model generation

text

Increasing repetition_penalty reduces the repetition in model generation,
1.0 means no penalty. Value range: (0, +inf)

Deepseek Model Provider

apiKey*string

modelName*string

type*object

baseUrlstring

Defaulthttps://api.deepseek.com/v1

caPemstring

clientPemstring

GitHub Models AI Model Provider

gitHubToken*string

GitHub Token

Personal Access Token (PAT) used to access GitHub Models.

modelName*string

type*object

baseUrlstring

caPemstring

clientPemstring

Google Gemini Model Provider

apiKey*string

modelName*string

type*object

baseUrlstring

caPemstring

clientPemstring

Google VertexAI Model Provider

endpoint*string

Endpoint URL

location*string

Project location

modelName*string

project*string

Project ID

type*object

baseUrlstring

caPemstring

clientPemstring

HuggingFace Model Provider

apiKey*string

modelName*string

type*object

baseUrlstring

Defaulthttps://router.huggingface.co/v1

caPemstring

clientPemstring

LocalAI Model Provider

baseUrl*string

modelName*string

type*object

caPemstring

clientPemstring

Mistral AI Model Provider

apiKey*string

modelName*string

type*object

baseUrlstring

caPemstring

clientPemstring

OciGenAI Model Provider

compartmentId*string

OCID of OCI Compartment with the model

modelName*string

region*string

OCI Region to connect the client to

type*object

authProviderstring

OCI SDK Authentication provider

baseUrlstring

caPemstring

clientPemstring

Ollama Model Provider

endpoint*string

Model endpoint

modelName*string

type*object

baseUrlstring

caPemstring

clientPemstring

OpenAI Model Provider

apiKey*string

modelName*string

type*object

baseUrlstring

Defaulthttps://api.openai.com/v1

caPemstring

clientPemstring

OpenRouter Model Provider

apiKey*string

modelName*string

type*object

baseUrlstring

caPemstring

clientPemstring

Watsonx AI Model Provider

apiKey*string

modelName*string

projectId*string

Project Id

type*object

baseUrlstring

caPemstring

clientPemstring

WorkersAI Model Provider

accountId*string

Account Identifier

Unique identifier assigned to an account

apiKey*string

modelName*string

type*object

baseUrlstring

Base URL

Custom base URL to override the default endpoint (useful for local tests, WireMock, or enterprise gateways).

caPemstring

clientPemstring

ZhiPu AI Model Provider

apiKey*string

API Key

modelName*string

Model name

type*object

baseUrlstring

Defaulthttps://open.bigmodel.cn/

API base URL

The base URL for ZhiPu API (defaults to https://open.bigmodel.cn/)

caPemstring

CA PEM certificate content

CA certificate as text, used to verify SSL/TLS connections when using custom endpoints.

clientPemstring

Client PEM certificate content

PEM client certificate as text, used to authenticate the connection to enterprise AI endpoints.

maxRetriesintegerstring

The maximum retry times to request

maxTokenintegerstring

The maximum number of tokens returned by this request

stopsarray

SubTypestring

With the stop parameter, the model will automatically stop generating text when it is about to contain the specified string or token_id

documentSplitter

Document splitter

Definitions

io.kestra.plugin.ai.rag.IngestDocument-DocumentSplitter

maxOverlapSizeInChars*integer

Maximum overlap size (characters). Only full sentences are considered for overlap.

maxSegmentSizeInChars*integer

Maximum segment size (characters)

splitterstring

DefaultRECURSIVE

Possible Values

RECURSIVEPARAGRAPHLINESENTENCEWORD

DocumentSplitter type

Recommended: RECURSIVE for generic text. It splits into paragraphs first and fits as many as possible into a single TextSegment. If paragraphs are too long, they are recursively split into lines, then sentences, then words, then characters until they fit into a segment.

dropbooleanstring

Defaultfalse

Drop the store before ingestion (useful for testing)

fromDocumentsarray

List of inline documents

Definitions

io.kestra.plugin.ai.rag.IngestDocument-InlineDocument

content*string

Document content

metadataobject

Document metadata

fromExternalURLsarray

SubTypestring

List of document URLs from external sources

fromInternalURIsarray

SubTypestring

List of internal storage URIs for documents

Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.

fromPathstring

Path in the task working directory containing documents to ingest

Each document in the directory will be ingested into the embedding store. Ingestion is recursive and protected against path traversal (CWE-22).

metadataobject

SubTypestring

Additional metadata to add to all ingested documents

Outputs

embeddingStoreOutputsobject

Additional outputs from the embedding store

ingestedDocumentsinteger

Number of ingested documents

inputTokenCountinteger

Input token count

outputTokenCountinteger

Output token count

totalTokenCountinteger

Total token count

Metrics

indexed.documentscounter

Unitrecords

Number of indexed documents

input.token.countcounter

Unittoken

Large Language Model (LLM) input token count

output.token.countcounter

Unittoken

Large Language Model (LLM) output token count

total.token.countcounter

Unittoken

Large Language Model (LLM) total token count

SurrealDB

Tasks that run SurrealQL queries against SurrealDB.

Database

Mariadb

Tasks that connect to MariaDB via JDBC to run queries and loads.

Database

Typesense

Tasks that load data into and query Typesense collections.

Database

IngestDocument
Ingest documents into an embedding store

More Plugins in this Category

SurrealDB

Mariadb

Typesense

1.3.0

IngestDocumentIngest documents into an embedding store

More Plugins in this Category

SurrealDB

Mariadb

Typesense

1.3.0

IngestDocument
Ingest documents into an embedding store