IngestDocument​Ingest​Document

Ingest documents into an embedding store.

Only text documents (TXT, HTML, Markdown) are supported for now.

yaml
type: "io.kestra.plugin.ai.rag.IngestDocument"

Ingest documents into a KV embedding store.\nWARNING: the KV embedding store is for quick prototyping only, as it stores the embedding vectors in a K/V Store and load them all in memory.

yaml
id: document-ingestion
namespace: company.team

tasks:
  - id: ingest
    type: io.kestra.plugin.ai.rag.IngestDocument
    provider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-embedding-exp-03-07
      apiKey: "{{ secret('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.ai.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-22.md
Properties

Embedding Store Provider

Language Model Provider

This provider must be configured with an embedding model.

The document splitter

Default false

Whether to drop the store before ingestion. Useful for testing purpose.

A list of inline documents

SubType string

A list of document URLs from external sources

SubType string

A list of internal storage URIs representing documents

A path inside the task working directory that contains documents to ingest

Each document inside the directory will be ingested into the embedding store. This is recursive and protected from being path traversal (CWE-22).

SubType string

Additional metadata that will be added to all ingested documents

Additional outputs from the embedding store.

The number of ingested documents

The input token count

The output token count

The total token count

API endpoint

The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/

Model name

API Key

Client ID

Client secret

API version

Tenant ID

The database name

The database server host

The database password

The database server port

The table to store embeddings in

The database user

Default false

Whether to use use an IVFFlat index

An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).

The API key

The collection name

The database server host

The database server port

Endpoint URL

Project location

Model name

Project ID

API Key

Model name

The collection name

The host

The index name

The scheme (e.g. mongodb+srv)

Create the index

The database

SubType string

The metadata field names

The connection string options

The password

The username

API Key

Model name

API base URL

Default {{flow.id}}-embedding-store

The name of the K/V entry to use

The database base URL

The collection name

Basic auth password.

Basic auth username.

The token

Whether to auto flush on delete

Whether to auto flush on insert

The collection name

If there is no such collection yet, it will be created automatically. Default value: "default".

The consistency level

The database name

If not provided, the default database will be used.

The host

Default value: "localhost"

The id field name

The index type

The metadata field name

The metric type

The password

If user authentication and TLS is enabled, this parameter is required. See: https://milvus.io/docs/authenticate.md

The port

Default value: "19530"

Whether to retrieve embeddings on search

The text field name

The uri

The username

If user authentication and TLS is enabled, this parameter is required. See: https://milvus.io/docs/authenticate.md

The vector field name

API Key

Model name

Default https://api.deepseek.com/v1

API base URL

The API key

The cloud provider

The index

The cloud provider region

The namespace (default will be used if not provided)

API Key

Model name

The maximum size of the overlap, defined in characters. Only full sentences are considered for the overlap.

The maximum size of the segment, defined in characters.

Default RECURSIVE
Possible Values
RECURSIVEPARAGRAPHLINESENTENCEWORD

Title the type of the DocumentSplitter

We recommend using a RECURSIVE DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.

Weaviate API key

Your Weaviate API key. Not required for local deployment.

Weaviate host

The host, e.g. "ai-4jw7ufd9.weaviate.network" of cluster URL. Find in under Details of your Weaviate cluster.

Weaviate avoid dups

If true (default), then WeaviateEmbeddingStore will generate a hashed ID based on provided text segment, which avoids duplicated entries in DB. If false, then random ID will be generated.

Possible Values
ONEQUORUMALL

Weaviate consistency level

Consistency level: ONE, QUORUM (default) or ALL.

gRPC port if used

Weaviate metadata field name

The name of the metadata field to store. If not provided, will default to "_metadata".

SubType string

Weaviate metadata keys

The list of metadata keys to store. If not provided, will default to an empty list.

Weaviate object class

The object class you want to store, e.g. "MyGreatClass". Must start from an uppercase letter. If not provided, will default to "Default".

Weaviate port

The port, e.g. 8080. This parameter is optional.

Weaviate scheme

The scheme, e.g. "https" of cluster URL. Find in under Details of your Weaviate cluster.

The gRPC connection is secured

Use gRPC for inserts

Use GRPC instead of HTTP for batch inserts only. You still need HTTP configured for search.

Model endpoint

Model name

API Key

Model name

API base URL

The content of the document

The metadata of the document

SubType string
Min items 1

List of HTTP ElasticSearch servers.

Must be an URI like https://elasticsearch.com: 9200 with scheme and port.

Basic auth configuration.

SubType string

List of HTTP headers to be send on every request.

Must be a string with key value separated with : , ex: Authorization: Token XYZ.

Sets the path's prefix for every request used by the HTTP client.

For example, if this is set to /my/path, then any client request will become /my/path/ + endpoint. In essence, every request's endpoint is prefixed by this pathPrefix. The path prefix is useful for when ElasticSearch is behind a proxy that provides a base path or a proxy that requires all paths to start with '/'; it is not intended for other purposes and it should not be supplied in other scenarios.

Whether the REST client should return any response containing at least one warning header as a failure.

Trust all SSL CA certificates.

Use this if the server is using a self signed SSL certificate.

The name of the index to store embeddings

AWS Access Key ID

Model name

AWS Secret Access Key

Default COHERE
Possible Values
COHERETITAN

Amazon Bedrock Embedding Model Type