IngestDocument

This plugin is currently in beta. While it is considered safe for use, please be aware that its API could change in ways that are not compatible with earlier versions in future releases, or it might become unsupported.

Ingest documents into an embedding store.

Only text documents (TXT, HTML, Markdown) are supported for now.

yaml
type: "io.kestra.plugin.langchain4j.rag.IngestDocument"

Examples

Ingest documents into a KV embedding store.\nWARNING: the KV embedding store is for quick prototyping only, as it stores the embedding vectors in a K/V Store and load them all in memory.

yaml
id: document-ingestion
namespace: company.team

tasks:
  - id: ingest
    type: io.kestra.plugin.langchain4j.rag.IngestDocument
    provider:
      type: io.kestra.plugin.langchain4j.provider.GoogleGemini
      modelName: gemini-embedding-exp-03-07
      apiKey: "{{ secret('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.langchain4j.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/content/blogs/release-0-22.md

Properties

embeddings *Elasticsearch KestraKVStore PGVector

Embedding Store Provider

provider *AmazonBedrock Anthropic AzureOpenAI DeepSeek GoogleGemini GoogleVertexAI MistralAI Ollama OpenAI

Language Model Provider

This provider must be configured with an embedding model.

documentSplitter IngestDocument-DocumentSplitter

The document splitter

drop booleanstring

Default false

Whether to drop the store before ingestion. Useful for testing purpose.

fromDocuments array

SubType

A list of inline documents

fromExternalURLs array

SubType string

A list of document URLs from external sources

fromInternalURIs array

SubType string

A list of internal storage URIs representing documents

Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.

fromPath string

A path inside the task working directory that contains documents to ingest

Each document inside the directory will be ingested into the embedding store. This is recursive and protected from being path traversal (CWE-22).

metadata object

SubType string

Additional metadata that will be added to all ingested documents

Outputs

embeddingStoreOutputs object

Additional outputs from the embedding store.

ingestedDocuments integer

The number of ingested documents

inputTokenCount integer

The input token count

outputTokenCount integer

The output token count

totalTokenCount integer

The total token count

Definitions

Google VertexAI Model Provider

endpoint *string

Endpoint URL

location *string

Project location

modelName *string

Model name

project *string

Project ID

type *object

Azure OpenAI Model Provider

endpoint *string

API endpoint

The Azure OpenAI endpoint in the format: https://{resource}.openai.azure.com/

modelName *string

Model name

type *object

apiKey string

API Key

clientId string

Client ID

clientSecret string

Client secret

serviceVersion string

API version

tenantId string

Tenant ID

Deepseek Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

Default https://api.deepseek.com/v1

API base URL

io.kestra.plugin.langchain4j.embeddings.Elasticsearch-ElasticsearchConnection

hosts *array

SubType string

Min items 1

List of HTTP ElasticSearch servers.

Must be an URI like https://elasticsearch.com: 9200 with scheme and port.

basicAuth Elasticsearch-ElasticsearchConnection-BasicAuth

Basic auth configuration.

headers array

SubType string

List of HTTP headers to be send on every request.

Must be a string with key value separated with : , ex: Authorization: Token XYZ.

pathPrefix string

Sets the path's prefix for every request used by the HTTP client.

For example, if this is set to /my/path, then any client request will become /my/path/ + endpoint. In essence, every request's endpoint is prefixed by this pathPrefix. The path prefix is useful for when ElasticSearch is behind a proxy that provides a base path or a proxy that requires all paths to start with '/'; it is not intended for other purposes and it should not be supplied in other scenarios.

strictDeprecationMode booleanstring

Whether the REST client should return any response containing at least one warning header as a failure.

trustAllSsl booleanstring

Trust all SSL CA certificates.

Use this if the server is using a self signed SSL certificate.

Anthropic AI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

OpenAI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

API base URL

Ollama Model Provider

endpoint *string

Model endpoint

modelName *string

Model name

type *object

io.kestra.plugin.langchain4j.embeddings.Elasticsearch-ElasticsearchConnection-BasicAuth

password string

Basic auth password.

username string

Basic auth username.

In-memory Embedding Store that then store its serialization form as a Kestra K/V pair

type *object

kvName string

Default {{flow.id}}-embedding-store

The name of the K/V entry to use

Google Gemini Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

Amazon Bedrock Model Provider

accessKeyId *string

AWS Access Key ID

modelName *string

Model name

secretAccessKey *string

AWS Secret Access Key

type *object

modelType string

Default COHERE

Possible Values

COHERETITAN

Amazon Bedrock Embedding Model Type

io.kestra.plugin.langchain4j.rag.IngestDocument-InlineDocument

content *string

The content of the document

metadata object

The metadata of the document

PGVector Embedding Store

database *string

The database name

host *string

The database server host

password *string

The database password

port *integerstring

The database server port

table *string

The table to store embeddings in

type *object

user *string

The database user

useIndex booleanstring

Default false

Whether to use use an IVFFlat index

An IVFFlat index divides vectors into lists, and then searches a subset of those lists closest to the query vector. It has faster build times and uses less memory than HNSW but has lower query performance (in terms of speed-recall tradeoff).

Mistral AI Model Provider

apiKey *string

API Key

modelName *string

Model name

type *object

baseUrl string

API base URL

io.kestra.plugin.langchain4j.rag.IngestDocument-DocumentSplitter

maxOverlapSizeInChars *integer

The maximum size of the overlap, defined in characters. Only full sentences are considered for the overlap.

maxSegmentSizeInChars *integer

The maximum size of the segment, defined in characters.

splitter string

Default RECURSIVE

Possible Values

RECURSIVEPARAGRAPHLINESENTENCEWORD

Title the type of the DocumentSplitter

We recommend using a RECURSIVE DocumentSplitter for generic text. It tries to split the document into paragraphs first and fits as many paragraphs into a single TextSegment as possible. If some paragraphs are too long, they are recursively split into lines, then sentences, then words, and then characters until they fit into a segment.

Elasticsearch Embedding Store

connection *Elasticsearch-ElasticsearchConnection

indexName *string

The name of the index to store embeddings

type *object

​Ingest​Document

IngestDocument