Call the HuggingFace Inference API.

The Serverless Inference API offers a fast and free way to explore thousands of models for a variety of tasks. Whether you’re prototyping a new application or experimenting with ML capabilities, this API gives you instant access to high-performing models across multiple domains:

text
- Text Generation: Including large language models and tool-calling prompts, generate and experiment with high-quality responses.
- Image Generation: Easily create customized images, including LoRAs for your own styles.
- Document Embeddings: Build search and retrieval systems with SOTA embeddings.
- Classical AI Tasks: Ready-to-use models for text classification, image classification, speech recognition, and more.
yaml
type: "io.kestra.plugin.huggingface.Inference"

Use inference for text classification

yaml
id: huggingface_inference_text
namespace: company.team

tasks:
- id: huggingface_inference
  type: io.kestra.plugin.huggingface.Inference
  model: cardiffnlp/twitter-roberta-base-sentiment-latest
  apiKey: "{{ secret('HUGGINGFACE_API_KEY') }}"
  inputs: "I want a refund"

Use inference for image classification.

yaml
id: huggingface_inference
namespace: company.team

tasks:
- id: huggingface_inference_image
  type: io.kestra.plugin.huggingface.Inference
  model: google/vit-base-patch16-224
  apiKey: "{{ secret('HUGGINGFACE_API_KEY') }}"
  inputs: "{{ read('my-base64-image.txt') }}"
  parameters:
    function_to_apply: sigmoid,
    top_k: 3
  waitForModel: true
  useCache: false
Properties

API Key

Huggingface API key (ex: hf_********)

Inputs

Inputs required for the specific model

Model

Model used for the Inference api (ex: cardiffnlp/twitter-roberta-base-sentiment-latest, google/gemma-2-2b-it)

Default https://api-inference.huggingface.co/models

API endpoint

Default value of the Huggingface API is https://api-inference.huggingface.co/models

Options

The options to set to customize the HTTP client

Parameters

Map of optional parameters depending on the model

Default true

Use cache

There is a cache layer on the inference API to speed up requests when the inputs are exactly the same. Many models, such as classifiers and embedding models, can use those results as is if they are deterministic, meaning the results will be the same. However, if you use a nondeterministic model, you can disable the cache mechanism from being used, resulting in a real new query.

Default false

Wait for model

When a model is warm, it is ready to be used and you will get a response relatively quickly. However, some models are cold and need to be loaded before they can be used. In that case, you will get a 503 error.

Output returned by the Huggingface API

Format duration

The time allowed to establish a connection to the server before failing.

Default PT0S
Format duration

The time an idle connection can remain in the client's connection pool before being closed.

Default UTF-8

The default charset for the request.

Default 10485760

The maximum content length of the response.

Default PT5M
Format duration

The time allowed for a read connection to remain idle before closing it.

Default PT10S
Format duration

The maximum time allowed for reading data from the server before failing.