Use BigQuery and Python script running in Docker to analyze Wikipedia page views

Source

yaml

id: wikipedia-top10-python-pandas
namespace: company.team
description: Analyze top 10 Wikipedia pages

tasks:
  - id: query
    type: io.kestra.plugin.gcp.bigquery.Query
    sql: |
      SELECT DATETIME(datehour) as date, title, views FROM
      `bigquery-public-data.wikipedia.pageviews_2024`
      WHERE DATE(datehour) = current_date() and wiki = 'en'
      ORDER BY datehour desc, views desc
      LIMIT 10
    store: true
    projectId: test-project
    serviceAccount: "{{ secret('GCP_SERVICE_ACCOUNT_JSON') }}"

  - id: write_csv
    type: io.kestra.plugin.serdes.csv.IonToCsv
    from: "{{ outputs.query.uri }}"

  - id: pandas
    type: io.kestra.plugin.scripts.python.Script
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: ghcr.io/kestra-io/pydata:latest
    inputFiles:
      data.csv: "{{ outputs.write_csv.uri }}"
    script: |
      import pandas as pd
      from kestra import Kestra

      df = pd.read_csv("data.csv")
      df.head(10)
      views = df['views'].max()
      Kestra.outputs({'views': int(views)})

About this blueprint

Python GCP

This flow will do the following:

Use bigquery.Query task to query the top 10 wikipedia pages for the current day
Use IonToCsv to store the results in a CSV file.
Use python.Script task to read the CSV file and use pandas to find the maximum number of views.
Use Kestra outputs to track the maximum number of views over time.

The Python script will run in a Docker container based on the public image ghcr.io/kestra-io/pydata:latest.

The BigQuery task exposes (by default) a variety of metrics such as:

total.bytes.billed
total.partitions.processed
number of rows processed
query duration

You can view those metrics on the Execution page in the Metrics tab.

Query

Ion To Csv

Script

Docker

More Related Blueprints

PythonGCP

Run Python script in a Docker container based on Google Artifact Registry container image

PythonGCP

Ingest Google Analytics data into a DuckDB database using dlt

New to Kestra?

Use blueprints to kickstart your first workflows.

Get started with Kestra