Source
yaml
id: wikipedia-top10-python-pandas
namespace: company.team
description: Analyze top 10 Wikipedia pages
tasks:
- id: query
type: io.kestra.plugin.gcp.bigquery.Query
sql: >
SELECT DATETIME(datehour) as date, title, views FROM
`bigquery-public-data.wikipedia.pageviews_2024` WHERE DATE(datehour) =
current_date() and wiki = 'en' ORDER BY datehour desc, views desc LIMIT 10
store: true
projectId: test-project
serviceAccount: "{{ secret('GCP_SERVICE_ACCOUNT_JSON') }}"
- id: write_csv
type: io.kestra.plugin.serdes.csv.IonToCsv
from: "{{ outputs.query.uri }}"
- id: pandas
type: io.kestra.plugin.scripts.python.Script
warningOnStdErr: false
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
containerImage: ghcr.io/kestra-io/pydata:latest
inputFiles:
data.csv: "{{ outputs.write_csv.uri }}"
script: |
import pandas as pd
from kestra import Kestra
df = pd.read_csv("data.csv")
df.head(10)
views = df['views'].max()
Kestra.outputs({'views': int(views)})
About this blueprint
Metrics Python BigQuery Outputs
This flow will do the following:
- Use
bigquery.Query
task to query the top 10 wikipedia pages for the current day 2. UseIonToCsv
to store the results in a CSV file. 3. Usepython.Script
task to read the CSV file and use pandas to find the maximum number of views. 4. Use Kestraoutputs
to track the maximum number of views over time. The Python script will run in a Docker container based on the public imageghcr.io/kestra-io/pydata:latest
. The BigQuery task exposes (by default) a variety of metrics such as:
- total.bytes.billed - total.partitions.processed - number of rows processed - query duration You can view those metrics on the Execution page in the Metrics tab.