Source
yaml
id: wikipedia-top10-python-pandas
namespace: company.team
description: Analyze top 10 Wikipedia pages
tasks:
- id: query
type: io.kestra.plugin.gcp.bigquery.Query
sql: |
SELECT DATETIME(datehour) as date, title, views FROM
`bigquery-public-data.wikipedia.pageviews_2024`
WHERE DATE(datehour) = current_date() and wiki = 'en'
ORDER BY datehour desc, views desc
LIMIT 10
store: true
projectId: test-project
serviceAccount: "{{ secret('GCP_SERVICE_ACCOUNT_JSON') }}"
- id: write_csv
type: io.kestra.plugin.serdes.csv.IonToCsv
from: "{{ outputs.query.uri }}"
- id: pandas
type: io.kestra.plugin.scripts.python.Script
warningOnStdErr: false
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
containerImage: ghcr.io/kestra-io/pydata:latest
inputFiles:
data.csv: "{{ outputs.write_csv.uri }}"
script: |
import pandas as pd
from kestra import Kestra
df = pd.read_csv("data.csv")
df.head(10)
views = df['views'].max()
Kestra.outputs({'views': int(views)})
About this blueprint
Metrics Python BigQuery Outputs
This flow will do the following:
- Use
bigquery.Query
task to query the top 10 wikipedia pages for the current day - Use
IonToCsv
to store the results in a CSV file. - Use
python.Script
task to read the CSV file and use pandas to find the maximum number of views. - Use Kestra
outputs
to track the maximum number of views over time.
The Python script will run in a Docker container based on the public image
ghcr.io/kestra-io/pydata:latest
.
The BigQuery task exposes (by default) a variety of metrics such as:
- total.bytes.billed
- total.partitions.processed
- number of rows processed
- query duration
You can view those metrics on the Execution page in the Metrics tab.