Create ML Models with BigQuery ML Challenge Lab Walkthrough

Klaus
5 min readJan 30, 2021

Introduction

While playing with the quest Create ML with BigQuery ML, I got stuck with the challenge lab Create ML Models with BigQuery ML: Challenge Lab because of the overly strict checkpoints that did not let me pass even I got the correct answer.

I am writing this to help anyone else that got stuck and would also like to share some of my thoughts at the end.

Before you begin

The challenge uses the Austin Bike dataset. It is better to have a look to get some basic idea about the data before you start working on it.

Preview of the bikeshare_stations table
Preview of the bikeshare_trips dataset

The models we train on this challenge aims to predict duration_minutes of the bike trips.

Task 1: Create a dataset to store your machine learning models

This dataset will be used to store our BQML models for the challenge. The name of the dataset does not matter. Let’s just call it bike.

You can do this part either from the UI or with the bq command.

Select your project and click ‘Create Dataset’
Just enter a dataset name and click ‘Create dataset’

If you would like to use the CLI, click Cloud Shell on the upper right corner and type the command bq mk bike

Task 2: Create a forecasting BigQuery machine learning model

This task asks us to train a model based on:

  • starting station name
  • the hour the trip started
  • the weekday of the trip,
  • location of the start station
  • data in the year 2018

In order to get the location, we have to join the bikeshare_stations table.

I named this model location_model but the name does not really matter. The query looks like this:

CREATE OR REPLACE MODEL bike.location_model
OPTIONS
(model_type='linear_reg', labels=['duration_minutes']) AS
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS start_hour,
EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week,
location,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations
ON
trips.start_station_name = stations.name
WHERE
EXTRACT(YEAR FROM start_time) = 2018
AND duration_minutes > 0

Task 3: Create the second machine learning model

Similar to task 2, this task requires you to train a second model based on:

  • starting station name
  • the bike share subscriber type
  • the hour the trip started
  • data in the year 2018

I named the model subscriber_model and the query looks like this:

CREATE OR REPLACE MODEL bike.subscriber_model
OPTIONS
(model_type='linear_reg', labels=['duration_minutes']) AS
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS start_hour,
subscriber_type,
duration_minutes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
WHERE EXTRACT(YEAR FROM start_time) = 2018

Task 4: Evaluate the two machine learning models

This task requires you get the evaluation metrics for both models:

  • Mean Absolute Error
  • Root Mean Square Error

Just run the following queries:

-- Evaluation metrics for location_modelSELECT
SQRT(mean_squared_error) AS rmse,
mean_absolute_error
FROM
ML.EVALUATE(MODEL bike.location_model, (
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS start_hour,
EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week,
location,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations
ON
trips.start_station_name = stations.name
WHERE EXTRACT(YEAR FROM start_time) = 2019)
)

and

-- Evaluation metrics for subscriber_model
SELECT
SQRT(mean_squared_error) AS rmse,
mean_absolute_error
FROM
ML.EVALUATE(MODEL bike.subscriber_model, (
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS start_hour,
subscriber_type,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
WHERE
EXTRACT(YEAR FROM start_time) = 2019)
)

Task 5: Use the subscriber type machine learning model to predict average trip durations

In this task, we are required to calculate the average of predicted trip time for the busiest station where subscriber_type is Single Trip in 2019.

This question is where most people got stuck according to the reviews. The key of this task is to separate the queries of:

  1. Finding the busiest station in 2019
  2. Calculating the average trip time

Some challengers for the quest spent time to combine both queries into one and did not pass the check. All you have to do is the hard-code the station name in the query of calculating the average trip time and you will be fine.

First, we find the busiest station with this query:

SELECT
start_station_name,
COUNT(*) AS trips
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
EXTRACT(YEAR FROM start_time) = 2019
GROUP BY
start_station_name
ORDER BY
trips DESC
The busiest station is 21st & Speedway @PCL

As you can see, the busiest station is 21st & Speedway @ PCL . As you can see, there are 2 rows for the same station. Just use the first one in the query in order to pass the check.

SELECT AVG(predicted_duration_minutes) AS average_predicted_trip_length
FROM ML.predict(MODEL bike.subscriber_model, (
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS start_hour,
subscriber_type,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
EXTRACT(YEAR FROM start_time) = 2019
AND subscriber_type = 'Single Trip'
AND start_station_name = '21st & Speedway @PCL'))

After running this query, you should pass the check of task 5.

Thoughts

It is not a good idea to use names for training / querying the data as required in Task 2, Task 3 and Task 5. Names can change over time or due to typo, which causes error in the data we use.

Problem of duplicated stations caused by using names for grouping

As we could see above from the query of finding the busiest station, there are 2 rows of 21st Speedway due to the different characters used. It would be much more better to use start_station_id .

Despite the above issue, some minor problems of ambiguous instruction and overly strict checks, the challenge still helped me to practice what I’ve learnt in the quest. Hope that someone find this walkthrough useful and thank you for reading.

 by the author.

--

--

Klaus

A data engineer from Hong Kong. Share tools and experience for better productivity. Follow to get my latest stories!