dharmendra mishra
3 min readOct 25, 2023

--

Unleashing Machine Learning with SQL Queries in BigQuery

BigQuery, a fully managed data warehouse by Google, has gained immense popularity due to its speed, scalability, and ease of use. One of the most attractive features of BigQuery is its access to a plethora of public datasets, which provide a treasure trove of information for various use cases. Here we’ll explore BigQuery’s public datasets and dive into writing BigQuery Machine Learning (BQML) queries with examples.

Understanding BigQuery Public Datasets

BigQuery Public Datasets are an invaluable resource These datasets cover a wide range of topics, including public health, climate, finance, transportation, and more. Google curates and hosts these datasets, making them accessible to anyone with a Google Cloud account.

These datasets are maintained by Google or other trusted sources and are made available for free, enabling users to run SQL queries on large-scale data without the need to host or maintain the data themselves. Leveraging BigQuery Public Datasets can save you both time and resources, as the data is ready for analysis and machine learning tasks.

Accessing BigQuery Public Datasets

To access BigQuery Public Datasets, follow these simple steps:

  1. Create a Google Cloud Account: If you don’t already have one, create a Google Cloud account. You can use the free tier, which includes a $300 credit to get started.
  2. Enable the BigQuery API: Go to the Google Cloud Console and enable the BigQuery API for your project.
  3. Access Public Datasets: Once the API is enabled, you can navigate to the BigQuery web UI, where you will find a list of public datasets under the “Resources” section.

Running BQML Queries using python:

we’ll create a classification model to predict whether an email is spam or not using the “samples” dataset from BigQuery Public Datasets.

from google.cloud import bigquery
# Set your Google Cloud project ID
project_id = ‘YOUR_PROJECT_ID’
# Create a BigQuery client
client = bigquery.Client(project=project_id)
# Define the dataset and model names
dataset_name = "your_dataset_name"
model_name = "email_spam_model"
# SQL query to create a logistic regression model
sql_query = """
#standardSQL
CREATE OR REPLACE MODEL `{project_id}.{dataset_name}.{model_name}`
OPTIONS(model_type=’logistic_reg’) AS
SELECT
id,
message,
label AS spam
FROM
`bigquery-public-data.samples.email`
"""
# Execute the SQL query to create the model
query_job = client.query(sql_query)
query_job.result()
# Check the training status
model = client.get_model(f’{project_id}.{dataset_name}.{model_name}’)
if model.training_info[‘state’] == "DONE":
print("Model {model_name} is trained.")
else:
print("Model {model_name} is still training.")
# Evaluate the model
evaluation_query = """
#standardSQL
SELECT
*
FROM
ML.EVALUATE(MODEL `{project_id}.{dataset_name}.{model_name}`,
(SELECT
id,
message,
label AS spam
FROM
`bigquery-public-data.samples.email`))
"""
evaluation_query_job = client.query(evaluation_query)
evaluation = evaluation_query_job.result().to_dataframe()
print("Model Evaluation:")
print(evaluation)
# Deploy the model for prediction
deploy_query = """
#standardSQL
CREATE OR REPLACE MODEL `{project_id}.{dataset_name}.{model_name}`
OPTIONS(model_type=’logistic_reg’,
auto_class_weights=TRUE) AS
SELECT
id,
message,
label AS spam
FROM
`bigquery-public-data.samples.email`
"""
deploy_query_job = client.query(deploy_query)
deploy_query_job.result()
print("Model {model_name} is deployed and ready for predictions.")

Before running this code, make sure to replace 'YOUR_PROJECT_ID' with your actual Google Cloud project ID and set the appropriate dataset_name and model_name. This code performs the following steps:

  1. Creates a logistic regression model in BigQuery using the “samples” dataset.
  2. Checks the training status of the model.
  3. Evaluates the model’s performance.
  4. Replaces the existing model with a new version to deploy it for predictions.

Please ensure that you have the necessary permissions and the required Python libraries (e.g., google-cloud-bigquery) installed. Additionally, make sure that the dataset and model names do not conflict with existing resources in your Google Cloud project.

--

--

dharmendra mishra

Data-driven Analytics/Engineering leader with 12+ years of experience in digital advertising company. Skills SQL, Python, Excel and GCP Google Cloud Platform.