This is my second blog post about running machine learning / deep learning models in BigQuery. This time, I will use the latest (still in preview) capabilities of BigQuery ML that allow to run ONNX models within the BigQuery itself.

In this blogpost dig into this new feature, to explore it’s capabilities and limitations (there are few!). As an example, I will deploy a DistilBERT-based classifier to analyse the sentiment of tweets stored in a BigQuery table.

If you’re interested in deploying custom models to BigQuery (that cannot be converted to ONNX), check out my previous blogpost: Deploy MLflow Models On BigQuery.

TL;DR

You will learn how to deploy ONNX models to BigQuery ML & understand the limitations.

What is not in the scope of this blogpost

For brevity, I’m not training any models here - I’m reusing a publicly available model.

Prerequisites

You will need:

  • Python (I’m using 3.10)
  • access to BigQuery
  • access to Cloud Functions
  • access to Google Cloud Storage bucket
  • Libraries: transformers, datasets, pandas, pandas-gbq, onnxruntime, onnx

Dataset

To make this blogpost both self-contained and complete, here’s how you can easily upload a portion of tweet_eval dataset into BigQuery. If you have some text already in BigQuery, you can skip this part.

import pandas as pd
from datasets import load_dataset

dataset = load_dataset('tweet_eval', 'sentiment')
validation_subset = dataset['validation'] # <-- use any split you want, here it's just for the demo purposes

df = pd.DataFrame(validation_subset)
df.to_gbq('onnx_demo.twitter_sentiment_validation', '<gcp-project-id>', if_exists='replace')

Preparing the model

The task is simple - find or train a model and convert in to ONNX format. I’m using the existing one: distilbert-base-uncased-finetuned-sst-2-english - it’s trained on 2 classes (positive and negative), not on tweets, but for the tutorial purposes that’s irrelevant - the goal is to deploy it and run the inference, so the steps for any other model will be the same.

⚠️ Limitation #1
BigQuery ML inference engine only supports ONNX models that are smaller than 450 MB. This is a serious limitation - to put this into context - deploying a popular sentiment-analysis model cardiffnlp/twitter-roberta-base-sentiment based on RoBERTa is NOT possible, because after the conversion to ONNX format, it’s size is 476MB.

State for 2023-05-14

To convert the model into ONNX format, you can just run the module provided by Huggingface’s transfomers library:

python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english --feature=sequence-classification onnx/ --opset 17

This command will output onnx/model.onnx file. Let’s run it locally first, to check if it’s working:

import onnxruntime as rt    
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
sess = rt.InferenceSession("./onnx/model.onnx")
text = ['This is a positive tweet!', "Let's put some hate into this text", "Hello world!"]
inputs = tokenizer(text, return_tensors='np', padding=True)
output_name = sess.get_outputs()[0].name
output = sess.run([output_name], dict(inputs))
output[0].argmax(1)
# Output:
# array([1, 0, 1])

It works fine, but it will not work on BigQuery though…

⚠️ Limitation #2
BigQuery ML inference does not support multiple dynamic axes in ONNX models, the only dynamic axis can be the one responsible for batch size (which means - the first one).

State for 2023-05-14

Dynamic axes in ONNX allow to pass tensors of variable-lenght into the models - this is especially useful when you want to pass batches of various sizes into the model, instead of having them fixed to 1, 2,… etc. Models converted to ONNX with Huggingface’s transformers.onnx module have 2 dynamic axes - for NLP models it means: batch size and input sequence length. This is visible if you inspect the model:

import onnx
model = onnx.load('./onnx/model.onnx')
model.graph.input[0]
# Output:
name: "input_ids"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "batch" # <--- first dynamic axis
      }
      dim {
        dim_param: "sequence" # <--- second dynamic axis
      }
    }
  }
}

Trying to deploy such model in BigQuery will result in the following error:

ONNX Model input ‘input_ids’ has unknown tensor dimension at index: 1, which is not supported.

How to fix it? - take 1

The model I’m using - DistilBERT - was trained with max_position_embeddings config set to 512, so let’s first try to set this value as a fixed size of the second axis in the ONNX’s models input (spoiler alert - it will not work).

import onnx
model = onnx.load('./onnx/model.onnx')
model.graph.input[0].type.tensor_type.shape.dim[1].dim_value = 512 # <--- for input_ids
model.graph.input[1].type.tensor_type.shape.dim[1].dim_value = 512 # <--- for attention_mask
onnx.save(model, "./onnx/model_fixed.onnx")

Now, when I import the model into BigQuery it will still complain, this time - about something else:

ONNX model cannot be parsed within the memory limit; try reducing the model size

This happens, because ONNX needs to pre-allocate some memory for the input tensors, which also affect internal layer sizes, resulting in exhausted memory limit on the BigQuery side.

How to fix it? - take 2

I had to shrink the model inputs even further, down to 256 (which is still fine for short tweets)

```python
import onnx
model = onnx.load('./onnx/model.onnx')
model.graph.input[0].type.tensor_type.shape.dim[1].dim_value = 256 # <--- for input_ids
model.graph.input[1].type.tensor_type.shape.dim[1].dim_value = 256 # <--- for attention_mask
onnx.save(model, "./onnx/model_fixed_take_2.onnx")

Now it passes, BigQuery is able to create my model.

Creating BigQuery ONNX model

In order to import ONNX model into BigQuery inference engine, first upload the model (the fixed one :) ) to GCS:

gsutil cp ./onnx/model_fixed_take_2.onnx gs://<your bucket>/models/distilbert-base-uncased-finetuned-sst-2-english.onnx

Then, open BigQuery UI and execute the following query:

CREATE OR REPLACE MODEL
onnx_demo.sentiment_classifier
OPTIONS(MODEL_TYPE = 'ONNX', MODEL_PATH = "gs://<your bucket>/models/distilbert-base-uncased-finetuned-sst-2-english.onnx");

Once executed, you should see the model in the UI:

👀 Spoiler alert - you will have to fix one more thing in the ONNX model, follow the guide below.

Huggingface tokenizers in BigQuery

Wait… The model accepts input_ids and attention_mask, but our BigQuery table has text 😱 We need to tokenize the text first, before we pass it to the model, but tokenizers are Python code that cannot be easily converted to ONNX format. How to solve this problem? Use BigQuery Remote Functions!

Huggingface tokenizer as BigQuery Remote Function

BigQuery allows to create a UDF that can actually call either Cloud Functions or Cloud Run. Let’s leverage that. For more details also see my previous blog post here.

Full code of the Cloud Function:

import json
from transformers import AutoTokenizer
import functions_framework
import os

tokenizer_name = os.environ.get("TOKENIZER_NAME", "distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

def expand(input_dict):
    input_ids = input_dict["input_ids"]
    attention_mask = input_dict["attention_mask"]
    result = []
    for i in range(len(input_ids)):
        obj = {"input_ids": input_ids[i], "attention_mask": attention_mask[i]}
        result.append(obj)
    return result

@functions_framework.http
def tokenize_text(request):
    request_json = request.get_json()
    calls = request_json["calls"]
    texts = [row[0] for row in calls]
    tokenized = tokenizer(texts, truncation=True, max_length=256, padding="max_length")
    result = expand(tokenized)
    return json.dumps({"replies": [json.dumps(r) for r in result]}) # double serialization, because we need to return STR in Remote Function

First, I’m using functions_framework to be able to test my code locally. Secondly, I’m pre-loading distilbert-base-uncased-finetuned-sst-2-english on function start. Every call to the deployed function will be in BigQuery format, which looks like this:

{
  "request_id": "string",
  "caller": "string",
  "sessionUser": "string",
  "userDefinedContext": {},
  "calls": [
    ["values", "from", "row", 1],
    ["values", "from", "row", 2]
  ]
}

I’m invoking the tokenizer for the whole batch and then expand it from nested list into flat list of objects - each row will be in exact form that the ONNX model expects - it needs to have input_ids and attention_ids fields. Finally, the whole response is serialized, note the double serialization of JSON objects in each row. Right now BigQuery Remote Function does not support returning complex types, so I will just de-serialize this JSON afterwards in BigQuery SQL.

Make sure that you create requirements.txt file with all dependencies:

transformers~=4.29.1
functions-framework

Deploy the function using CLI:

gcloud functions deploy tokenize_text --runtime python311 --trigger-http --min-instances 0 --no-allow-unauthenticated --region us-east1

Then, create connection between BigQuery and Cloud Functions and assing the Cloud Function Invoker role to the service account in this connection:

# 1. Create connection:
bq mk --connection --location=US --project_id=<project id> \
    --connection_type=CLOUD_RESOURCE huggingface-tokenizer
# 2. Copy the serviceAccountId from the results of:
bq show --format=prettyjson --connection <project id>.US.huggingface-tokenizer
# 3. Add Cloud Function Invoker role
gcloud functions add-iam-policy-binding tokenize_text --member="serviceAccount:bqcx-<redacted>@gcp-sa-bigquery-condel.iam.gserviceaccount.com" --role="roles/cloudfunctions.invoker" --region=us-east1

Once completed, add the remote function…

CREATE OR REPLACE FUNCTION `onnx_demo`.tokenize(text STRING) RETURNS STRING
REMOTE WITH CONNECTION `<connection id>`
OPTIONS (
  endpoint = 'https://<cloud function url>.cloudfunctions.net/tokenize_text',
  max_batching_rows = 10 -- <-- adjust this depending on the memory limits of your Cloud Function
)

…and invoke it:

select `onnx_demo`.tokenize(text) as t, text, label
from `onnx_demo.twitter_sentiment_validation` 
LIMIT 100

Result:

Great, we have deployed model and invoked the tokenizer, let’s put everything together!

DistilBERT inference on BigQuery

For clarity, I’ve split the inference code into 4 parts (CTEs):

  1. Tokenizing the text
  2. Transforming the tokenized JSONs into BigQuery arrays
  3. Running the inference using the ONNX model
  4. Parsing results to determine the text sentiment

Let’s start with #1, this query is simple - it’s just calling the tokenizer. Note that I’m explicitly limiting to 10 rows, see below why 👀

with tokenized as (
  select `onnx_demo`.tokenize(text) as t, text, label
  from `onnx_demo.twitter_sentiment_validation` 
  LIMIT 10
)

Next part, #2 is parsing the JSONs from the tokenizer into BigQuery ARRAY<INT64> format:

expanded as (
  select 
  (SELECT ARRAY_AGG(CAST(i AS INT64)) FROM UNNEST(JSON_EXTRACT_ARRAY(t, '$.input_ids')) AS i) as input_ids, 
  (SELECT ARRAY_AGG(CAST(a AS INT64)) FROM UNNEST(JSON_EXTRACT_ARRAY(t, '$.attention_mask')) AS a) as attention_mask, 
  text, label
  from tokenized
)

Calling the model (#3), while also passing the text and labels from the original table:

predictions as (
  SELECT
    * EXCEPT(input_ids, attention_mask)
  FROM
    ML.PREDICT(MODEL `onnx_demo.sentiment_classifier`, (
      SELECT
        input_ids,
        attention_mask,
        text,
        label
      FROM
        expanded
      )
    )
)

Finally #4, parse the results (logits) from the model (if you have more than 2 classes in the model, you will effectively have to implement something similar to argmax to get the class predicted by the model).

select text, (CASE WHEN logits[OFFSET(0)] > logits[OFFSET(1)] THEN 'negative' ELSE 'positive' END) as predicted, label, logits
from predictions
See the full query ⬇️
with tokenized as (
  select `onnx_demo`.tokenize(text) as t, text, label
  from `onnx_demo.twitter_sentiment_validation` 
  LIMIT 10
), expanded as (
  select 
  (SELECT ARRAY_AGG(CAST(i AS INT64)) FROM UNNEST(JSON_EXTRACT_ARRAY(t, '$.input_ids')) AS i) as input_ids, 
  (SELECT ARRAY_AGG(CAST(a AS INT64)) FROM UNNEST(JSON_EXTRACT_ARRAY(t, '$.attention_mask')) AS a) as attention_mask, 
  text, label
  from tokenized
), predictions as (
  SELECT
    * EXCEPT(input_ids, attention_mask)
  FROM
    ML.PREDICT(MODEL `onnx_demo.sentiment_classifier`, (
      SELECT
        input_ids,
        attention_mask,
        text,
        label
      FROM
        expanded
      )
    )
)
select text, (CASE WHEN logits[OFFSET(0)] > logits[OFFSET(1)] THEN 'negative' ELSE 'positive' END) as predicted, label, logits
from predictions

Result:

Scalling the inference

In the paragraph above, I’ve deliberately put a LIMIT 10 in the initial query, to limit the number of rows. Why? Try to run the query without the limit and see what happens:

Resources exceeded during query execution: UDF out of memory

⚠️ Limitation #3
BigQuery ML inference does not allow to set the batch size during model creation, which makes scalling up the inference challenging - you effectively don’t know how many rows will be send to the ONNX model at runtime, which might very often results in UDF out of memory exception.

State for 2023-05-14

Fortunately, there is a workaround for that. Remember, that original ONNX exported model had 2 dynamic axes - first one being batch_size and second one - sequence length. We’ve first reduced the second one to a fixed value of 256. Now it’s time to set the batch size to 1, which will force the BigQuery to only send one input example at a time to the model, making the inference possible for any number of rows. 💡 If you have any suggestion how it can be fixed in any other way, let me know in comments below.

import onnx
model = onnx.load('./onnx/model.onnx')
model.graph.input[0].type.tensor_type.shape.dim[0].dim_value = 1 # <--- force batch size to 1
model.graph.input[0].type.tensor_type.shape.dim[1].dim_value = 256
model.graph.input[1].type.tensor_type.shape.dim[0].dim_value = 1 # <--- force batch size to 1
model.graph.input[1].type.tensor_type.shape.dim[1].dim_value = 256
onnx.save(model, "./onnx/model_fixed_final.onnx")

Once you upload the model to GCS and re-create it using CREATE OR REPLACE MODEL, the inference should start to work at larger scale! 🎉

ONNX models vs Remote Functions

As you’ve seen, the ONNX feature is still in preview stage (Pre-GA), with some important limitations that you should be aware of. The biggest of them is the model size limit, making it impossible to deploy even medium size models, like base versions of BERT/RoBERTa. At the same time, a lot of data that customers store in BigQuery is tabular and those limits might be enough to deploy smaller models, e.g. TabNet model family. The API for ONNX models is also very simple, which makes it easy to use.

For NLP models that also require tokenization before doing the inference, Remote Functions are still a crucial building block. They allow to virtually run any Python code on the BigQuery side, meaning that you can use any Python library. The downside of this approach is that you need to write some code and deploy it separately as either Cloud Functions or Cloud Run.

Using Remote Functions for everything right now seems to have less limitations than ONNX models - you can deploy models on instances with up to 8 CPU and 32GB RAM (preview in Cloud Run), which will fit much larger models, with possibility of using quantized versions too.

If this post will get enough interest, I’ll be happy to prepare a side-by-side performance comparison of both approaches 📊.

Summary

I hope that this post helped you to deploy your own custom model in BigQuery ML. You also know the current limitations of ONNX models in BigQuery ML inference engine. Let me know what you think in the comments!

Comments