Home

the case for (date) partitioned table

tags: #gcp #bigquery #article

source: Official Docs [link]

  • Query 1: 5 rows returned and 1.74 GB processed
    SELECT DISTINCT
    fullVisitorId,
    date,
    city,
    pageTitle
    FROM `data-to-insights.ecommerce.all_sessions_raw`
    WHERE date = '20170708'
    LIMIT 5
    
  • Query 2: 0 Rows returned still 1.74 GB processed
    SELECT DISTINCT
    fullVisitorId,
    date,
    city,
    pageTitle
    FROM `data-to-insights.ecommerce.all_sessions_raw`
    WHERE date = '20180708'
    LIMIT 5
    
  • Why?
    • Before the query runs, the query engine does not know whether 2018 data exists to satisfy the WHERE clause condition and it needs to scan through all records in a non-partitioned table.
  • Scanning the entire dataset to apply a WHERE condition is inefficient, particularly when focusing on records from specific timeframes, such as the last year, last 7 days, or last month. To optimize this process, it is advisable to use a date-partitioned table. This approach enables the system to skip irrelevant partitions, streamlining the querying process and improving performance.
  • Creating Date Partitioned table
    CREATE OR REPLACE TABLE ecommerce.partition_by_day
    PARTITION BY date_formatted
    OPTIONS(
    description="a table partitioned by date"
    ) AS
    SELECT DISTINCT
    PARSE_DATE("%Y%m%d", date) AS date_formatted,
    fullvisitorId
    FROM `data-to-insights.ecommerce.all_sessions_raw`
    
  • Auto Expiring:
    • Auto-expiring partitioned tables help meet data privacy regulations and reduce unnecessary storage costs.
    • By assigning expiration dates to partitions, you can create a rolling window of data that automatically deletes partitions once they are no longer needed.
       CREATE OR REPLACE TABLE ecommerce.days_with_rain
       PARTITION BY date
       OPTIONS (
       partition_expiration_days=60,
       description="weather stations with precipitation, partitioned by day"
       ) AS
       SELECT
       DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
       (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations
        WHERE stations.usaf = stn) AS station_name,  -- Stations may have multiple names
       prcp
       FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
       WHERE prcp < 99.9  -- Filter unknown values
       AND prcp > 0      -- Filter
       AND _TABLE_SUFFIX >= '2018'