Example: Using the source Data Type with Transforms

You can use Vega transform aggregation and formula expressions to automate the process of gathering statistical information about a rendered query. By doing so, you do not have to run an SQL prequery to get the information, thereby reducing the time it takes to process and render a chart.

The following heatmap example demonstrates the benefits of Vega tranforms in terms of performance and reducing redundancy:

  • First, you will see an SQL expression used to render a heatmap, as well as an additional expression used to color the hexagonal bins according to the min and max of the cnt value of the aggregated bins from the query.
  • Then, the example shows how to render the heatmap and color the bins directly in Vega by using a source data set and performing aggregation transforms on that data, decreasing chart rendering time and redundancy.

SQL Queries

The following SQL query is typically used to render a hexagonal heatmap:

SELECT
        reg_hex_horiz_pixel_bin_x(conv_4326_900913_x(lon),conv_4326_900913_x(-157),conv_4326_900913_x(157),conv_4326_900913_y(lat),conv_4326_900913_y(-63),conv_4326_900913_y(81),9.9667,11.5085,0,0,897,647) as x,
        reg_hex_horiz_pixel_bin_y(conv_4326_900913_x(lon),conv_4326_900913_x(-157),conv_4326_900913_x(157),conv_4326_900913_y(lat),conv_4326_900913_y(-63),conv_4326_900913_y(81),9.9667,11.5085,0,0,897,647) as y,
        count(*) as cnt
FROM tweets_nov_feb
WHERE ((lon >= -157 AND lon <= 157) AND (lat >= -63 AND lat <= 81))
GROUP BY x, y;

To color the hexagonal bins according to the min and max of the cnt value of the bins from the query, you need to run a prequery to gather these statistics manually:

    SELECT
             min(cnt), max(cnt)
    FROM (SELECT
       reg_hex_horiz_pixel_bin_x(conv_4326_900913_x(lon),conv_4326_900913_x(-157),conv_4326_900913_x(157),conv_4326_900913_y(lat),conv_4326_900913_y(-63),conv_4326_900913_y(81),9.9667,11.5085,0,0,897,647) as x,
       reg_hex_horiz_pixel_bin_y(conv_4326_900913_x(lon),conv_4326_900913_x(-157),conv_4326_900913_x(157),conv_4326_900913_y(lat),conv_4326_900913_y(-63),conv_4326_900913_y(81),9.9667,11.5085,0,0,897,647) as y,
       count(*) as cnt
FROM tweets_nov_feb
WHERE ((lon >= -157 AND lon <= 157) AND (lat >= -63 AND lat <= 81))
GROUP BY x, y
    );

The second query does an aggregation over the query, effectively running the query twice.

To avoid the redundancy and expense of running the query twice, you can instead specify this data-gathering in Vega.

Render the Heatmap in Vega

The following Vega code renders the heatmap using Vega-aggregated statistics.

{
  "width": 897,
  "height": 647,
  "data": [
        {
          "name": "heatmap_query",
          "sql": "SELECT reg_hex_horiz_pixel_bin_x(conv_4326_900913_x(lon),conv_4326_900913_x(-157),conv_4326_900913_x(157),conv_4326_900913_y(lat),conv_4326_900913_y(-63),conv_4326_900913_y(81),9.9667,11.5085,0,0,897,647) as x, reg_hex_horiz_pixel_bin_y(conv_4326_900913_x(lon),conv_4326_900913_x(-157),conv_4326_900913_x(157),conv_4326_900913_y(lat),conv_4326_900913_y(-63),conv_4326_900913_y(81),9.9667,11.5085,0,0,897,647) as y, count(*) as cnt FROM tweets_nov_feb WHERE ((lon >= -157 AND lon <= 157) AND (lat >= -63 AND lat <= 81)) GROUP BY x, y"
        },
        {
          "name": "heatmap_stats",
          "source": "heatmap_query",
          "transform": [
                {
                  "type": "aggregate",
                  "fields": ["cnt", "cnt"],
                  "ops":    ["min", "max"],
                  "as":     ["mincnt", "maxcnt"]
                }
          ]
        }
  ],
  "scales": [
        {
          "name": "x",
          "type": "linear",
          "domain": [-17477160.052146, 17477160.052146],
          "range": "width"
        },
        {
          "name": "y",
          "type": "linear",
          "domain": [-9100250.905852, 16213801.065472],
          "range": "height"
        },
        {
          "name": "heat_color",
          "type": "quantize",
          "domain": {"data": "heatmap_stats", "fields": ["mincnt", "maxcnt"]},
          "range": ["#115f9a", "#1984c5", "#22a7f0", "#48b5c4", "#76c68f",
                                "#a6d75b", "#c9e52f", "#d0ee11", "#d0f400"
          ]
        }
  ],
  "marks": [
        {
          "type": "symbol",
          "from": {
                "data": "heatmap_query"
          },
          "properties": {
                "shape": "hexagon-horiz",
                "xc": {
                  "field": "x"
                },
                "yc": {
                  "field": "y"
                },
                "width": 9.9667,
                "height": 11.5085,
                "fillColor": {
                  "scale": "heat_color",
                  "field": "cnt"
                }
          }
        }
  ]
}

The data section named heatmap_stats has a source data table defined by the "source": "heatmap_query" line:

"name": "heatmap_stats",
"source": "heatmap_query",

The "heatmap_stats" data takes as input the "heatmap_query" data, which is the data supplied by the SQL code. You can use the source data type to apply intermediary steps or expressions (transforms) to the input source data.

For information about syntax and requirements for the source and transform properties, see Data property information in Reference.

To color the data according to two standard deviations from the mean, edit the "heatmap_stats" section as follows to:

  • Aggregate the minimum, maximum, average, and sampled standard deviation of the count column.
  • Use forumla expressions to calculate values that are two standard deviations from the average.
{
   "name": "heatmap_stats",
   "source": "heatmap_query",
   "transform": [
     {
       "type": "aggregate",
       "fields": ["cnt", "cnt", "cnt", "cnt"],
       "ops":    ["min", "max", "avg", "stddev"],
       "as":     ["mincnt", "maxcnt", "avgcnt", "stdcnt"]
     },
     {
       "type": "formula",
       "expr": "max(mincnt, avgcnt-2*stdcnt)",
       "as": "mincnttouse"
     },
     {
       "type": "formula",
       "expr": "min(maxcnt, avgcnt+2*stdcnt)",
       "as": "maxcnttouse"
     }
   ]
 }

Then, reference these values in the scale domain:

{
   "name": "heat_color",
   "type": "quantize",
   "domain": {"data": "heatmap_stats", "fields": ["mincnttouse", "maxcnttouse"]},
   "range": ["#115f9a", "#1984c5", "#22a7f0", "#48b5c4", "#76c68f",
             "#a6d75b", "#c9e52f", "#d0ee11", "#d0f400"
   ]
 }

Performing these calculations in Vega improves performance because the SQL query is only run once and the aggregated statistics are done “on the fly.” Because the query is not repeated in a statistical prequery step, you can reduce the full render time by half by performing the statistics step in Vega at render time.