Views v.s. materialized Views v.s. Rollup Tables with PostgreSQL

For applications it’s often really annoying to aggregate data in memory with a self written command/program instead of having this already done by the storage engine, which holds the data. The point is, it’s really cumbersome to fetch data from a database, aggregate it within a program and write it to another table, which will be used as data source for another program. When you see something like this, please always try to get rid of this aggregation work and let your storage engine handle it. So let’s clearify how materialized views can help in this situation.

What are Views?

First of all we need to clearify what is a View? A View is something like a virtual table/structure that pulls data from one or more tables and you can query against it.

Views help to not copy & paste a complex query around. Yes it can be solved by introducing a repository in your application, but it’s better to keep your application as simple as possible. The disadvantage of a view is, that for every query hitting the view, the data will be recomputed. This might affect your application performance.

What are materialized Views?

To fix the recomputation problem with views, PostgreSQL offers materialized views. In general it’s a disc-stored view that can be refreshed whenever you need it and also supports indices. Since version 9.4, materialized views can be refreshed concurrently without holding a lock on a table.

Project setup

Let’s see how we can profit from a materialized view by building a daily report for advertisement clicks on a website. We create a table to track all the ad clicks on a website. I prepared a postgreSQL Docker setup for version 12 with the pg_cron extension enabled: https://github.com/stefpe/postgres_cron_rollup

So let’s get our hands dirty and boot our postgreSQL and a pgadmin interface:

### postgresql 12 with pg_cron extension
git clone https://github.com/stefpe/postgres_cron_rollup.git
cd postgres_cron_rollup && docker build -t postgre .docker run -e POSTGRES_PASSWORD=test -p 5432:5432 -d postgre### pgAdmin 4 to execute queries and check the db ###docker run -e PGADMIN_DEFAULT_EMAIL=example@example.com -e PGADMIN_DEFAULT_PASSWORD=test -p 8080:80 -d dpage/pgadmin4### use the host: host.docker.internal to connect to the booted postgresql instance

Create the DB schema:

CREATE TABLE ad_clicks (
id bigserial PRIMARY KEY,
page text,
click_time timestamp,
user_session_id BIGINT
);
CREATE INDEX ON ad_clicks (click_time,page);

Fill up the table with 10000 rows of fake data:

INSERT INTO ad_clicks(page, click_time, user_session_id)
SELECT
(CASE (RANDOM() * 2)::INT
WHEN 0 THEN 'medium.com'
WHEN 1 THEN 'mywristbands.de'
WHEN 2 THEN 'google.com'
END
) AS page,
NOW() as click_time,
(floor(random() * (111111111-1000000 + 1) +1000000))::INT as user_session_id
FROM GENERATE_SERIES(1, 10000) seq;
Added fake click data for testing

Now we create a daily report view:

CREATE MATERIALIZED VIEW ad_clicks_report AS 
SELECT date_trunc('day', click_time) as day,
page,
count(*) as clicks
FROM ad_clicks
GROUP BY day, page;

With this view, we can see how much ad clicks were performed per page at a day. To refresh the view you just need to run:

REFRESH MATERIALIZED VIEW ad_clicks_report;

The keyword CONCURRENTLY in the refresh statemenet allows to run queries while the view refreshes, but you need an unique index in the view. Having indices in a materialized view compared to a normal view is also a huge benefit. Let’s query the view:

Output of materialized view

How to get realtime data?

The computation of this aggregated view gets heavier with the growing amount of data in the ad_clicks table. To solve this you can create a materialized view per day and refresh it e.g. every 10 minutes with these queries:

CREATE MATERIALIZED VIEW daily_ad_clicks AS
SELECT click_time,
page,
count(*) as cnt
FROM ad_clicks
WHERE click_time >= date_trunc('day', now()) AND click_time < TIMESTAMP 'tomorrow'
GROUP BY day, page;

Refreshing the View:

REFRESH MATERIALIZED VIEW CONCURRENTLY daily_ad_clicks;

Of course you should append a date to the view name if you generate a new one on a daily basis.

Rollup

The other way to to avoid heavy computations on the DB server is to implement a delta approach. So let’s build a daily realtime table that shows the click count per page on a daily basis:

CREATE TABLE daily_ad_clicks (
day timestamp,
page text,
cnt BIGINT
);
CREATE UNIQUE INDEX ON daily_ad_clicks (day,page);

The rollup logic works with a UPSERT query, that inserts a new row and updates existing rows on a constraint conflict.

To have only one job running at a time and to know at which point of time we start or proceed to feed the next batch to the daily_ad_clicks table, we create a helper table. This table stores the last start time, the time frame of the data we want to process and the destination table name where the data gets stored. During the data update process we lock the related table row with an exclusive lock, so other jobs that start and try to read the last click time from the helper table will fail.

CREATE TABLE aggregation_job_config (
start_time timestamp,
delta_minutes integer default 0,
table_name text not null
);

We add our job that will fill daily_ad_clicks like this:

INSERT INTO aggregation_job_config(start_time, delta_minutes, table_name) VALUES(CURRENT_TIMESTAMP, 5, 'daily_ad_clicks');

And now a function that performs the aggregation:

CREATE OR REPLACE FUNCTION aggregation_job(tbl_name text)
RETURNS void
LANGUAGE plpgsql
AS $function$
DECLARE
st_time timestamp;
ed_time timestamp;
BEGIN
SELECT start_time, start_time+(delta_minutes * interval '1 minute')
INTO st_time, ed_time
FROM aggregation_job_config
WHERE table_name = tbl_name FOR UPDATE;
INSERT INTO daily_ad_clicks
SELECT date_trunc('day', click_time) as click_day,
page,
count(*) as click_count
FROM ad_clicks
WHERE click_time >= st_time AND click_time < ed_time
GROUP BY click_day,page
ON CONFLICT (day, page) DO UPDATE
SET cnt = daily_ad_clicks.cnt + EXCLUDED.cnt;
UPDATE aggregation_job_config SET start_time = ed_time WHERE table_name = tbl_name;
END;
$function$;

You can execute the function with the following statement:

SELECT aggregation_job('daily_ad_clicks');

Automate the delta rollup function with pg_cron

To get the function call automatically scheduled we can use pg_cron. To add a cronjob we need to execute this query:

SELECT cron.schedule('*/10 * * * *',$$SELECT aggregation_job('daily_ad_clicks');$$);

If you are not familiar with the cronjob scheduling format, please check this link: */10 * * * *

Now check the available cronjobs:

SELECT * FROM cron.job;

And here is our successfully registered job:

Now our daily_ad_clicks table gets updated every 10 minutes with new click counts. You can check the log of the database with Docker like this:

docker logs $(docker ps --format '{{.Image}} {{.ID}}' | grep postgre | cut -f2 -d ' ')
Log output for the cronjob

Summary

We tested two approaches to implement realtime data sources for something like a dashboard. The easiest way is a materialized view setup that is simple to implement. The second one is a complex rollup approach that on the other side avoids heavy computations on the DB. Now it’s your turn to decide which strategy best suits your needs, but one thing is clear: PostgreSQL will help you anyways ;)

Software Architect

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store