DRY HiveQL

Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn0

DRY (don’t repeat yourself) is one of the fundamental principles of software engineering. The main idea is to avoid duplicating business/processing logic throughout the code. However, I rarely see it being applied when writing SQL queries; making it difficult to understand and maintain them.

Below are few tips on making HiveQL DRY.

Quick Summary

  1. Use “Macro” to express transformation logic that can be easily expressed in SQL.
  2. Consider using “With Clause” to express subqueries.
  3. Use variables to define constants.

Macros

Macros allows to assign an alias to a reusable processing logic that can be expressed in SQL. In simple terms, its like defining a function purely in SQL (although it doesn’t operate that way. It will do an inline expansion but we don’t have to worry about it for now).

For instance, in the below table we have two duration fields where the duration value is expressed in different units (such as milliseconds, seconds, minutes, etc).

UUID duration1 duration2
1 10ms 20us
2 16s 20ms
3 5m 2us

Below is a typical way of writing the HiveQL for this. Its bad because we have duplicated (once for each field) the logic of converting the duration expressed as string to duration in seconds. Anytime we make changes to it we will have to make sure to update the logic everywhere in the code.

SELECT
UUID,
CASE
WHEN duration1 like '%us'
THEN CAST(REPLACE(duration1, 'us', '') AS DOUBLE) / 1.0E6
WHEN duration1 like '%ms'
THEN CAST(REPLACE(duration1, 'ms', '') AS DOUBLE) / 1000.0
WHEN duration1 like '%s'
THEN CAST(REPLACE(duration1, 's', '') AS DOUBLE)
WHEN duration1 like '%m'
THEN CAST(REPLACE(duration1, 'm', '') AS DOUBLE) * 60
ELSE NULL
END as duration1_seconds,
CASE
WHEN duration2 like '%us'
THEN CAST(REPLACE(duration2, 'us', '') AS DOUBLE) / 1.0E6
WHEN duration2 like '%ms'
THEN CAST(REPLACE(duration2, 'ms', '') AS DOUBLE) / 1000.0
WHEN duration2 like '%s'
THEN CAST(REPLACE(duration2, 's', '') AS DOUBLE)
WHEN duration2 like '%m'
THEN CAST(REPLACE(duration2, 'm', '') AS DOUBLE) * 60
ELSE NULL
END as duration2_seconds
FROM
(
SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2
UNION ALL
SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2
UNION ALL
SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2
) A

DRY way to rewrite the above query is to utilize “macro”. We first define a macro “DURATION_IN_SECONDS” and use it convert all the duration fields as shown below.

-- define macro to convert duration string to duration in seconds
CREATE TEMPORARY MACRO DURATION_IN_SECONDS (t string)
CASE
WHEN t like '%us'
THEN CAST(REPLACE(t, 'us', '') AS DOUBLE) / 1.0E6
WHEN t like '%ms'
THEN CAST(REPLACE(t, 'ms', '') AS DOUBLE) / 1000.0
WHEN t like '%s'
THEN CAST(REPLACE(t, 's', '') AS DOUBLE)
WHEN t like '%m'
THEN CAST(REPLACE(t, 'm', '') AS DOUBLE) * 60
ELSE NULL
END;

SELECT
UUID,
-- use macro to convert first duration field
DURATION_IN_SECONDS(duration1) duration1_seconds,
-- use macro to convert second duration field
DURATION_IN_SECONDS(duration2) duration2_seconds
FROM
(
SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2
UNION ALL
SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2
UNION ALL
SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2
) A

With Clause

Below is an example of another typical query. In the SQL below, we use tableC to filter tableA and tableB and then join the two together. The logic on how to filter tableC itself has been duplicated.

SELECT *
FROM
(
SELECT TableA.*
FROM TableA
JOIN TableC
ON (TableA.id = TableC.id)
WHERE
TableA.datestr >= '2017-01-01'
-- filters on table C
AND TableC.datestr >= '2017-01-01'
AND TableC.status != 0
) A
JOIN
(
SELECT TableB.*
FROM TableB
JOIN TableC
ON (TableB.id = TableC.id)
WHERE
TableB.datestr >= '2017-01-01'
-- filters on table C
AND TableC.datestr >= '2017-01-01'
AND TableC.status != 0

) B
On (A.id = B.id)

Here, using “With” clause can help us make this query DRY. We first express the logic of filtering table C and assign it an alias. Next we join tableA and tableB to this alias.

-- express logic to filter table C over here
WITH FilteredTableC AS
(
SELECT *
FROM TableC
WHERE
datestr >= '2017-01-01'
AND status != 0
)

SELECT *
FROM
(
SELECT TableA.*
FROM TableA
JOIN FilteredTableC
ON (TableA.id = FilteredTableC.id)
WHERE
TableA.datestr >= '2017-01-01'
) A
JOIN
(
SELECT TableB.*
FROM TableB
JOIN FilteredTableC
ON (TableB.id = FilteredTableC.id)
WHERE
TableB.datestr >= '2017-01-01'
) B
On (A.id = B.id)

“With” Clause not only helps with making a SQL DRY, but is also very useful in breaking a big sql involving many joins into smaller easy self summarizing chunks. For instance below is an example of a query that joins three tables together. Even in this simple query it becomes difficult to understand the goal as there is a list of filters that we are applying to different tables.

SELECT drivers.*, riders.*
FROM trips
JOIN drivers ON drivers.driver_id = trips.driver_id
JOIN riders ON riders.rider_id = trips.rider_id
WHERE
trips.datestr >= '2017-01-01'
AND trips.status = 0
AND trips.city = 'SF'
AND drivers.joined >= '2017-01-01'
AND drivers.status = 'active'
AND riders.joined >= '2017-01-01'
AND riders.name like 'XYZ%'

Using “With Clause” allows to rewrite the above query in much more legible way. Each table is separately filtered and assigned a readable alias which is then used in the main query.

WITH SuccessfulTrips as
(
SELECT *
FROM trips
WHERE
trips.status = 0
AND trips.datestr >= '2017-01-01'
),
ActiveDrivers as
(
SELECT *
FROM drivers
WHERE
drivers.status = 'active'
AND drivers.joined >= '2017-01-01'
),
XYZRiders as
(
SELECT *
FROM riders
WHERE
riders.name like 'XYZ%'
AND riders.joined >= '2017-01-01'
)

SELECT ActiveDrivers.*, XYZRiders.*
FROM SuccessfulTrips
JOIN ActiveDrivers ON (SuccessfulTrips.driver_id = ActiveDrivers.driver_id)
JOIN XYZRiders ON (SuccessfulTrips.rider_id = XYZRiders.rider_id)

Variables

Often we use same constant values in multiple places. Instead of copying these constant values all over the place we can easily define a variable and use the variable.

SET start_date = '2017-01-01';
SET end_date = '2017-05-01';

SELECT A.*, B.*
FROM A.*
JOIN B.* ON (A.id = B.id)
WHERE
A.datestr >= ${hiveconf:start_date}
AND A.datestr <= ${hiveconf:end_date}
AND B.datestr >= ${hiveconf:start_date}
AND B.datestr <= ${hiveconf:end_date}

There are few different options for setting variables in hive. Make sure to read comments on this stackoverflow post.

Advertisements

Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn0