Data Sampling In Presto

Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn0

The importance of “Sampling” cannot be overstated. The conclusions we draw from the data as well as the quality of the machine learned model significantly depends on how we sample the data. However there are many different ways to sample the data and expressing these different ways of sampling in SQL can often be tricky. Below are examples of few sampling techniques that can be easily expressed using Presto query engine.

1.0 Random Sampling

In random sampling, the probability of selecting any given row is same. In other words all rows are equally weighted. As shown below random sampling can be easily achieved in Presto using TABLESAMPLE operator along with BERNOULLI method for sampling.

WITH dataset AS (
    SELECT *
    FROM (
        VALUES 
            (1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'),
            (5, 'A'), (6, 'B'), (7, 'C'),
            (9, 'A'), (10, 'B'),(11, 'C'),
            (13, 'A'),(14, 'B'),(15, 'C'),
            (17, 'A'),(18, 'B'),(19, 'C'),
            (21, 'A'),(22, 'B'),(23, 'C'),
            (25, 'A'),(26, 'B')
    ) AS t
)

SELECT * 
-- assuming we want to sample 25% of records
FROM dataset TABLESAMPLE BERNOULLI(25)

Since all rows are equally weighted, one of the problems with random sampling is that we might not see rare events in our sample data. For instance, above there is only record related to letter ‘D’ and most likely it won’t appear in our sampled data. This is where stratified sampling comes handy.

2.0 Stratified Sampling

The idea of stratified sampling is to partition the data into different groups and then select records from each of these groups. There can be many different strategies on how many records are selected from each group. For instance below are two different strategies:

  1. Select N records from each group. If N is greater than the number of records available in a group then select all the records. Loosely this is known as reservoir sampling. Technically, “reservoir sampling” is defined as group of algorithms for selecting N records from a list whose length is unknown.
  2. Select X% of records from each group. So if one group has 100 records and another group has 10K records and we want to select 10% records from each group then the sample output should contain 10 records from the first group and 1K records from the second group.

2.1 Stratified Reservoir Sampling

Assume the same dataset as above (containing letters and numbers), we want to select 3 rows from each letter group. Thus the sample dataset should contain 3 random records from group A, B, C as well as the only record related to group D.

WITH dataset AS (
    SELECT *
    FROM (
        VALUES 
            (1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'),
            (5, 'A'), (6, 'B'), (7, 'C'),
            (9, 'A'), (10, 'B'),(11, 'C'),
            (13, 'A'),(14, 'B'),(15, 'C'),
            (17, 'A'),(18, 'B'),(19, 'C'),
            (21, 'A'),(22, 'B'),(23, 'C'),
            (25, 'A'),(26, 'B')
    ) AS t(number, letter)
)

SELECT letter, number
FROM
(
    SELECT *, ROW_NUMBER() OVER (PARTITION BY letter ORDER BY rnd) as rnk
    FROM (
        SELECT letter, number, RANDOM() AS rnd 
        FROM dataset 
    ) bucketed
) sampled
-- assuming we want 3 records from each group
WHERE rnk <= 3 

The idea above is to assign a random value to each record and then select top N records in each group based on the random value.

2.2 Stratified Constant Proportion

The problem becomes little tricky if we intent to select a fixed proportion of records from each group. One naive solution might be to use two queries. First, to compute the number of records in each group and then use this information to select top N records from each group (just like in 2.1) but where N changes for each group. But this requires two passes through the data and this can be computational expensive.

However if we assume/confirm that there are more than 100 records in each group, then we can select the X% records from each group in a single pass using NTILE function as shown below.

WITH dataset AS (
    SELECT *
    FROM (
        VALUES 
            (1,'A'),(2,'A'),...,(100,'A'),
            (1,'B'),(2,'B'),...,(100,'B'),
            (1,'C'),(2,'C'),...,(100,'C'),
            (1,'D'),(2,'D'),...,(100,'D')
    ) AS t(number, letter)
)

SELECT letter, number
FROM
(
    SELECT 
        *, 
        NTILE(100) OVER (PARTITION BY letter ORDER BY rnd) as tile
    FROM (
        SELECT d.letter, number, RANDOM() AS rnd
        FROM dataset d
    ) bucketed
) sampled
-- assuming we want 10% records from each letter group
WHERE tile <= 10
Share on Facebook0Share on Google+0Tweet about this on TwitterShare on LinkedIn0