[Stratascratch - HARD] Rank Variance Per Country (NULL값이 연산이 안될 경우를 대비하기, RANK()/DENSE_RANK()차이, CTEs)

Stratascratch Meta/Facebook Interview Question - Rank Variance Per Country

Problem

  • Difficulty level : Hard
  • Find the countries have risen in the rankings based on the number of comments between Dec 2019 vs Jan 2020? (Avoid gaps between ranks when ranking countries.)

Table : fb_comments_count

user_id int
created_at datetime
number_of_comments int

Table : fb_active_users

user_id int
name varchar
status varchar
country varchar

My Solution

1. 3 CTEs + 2 JOINs

  • 두 테이블을 조인하면서 필요한 정보만 필터링한 기본 테이블 CTE1
  • 두 개의 다른 시기의 랭킹 변화를 구하기 위해 각 시기의 랭킹을 구하는 CTE2 CTE3
  • CTE2 CTE3 를 조인 후, CTE3의 정보를 기준으로 랭킹이 증가한 국가를 출력한다.
    • 랭킹이 증가한 국가 조건을 WHERE절에 작성했다.
      1. CTE3에서만 등장한 국가 혹은 CTE2에서는 등장하지 않은 국가인 경우
      2. CTE3CTE2 ranking 변수의 차이가 0보다 큰 국가인 경우
        ( 1의 조건에 해당하는 레코드가 있는 경우를 대비해 NULL값에도 수식이 성립하도록 COALESCE(CTE3.ranking,0) )
-- 1. LEFT JOIN `fb_comments_count` <- `fb_active_users`
-- 2. Filter records Dec 2019 - Jan 2020
-- 3. RANKING WINDOW FUNCTIONS - FOR DEC2019 JAN2020 SEPARATELY
-- 4. FIND THE COUNTRIES THAT HAVE RISEN IN RANKINGS (2CASES)

WITH CTE1 AS (
    SELECT 
        u.country,

        -- EXTRACT THE YEAR-MONTH FROM DATE COLUMN
        EXTRACT(YEAR_MONTH FROM created_at) AS `period`,
        -- AGGREGATE : SUM BY country & period(year-month)
        SUM(number_of_comments) AS `number_of_comments`

    FROM fb_comments_count c

        -- JOIN TYPE : INNER JOIN 
        -- fb_active_users : there are records which don't have country information
        JOIN fb_active_users u
            ON c.user_id = u.user_id

    -- Filter the records : Dec 2019, Jan 2020
    WHERE created_at between '2019-12-01' AND '2020-01-31'
    GROUP BY 1,2
)

-- 201912 RANKING INFORMATION
, CTE2 AS (
    SELECT 
        country,
        -- Avoid gaps between ranks when ranking countries : DENSE_RANK()
        DENSE_RANK() OVER (ORDER BY number_of_comments DESC) AS ranking
    FROM CTE1
    WHERE period="201912"
)

-- 202001 RANKING INFORMATION
, CTE3 AS (
    SELECT 
        country,
        -- Avoid gaps between ranks when ranking countries : DENSE_RANK()
        DENSE_RANK() OVER (ORDER BY number_of_comments DESC) AS ranking
    FROM CTE1
    WHERE period="202001"
)

-- RIGHT JOIN : WE NEED THE CHANGE ON RANKING >> CTE3
-- CTE3.country (NOT CTE2.country)
SELECT CTE3.country
FROM CTE2 -- 201912ranking
    RIGHT JOIN CTE3 -- 202001ranking 
    ON CTE2.country = CTE3.country

-- countries which have RISEN in the rankings
-- NO RANKING INFORMATION ON 201912 PERIOD
-- OR CHANGE ON RANKING IS GREATER THAN 0
WHERE ISNULL(CTE2.ranking) 
    OR COALESCE(CTE3.ranking,0) - CTE2.ranking < 0

✔️ CTE를 3개를 사용하는 쿼리보다 효율적인 쿼리를 고민해봐야할 것 같다.

2. 2 CTEs + 3 JOINs

WITH CTE1 AS (
    SELECT 
        country, 
        -- AGGREGATE : count(*) FOR SPECIFIC PERIOD
         DENSE_RANK() OVER (ORDER BY count(*) DESC) AS `ranking`

    FROM fb_comments_count c
        -- JOIN TYPE : INNER JOIN 
        -- fb_active_users : there are records which don't have country information
        JOIN fb_active_users u
            ON c.user_id = u.user_id

    -- Filter the records : Jan 2020
    -- EXTRACT THE YEAR-MONTH FROM DATE COLUMN : `EXTRACT(YEAR_MONTH FROM created_at)`
    WHERE EXTRACT(YEAR_MONTH FROM created_at) = "201912"
    -- TO REMOVE DUPLICATE ROWS 
    GROUP BY 1

), CTE2 AS (

    SELECT 
        country, 
        -- EXTRACT THE YEAR-MONTH FROM DATE COLUMN : `EXTRACT(YEAR_MONTH FROM created_at)`
        -- AGGREGATE : count(*) FOR SPECIFIC PERIOD
         DENSE_RANK() OVER (ORDER BY count(*) DESC) AS `ranking`

    FROM fb_comments_count c
        -- JOIN TYPE : INNER JOIN 
        -- fb_active_users : there are records which don't have country information
        JOIN fb_active_users u
            ON c.user_id = u.user_id

    -- Filter the records : Jan 2020
    WHERE EXTRACT(YEAR_MONTH FROM created_at) = "202001"
    -- TO REMOVE DUPLICATE ROWS 
    GROUP BY 1
)


SELECT CTE2.country
FROM CTE2 -- 202001ranking
    LEFT JOIN CTE1 -- 201912ranking
    ON CTE2.country = CTE1.country
WHERE ISNULL(CTE1.ranking)
    OR CTE2.ranking - COALESCE(CTE1.ranking, 0) < 0
  • Submission
    image

Lessons Learned

NULL값이 연산이 안될 경우를 대비하기 COALESCE()

WHERE ISNULL(CTE1.ranking) OR CTE2.ranking - COALESCE(CTE1.ranking, 0) < 0

To avoid "gaps" between ranks : DENSE_RANK()

RANK() vs. DENSE_RANK()

테이블 조인타입 → 누락된 정보 확인

fb_comments_countfb_active_usersLEFT JOIN 했다. fb_active_users 테이블에 유저 정보 특히 country 가 없는 경우가 있었고, country 변수로 집계하는 문제이므로, 해당 정보가 없는 경우는 의미가 없기 때문이다. (문제에서 특별한 조건이 없는 경우)

-- there are records which don't have country information 
SELECT *
FROM fb_comments_count c
    LEFT JOIN fb_active_users u
        ON c.user_id = u.user_id 

수식 실수 주의하기

랭킹이 더 높아진 경우를 출력하는 문제이다. 랭킹이 높아졌다 : 후자의 랭킹 - 전자의 랭킹 < 0

COALESCE(CTE3.ranking,0) - CTE2.ranking < 0