NOT IN vs LEFT ANTI JOIN: A Performance Comparison

When filtering data based on exclusion criteria, the choice between NOT IN and LEFT ANTI JOIN can significantly impact query performance. This post demonstrates why LEFT ANTI JOIN is typically the better choice.

< Revised and generated with help of Claude >

Original Approach (Inefficient)

SELECT product_id, product_category
FROM products_dim
WHERE region_id = 100
    AND product_id NOT IN (
        SELECT product_id
        FROM products_dim
        WHERE region_id = 200
    )
    AND product_category IS NOT NULL
SELECT a.product_id, a.product_category
FROM products_dim a
LEFT ANTI JOIN (
    SELECT DISTINCT product_id
    FROM products_dim
    WHERE region_id = 200
) b ON a.product_id = b.product_id
WHERE a.region_id = 100
    AND a.product_category IS NOT NULL

Why This Works

Both queries return exactly the same result: products from region 100 that don’t exist in region 200.

Key Differences

Aspect NOT IN LEFT ANTI JOIN
Performance Slower, less optimized Faster, better optimized by Spark
Broadcast Risk Can trigger unwanted broadcasts Better control, prevents large broadcasts
Execution Plan Subquery execution Efficient join strategy
NULL Handling Unpredictable with NULLs Predictable behavior

Bottom Line

LEFT ANTI JOIN prevents broadcast errors while delivering the same results faster. When working with large datasets, this optimization can make a substantial difference in query execution time and resource utilization.

Written on December 27, 2025