Difference Between Rows with PostgreSQL EXCEPT

Okay, here’s a comprehensive article detailing the EXCEPT operator in PostgreSQL, focusing on its use in finding differences between rows, along with numerous examples, explanations, and comparisons to alternative approaches.

PostgreSQL’s EXCEPT: Unveiling the Differences Between Rows

In the realm of relational databases, comparing datasets and identifying differences between them is a fundamental operation. PostgreSQL, a powerful and versatile open-source database system, provides the EXCEPT operator as a crucial tool for precisely this task. EXCEPT allows you to retrieve rows present in one query’s result set but not present in another’s. This capability is invaluable for data analysis, auditing, reconciliation, and various other scenarios where pinpointing discrepancies is essential.

This article delves deep into PostgreSQL’s EXCEPT operator, covering its syntax, behavior, use cases, performance considerations, and comparisons with alternative methods. We’ll explore numerous examples, demonstrating its practical application and clarifying its nuances.

1. The Fundamentals of EXCEPT

1.1 Syntax and Basic Usage

The core syntax of EXCEPT is straightforward:

“`sql
SELECT column1, column2, …
FROM table1
[WHERE condition1]

EXCEPT

SELECT column1, column2, …
FROM table2
[WHERE condition2];
“`

Let’s break this down:

  • SELECT ... FROM table1 [WHERE condition1]: This is the first query. It selects rows from table1, potentially filtering them with an optional WHERE clause.
  • EXCEPT: This is the keyword that performs the set difference operation.
  • SELECT ... FROM table2 [WHERE condition2]: This is the second query. It selects rows from table2, also potentially with a WHERE clause.

The EXCEPT operator returns all rows that are present in the result set of the first query but not present in the result set of the second query. In essence, it subtracts the second result set from the first.

1.2 Important Considerations:

  • Column Compatibility: The number and data types of the columns selected in both queries must be compatible. You can’t, for example, compare a text column with a numeric column directly. PostgreSQL will throw an error if the column types are incompatible. The order of the columns matters; column 1 of the first query is compared to column 1 of the second, and so on.
  • Duplicate Handling: EXCEPT inherently removes duplicate rows from the final result. If the first query returns multiple identical rows, and the second query does not contain that row, EXCEPT will only return one instance of that row. This behavior aligns with the mathematical concept of set difference. There is an EXCEPT ALL operator that we’ll discuss later.
  • NULL Values: NULL values are treated as equal to other NULL values for the purpose of EXCEPT. If a row in the first query has NULL in a particular column, and a row in the second query also has NULL in the corresponding column, those rows are considered a match, and the row from the first query will not be included in the EXCEPT result.
  • Ordering: The EXCEPT operator itself does not guarantee any specific order of the results. If you need the results sorted, use an ORDER BY clause after the second query:

    sql
    SELECT ...
    FROM ...
    EXCEPT
    SELECT ...
    FROM ...
    ORDER BY column1, column2;

2. Practical Examples: Illustrating EXCEPT’s Power

Let’s solidify our understanding with concrete examples. Assume we have two tables:

  • employees: Contains information about all current employees.
  • former_employees: Contains information about employees who have left the company.

Both tables have the following structure (for simplicity):

“`sql
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(255),
last_name VARCHAR(255),
department VARCHAR(255)
);

CREATE TABLE former_employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(255),
last_name VARCHAR(255),
department VARCHAR(255)
);

— Insert some sample data
INSERT INTO employees (employee_id, first_name, last_name, department) VALUES
(1, ‘Alice’, ‘Smith’, ‘Sales’),
(2, ‘Bob’, ‘Johnson’, ‘Marketing’),
(3, ‘Charlie’, ‘Brown’, ‘Engineering’),
(4, ‘David’, ‘Lee’, ‘Sales’),
(5, ‘Eve’, ‘Williams’, ‘HR’);

INSERT INTO former_employees (employee_id, first_name, last_name, department) VALUES
(1, ‘Alice’, ‘Smith’, ‘Sales’),
(3, ‘Charlie’, ‘Brown’, ‘Finance’), — Note: Different department
(6, ‘Frank’, ‘Garcia’, ‘IT’);
“`

2.1 Finding Active Employees Not in former_employees

This is the most straightforward application of EXCEPT. We want to find all employees who are currently employed and have not left the company.

“`sql
SELECT employee_id, first_name, last_name, department
FROM employees

EXCEPT

SELECT employee_id, first_name, last_name, department
FROM former_employees;
“`

Result:

employee_id first_name last_name department
2 Bob Johnson Marketing
4 David Lee Sales
5 Eve Williams HR

Explanation:

  • Alice Smith (employee_id 1) is present in both tables, so she’s excluded.
  • Charlie Brown (employee_id 3) is present in both tables, but the department is different. Since we selected all columns, this difference matters, and the row would NOT be excluded. However, because employee_id 3 with ANY department is present in former_employees, it’s excluded.
  • Bob Johnson, David Lee, and Eve Williams are only in employees, so they are included.
  • Frank Garcia is only in former_employees and is not considered.

2.2 Finding Employees Who Changed Departments (More Complex Example)

Suppose we want to find employees who were once employed (are in former_employees) but are currently employed in a different department. This requires a more nuanced approach. We can use a combination of EXCEPT and INTERSECT (which finds common rows) or subqueries.

Method 1: Using INTERSECT and EXCEPT

“`sql
— Find employees who are in both tables (i.e., were once employed)
SELECT employee_id
FROM employees
INTERSECT
SELECT employee_id
FROM former_employees

EXCEPT

— Find employees who are in both tables AND have the same department
SELECT e.employee_id
FROM employees e
JOIN former_employees fe ON e.employee_id = fe.employee_id
WHERE e.department = fe.department;

“`
Explanation:

  1. INTERSECT part: The first part of the query uses INTERSECT to find employee_id values that exist in both the employees and former_employees tables. This identifies employees who were once employed and are currently employed.
  2. JOIN and WHERE part: The second part uses a JOIN and WHERE clause to find employees who are in both tables and have the same department in both tables.
  3. EXCEPT: The EXCEPT operator then subtracts the second result set (employees with the same department) from the first result set (employees in both tables). This leaves us with only the employee_id values of employees who were once employed and are currently employed in a different department.

Method 2: Using Subqueries and NOT IN

sql
SELECT e.employee_id, e.first_name, e.last_name, e.department
FROM employees e
WHERE e.employee_id IN (SELECT employee_id FROM former_employees)
AND NOT (e.employee_id, e.department) IN (SELECT employee_id, department FROM former_employees);

Explanation:

  1. e.employee_id IN (SELECT employee_id FROM former_employees): This checks if the employee_id from the employees table exists in the former_employees table. This identifies employees who were once employed.
  2. NOT (e.employee_id, e.department) IN (SELECT employee_id, department FROM former_employees): This is the crucial part. It checks if the combination of employee_id and department from the employees table exists in the former_employees table. The NOT negates this, ensuring we only select employees whose current employee_id and department combination is not found in former_employees. This effectively identifies department changes.

Result (for both methods):
(Empty Result Set) – Because Charlie Brown is now in Engineering, not Finance. If Charlie was in Finance in the employees table, then his employee ID would be returned.

Let’s change Charlie Brown’s department in employees to ‘Finance’ to demonstrate this:

sql
UPDATE employees SET department = 'Finance' WHERE employee_id = 3;

Now, re-run the queries above, and they return:
| employee_id |
| :———– |
| 3 |

And re-run the subquery with the full row selection:
| employee_id | first_name | last_name | department |
| :———– | :———- | :——— | :———- |
| 3 | Charlie | Brown | Finance |

2.3 Identifying New Employees

To find employees who are new (present in employees but not in former_employees), we can simply use the initial EXCEPT example. However, let’s say we have a third table, new_hires, that contains employees hired within the last month. We want to identify employees who are new hires and are not former employees.

sql
CREATE TABLE new_hires (
employee_id INT PRIMARY KEY,
first_name VARCHAR(255),
last_name VARCHAR(255),
department VARCHAR(255),
hire_date DATE
);
INSERT INTO new_hires (employee_id, first_name, last_name, department, hire_date) VALUES
(2, 'Bob', 'Johnson', 'Marketing', '2024-01-15'),
(7, 'Grace', 'Hopper', 'Engineering', '2024-01-20');

Now, we can use EXCEPT twice:

“`sql
SELECT employee_id, first_name, last_name, department
FROM new_hires

EXCEPT

SELECT employee_id, first_name, last_name, department
FROM former_employees;
“`

Result:

employee_id first_name last_name department
2 Bob Johnson Marketing
7 Grace Hopper Engineering

This correctly identifies Bob Johnson and Grace Hopper as new hires who are not former employees. If Alice Smith were also in new_hires, she would not appear in this result because she is also in former_employees.

2.4. EXCEPT ALL

The EXCEPT ALL operator is a variant of EXCEPT that does consider duplicates. If a row appears n times in the first result set and m times in the second result set, and n > m, then the row will appear n – m times in the final result. If n <= m, the row will not appear in the result.

Let’s add a duplicate row to employees:

sql
INSERT INTO employees (employee_id, first_name, last_name, department) VALUES
(2, 'Bob', 'Johnson', 'Marketing'); -- Duplicate Bob Johnson

Now, let’s use EXCEPT ALL:

“`sql
SELECT employee_id, first_name, last_name, department
FROM employees

EXCEPT ALL

SELECT employee_id, first_name, last_name, department
FROM former_employees;
“`

Result:

employee_id first_name last_name department
2 Bob Johnson Marketing
2 Bob Johnson Marketing
4 David Lee Sales
5 Eve Williams HR

The result now includes two rows for Bob Johnson, because he appears twice in employees and only once in former_employees. The EXCEPT operator (without ALL) would only return one row for Bob Johnson.

3. Comparing EXCEPT with Other Techniques

While EXCEPT is a powerful tool, there are alternative ways to achieve similar results in PostgreSQL. Understanding these alternatives and their trade-offs is crucial for choosing the best approach for a given situation.

3.1. NOT EXISTS

The NOT EXISTS clause, used with a correlated subquery, is a common alternative to EXCEPT.

sql
SELECT e.employee_id, e.first_name, e.last_name, e.department
FROM employees e
WHERE NOT EXISTS (
SELECT 1
FROM former_employees fe
WHERE e.employee_id = fe.employee_id
AND e.first_name = fe.first_name
AND e.last_name = fe.last_name
AND e.department = fe.department
);

Explanation:

  • SELECT e.employee_id, ... FROM employees e: This is the outer query, selecting from the employees table.
  • WHERE NOT EXISTS (...): This clause filters the outer query’s results. It only includes rows from employees for which the subquery returns no rows.
  • SELECT 1 FROM former_employees fe WHERE ...: This is the correlated subquery. It’s correlated because it references the e alias from the outer query. The SELECT 1 is a common practice; we only care if any row exists, not the specific value.
  • e.employee_id = fe.employee_id AND ...: This is the correlation condition. It compares all the columns from the employees row (e) to the corresponding columns in the former_employees table (fe).

NOT EXISTS often performs very well, especially when indexes are properly used. It’s generally a good choice when you need to compare entire rows and have indexes on the join columns.

3.2. LEFT JOIN … WHERE … IS NULL

Another common technique involves using a LEFT JOIN and filtering for NULL values in the right-hand table.

sql
SELECT e.employee_id, e.first_name, e.last_name, e.department
FROM employees e
LEFT JOIN former_employees fe
ON e.employee_id = fe.employee_id
AND e.first_name = fe.first_name
AND e.last_name = fe.last_name
AND e.department = fe.department
WHERE fe.employee_id IS NULL;

Explanation:

  • LEFT JOIN: This join includes all rows from the left table (employees) and matching rows from the right table (former_employees). If there’s no match in former_employees, the columns from former_employees will be filled with NULL values.
  • ON ...: The join condition specifies how rows from the two tables are matched.
  • WHERE fe.employee_id IS NULL: This is the crucial part. It filters the results to include only rows where fe.employee_id is NULL. This indicates that there was no matching row in former_employees, effectively achieving the same result as EXCEPT.

This method can also be efficient, especially with indexes. It’s often preferred when you need to select columns from both tables (even if some are NULL), as EXCEPT only returns columns from the first query.

3.3. NOT IN (with Subquery)

While NOT IN can sometimes be used, it’s generally not recommended as a replacement for EXCEPT or NOT EXISTS, especially with large datasets.

sql
SELECT e.employee_id, e.first_name, e.last_name, e.department
FROM employees e
WHERE (e.employee_id, e.first_name, e.last_name, e.department) NOT IN (
SELECT employee_id, first_name, last_name, department
FROM former_employees
);

Explanation:

  • This query uses NOT IN with a subquery to select rows from employees where the combination of all columns is not present in the result of the subquery (which selects all rows from former_employees).

Performance Considerations:

  • NOT IN with NULL Values: NOT IN can behave unexpectedly if the subquery returns any NULL values. If any row in the subquery’s result has a NULL value in a compared column, the entire NOT IN condition will evaluate to UNKNOWN, and no rows will be returned from the outer query. This is a common source of errors and confusion. NOT EXISTS does not have this problem.
  • Performance Degradation: NOT IN can often lead to poor performance, especially with large subqueries, as the database may need to scan the entire subquery result for each row in the outer query. PostgreSQL’s query optimizer might be able to rewrite NOT IN to a more efficient form (like a join), but it’s not guaranteed.
  • Composite NOT IN: You need to check multiple columns at once. This can be even slower.

4. Performance Tuning for EXCEPT

Optimizing queries using EXCEPT (or its alternatives) often involves the same principles as optimizing other SQL queries:

  • Indexes: Create indexes on the columns used in the SELECT lists and, crucially, any columns used in WHERE clauses or join conditions within the subqueries of EXCEPT. For our example, indexes on employees.employee_id, former_employees.employee_id, and potentially composite indexes on (employee_id, first_name, last_name, department) for both tables would significantly speed up the EXCEPT operation. PostgreSQL’s query planner can use these indexes to quickly locate matching rows.
  • Query Structure: In some cases, rewriting the query using NOT EXISTS or a LEFT JOIN with a NULL check might be more efficient than EXCEPT, even with indexes. Experiment with different approaches and use EXPLAIN ANALYZE (see below) to compare their performance.
  • EXPLAIN ANALYZE: This is your most valuable tool for performance analysis. Run your query with EXPLAIN ANALYZE prepended:

    sql
    EXPLAIN ANALYZE
    SELECT ...
    FROM ...
    EXCEPT
    SELECT ...
    FROM ...;

    EXPLAIN ANALYZE will execute the query and provide detailed information about the execution plan, including the time spent in each step, the number of rows processed, and whether indexes were used. This allows you to identify bottlenecks and optimize your query accordingly.

  • Statistics: Ensure that PostgreSQL has up-to-date statistics about your tables. Run ANALYZE table_name; periodically, especially after significant data changes. Accurate statistics help the query planner choose the most efficient execution plan.

  • Avoid SELECT *: Explicitly list the columns you need in your SELECT statements. SELECT * can retrieve unnecessary data, slowing down the query. This is particularly important with EXCEPT because the column matching is crucial.

5. Advanced Use Cases and Considerations

  • Data Auditing: EXCEPT is excellent for comparing snapshots of data taken at different times to identify changes. For example, you could compare a table of customer data from yesterday with a table of customer data from today to find new customers, deleted customers, or customers whose information has changed.

  • Data Reconciliation: When integrating data from different sources, EXCEPT can help identify discrepancies. For example, if you’re merging data from two different CRM systems, you could use EXCEPT to find customers who are present in one system but not the other.

  • Finding Missing Data: If you expect a certain set of data to be present, EXCEPT can help you find missing rows.

  • Combining with Other Set Operators: EXCEPT can be combined with other set operators like UNION (combines results and removes duplicates), UNION ALL (combines results and keeps duplicates), and INTERSECT (finds common rows).

  • Common Table Expressions (CTEs): For complex queries involving multiple EXCEPT operations or subqueries, using CTEs can improve readability and, in some cases, performance. CTEs allow you to define named subqueries that can be reused within the main query.

    sql
    WITH
    CurrentEmployees AS (
    SELECT employee_id, first_name, last_name, department
    FROM employees
    ),
    FormerEmployees AS (
    SELECT employee_id, first_name, last_name, department
    FROM former_employees
    )
    SELECT *
    FROM CurrentEmployees
    EXCEPT
    SELECT *
    FROM FormerEmployees;

6. Conclusion

PostgreSQL’s EXCEPT operator is a powerful and versatile tool for finding differences between rows in two result sets. Its concise syntax and clear semantics make it a valuable asset for various data manipulation and analysis tasks. By understanding its behavior, performance considerations, and alternatives like NOT EXISTS and LEFT JOIN, you can effectively leverage EXCEPT to gain valuable insights from your data. Remember to utilize indexing and EXPLAIN ANALYZE to ensure optimal performance, and consider using CTEs to improve the readability of complex queries. With careful planning and optimization, EXCEPT can be a key component of your SQL toolkit.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top