How to Break a MySQL Left Join
Posted on February 19th, 2008 by Greg Allard in MySQL |
Care must be taken when placing conditions on the results of the right-hand table of a LEFT JOIN because it could easily become a normal JOIN since MySQL is using a NULL row for every column in the right-hand table when no records exist.
Since I like to use a soft delete on all tables (hidden but not deleted), I need to put a condition to alter the results from the right-hand table. The proper way to place a condition on this table is to use:
... WHERE ( right_table.foo = 'bar' -- the condition you want OR right_table.foo IS NULL -- it's cool if it's empty ) ...
Here is an Example
Here are some example SQL statements to see exactly what is happening. First we create two tables with a one-to-many relationship.
CREATE TABLE tasks ( -- left task_id MEDIUMINT( 5 ) NOT NULL AUTO_INCREMENT PRIMARY KEY , task_text TEXT NOT NULL , task_deleted TINYINT( 1 ) NOT NULL , task_last_mod INT( 11 ) NOT NULL , task_created INT( 11 ) NOT NULL ) ENGINE = MYISAM; CREATE TABLE logs ( -- right log_id MEDIUMINT( 6 ) NOT NULL AUTO_INCREMENT PRIMARY KEY , task_id MEDIUMINT( 5 ) NOT NULL , log_text TEXT NOT NULL , log_length INT( 11 ) NOT NULL , log_deleted TINYINT( 1 ) NOT NULL , log_last_mod INT( 11 ) NOT NULL , log_created INT( 11 ) NOT NULL ) ENGINE = MYISAM;
After creating the tables we are ready to add some testing data.
INSERT INTO tasks ( task_id , task_text , task_deleted , task_last_mod , task_created ) VALUES ( 1 , 'Get me a beer.', 0, 1203372686 , 1203372686 ), ( 2 , 'Something ridiculous.', 0, 1203372686 , 1203372686 ), ( 3 , 'a logless task', 0, 1203372686, 1203372686 ); INSERT INTO logs ( log_id , task_id , log_text , log_length , log_deleted , log_last_mod , log_created ) VALUES ( 1, 1, 'got a beer', 120, 0, 1203372686, 1203372686 ), ( 2, 2, 'this took forever', 900, 0, 1203372686, 1203372686 ), ( 3, 1, 'a deleted one', 900, 1, 1203372686, 1203372686 ), ( 4, 1, 'moar beer', 120, 0, 1203372686, 1203372686 );
Simple SELECT
This will get all tasks, count the logs, and sum the logged time whether there is a log or not for the task.
SELECT task_text, count(tasks.task_id) AS log_count, sum(log_length) AS total_time FROM tasks LEFT JOIN logs ON tasks.task_id = logs.task_id WHERE task_deleted = 0 GROUP BY logs.task_id;
This query produces expected results as seen below.
task_text log_count total_time a logless task 1 NULL Get me a beer. 3 1140 Something ridiculous. 1 900
Improper Conditional SELECT
If we were to place a condition on the right-hand table to limit the results to only count and sum the logs that are active, at first we might try:
SELECT task_text, count(tasks.task_id) AS log_count, sum(log_length) AS total_time FROM tasks LEFT JOIN logs ON tasks.task_id = logs.task_id WHERE task_deleted = 0 AND log_deleted = 0 -- this breaks the left join GROUP BY logs.task_id;
This produces the results in the following table. Notice that the logless task has disappeared and that the log count and sum for the Get me a beer task has changed to the desired values. This is the same as a normal JOIN statement.
task_text log_count total_time Get me a beer. 2 240 Something ridiculous. 1 900
The Proper Way
The proper way to achieve the desired results is to use the following query:
SELECT task_text, count(tasks.task_id) AS log_count, sum(log_length) AS total_time FROM tasks LEFT JOIN logs ON tasks.task_id = logs.task_id WHERE task_deleted = 0 AND ( log_deleted = 0 OR log_deleted IS NULL ) GROUP BY logs.task_id;
Producing:
task_text log_count total_time a logless task 1 NULL Get me a beer. 2 240 Something ridiculous. 1 900
As you can see the logless task is back and the values for Get me a beer are accurate.
7 Responses
[...] up correctly and when I realized that I was doing a LEFT JOIN improperly, I fixed that and then wrote a post about it on Code [...]
Why not do this?
[...]FROM tasks LEFT JOIN logs ON tasks.task_id = logs.task_id AND logs.log_deleted = 0
WHERE tasks.task_deleted = 0[...]
You want to restrict which rows are pulled for the right-hand side without changing the total number of rows, which is determined by the left-hand side criteria. Adding additional conditions in ON seems perfect for that.
Thanks for the input, Brian. That is a great idea and works the same. I didn’t think of adding the condition there.
My experience with this sort of join, in MSSQL is that putting an or on the join will cause the query to lose a LOT of performance. Not sure how you could measure this in mysql. Of course this was on large datasets.
ORs typically are very expensive.
Oh, it might be better to write the query like this:
SELECT task_text, count(tasks.task_id) AS log_count,
sum(log_length) AS total_time
FROM tasks LEFT JOIN logs ON tasks.task_id = logs.task_id
WHERE task_deleted = 0
AND isnull(log_deleted, 0) = 0
GROUP BY logs.task_id;
Thanks for the responses. I’ll have to test out the performance of using OR in MySQL and see if it has similar performance loss.
It looks like that would only be an alternative for MSSQL since the ISNULL function only has one parameter for MySQL. I’ll keep an eye out for something similar in MySQL.
Ah it would be COALESCE(log_deleted, 0) = 0 for MySQL