msyql笔记 - 子查询
子查询比较好理解
子查询是比较容易出问题的写法
5.6以前子查询的性能不好
子查询的写法,通常来说只会用IN子查询,ANY,SOME,ALL几乎不用,只在某些场景下会用
operand comparison_operator ANY (subquery)
operand IN (subquery)
operand comparison_operator SOME (subquery)
operand comparison_operator ALL (subquery)
子查询的使用
ANY关键词的意思是“对于在子查询返回的列中的任一数值,如果比较结果为TRUE的话,则返回TRUE
select s1 from t1 where s1 > any (select s1 from t2);
SOME = ANY
IN equals = ANY
select s1 from t1 where s1 = any (select s1 from t2);
使用子查询和内连join的一些区别
insert into b select 2;
select x from a where x in (select y from b);
select x from a,b where a.x = b.y;
IN语句在取出数据之后会对取出的数据进行一次去重(1,2,2) ->(1,2),然后会判断是不是在(1,2)里而不会问是不是在(1,2,2)里,所以要看b中的y是不是唯一的,如果是唯一的用join问题不大,如果不是唯一的就会出现问题
select x from a where x in (select y from b);
+------+
| x |
+------+
| 1 |
| 2 |
+------+
select x from a,b where a.x = b.y;
+------+
| x |
+------+
| 1 |
| 2 |
+------+
insert into b select 2;
Query OK, 1 row affected (0.05 sec)
Records: 1 Duplicates: 0 Warnings: 0
select x from a where x in (select y from b);
+------+
| x |
+------+
| 1 |
| 2 |
+------+
select x from a,b where a.x = b.y;
+------+
| x |
+------+
| 1 |
| 2 |
| 2 |
+------+
使用派生表解决这个join的问题,先通过一个子查询的结果产生派生表c,其中使用distinct关键词进行去重,最后再join
select * from a,(select distinct y from b) c where a.x = c.y;
insert into b select NULL;
select * from a where a.x not in (select y from b);
Empty set (0.00 sec) -- b中插入null值的时候,a表中3这个结果没有了
delete from b where y is NULL;
select * from a where a.x not in (select y from b);
+------+
| x |
+------+
| 3 |
+------+
select a.x from a left join (select distinct y from b) c on a.x = c.y where c.y is null; -- 取出在a表中但是不在b表中的值,哪怕此时b表中包含了1,2,2,NULL这样的值
+------+
| x |
+------+
| 3 |
+------+
select * from a where a.x not in (select y from b where y is not null);
+------+
| x |
+------+
| 3 |
+------+
所以在建表的时候默认值为null的话可能会有一些潜在的坑,比如
select 3 not in (1,2,2,NULL); -- 返回NULL值,也就是上面的not in语句为什么不返回数据的原因
select 3 not in (1,2,3); -- 返回1也就是TRUE
EXISTS谓词
仅返回TRUE, FALSE
UNKNOWN返回FALSE
不会返回NULL值,not in则返回0和NULL值
SELECT customerid, companyname
FROM customers AS A
WHERE country = 'Spain'
AND EXISTS
(SELECT * FROM orders AS B
WHERE A.customerid = B.customerid)
EXISTS => IN 写法
SELECT customerid, companyname
FROM customers AS A
WHERE country = 'Spain'
AND customerid IN (SELECT customerid FROM orders);
select * from a where a.x in (select y from b);
=>
select * from a where exists (select * from b where a.x = b.y);
+------+
| x |
+------+
| 1 |
| 2 |
+------+
子查询最大的一个优势是易于理解,另外需要理解的一个重点
select * from a where exists (select 1 from b where a.x = b.y);
select * from a where exists (select NULL from b where a.x = b.y);
返回的结果集还是一样的
+------+
| x |
+------+
| 1 |
| 2 |
+------+
因为exists表示的是这条语句取出来有没有结果,但是这个结果值是1还是有多个列组成还是这个结果只是返回一个NULL值都没有关系,都表示的是返回了一行记录,表示的是有没有一行记录返回,因为如果这个条件不匹配的话是任何一条记录也不返回的。哪怕是null也会返回一行null的结果集,所以使用null也是成立的,主要是根据判断条件来进行判断,结果集只要有就可以了,另外exists里的语句依旧不推荐使用select *
NOT EXISTS
NOT EXISTS 同样也只返回0和1
select * from b;
+------+
| y |
+------+
| 1 |
| 2 |
| 2 |
| NULL |
+------+
select * from a where a.x not in (select y from b);
Empty set (0.01 sec)
select * from a where not exists (select * from b where a.x = b.y);
+------+
| x |
+------+
| 3 |
+------+
in和exists,not in和not exists基本上是一样的,但是带了NULL值后就不一样了,这是两者之间的差别
IN EXISTS 性能比较
5.6版本之前mysql对in子查询的优化是不完善的,所有的in会被优化重写成exists,这种查询重写效果是不好的
SELECT ... FROM t1 WHERE t1.a IN (SELECT b FROM t2);
SELECT ... FROM t1 WHERE
EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a);
子查询的优化,这里IN和EXISTS的性能差距是很大的
-- 每月最后实际订单日期发生的订单
SELECT
*
FROM
dbt3.orders
WHERE
o_orderdate IN (SELECT
MAX(o_orderdate)
FROM
dbt3.orders b
GROUP BY (DATE_FORMAT(o_orderdate, '%Y%M')));
-- 这里IN的GROUP BY只执行一次
-- EXISTS写法
SELECT
*
FROM
dbt3.orders a
WHERE
EXISTS( SELECT
MAX(o_orderdate)
FROM
dbt3.orders b
GROUP BY (DATE_FORMAT(o_orderdate, '%Y%M'))
HAVING MAX(o_orderdate) = a.o_orderdate);
-- 这里EXISTS的GROUP BY会执行很多次,这里的问题在于group by要执行太多次,如果有10w行记录,group by也要执行10万次,也就是100w次的数据扫描。因为后面这条exists是相关子查询,每一次执行子查询都需要跟外表中的数据去关联。
-- 派生表的IN写法
SELECT
*
FROM
dbt3.orders a,
(SELECT
MAX(o_orderdate) o_orderdate
FROM
dbt3.orders
GROUP BY (DATE_FORMAT(o_orderdate, '%Y%M'))) b
WHERE
a.o_orderdate = b.o_orderdate;
所以这里IN的性能是要比EXISTS高很多的
一些例子
求出当前employees当前员工的级别、然后titles、目前的薪资
SELECT
CONCAT(e.first_name, , e.last_name) AS name,
d.dept_name,
s.salary,
t.title
FROM
employees e
LEFT JOIN
dept_manager dm ON e.emp_no = dm.emp_no
INNER JOIN
dept_emp de ON e.emp_no = de.emp_no
INNER JOIN
departments d ON d.dept_no = de.dept_no
INNER JOIN
salaries s ON s.emp_no = e.emp_no
INNER JOIN
titles t ON e.emp_no = t.emp_no
WHERE
dm.emp_no IS NULL;
-- 这样是不正确的,因为salaries表中的数据是历史数据,每次薪资变动都会有一条记录,不同的时间区间会有不同的salary,所以产生了一对多的关系;同样的departments中的dept_name也有同样的问题
取出当前员工最大的to_date
SELECT
emp_no, title
FROM
titles
WHERE
(emp_no , to_date) IN (SELECT
emp_no, MAX(to_date)
FROM
titles
GROUP BY emp_no,to_date)
ORDER BY emp_no
LIMIT 10;
+--------+--------------------+
| emp_no | title |
+--------+--------------------+
| 10013 | Senior Staff |
| 10048 | Engineer |
| 10064 | Staff |
| 10070 | Technique Leader |
| 10363 | Assistant Engineer |
| 10364 | Senior Engineer |
| 10372 | Technique Leader |
| 10426 | Technique Leader |
| 10469 | Senior Engineer |
| 10632 | Technique Leader |
+--------+--------------------+
SELECT
emp_no, salary
FROM
salaries
WHERE
(emp_no , to_date) IN (SELECT
emp_no, MAX(to_date)
FROM
salaries
GROUP BY emp_no)
LIMIT 10;
+--------+--------+
| emp_no | salary |
+--------+--------+
| 13049 | 60266 |
| 14688 | 42041 |
| 15509 | 43807 |
| 16012 | 76142 |
| 18061 | 57737 |
| 20869 | 40000 |
| 21610 | 81589 |
| 24040 | 40000 |
| 24673 | 49838 |
| 24861 | 68066 |
+--------+--------+
完结版
SELECT
e.emp_no,
CONCAT(last_name, ' ', first_name) AS name,
t.title,
dp.dept_name,
s.salary
FROM
employees e
LEFT JOIN
dept_manager d ON e.emp_no = d.emp_no
LEFT JOIN
(SELECT
emp_no, title
FROM
titles
WHERE
(emp_no , to_date) IN (SELECT
emp_no, MAX(to_date)
FROM
titles
GROUP BY emp_no)) t ON t.emp_no = e.emp_no
LEFT JOIN
(SELECT
dept_no, emp_no, MAX(to_date)
FROM
dept_emp
GROUP BY emp_no) de ON de.emp_no = e.emp_no
LEFT JOIN
(SELECT
emp_no, salary
FROM
salaries
WHERE
(emp_no , to_date) IN (SELECT
emp_no, MAX(to_date)
FROM
salaries
GROUP BY emp_no)) s ON s.emp_no = e.emp_no
LEFT JOIN
departments dp ON dp.dept_no = de.dept_no
WHERE
d.emp_no IS NULL;
总结
IN和EXISTS,IN改写成join可能要去重
IN可能会返回NULL值
EXISTS只会返回true和false