数据库count函数能不能和mysql count distinctt

社区像册:
QQ群1:3336901(满)
文档翻译平台:
邮件列表:
     
PostgreSQL携手阿里云
distinct xx和count(distinct xx)的变态递归优化方法原作者:digoal/德哥  创作时间: 17:18:30+08  
doudou586 发布于 17:18:30  
评论: 2  
浏览: 308  
distinct xx和count(distinct xx)的变态递归优化方法 - 索引收敛(skip scan)扫描
作者: digoal
标签: PostgreSQL , 递归去重 , 递归优化 , count(distinct ), 稀疏列 , 统计
今天要说的这个优化是从前面一篇讲解《performance tuning case :use cursor or trigger replace group by and order by》
,/digoal@126/blog/static// 的延展.
例如一个表中有一个字段是性别, 这个表不管有多少条记录, 性别这个字段一般来说也就2个值
select count(distinct sex)
得到的结果当然是2. 但是如果数据量很大的情况下, 这种运算就非常耗时, 需要排序,去重。那么有什么优化手段呢?
PostgreSQL
digoal=& create table sex (sex char(1), otherinfo text);
CREATE TABLE
digoal=& insert into sex select 'm', generate_series(1,)||'this is test';
digoal=& insert into sex select 'w', generate_series(1,)||'this is test';
digoal=& \timing on
digoal=& select count(distinct sex)
digoal=& select sex fro
digoal=& explain select count(distinct sex)
QUERY PLAN
---------------------------------------------------------------------
(cost=.. rows=1 width=2)
Seq Scan on sex
(cost=0.00.. rows= width=2)
digoal=& explain select sex fro
QUERY PLAN
---------------------------------------------------------------------
HashAggregate
(cost=.. rows=2 width=2)
Seq Scan on sex t
(cost=0.00.. rows= width=2)
digoal=& create index idx_sex_1 on sex(sex);
CREATE INDEX
digoal=& set enable_seqscan=
使用索引后的执行计划, PostgreSQL可以使用Index Only Scan.
digoal=& explain select count(distinct sex)
QUERY PLAN
-------------------------------------------------------------------------------------------
(cost=.. rows=1 width=2)
Index Only Scan using idx_sex_1 on sex
(cost=0.00.. rows= width=2)
digoal=& explain select sex fro
QUERY PLAN
------------------------------------------------------------------------------------------
(cost=0.00.. rows=2 width=2)
Index Only Scan using idx_sex_1 on sex t
(cost=0.00.. rows= width=2)
创建索引后SQL耗时
digoal=& select count(distinct sex)
digoal=& select sex fro
SQL& create table sex(sex char(1), otherinfo varchar2(64));
Table created.
SQL& insert into sex select 'm', rownum||'this is test' from dual connect by level &=000001 rows created. SQL&
Commit complete.
SQL& insert into sex select 'w', rownum||'this is test' from dual connect by level &=000001 rows created. SQL&
Commit complete.
SQL& set autotrace on
SQL& set timing on
SQL& select count(distinct sex)
COUNT(DISTINCTSEX)
------------------
Elapsed: 00:00:03.62
Execution Plan
----------------------------------------------------------
Plan hash value:
----------------------------------------------------------
| Operation
| Name | Rows
| Bytes | Cost (%CPU)| Time
---------------------------------------------------------------------------
0 | SELECT STATEMENT
(3)| 00:02:38 |
SORT GROUP BY
TABLE ACCESS FULL| SEX
69M| 13106
(3)| 00:02:38 |
---------------------------------------------------------------------------
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
recursive calls
db block gets
consistent gets
physical reads
bytes sent via SQL*Net to client
bytes received via SQL*Net from client
SQL*Net roundtrips to/from client
sorts (memory)
sorts (disk)
rows processed
SQL& select sex fro
Elapsed: 00:00:03.23
Execution Plan
----------------------------------------------------------
Plan hash value:
---------------------------------------------------------------------------
| Operation
| Name | Rows
| Bytes | Cost (%CPU)| Time
---------------------------------------------------------------------------
0 | SELECT STATEMENT
69M| 14908
(14)| 00:02:59 |
HASH GROUP BY
69M| 14908
(14)| 00:02:59 |
TABLE ACCESS FULL| SEX
69M| 13106
(3)| 00:02:38 |
---------------------------------------------------------------------------
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
recursive calls
db block gets
consistent gets
physical reads
bytes sent via SQL*Net to client
bytes received via SQL*Net from client
SQL*Net roundtrips to/from client
sorts (memory)
sorts (disk)
rows processed
SQL& create index idx_sex_1 on sex(sex);
Index created.
Elapsed: 00:00:33.40
创建索引后的测试, 执行时间没有明显变化.
SQL& select count(distinct sex)
COUNT(DISTINCTSEX)
------------------
Elapsed: 00:00:04.32
Execution Plan
----------------------------------------------------------
Plan hash value:
-----------------------------------------------------------------------------------
| Operation
| Bytes | Cost (%CPU)| Time
-----------------------------------------------------------------------------------
0 | SELECT STATEMENT
(3)| 00:01:18 |
SORT GROUP BY
INDEX FAST FULL SCAN| IDX_SEX_1 |
(3)| 00:01:18 |
-----------------------------------------------------------------------------------
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
recursive calls
db block gets
consistent gets
physical reads
bytes sent via SQL*Net to client
bytes received via SQL*Net from client
SQL*Net roundtrips to/from client
sorts (memory)
sorts (disk)
rows processed
SQL& select sex fro
Elapsed: 00:00:03.21
Execution Plan
----------------------------------------------------------
Plan hash value:
---------------------------------------------------------------------------
| Operation
| Name | Rows
| Bytes | Cost (%CPU)| Time
---------------------------------------------------------------------------
0 | SELECT STATEMENT
69M| 14908
(14)| 00:02:59 |
HASH GROUP BY
69M| 14908
(14)| 00:02:59 |
TABLE ACCESS FULL| SEX
69M| 13106
(3)| 00:02:38 |
---------------------------------------------------------------------------
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
recursive calls
db block gets
consistent gets
physical reads
bytes sent via SQL*Net to client
bytes received via SQL*Net from client
SQL*Net roundtrips to/from client
sorts (memory)
sorts (disk)
rows processed
对比以上测试, Oracle的性能要明显优于PostgreSQL.
将count(distinct sex)修改如下后PostgreSQL的执行速度有明显改善, 但是性能还是低于O一截, 约一半.
digoal=& select count(*) from (select sex from sex t group by sex)
那么如何优化呢?
在PostgreSQL中的递归SQL在这里就派上大用场了, 结合btree索引扫描. 性能可以提升几万倍.
来看如下优化过程 :
创建测试表 :
create table user_download_log (user_id int not null, listid int not null, apkid int not null,
get_time timestamp(0) not null, otherinfo text);
插入测试数据
insert into user_download_log
select generate_series(0,),generate_series(0,),generate_series(0,),
generate_series(clock_timestamp(),clock_timestamp()+interval ' min',interval '1 min'),
'this is test';
创建索引 :
create index i1 on user_download_log (user_id);
create index i2 on user_download_log (otherinfo);
查看数据分布 :
用来说明递归SQL适合哪种场景的优化.
select count(distinct user_id), count(distinct otherinfo) from user_download_
----------+-------
查看未优化时以下SQL的执行计划以及耗时.
digoal=& explain analyze select count(distinct otherinfo) from user_download_
QUERY PLAN
-------------------------------------------------------------------------------------------------------
(cost=.. rows=1 width=13) (actual time=95.494 rows=1 loops=1)
Seq Scan on user_download_log
(cost=0.00.. rows= width=13)
(actual time=0.014.. rows= loops=1)
Total runtime:
优化后的SQL :
digoal=& with recursive skip as (
select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null
select (select min(t.otherinfo) from user_download_log t where t.otherinfo & s.otherinfo and
t.otherinfo is not null)
from skip s where s.otherinfo is not null
-- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.
digoal(& )
digoal-& select count(distinct otherinfo)
优化后的SQL执行计划以及耗时, 性能提升了36390倍, 相比O也提升了上万倍.
digoal=& explain analyze with recursive skip as (
select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null
select (select min(t.otherinfo) from user_download_log t where t.otherinfo & s.otherinfo and
t.otherinfo is not null)
from skip s where s.otherinfo is not null
-- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.
select count(distinct otherinfo)
QUERY PLAN
-------------------------------------------------------------------------------------------------
(cost=10.55..10.56 rows=1 width=32) (actual time=0.094..0.094 rows=1 loops=1)
Recursive Union
(cost=0.03..8.28 rows=101 width=32) (actual time=0.044..0.073 rows=2 loops=1)
(cost=0.03..0.04 rows=1 width=0) (actual time=0.042..0.042 rows=1 loops=1)
InitPlan 1 (returns $1)
(cost=0.00..0.03 rows=1 width=13) (actual time=0.038..0.039 rows=1 loops=1)
-& Index Only Scan using i2 on user_download_log t (cost=0.00.. rows= width=13)
(actual time=0.037..0.037 rows=1 loops=1)
Index Cond: (otherinfo IS NOT NULL)
Heap Fetches: 1
WorkTable Scan on skip s
(cost=0.00..0.62 rows=10 width=32) (actual time=0.013..0.013 rows=0 loops=2)
Filter: (otherinfo IS NOT NULL)
Rows Removed by Filter: 0
(cost=0.03..0.04 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)
InitPlan 2 (returns $3)
(cost=0.00..0.03 rows=1 width=13) (actual time=0.017..0.017 rows=0 loops=1)
Index Only Scan using i2 on user_download_log t (cost=0.00.. rows=3333343 width=13)
(actual time=0.015..0.015 rows=0 loops=1)
Index Cond: ((otherinfo & s.otherinfo) AND (otherinfo IS NOT NULL))
Heap Fetches: 0
CTE Scan on skip
(cost=0.00..2.02 rows=101 width=32) (actual time=0.047..0.077 rows=2 loops=1)
Total runtime: 0.173 ms
换一个字段, 数据分布广泛的字段上使用以上优化方法, 看是否妥当, 以下是原始SQL的执行计划以及耗时 :
digoal=& explain analyze select count(distinct user_id) from user_download_
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
(cost=.. rows=1 width=4) (actual time=08.858 rows=1 loops=1)
Seq Scan on user_download_log
(cost=0.00.. rows= width=4)
(actual time=0.014.. rows= loops=1)
Total runtime:
换一个字段, 数据分布广泛的字段上使用以上优化方法, 看是否妥当, 以下是采用递归SQL后的执行计划以及耗时 :
显然性能是下降的, 所以使用递归SQL不适合数据分布广泛的字段的group by或者count(distinct)操作.
digoal=& explain analyze with recursive skip as (
select min(t.user_id) as user_id from user_download_log t where t.user_id is not null
select (select min(t.user_id) from user_download_log t where t.user_id & s.user_id and
t.user_id is not null)
from skip s where s.user_id is not null
-- 这里的where s.user_id is not null 一定要加,否则就死循环了.
select count(distinct user_id)
QUERY PLAN
-----------------------------------------------------------------------------------------------------
(cost=10.44..10.45 rows=1 width=4) (actual time=.. rows=1 loops=1)
Recursive Union
(cost=0.03..8.17 rows=101 width=4) (actual time=0.047.. rows= loops=1)
(cost=0.03..0.04 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)
InitPlan 1 (returns $1)
(cost=0.00..0.03 rows=1 width=4) (actual time=0.042..0.042 rows=1 loops=1)
Index Only Scan using i1 on user_download_log t (cost=0.00.. rows= width=4)
(actual time=0.040..0.040 rows=1 loops=1)
Index Cond: (user_id IS NOT NULL)
Heap Fetches: 1
WorkTable Scan on skip s
(cost=0.00..0.61 rows=10 width=4) (actual time=0.017..0.017 rows=1 loops=)
Filter: (user_id IS NOT NULL)
Rows Removed by Filter: 0
(cost=0.03..0.04 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=)
InitPlan 2 (returns $3)
(cost=0.00..0.03 rows=1 width=4) (actual time=0.015..0.015 rows=1 loops=)
Index Only Scan using i1 on user_download_log t (cost=0.00.. rows=3333343 width=4)
(actual time=0.014..0.014 rows=1 loops=)
Index Cond: ((user_id & s.user_id) AND (user_id IS NOT NULL))
Heap Fetches:
CTE Scan on skip
(cost=0.00..2.02 rows=101 width=4) (actual time=0.050.. rows= loops=1)
Total runtime:
以下是同样的数据结构以及测试数据在O下的测试.
SQL& create table test (id int, otherinfo varchar2(32))
Table created.
SQL& insert into test select rownum,'this is test' from dual connect by level &=000001 rows created. SQL&
SQL& create index i1 on test(id);
SQL& create index i2 on test(otherinfo);
SQL& explain plan for select count(distinct id)
Explained.
SQL& select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Plan hash value:
------------------------------------------------------------------------------
| Operation
| Name | Rows
| Bytes | Cost (%CPU)| Time
------------------------------------------------------------------------------
0 | SELECT STATEMENT
(3)| 00:00:51 |
SORT GROUP BY
INDEX FAST FULL SCAN| I1
(3)| 00:00:51 |
------------------------------------------------------------------------------
- dynamic sampling used for this statement
13 rows selected.
SQL& explain plan for select count(distinct otherinfo)
Explained.
SQL& select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value:
---------------------------------------------------------------------------
| Operation
| Name | Rows
| Bytes | Cost (%CPU)| Time
---------------------------------------------------------------------------
0 | SELECT STATEMENT
(3)| 00:01:11 |
SORT GROUP BY
TABLE ACCESS FULL| TEST |
(3)| 00:01:11 |
---------------------------------------------------------------------------
- dynamic sampling used for this statement
13 rows selected.
SQL& set timing on
SQL& select count(distinct otherinfo)
COUNT(DISTINCTOTHERINFO)
------------------------
Elapsed: 00:00:02.49
SQL& select count(distinct id)
COUNT(DISTINCTID)
-----------------
Elapsed: 00:00:07.13
从执行耗时可以看出PostgreSQL在数据分布稀疏的字段上使用递归SQL优化后的性能相比Oracle有41213倍的性能提升.
递归查询中不允许使用聚合函数 :
with recursive skip as (
select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null
select min(t.otherinfo) from user_download_log t, skip s
where t.otherinfo & s.otherinfo
and t.otherinfo is not null
and s.otherinfo is not null
-- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.
aggregate functions not allowed in a recursive query's recursive term
select min(t.otherinfo) from user_download_log t, skip s...
Time: 0.581 ms
修改如下即可 :
with recursive skip as (
select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null
select (select min(t.otherinfo) from user_download_log t where t.otherinfo & s.otherinfo and
t.otherinfo is not null)
from skip s where s.otherinfo is not null
-- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.
细心的朋友发现Oracle测试中未对表进行分析, 以下是分析后的结果, 执行计划无变化 :
SQL& analyze table sex estimate statistics for all columns sample 10
Table analyzed.
SQL& analyze index idx_sex_1 estimate statistics sample 10
Index analyzed.
SQL& select sex fro
Elapsed: 00:00:03.17
Execution Plan
----------------------------------------------------------
Plan hash value:
---------------------------------------------------------------------------
| Operation
| Name | Rows
| Bytes | Cost (%CPU)| Time
---------------------------------------------------------------------------
0 | SELECT STATEMENT
(12)| 00:02:55 |
HASH GROUP BY
(12)| 00:02:55 |
TABLE ACCESS FULL| SEX
19M| 13062
(2)| 00:02:37 |
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
recursive calls
db block gets
consistent gets
physical reads
bytes sent via SQL*Net to client
bytes received via SQL*Net from client
SQL*Net roundtrips to/from client
sorts (memory)
sorts (disk)
rows processed
SQL& select count(distinct sex)
COUNT(DISTINCTSEX)
------------------
Elapsed: 00:00:03.85
Execution Plan
----------------------------------------------------------
Plan hash value:
-----------------------------------------------------------------------------------
| Operation
| Bytes | Cost (%CPU)| Time
-----------------------------------------------------------------------------------
0 | SELECT STATEMENT
(3)| 00:01:18 |
SORT GROUP BY
INDEX FAST FULL SCAN| IDX_SEX_1 |
(3)| 00:01:18 |
-----------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
recursive calls
db block gets
consistent gets
physical reads
bytes sent via SQL*Net to client
bytes received via SQL*Net from client
SQL*Net roundtrips to/from client
sorts (memory)
sorts (disk)
rows processed
Oracle在这类应用场景中还有一个选择,使用位图索引。
摘录一段O位图索引的介绍
位图索引 Bitmap index
场合:列的基数很少,可枚举,重复值很多,数据不会被经常更新
原理:一个键值对应很多行(rowid), 格式:键值 start_rowid end_rowid 位图
优点:OLAP 例如报表类数据库 重复率高的数据 特定类型的查询例如count、or、and等逻辑操作因为只需要进行位运算即可得到我们需要的结果
缺点:不适合重复率低的字段,还有经常DML操作(insert,update,delete),因为位图索引的锁代价极高,修改一个位图索引段影响整个位图段,例如修改一个键值,会影响同键值的多行,所以对于OLTP 系统位图索引基本上是不适用的因bitmap在OLTP使用场景较少,PostgreSQL 没有实现这个类型的索引。
想了解更多PG索引的情况,请参考
评论:2  
请在后发表评论,否则用户名称只能显示为GUEST。
GUEST 回答于
03:54:20+08
Felt so hopeless looking for answers to my
GUEST 回答于
04:11:34+08
Lovely post, Roxy!
You probably gave Judy the best retirement gift she could have wished for. Great lessons in there. Hmg80#m23&;.mets me thinking about who my mentors are.
发表评论:Mysql中count(*),DISTINCT的使用方法和效率研究
Mysql中count(*),DISTINCT的使用方法和效率研究
在处理一个大数据量的时候
突然发现对于count(*)的不同处理会造成不同的结果
SELECT count(*) FROM tablename
即使对于千万级别的数据mysql也能非常迅速的返回结果
SELECT count(*) FROM tablename WHERE&..
mysql的查询时间开始攀升
仔细查阅累下手册,发现当没有WHERE语句对于整个mysql的表进行count运算的时候
MyISAM类型的表中保存有总的行数,而当添加有WHERE限定语句的时候Mysql需要对整个表进行检索
从而得出count的数值
突然又想起来看到的不少新兴的php程序对于count的处理并没有很好的意识到这点
顺便提下mysql的DISTINCT的关键字有很多你想不到的用处
1.在count 不重复的记录的时候能用到
比如SELECT COUNT( DISTINCT id ) FROM
就是计算talbebname表中id不同的记录有多少条
2,在需要返回记录不同的id的具体值的时候可以用
比如SELECT DISTINCT id FROM
返回talbebname表中不同的id的具体的值
3.上面的情况2对于需要返回mysql表中2列以上的结果时会有歧义
比如SELECT DISTINCT id, type FROM
实际上返回的是 id与type同时不相同的结果,也就是DISTINCT同时作用了两个字段,必须得id与tyoe都相同的才被排除了,与我们期望的结果不一样
4.这时候可以考虑使用group_concat函数来进行排除,不过这个mysql函数是在mysql4.1以上才支持的
5.其实还有另外一种解决方式,就是使用
SELECT id, type, count(DISTINCT id) FROM tablename
虽然这样的返回结果多了一列无用的count数据(或许你就需要这个我说的无用数据)
返回的结果是 只有id不同的所有结果和上面的4类型可以互补使用,就是看你需要什么样的数据了
(window.slotbydup=window.slotbydup || []).push({
id: '2467140',
container: s,
size: '1000,90',
display: 'inlay-fix'
(window.slotbydup=window.slotbydup || []).push({
id: '2467141',
container: s,
size: '1000,90',
display: 'inlay-fix'
(window.slotbydup=window.slotbydup || []).push({
id: '2467142',
container: s,
size: '1000,90',
display: 'inlay-fix'
(window.slotbydup=window.slotbydup || []).push({
id: '2467143',
container: s,
size: '1000,90',
display: 'inlay-fix'
(window.slotbydup=window.slotbydup || []).push({
id: '2467148',
container: s,
size: '1000,90',
display: 'inlay-fix'SQL COUNT() 函数
COUNT() 函数返回匹配指定条件的行数。
SQL COUNT(column_name) 语法
COUNT(column_name) 函数返回指定列的值的数目(NULL 不计入):
SELECT COUNT(column_name) FROM table_
SQL COUNT(*) 语法
COUNT(*) 函数返回表中的记录数:
SELECT COUNT(*) FROM table_
SQL COUNT(DISTINCT column_name) 语法
COUNT(DISTINCT column_name) 函数返回指定列的不同值的数目:
SELECT COUNT(DISTINCT column_name) FROM table_
注释:COUNT(DISTINCT) 适用于 ORACLE 和 Microsoft SQL Server,但是无法用于 Microsoft Access。
演示数据库
在本教程中,我们将使用 RUNOOB 样本数据库。
下面是选自 "access_log" 表的数据:
+-----+---------+-------+------------+
| aid | site_id | count | date
+-----+---------+-------+------------+
+-----+---------+-------+------------+
SQL COUNT(column_name) 实例
下面的 SQL 语句计算 "access_log" 表中 "site_id"=3 的总访问量:
SELECT COUNT(count) AS nums FROM access_logWHERE
site_id=3;
SQL COUNT(*) 实例
下面的 SQL 语句计算 "access_log" 表中总记录数:
SELECT COUNT(*) AS nums FROM access_
执行以上 SQL 输出结果如下:
SQL COUNT(DISTINCT column_name) 实例
下面的 SQL 语句计算 "access_log" 表中不同 site_id 的记录数:
SELECT COUNT(DISTINCT site_id) AS nums FROM access_
执行以上 SQL 输出结果如下:
反馈内容(*必填)
截图标记颜色
联系方式(邮箱)
联系邮箱:
投稿页面:
记住登录状态
重复输入密码

我要回帖

更多关于 count distinct 的文章

 

随机推荐