qscatterqwilt qb seriess怎样知道里面点的个数

&img src=&/50/v2-3c735da0a0623aacfdc160_b.jpg& data-rawwidth=&1000& data-rawheight=&666& class=&origin_image zh-lightbox-thumb& width=&1000& data-original=&/50/v2-3c735da0a0623aacfdc160_r.jpg&&&p&入职数据岗快两个月了,多数时间还是和excel打交道。以至于上周想写下Python,却发觉有点生疏,突然慌了起来,也想起以前看过的一篇学Python的文章,大意是自从学了Python后就逼迫自己不用Excel,所有操作用Python实现。目的是巩固Python,与增强数据处理能力。这也是我写这篇文章的初衷。废话不说了,直接进入正题。&/p&&p&&br&&/p&&p&数据是网上找到的销售数据,长这样:&/p&&img src=&/50/v2-5ef2a4a06_b.jpg& data-caption=&& data-rawwidth=&1245& data-rawheight=&736& class=&origin_image zh-lightbox-thumb& width=&1245& data-original=&/50/v2-5ef2a4a06_r.jpg&&&p&&br&&/p&&h2&&b&一、关联公式:Vlookup&/b&&/h2&&p&vlookup是excel几乎最常用的公式,一般用于两个表的关联查询等。所以我先把这张表分为两个表。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&df1=sale[['订单明细号','单据日期','地区名称', '业务员名称','客户分类', '存货编码', '客户名称', '业务员编码', '存货名称', '订单号',
'客户编码', '部门名称', '部门编码']]
df2=sale[['订单明细号','存货分类', '税费', '不含税金额', '订单金额', '利润', '单价','数量']]
&/code&&/pre&&/div&&p&&br&&/p&&p&&b&需求:想知道df1的每一个订单对应的利润是多少。&/b&&/p&&p&利润一列存在于df2的表格中,所以想知道df1的每一个订单对应的利润是多少。用excel的话首先确认订单明细号是唯一值,然后在df1新增一列写:=vlookup(a2,df2!a:h,6,0) ,然后往下拉就ok了。(剩下13个我就不写excel啦)&/p&&p&&br&&/p&&p&那用python是如何实现的呢?&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#查看订单明细号是否重复,结果是没。
df1[&订单明细号&].duplicated().value_counts()
df2[&订单明细号&].duplicated().value_counts()
df_c=pd.merge(df1,df2,on=&订单明细号&,how=&left&)
&/code&&/pre&&/div&&p&&br&&/p&&h2&二、数据透视表&/h2&&p&&b&需求:想知道每个地区的业务员分别赚取的利润总和与利润平均数。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&pd.pivot_table(sale,index=&地区名称&,columns=&业务员名称&,values=&利润&,aggfunc=[np.sum,np.mean])
&/code&&/pre&&/div&&p&&br&&/p&&h2&三、对比两列差异&/h2&&p&因为这表每列数据维度都不一样,比较起来没啥意义,所以我先做了个订单明细号的差异再进行比较。&/p&&p&需求:比较订单明细号与订单明细号2的差异并显示出来。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale[&订单明细号2&]=sale[&订单明细号&]
#在订单明细号2里前10个都+1.
sale[&订单明细号2&][1:10]=sale[&订单明细号2&][1:10]+1
result=sale.loc[sale[&订单明细号&].isin(sale[&订单明细号2&])==False]
&/code&&/pre&&/div&&p&&br&&/p&&h2&四、去除重复值&/h2&&p&&b&需求:去除业务员编码的重复值&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale.drop_duplicates(&业务员编码&,inplace=True)
&/code&&/pre&&/div&&p&&br&&/p&&h2&五、缺失值处理&/h2&&p&先查看销售数据哪几列有缺失值。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#列的行数小于index的行数的说明有缺失值,这里客户名称329&335,说明有缺失值
&/code&&/pre&&/div&&img src=&/50/v2-bf9313ddccf8c_b.jpg& data-caption=&& data-rawwidth=&637& data-rawheight=&563& class=&origin_image zh-lightbox-thumb& width=&637& data-original=&/50/v2-bf9313ddccf8c_r.jpg&&&p&&b&需求:用0填充缺失值或则删除有客户编码缺失值的行。&/b&实际上缺失值处理的办法是很复杂的,这里只介绍简单的处理方法,若是数值变量,最常用平均数或中位数或众数处理,比较复杂的可以用随机森林模型根据其他维度去预测结果填充。若是分类变量,根据业务逻辑去填充准确性比较高。&b&比如这里的需求填充客户名称缺失值:就可以根据存货分类出现频率最大的存货所对应的客户名称去填充。&/b&&/p&&p&这里我们用简单的处理办法:用0填充缺失值或则删除有客户编码缺失值的行。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#用0填充缺失值
sale[&客户名称&]=sale[&客户名称&].fillna(0)
#删除有客户编码缺失值的行
sale.dropna(subset=[&客户编码&])
&/code&&/pre&&/div&&p&&br&&/p&&h2&六、多条件筛选&/h2&&p&&b&需求:想知道业务员张爱,在北京区域卖的商品订单金额大于6000的信息。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale.loc[(sale[&地区名称&]==&北京&)&(sale[&业务员名称&]==&张爱&)&(sale[&订单金额&]&5000)]
&/code&&/pre&&/div&&p&&br&&/p&&h2&七、 模糊筛选数据&/h2&&p&&b&需求:筛选存货名称含有&三星&或则含有&索尼&的信息。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale.loc[sale[&存货名称&].str.contains(&三星|索尼&)]
&/code&&/pre&&/div&&p&&br&&/p&&h2&八、分类汇总&/h2&&p&&b&需求:北京区域各业务员的利润总额。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale.groupby([&地区名称&,&业务员名称&])[&利润&].sum()
&/code&&/pre&&/div&&p&&br&&/p&&h2&九、条件计算&/h2&&p&&b&需求:存货名称含“三星字眼”并且税费高于1000的订单有几个?这些订单的利润总和和平均利润是多少?(或者最小值,最大值,四分位数,标注差)&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale.loc[sale[&存货名称&].str.contains(&三星&)&(sale[&税费&]&=1000)][[&订单明细号&,&利润&]].describe()
&/code&&/pre&&/div&&img src=&/50/v2-81f995bbdeb6cecfc4198_b.jpg& data-caption=&& data-rawwidth=&422& data-rawheight=&344& class=&origin_image zh-lightbox-thumb& width=&422& data-original=&/50/v2-81f995bbdeb6cecfc4198_r.jpg&&&h2&十、删除数据间的空格&/h2&&p&&b&需求:删除存货名称两边的空格。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale[&存货名称&].map(lambda s :s.strip(&&))
&/code&&/pre&&/div&&p&&br&&/p&&h2&十一、数据分列&/h2&&img src=&/50/v2-76d82b057ad519a85a6a_b.jpg& data-caption=&& data-rawwidth=&764& data-rawheight=&283& class=&origin_image zh-lightbox-thumb& width=&764& data-original=&/50/v2-76d82b057ad519a85a6a_r.jpg&&&p&需求:将日期与时间分列。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale=pd.merge(sale,pd.DataFrame(sale[&单据日期&].str.split(& &,expand=True)),how=&inner&,left_index=True,right_index=True)
&/code&&/pre&&/div&&p&&br&&/p&&h2&十二、异常值替换&/h2&&p&首先用describe()函数简单查看一下数据有无异常值。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#可看到销项税有负数,一般不会有这种情况,视它为异常值。
sale.describe()
&/code&&/pre&&/div&&img src=&/50/v2-781fc89d41e7e2feab760b_b.jpg& data-caption=&& data-rawwidth=&1260& data-rawheight=&334& class=&origin_image zh-lightbox-thumb& width=&1260& data-original=&/50/v2-781fc89d41e7e2feab760b_r.jpg&&&p&&b&需求:用0代替异常值。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale[&订单金额&]=sale[&订单金额&].replace(min(sale[&订单金额&]),0)
&/code&&/pre&&/div&&p&&br&&/p&&h2&十三、分组&/h2&&p&&b&需求:根据利润数据分布把地区分组为:&较差&,&中等&,&较好&,&非常好&&/b&&/p&&p&首先,当然是查看利润的数据分布呀,这里我们采用四分位数去判断。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale.groupby(&地区名称&)[&利润&].sum().describe()
&/code&&/pre&&/div&&img src=&/50/v2-3621cec37acb_b.jpg& data-caption=&& data-rawwidth=&361& data-rawheight=&211& class=&content_image& width=&361&&&p&根据四分位数把地区总利润为[-9,7091]区间的分组为“较差”,(]区间的分组为&中等&&/p&&p&(]分组为较好,(]分组为非常好。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&#先建立一个Dataframe
sale_area=pd.DataFrame(sale.groupby(&地区名称&)[&利润&].sum()).reset_index()
#设置bins,和分组名称
bins=[-10,,]
groups=[&较差&,&中等&,&较好&,&非常好&]
#使用cut分组
#sale_area[&分组&]=pd.cut(sale_area[&利润&],bins,labels=groups)
&/code&&/pre&&/div&&p&&br&&/p&&h2&十四、根据业务逻辑定义标签&/h2&&p&&b&需求:销售利润率(即利润/订单金额)大于30%的商品信息并标记它为优质商品,小于5%为一般商品。&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&sale.loc[(sale[&利润&]/sale[&订单金额&])&0.3,&label&]=&优质商品&
sale.loc[(sale[&利润&]/sale[&订单金额&])&0.05,&label&]=&一般商品&
&/code&&/pre&&/div&&p&&br&&/p&&p&其实excel常用的操作还有很多,我就列举了14个自己比较常用的,若还想实现哪些操作可以评论一起交流讨论,另外我自身也知道我写python不够精简,惯性使用loc。(其实query会比较精简)。若大家对这几个操作有更好的写法请务必评论告知我,感谢!&/p&&p&&br&&/p&&p&最后想说说,我觉得最好不要拿excel和python做对比,去研究哪个好用,其实都是工具,excel作为最为广泛的数据处理工具,垄断这么多年必定在数据处理方便也是相当优秀的,有些操作确实python会比较简单,但也有不少excel操作起来比python简单的。比如一个很简单的操作:对各列求和并在最下一行显示出来,excel就是对一列总一个sum()函数,然后往左一拉就解决,而python则要定义一个函数(因为python要判断格式,若非数值型数据直接报错。)&/p&&p&总结一下就是:&b&无论用哪个工具,能解决问题就是好数据分析师!&/b&&/p&&p&&br&&/p&&p&&b&--------------------------------分割线&/b&&/p&&p&收藏之前点个赞?感谢!&/p&
入职数据岗快两个月了,多数时间还是和excel打交道。以至于上周想写下Python,却发觉有点生疏,突然慌了起来,也想起以前看过的一篇学Python的文章,大意是自从学了Python后就逼迫自己不用Excel,所有操作用Python实现。目的是巩固Python,与增强数据处理能…
&img src=&/50/v2-75cd7e2adbeaa7ba8b5deb8_b.png& data-rawwidth=&1366& data-rawheight=&714& class=&origin_image zh-lightbox-thumb& width=&1366& data-original=&/50/v2-75cd7e2adbeaa7ba8b5deb8_r.png&&&blockquote&在兼容原生Python的情况下,让Python变得更舒服和自由一点。&/blockquote&&p&这就是我写flowpython的初衷了。地址如下可见。&/p&&p&&a href=&/?target=https%3A///thautwarm/flowpython& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&thautwarm/flowpython&i class=&icon-external&&&/i&&/a&&/p&&p&flowpython旨在让写python代码变得像流动一样,你可以用一个表达式写出一个文件,就像下面这种样子(临时随便想的一个场景,当然这个代码并不是flowpython,不过会很像未来的版本):&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&ret&/span& &span class=&o&&=&/span& &span class=&n&&container1&/span& &span class=&n&&merge&/span& &span class=&n&&container2&/span&
&span class=&n&&then&/span& &span class=&o&&.&/span&&span class=&n&&sort&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&reduce&/span&&span class=&p&&(&/span&&span class=&n&&f1&/span&&span class=&p&&)&/span&
&span class=&n&&where&/span&
&span class=&n&&f1&/span& &span class=&o&&=&/span& &span class=&o&&...&/span&
&span class=&n&&match&/span&
&span class=&nb&&int&/span&
&span class=&o&&=&&/span& &span class=&o&&...&/span&
&span class=&nb&&float&/span&
&span class=&o&&=&&/span&
&span class=&o&&.&/span&&span class=&n&&fsplit&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&&/span&&span class=&n&&x&/span&&span class=&o&&//&/span&&span class=&mi&&10&/span&&span class=&p&&)&/span& \
&span class=&o&&.&/span&&span class=&n&&connectWith&/span&&span class=&p&&(&/span&
&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&&/span& &span class=&n&&x&/span&&span class=&o&&&&/span&&span class=&mi&&10&/span&&span class=&p&&,&/span& &span class=&n&&x&/span&&span class=&o&&-&&/span& &span class=&o&&...&/span& &span class=&p&&),&/span&
&span class=&o&&...&/span& &span class=&c1&&#省略一些情况&/span&
&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&&/span& &span class=&n&&x&/span&&span class=&o&&&&/span&&span class=&mi&&0&/span& &span class=&p&&,&/span& &span class=&n&&x&/span&&span class=&o&&-&...&/span&&span class=&p&&)&/span&
&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&&/span& &span class=&o&&...&/span&&span class=&p&&)&/span&
&span class=&p&&)&/span&
&span class=&n&&callable&/span& &span class=&o&&=&&/span& &span class=&o&&...&/span&
&/code&&/pre&&/div&&p&这段代码呢,如果用原生python大概是这样:&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&f1&/span& &span class=&o&&=&/span& &span class=&o&&...&/span&
&span class=&n&&test&/span& &span class=&o&&=&/span& &span class=&n&&reduce&/span&&span class=&p&&(&/span&&span class=&n&&f1&/span&&span class=&p&&,&/span& &span class=&n&&merge&/span&&span class=&p&&(&/span&&span class=&n&&container1&/span&&span class=&p&&,&/span&&span class=&n&&container2&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&sort&/span&&span class=&p&&())&/span&
&span class=&k&&if&/span&
&span class=&nb&&isinstance&/span&&span class=&p&&(&/span&&span class=&n&&test&/span&&span class=&p&&,&/span&&span class=&nb&&int&/span&&span class=&p&&):&/span&
&span class=&n&&ret&/span&
&span class=&o&&=&/span& &span class=&o&&...&/span&
&span class=&k&&elif&/span& &span class=&nb&&isinstance&/span&&span class=&p&&(&/span&&span class=&n&&test&/span&&span class=&p&&,&/span&&span class=&nb&&float&/span&&span class=&p&&):&/span&
&span class=&n&&test2&/span& &span class=&o&&=&/span& &span class=&n&&test&/span&&span class=&o&&//&/span&&span class=&mi&&10&/span&
&span class=&k&&if&/span&
&span class=&n&&test2&/span&&span class=&o&&&&/span&&span class=&mi&&10&/span&&span class=&p&&:&/span&
&span class=&n&&ret&/span& &span class=&o&&=&/span& &span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&...&/span&&span class=&p&&)(&/span&&span class=&n&&test&/span&&span class=&p&&)&/span&
&span class=&o&&...&/span&
&span class=&c1&&#省略一些情况&/span&
&span class=&k&&elif&/span&
&span class=&n&&test2&/span&&span class=&o&&&&/span&&span class=&mi&&0&/span&&span class=&p&&:&/span&
&span class=&n&&ret&/span& &span class=&o&&=&/span& &span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&...&/span&&span class=&p&&)(&/span&&span class=&n&&test&/span&&span class=&p&&)&/span&
&span class=&k&&else&/span&&span class=&p&&:&/span&
&span class=&n&&ret&/span& &span class=&o&&=&/span& &span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&...&/span&&span class=&p&&)(&/span&&span class=&n&&test&/span&&span class=&p&&)&/span&
&span class=&k&&elif&/span& &span class=&n&&iscallable&/span&&span class=&p&&(&/span&&span class=&n&&test&/span&&span class=&p&&):&/span&
&span class=&n&&ret&/span& &span class=&o&&=&/span& &span class=&o&&...&/span&
&/code&&/pre&&/div&&p&其实当工程很小时,原生python的代码量似乎并不冗长,也简单,虽然不像前者那种一眼过去就能知道意思。&/p&&p&但是&b&由于不符合人的思维习惯&/b&(比如f1如果是复合句函数,则需要在reduce的前边定义),以及if-else,无论是使用表达式 test if else test 还是 复合句 if test : suite else suite ,都还是看起来很催眠。 &/p&&p&就是说&b&不够灵活&/b&,这件事情是非常有害的。&/p&&p&敲代码是占了生命很大一部分比重的事情,我不愿我的生命逐渐变成漫长而单调的行程,不愿变成每天坐10小时车的自驾游。&/p&&p&我希望这一切还会有彩色,有惊喜,正如我接触编程以来所感受到的,那些想法被满足、被实现的欢乐。&/p&&p&所以就有了flowpython,希望从此写python能写出流动的感觉。&/p&&p&当然从c层面去了解语言,去创造语言,本身就是非常快乐的事情。当我成功写出where子句通过test后,是肉眼可见的傻笑了起来,还是很逗的...&/p&&p&下面放一个用where提升可读性的例子。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&c1&&# 圆柱面积 / surface area of a cylinder&/span&
&span class=&kn&&from&/span& &span class=&nn&&math&/span& &span class=&k&&import&/span& &span class=&n&&pi&/span&
&span class=&n&&r&/span& &span class=&o&&=&/span& &span class=&mi&&1&/span&
&span class=&c1&&# the radius&/span&
&span class=&n&&h&/span& &span class=&o&&=&/span& &span class=&mi&&10&/span& &span class=&c1&&# the height&/span&
&span class=&n&&S&/span& &span class=&o&&=&/span& &span class=&p&&(&/span&&span class=&mi&&2&/span&&span class=&o&&*&/span&&span class=&n&&S_top&/span& &span class=&o&&+&/span& &span class=&n&&S_side&/span&&span class=&p&&)&/span& &span class=&n&&where&/span&&span class=&p&&:&/span&
&span class=&n&&S_top&/span&
&span class=&o&&=&/span& &span class=&n&&pi&/span&&span class=&o&&*&/span&&span class=&n&&r&/span&&span class=&o&&**&/span&&span class=&mi&&2&/span&
&span class=&n&&S_side&/span& &span class=&o&&=&/span& &span class=&n&&C&/span& &span class=&o&&*&/span& &span class=&n&&h&/span& &span class=&n&&where&/span&&span class=&p&&:&/span&
&span class=&n&&C&/span& &span class=&o&&=&/span& &span class=&mi&&2&/span&&span class=&o&&*&/span&&span class=&n&&pi&/span&&span class=&o&&*&/span&&span class=&n&&r&/span&
&/code&&/pre&&/div&&p&再放一遍地址,要不来看看?&/p&&p&&a href=&/?target=https%3A///thautwarm/flowpython& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&thautwarm/flowpython&i class=&icon-external&&&/i&&/a&&/p&&p&项目本身,除了管理工具是我用python写的外,其他都是c语言的。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&project&/span&&span class=&o&&.&/span&&span class=&n&&filter&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&&/span&&span class=&n&&x&/span&&span class=&o&&.&/span&&span class=&n&&type&/span&&span class=&o&&!=&/span&&span class=&n&&管理工具&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&all&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&-&&/span&&span class=&n&&x&/span&&span class=&o&&.&/span&&span class=&n&&lang&/span&&span class=&o&&==&/span&&span class=&s1&&'c'&/span&&span class=&p&&)&/span& &span class=&o&&==&/span& &span class=&kc&&True&/span&
&span class=&o&&&&&/span& &span class=&kc&&True&/span&
&/code&&/pre&&/div&&p&本人不才, python的grammar, ast, complie等模块滚瓜烂熟(一周时间每天看18+小时C码源让我装个逼好嘛...&/p&&p&&/p&&p&&/p&
在兼容原生Python的情况下,让Python变得更舒服和自由一点。这就是我写flowpython的初衷了。地址如下可见。flowpython旨在让写python代码变得像流动一样,你可以用一个表达式写出一个文件,就像下面这种样子(临时随便想的一个场景,当…
&p&如果学会了python的基本语法,我认为入门爬虫是很容易的。
&/p&&p&我写的第一个爬虫大概只需要10分钟,自学的 scrapyd ,
看官方文档花了20分钟,&/p&&p&因为我英文不是很好,很多单词需要搜索一下。
官方文档链接 &a href=&///?target=https%3A//docs.scrapy.org/en/latest/intro/tutorial.html& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&docs.scrapy.org/en/late&/span&&span class=&invisible&&st/intro/tutorial.html&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a& )
(scrapy 并不是入门必须的,所以你可以看完我的答案再酌情考虑 scrapy )
&/p&&p&再接触到了 requests , lxml ,配合基本库 urllib, urllib2 就几乎无所不能了。&/p&&p&后来有人推荐我用 BeatufulSoup
之类的库,但其实原理都差不多。&/p&&p&一、入门爬虫的干货 &/p&&p&0. 爬虫的基本思路 &/p&&p& a. 通过URL或者文件获取网页,&/p&&p& b. 分析要爬取的目标内容所在的位置&/p&&p& c. 用元素选择器快速提取(Raw) 目标内容&/p&&p& d. 处理提取出来的目标内容 ( 通常整理合成一个 Json) &/p&&p& e. 存储处理好的目标内容 (比如放到 MongoDB 之类的数据库,或者写进文件里。) &/p&&br&&p&1. 为什么我入门爬虫那么快,我是不是在装逼? &/p&&p&答:我自己总结了一下,在接触爬虫之前:
&/p&&p& a. 我挺了解HTTP 协议(看了《HTTP权威指南》),&/p&&p& b. 我写过基于Flask框架的后端(大概三年前@萧井陌 在知乎上推荐Flask框架,然后我就自学了,用的是《Flask Web开发:基于Python的Web应用开发实战 》) &/p&&p&
c. 我写过前端(HTML+CSS+JS),了解什么是DOM ,会一点jquery。 &/p&&p& d. 正则也是勉强够用的。 &/p&&p& e. 本人大学也是计算机专业,学习挺认真的。 &/p&&p& f. 所以算是厚积薄发。 &/p&&br&&p&2. 那么毫无专业基础,也没有前后端基础的人应该怎么办? &/p&&p&答:那当然要超过半小时啦。先花点时间去大概了解以下内容: &/p&&p& a. HTTP协议的请求方法,请求头部,请求数据&/p&&p& b. 大概了解一下什么是 cookie &/p&&p& c. 学一点HTML和元素选择器 &/p&&p& d. 学会使用Chrome 的 开发者工具
磨刀不误砍柴工,当然如果有人带着,这些大概1-2小时就能过到能凑合用的程度了。如果没人带,就上网搜索学习一下,也很快的,估摸最多十小时。 &/p&&p&ps, 阮一峰老师的技术入门博客写得很不错,除此之外,博客园也有很多好资源。
&/p&&br&&p&3. 放一个新鲜出炉的代码,看懂就能入门了:&/p&&img src=&/v2-80ea38ae0ede4bf068396_b.png& data-rawwidth=&759& data-rawheight=&727& class=&origin_image zh-lightbox-thumb& width=&759& data-original=&/v2-80ea38ae0ede4bf068396_r.png&&&br&&p&4. Python 爬虫常用的库是哪些?入门应该掌握哪些库?&/p&&p&答:网上有很多相关的资料,但是我个人觉得新入门的人,不需要也不应该一下子接触所有的库。正如幼儿刚开始学说话的时候,不应该同时教普通话粤语闽南语英语。 &/p&&p&我个人认为,学会 requests 和 lxml ,就可以入门爬虫了。
其他的常用库,自己搜,但注意贪多嚼不烂。 (我整理出来的被小马甲人喷了,我很不开心,所以我自己存好删了)&/p&&br&&p&二、一点点涉及爬虫进阶的分界线 &/p&&p&0. 知乎上很多爬虫代码,一个函数几十行,是很不好的。应该尽量减少重复代码。 &/p&&br&&p&1. 重要的事情说三次,&/p&&p&函数不是越长越好, 好代码应该简单易懂好维护!&/p&&p&函数不是越长越好, 好代码应该简单易懂好维护! &/p&&p&函数不是越长越好, 好代码应该简单易懂好维护! &/p&&p&(放在进阶是因为能做到这一点的爬虫代码不多,很多都一团乱麻,坑死接盘侠)&/p&&br&&p&2. Scrapy + MongoDB + Redis
分布式爬虫系统其实不复杂。&/p&&p&a). Redis 用来存储要爬取的网页队列,也就是任务队列 &/p&&p&b). MongoDB
用来存储爬取的内容结果。&/p&&p&c) . Scrapy
里放爬虫crawler , 分别爬取不同的网页内容,
&/p&&p&ps:分布式这个东西,听起来很恐怖,但是拆开了也就这样。所以不用害怕。&/p&&p&*************************&/p&&p&-----
讲事故的分割线
----- &/p&&p&*************************
&/p&&p&曾经在某创业公司被赶鸭子上架(我最初是一个后端程序员,现在成分有点复杂,一言难尽),要在一星期内跟一个分布式爬取各大网商(包括淘宝天猫京东等十几家网商,Scrapy + MongoDB + Redis)的数据。&/p&&p&当时差点吓坏我了,因为没写过爬虫。
&/p&&p&然后leader 给我的线索只有 基本框架是 Scrapy。
&/p&&p&也许是无知者无畏, 也没想到去问谁,就自己看了 Scrapy 的文档,半小时就写出来了。&/p&&p&后来就很顺利把分布式爬虫系统搭起来了。
还爬了谷歌、百度、Bing、 Pinterest 、Instagram 等大量和当时公司业务相关的数据。 &/p&&p&就这样,我做到了。&/p&&p&当然,加了不少班。&/p&&p&ps:
用很多的机器,代表需要爬取的爬取的数据量很多,但是和项目的复杂程度不一定相关。所以不要害怕。害怕也没用,需求来了,一边颤抖一边加班也要写完代码的。&/p&
如果学会了python的基本语法,我认为入门爬虫是很容易的。
我写的第一个爬虫大概只需要10分钟,自学的 scrapyd , 看官方文档花了20分钟,因为我英文不是很好,很多单词需要搜索一下。
官方文档链接 …
&p&第一条:&b&you-get(&a href=&///?target=https%3A///soimort/you-get/releases& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Releases · soimort/you-get · GitHub&i class=&icon-external&&&/i&&/a&,&/b&这里面有各种发布版本&b&)。&/b&什么,你不知道?想爬取视频网站的视频和图片分享网站的图片,是不是就得造个轮子写个爬虫?No,你只需要:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span class=&err&&?&/span& &span class=&n&&pip3&/span& &span class=&n&&install&/span& &span class=&n&&you&/span&&span class=&o&&-&/span&&span class=&n&&get&/span&
&/code&&/pre&&/div&&p&能干什么呢?我提供几个例子:&/p&&br&&p&&b&1. 下载优酷视频&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&? you-get /v_show/id_XMTc1MDQwODMxNg\=\=.html
优酷 (Youku)
container:
video-profile: 超清
552.9 MiB ( bytes) # download-with: you-get --format=hd2 [URL]
Downloading 麻雀 51.flv ...
100% (552.9/552.9MB) ├██████████████████████████████████████████████████████████████████████████┤[16/16] 9 MB/s
Merging video parts... Merged into 麻雀 51.mp4
&/code&&/pre&&/div&&p&评论区 &a class=&member_mention& href=&///people/3c7cca22& data-hash=&3c7cca22& data-hovercard=&p$b$3c7cca22&&@xavierskip&/a& 提到可以使用-p观看无广告的优酷视频!我在Mac上使用的是mplayer&/p&&p&,安装和使用方法如下:&/p&&div class=&highlight&&&pre&&code class=&language-text&&? brew install mplayer
? you-get -p /usr/local/Cellar/mplayer/1.3.0/bin/mplayer
/v_show/id_XMTc1MDQwODMxNg==.html
&/code&&/pre&&/div&&p&这样就可以使用本地播放器播放了&/p&&br&&p&&b&2. B站&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&? you-get /video/av6543659/
【张继科】论张继科的CP是如何被国胖队手撕的
Flash video (video/x-flv)
28.2 MiB ( Bytes)
Downloading 【张继科】论张继科的CP是如何被国胖队手撕的.flv ...
100% ( 28.2/ 28.2MB) ├████████████████████████████████████████████████████████████████████████████┤[1/1] 7 MB/s
Downloading 【张继科】论张继科的CP是如何被国胖队手撕的.cmt.xml ...
&/code&&/pre&&/div&&p&&b&3. 网易云音乐&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&? you-get /\#/song\?id\=
5. Rolling in the deep
MP3 (audio/mpeg)
5.86 MiB (6144044 Bytes)
Downloading 5. Rolling in the deep.mp3 ...
5.9MB) ├████████████████████████████████████████████████████████████████████████████┤[1/1] 16 MB/s
Saving 5. Rolling in the deep.lrc ...Done.
&/code&&/pre&&/div&&p&&b&4. 花瓣画板&/b&&/p&&div class=&highlight&&&pre&&code class=&language-text&&? you-get /boards//
花瓣 (Huaban)
JPEG Image (image/jpeg)
inf MiB (inf Bytes)
Downloading .jpeg ...
infMB) ├────────────────────────────────────────────────────────────────────────────┤[1/1] 114 kB/s
Downloading .jpeg ...
infMB) ├────────────────────────────────────────────────────────────────────────────┤[1/1] 152 kB/s
Downloading .jpeg ...
infMB) ├────────────────────────────────────────────────────────────────────────────┤[1/1] 344 kB/s
Downloading .jpeg ...
infMB) ├────────────────────────────────────────────────────────────────────────────┤[1/1] 119 kB/s
Downloading .jpeg ...
infMB) ├────────────────────────────────────────────────────────────────────────────┤[1/1] 248 kB/s
.... 图片太多就展示到这里吧 (?&?&?)
&/code&&/pre&&/div&&br&&p&支持的网站太多,还是去&a href=&///?target=https%3A///soimort/you-get%23supported-sites& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&GitHub - soimort/you-get: Dumb downloader that scrapes the web&i class=&icon-external&&&/i&&/a&看吧,如果你有其他网站的需求,欢迎去PR添加支持,未来让其他同学也能受益。&/p&&br&&p&&b&&u&还不快去用!!!&/u&&/b&&/p&&br&&p&you-get的可扩展爬虫实现非常值得学习,相信给它贡献代码甚至读了它的源码都会对你的爬虫技术有所提高的。&/p&&br&&p&第二条: &b&不要只看 Web 网站, 还有移动版、 App 和 H5, 它们的反爬虫措施一般比较少, 所有社交网站爬虫, 优先选择爬移动版。 &/b&这条大家好像都是直接忽略的... 忧伤&/p&&p&&b&欢迎关注本人的微信公众号获取更多Python相关的内容(也可以直接搜索「Python之美」):&/b& &/p&&p&&a href=&///?target=http%3A///r/D0zH35LE_s_Frda89xkd& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&/r/D0zH35L&/span&&span class=&invisible&&E_s_Frda89xkd&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a& (二维码自动识别)&/p&
第一条:you-get(,这里面有各种发布版本)。什么,你不知道?想爬取视频网站的视频和图片分享网站的图片,是不是就得造个轮子写个爬虫?No,你只需要:? pip3 install you-get能干什么呢?我提供几个例子: 1. …
&img src=&/50/v2-277d38cefb9c636ac157a8ddd44af3d5_b.jpg& data-rawwidth=&1920& data-rawheight=&1080& class=&origin_image zh-lightbox-thumb& width=&1920& data-original=&/50/v2-277d38cefb9c636ac157a8ddd44af3d5_r.jpg&&&p&编译:西西、wally21st&/p&&blockquote&&i&未经允许,不得转载&/i&&/blockquote&&p&&b&原文链接:&a href=&/?target=https%3A//mp./s%3F__biz%3DMzAxNTc0Mjg0Mg%3D%3D%26mid%3D%26idx%3D1%26sn%3Df8f0eaf9bd8a0%26chksm%3D802e2d23b759a435aa1fc3a4ce26c69a7a4ce0a769c8d9f873caf2a9d15%23rd& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&用pandas处理大数据——节省90%内存消耗的小贴士&i class=&icon-external&&&/i&&/a&&/b&&/p&&p&&br&&/p&&p&一般来说,用pandas处理小于100兆的数据,性能不是问题。当用pandas来处理100兆至几个G的数据时,将会比较耗时,同时会导致程序因内存不足而运行失败。&/p&&p&&br&&/p&&p&当然,像Spark这类的工具能够胜任处理100G至几个T的大数据集,但要想充分发挥这些工具的优势,通常需要比较贵的硬件设备。而且,这些工具不像pandas那样具有丰富的进行高质量数据清洗、探索和分析的特性。对于中等规模的数据,我们的愿望是尽量让pandas继续发挥其优势,而不是换用其他工具。&/p&&p&&br&&/p&&p&本文我们讨论pandas的内存使用,展示怎样简单地为数据列选择合适的数据类型,就能够减少dataframe近90%的内存占用。&/p&&p&&br&&/p&&h2&&b&处理棒球比赛记录数据&/b&&/h2&&p&我们将处理130年的棒球甲级联赛的数据,数据源于&a href=&/?target=http%3A//www.retrosheet.org/gamelogs/index.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Retrosheet&i class=&icon-external&&&/i&&/a&&/p&&p&原始数据放在127个csv文件中,我们已经用&u&&a href=&/?target=https%3A//csvkit.readthedocs.io/en/1.0.2/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&csvkit&i class=&icon-external&&&/i&&/a&&/u&将其合并,并添加了表头。如果你想下载我们版本的数据用来运行本文的程序,我们提供了&u&&a href=&/?target=https%3A//data.world/dataquest/mlb-game-logs& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&下载地址&i class=&icon-external&&&/i&&/a&&/u&。&/p&&p&我们从导入数据,并输出前5行开始:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&kn&&import&/span& &span class=&nn&&pandas&/span& &span class=&kn&&as&/span& &span class=&nn&&pd&/span&
&span class=&n&&gl&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&read_csv&/span&&span class=&p&&(&/span&&span class=&s1&&'game_logs.csv'&/span&&span class=&p&&)&/span&
&span class=&n&&gl&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&()&/span&
&/code&&/pre&&/div&&img src=&/50/v2-9d9f6d9f87062fdffc50a7e156c3e564_b.png& data-rawwidth=&685& data-rawheight=&185& class=&origin_image zh-lightbox-thumb& width=&685& data-original=&/50/v2-9d9f6d9f87062fdffc50a7e156c3e564_r.png&&&p&&br&&/p&&p&我们将一些重要的字段列在下面:&/p&&ul&&li&&b&date &/b&- 比赛日期&/li&&li&&b&v_name &/b&- 客队名&/li&&li&&b&v_league&/b& - 客队联赛&/li&&li&&b&h_name &/b&- 主队名&/li&&li&&b&h_league -&/b& 主队联赛&/li&&li&&b&v_score&/b& - 客队得分&/li&&li&&b&h_score &/b&- 主队得分&/li&&li&&b&v_line_score&/b& - 客队线得分, 如)00.&/li&&li&&b&h_line_score&/b&- 主队线得分, 如)0X.&/li&&li&&b&park_id&/b& - 主办场地的ID&/li&&li&&b&attendance&/b&- 比赛出席人数&/li&&/ul&&p&我们可以用&b&&a href=&/?target=http%3A//%28%29& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://&/span&&span class=&visible&&()&/span&&span class=&invisible&&&/span&&i class=&icon-external&&&/i&&/a&&/b&方法来获得我们dataframe的一些高level信息,譬如数据量、数据类型和内存使用量。&/p&&p&&br&&/p&&p&这个方法默认情况下返回一个近似的内存使用量,现在我们设置参数&b&memory_usage&/b&为&b&'deep'&/b&来获得准确的内存使用量:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&gl&/span&&span class=&o&&.&/span&&span class=&n&&info&/span&&span class=&p&&(&/span&&span class=&n&&memory_usage&/span&&span class=&o&&=&/span&&span class=&s1&&'deep'&/span&&span class=&p&&)&/span&&span class=&n&&p&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&o&&&&/span&&span class=&k&&class&/span& &span class=&err&&'&/span&&span class=&nc&&pandas&/span&&span class=&o&&.&/span&&span class=&n&&core&/span&&span class=&o&&.&/span&&span class=&n&&frame&/span&&span class=&o&&.&/span&&span class=&n&&DataFrame&/span&&span class=&s1&&'&&/span&
&span class=&n&&RangeIndex&/span&&span class=&p&&:&/span& &span class=&mi&&171907&/span& &span class=&n&&entries&/span&&span class=&p&&,&/span& &span class=&mi&&0&/span& &span class=&n&&to&/span& &span class=&mi&&171906&/span&
&span class=&n&&Columns&/span&&span class=&p&&:&/span& &span class=&mi&&161&/span& &span class=&n&&entries&/span&&span class=&p&&,&/span& &span class=&n&&date&/span& &span class=&n&&to&/span& &span class=&n&&acquisition_info&/span&
&span class=&n&&dtypes&/span&&span class=&p&&:&/span& &span class=&n&&float64&/span&&span class=&p&&(&/span&&span class=&mi&&77&/span&&span class=&p&&),&/span& &span class=&n&&int64&/span&&span class=&p&&(&/span&&span class=&mi&&6&/span&&span class=&p&&),&/span& &span class=&nb&&object&/span&&span class=&p&&(&/span&&span class=&mi&&78&/span&&span class=&p&&)&/span&
&span class=&n&&memory&/span& &span class=&n&&usage&/span&&span class=&p&&:&/span& &span class=&mf&&861.6&/span& &span class=&n&&MB&/span&
&/code&&/pre&&/div&&p&我们可以看到它有171907行和161列。pandas已经为我们自动检测了数据类型,其中包括83列数值型数据和78列对象型数据。对象型数据列用于字符串或包含混合数据类型的列。&/p&&p&由此我们可以进一步了解我们应该如何减少内存占用,下面我们来看一看pandas如何在内存中存储数据。&/p&&p&&br&&/p&&h2&&b&Dataframe对象的内部表示&/b&&/h2&&p&在底层,pandas会按照数据类型将列分组形成数据块(blocks)。下图所示为pandas如何存储我们数据表的前十二列:&/p&&img src=&/50/v2-19c2ec4fbfb1b995a3d9_b.png& data-rawwidth=&2263& data-rawheight=&1003& class=&origin_image zh-lightbox-thumb& width=&2263& data-original=&/50/v2-19c2ec4fbfb1b995a3d9_r.png&&&p&&br&&/p&&p&可以注意到,这些数据块没有保持对列名的引用,这是由于为了存储dataframe中的真实数据,这些数据块都经过了优化。有个&u&&a href=&/?target=https%3A///pandas-dev/pandas/blob/master/pandas/core/internals.py%23L2691& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&BlockManager类&i class=&icon-external&&&/i&&/a&&/u&会用于保持行列索引与真实数据块的映射关系。他扮演一个API,提供对底层数据的访问。每当我们查询、编辑或删除数据时,dataframe类会利用BlockManager类接口将我们的请求转换为函数和方法的调用。&/p&&p&&br&&/p&&p&每种数据类型在pandas.core.internals模块中都有一个特定的类。pandas使用ObjectBlock类来表示包含字符串列的数据块,用FloatBlock类来表示包含浮点型列的数据块。对于包含数值型数据(比如整型和浮点型)的数据块,pandas会合并这些列,并把它们存储为一个Numpy数组(ndarray)。Numpy数组是在C数组的基础上创建的,其值在内存中是连续存储的。基于这种存储机制,对其切片的访问是相当快的。&/p&&p&&br&&/p&&p&由于不同类型的数据是分开存放的,我们将检查不同数据类型的内存使用情况,我们先看看各数据类型的平均内存使用量:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&k&&for&/span& &span class=&n&&dtype&/span& &span class=&ow&&in&/span& &span class=&p&&[&/span&&span class=&s1&&'float'&/span&&span class=&p&&,&/span&&span class=&s1&&'int'&/span&&span class=&p&&,&/span&&span class=&s1&&'object'&/span&&span class=&p&&]:&/span&
&span class=&n&&selected_dtype&/span& &span class=&o&&=&/span& &span class=&n&&gl&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&n&&dtype&/span&&span class=&p&&])&/span&
&span class=&n&&mean_usage_b&/span& &span class=&o&&=&/span& &span class=&n&&selected_dtype&/span&&span class=&o&&.&/span&&span class=&n&&memory_usage&/span&&span class=&p&&(&/span&&span class=&n&&deep&/span&&span class=&o&&=&/span&&span class=&bp&&True&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&mean&/span&&span class=&p&&()&/span&
&span class=&n&&mean_usage_mb&/span& &span class=&o&&=&/span& &span class=&n&&mean_usage_b&/span& &span class=&o&&/&/span& &span class=&mi&&1024&/span& &span class=&o&&**&/span& &span class=&mi&&2&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&s2&&&Average memory usage for {} columns: {:03.2f} MB&&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&dtype&/span&&span class=&p&&,&/span&&span class=&n&&mean_usage_mb&/span&&span class=&p&&))&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&Average&/span& &span class=&n&&memory&/span& &span class=&n&&usage&/span& &span class=&k&&for&/span& &span class=&nb&&float&/span& &span class=&n&&columns&/span&&span class=&p&&:&/span& &span class=&mf&&1.29&/span& &span class=&n&&MB&/span&
&span class=&n&&Average&/span& &span class=&n&&memory&/span& &span class=&n&&usage&/span& &span class=&k&&for&/span& &span class=&nb&&int&/span& &span class=&n&&columns&/span&&span class=&p&&:&/span& &span class=&mf&&1.12&/span& &span class=&n&&MB&/span&
&span class=&n&&Average&/span& &span class=&n&&memory&/span& &span class=&n&&usage&/span& &span class=&k&&for&/span& &span class=&nb&&object&/span& &span class=&n&&columns&/span&&span class=&p&&:&/span& &span class=&mf&&9.53&/span& &span class=&n&&MB&/span&
&/code&&/pre&&/div&&p&我们可以看到内存使用最多的是78个object列,我们待会再来看它们,我们先来看看我们能否提高数值型列的内存使用效率。&/p&&p&&br&&/p&&h2&&b&理解子类(Subtypes)&/b&&/h2&&p&刚才我们提到,pandas在底层将数值型数据表示成Numpy数组,并在内存中连续存储。这种存储方式消耗较少的空间,并允许我们较快速地访问数据。由于pandas使用相同数量的字节来表示同一类型的每一个值,并且numpy数组存储了这些值的数量,所以pandas能够快速准确地返回数值型列所消耗的字节量。&/p&&p&&br&&/p&&p&pandas中的许多数据类型具有多个子类型,它们可以使用较少的字节去表示不同数据,比如,float型就有float16、float32和float64这些子类型。这些类型名称的数字部分表明了这种类型使用了多少比特来表示数据,比如刚才列出的子类型分别使用了2、4、8个字节。下面这张表列出了pandas中常用类型的子类型:&/p&&img src=&/50/v2-5e6aaa3b096cd4a821d1f462de4c45de_b.png& data-rawwidth=&450& data-rawheight=&188& class=&origin_image zh-lightbox-thumb& width=&450& data-original=&/50/v2-5e6aaa3b096cd4a821d1f462de4c45de_r.png&&&p&&br&&/p&&p&一个int8类型的数据使用1个字节(8位比特)存储一个值,可以表示256(2^8)个二进制数值。这意味着我们可以用这种子类型去表示从-128到127(包括0)的数值。&/p&&p&我们可以用numpy.iinfo类来确认每一个整型子类型的最小和最大值,如下:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&kn&&import&/span& &span class=&nn&&numpy&/span& &span class=&kn&&as&/span& &span class=&nn&&np&/span&
&span class=&n&&int_types&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s2&&&uint8&&/span&&span class=&p&&,&/span& &span class=&s2&&&int8&&/span&&span class=&p&&,&/span& &span class=&s2&&&int16&&/span&&span class=&p&&]&/span&
&span class=&k&&for&/span& &span class=&n&&it&/span& &span class=&ow&&in&/span& &span class=&n&&int_types&/span&&span class=&p&&:&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&np&/span&&span class=&o&&.&/span&&span class=&n&&iinfo&/span&&span class=&p&&(&/span&&span class=&n&&it&/span&&span class=&p&&))&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&Machine&/span& &span class=&n&&parameters&/span& &span class=&k&&for&/span& &span class=&n&&uint8&/span&
&span class=&o&&-----------------------------------------------------&/span&
&span class=&nb&&min&/span& &span class=&o&&=&/span& &span class=&mi&&0&/span&
&span class=&nb&&max&/span& &span class=&o&&=&/span& &span class=&mi&&255&/span&
&span class=&o&&-----------------------------------------------------&/span&
&span class=&n&&Machine&/span& &span class=&n&&parameters&/span& &span class=&k&&for&/span& &span class=&n&&int8&/span&
&span class=&o&&-----------------------------------------------------&/span&
&span class=&nb&&min&/span& &span class=&o&&=&/span& &span class=&o&&-&/span&&span class=&mi&&128&/span&
&span class=&nb&&max&/span& &span class=&o&&=&/span& &span class=&mi&&127&/span&
&span class=&o&&-----------------------------------------------------&/span&
&span class=&n&&Machine&/span& &span class=&n&&parameters&/span& &span class=&k&&for&/span& &span class=&n&&int16&/span&
&span class=&o&&-----------------------------------------------------&/span&
&span class=&nb&&min&/span& &span class=&o&&=&/span& &span class=&o&&-&/span&&span class=&mi&&32768&/span&
&span class=&nb&&max&/span& &span class=&o&&=&/span& &span class=&mi&&32767&/span&
&span class=&o&&-----------------------------------------------------&/span&
&/code&&/pre&&/div&&p&这里我们还可以看到uint(无符号整型)和int(有符号整型)的区别。两者都占用相同的内存存储量,但无符号整型由于只存正数,所以可以更高效的存储只含正数的列。&/p&&p&&br&&/p&&h2&&b&用子类型优化数值型列&/b&&/h2&&p&我们可以用函数pd.to_numeric()来对数值型进行向下类型转换。我们用DataFrame.select_dtypes来只选择整型列,然后我们优化这种类型,并比较内存使用量。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&c1&&# We're going to be calculating memory usage a lot,&/span&
&span class=&c1&&# so we'll create a function to save us some time!&/span&
&span class=&k&&def&/span& &span class=&nf&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&pandas_obj&/span&&span class=&p&&):&/span&
&span class=&k&&if&/span& &span class=&nb&&isinstance&/span&&span class=&p&&(&/span&&span class=&n&&pandas_obj&/span&&span class=&p&&,&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&DataFrame&/span&&span class=&p&&):&/span&
&span class=&n&&usage_b&/span& &span class=&o&&=&/span& &span class=&n&&pandas_obj&/span&&span class=&o&&.&/span&&span class=&n&&memory_usage&/span&&span class=&p&&(&/span&&span class=&n&&deep&/span&&span class=&o&&=&/span&&span class=&bp&&True&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&sum&/span&&span class=&p&&()&/span&
&span class=&k&&else&/span&&span class=&p&&:&/span& &span class=&c1&&# we assume if not a df it's a series&/span&
&span class=&n&&usage_b&/span& &span class=&o&&=&/span& &span class=&n&&pandas_obj&/span&&span class=&o&&.&/span&&span class=&n&&memory_usage&/span&&span class=&p&&(&/span&&span class=&n&&deep&/span&&span class=&o&&=&/span&&span class=&bp&&True&/span&&span class=&p&&)&/span&
&span class=&n&&usage_mb&/span& &span class=&o&&=&/span& &span class=&n&&usage_b&/span& &span class=&o&&/&/span& &span class=&mi&&1024&/span& &span class=&o&&**&/span& &span class=&mi&&2&/span& &span class=&c1&&# convert bytes to megabytes&/span&
&span class=&k&&return&/span& &span class=&s2&&&{:03.2f} MB&&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&usage_mb&/span&&span class=&p&&)&/span&
&span class=&n&&gl_int&/span& &span class=&o&&=&/span& &span class=&n&&gl&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s1&&'int'&/span&&span class=&p&&])&/span&
&span class=&n&&converted_int&/span& &span class=&o&&=&/span& &span class=&n&&gl_int&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&to_numeric&/span&&span class=&p&&,&/span&&span class=&n&&downcast&/span&&span class=&o&&=&/span&&span class=&s1&&'unsigned'&/span&&span class=&p&&)&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&gl_int&/span&&span class=&p&&))&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&converted_int&/span&&span class=&p&&))&/span&
&span class=&n&&compare_ints&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&concat&/span&&span class=&p&&([&/span&&span class=&n&&gl_int&/span&&span class=&o&&.&/span&&span class=&n&&dtypes&/span&&span class=&p&&,&/span&&span class=&n&&converted_int&/span&&span class=&o&&.&/span&&span class=&n&&dtypes&/span&&span class=&p&&],&/span&&span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&1&/span&&span class=&p&&)&/span&
&span class=&n&&compare_ints&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s1&&'before'&/span&&span class=&p&&,&/span&&span class=&s1&&'after'&/span&&span class=&p&&]&/span&
&span class=&n&&compare_ints&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&Series&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&mf&&7.87&/span& &span class=&n&&MB&/span&
&span class=&mf&&1.48&/span& &span class=&n&&MB&/span&
&/code&&/pre&&/div&&img src=&/50/v2-eaafbae097e_b.png& data-rawwidth=&188& data-rawheight=&127& class=&content_image& width=&188&&&p&我们看到内存用量从7.9兆下降到1.5兆,降幅达80%。这对我们原始dataframe的影响有限,这是由于它只包含很少的整型列。&/p&&p&同理,我们再对浮点型列进行相应处理:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&gl_float&/span& &span class=&o&&=&/span& &span class=&n&&gl&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s1&&'float'&/span&&span class=&p&&])&/span&
&span class=&n&&converted_float&/span& &span class=&o&&=&/span& &span class=&n&&gl_float&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&to_numeric&/span&&span class=&p&&,&/span&&span class=&n&&downcast&/span&&span class=&o&&=&/span&&span class=&s1&&'float'&/span&&span class=&p&&)&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&gl_float&/span&&span class=&p&&))&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&converted_float&/span&&span class=&p&&))&/span&
&span class=&n&&compare_floats&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&concat&/span&&span class=&p&&([&/span&&span class=&n&&gl_float&/span&&span class=&o&&.&/span&&span class=&n&&dtypes&/span&&span class=&p&&,&/span&&span class=&n&&converted_float&/span&&span class=&o&&.&/span&&span class=&n&&dtypes&/span&&span class=&p&&],&/span&&span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&1&/span&&span class=&p&&)&/span&
&span class=&n&&compare_floats&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s1&&'before'&/span&&span class=&p&&,&/span&&span class=&s1&&'after'&/span&&span class=&p&&]&/span&
&span class=&n&&compare_floats&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&Series&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&p&我们可以看到所有的浮点型列都从float64转换为float32,内存用量减少50%。&/p&&p&&br&&/p&&p&我们再创建一个原始dataframe的副本,将其数值列赋值为优化后的类型,再看看内存用量的整体优化效果。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&optimized_gl&/span& &span class=&o&&=&/span& &span class=&n&&gl&/span&&span class=&o&&.&/span&&span class=&n&&copy&/span&&span class=&p&&()&/span&
&span class=&n&&optimized_gl&/span&&span class=&p&&[&/span&&span class=&n&&converted_int&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&converted_int&/span&
&span class=&n&&optimized_gl&/span&&span class=&p&&[&/span&&span class=&n&&converted_float&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&converted_float&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&gl&/span&&span class=&p&&))&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&optimized_gl&/span&&span class=&p&&))&/span&
&/code&&/pre&&/div&&p&可以看到通过我们显著缩减数值型列的内存用量,我们的dataframe的整体内存用量减少了7%。余下的大部分优化将针对object类型进行。&/p&&p&&br&&/p&&p&在这之前,我们先来研究下与数值型相比,pandas如何存储字符串。&/p&&p&&br&&/p&&h2&&b&选对比数值与字符的储存&/b&&/h2&&p&object类型用来表示用到了Python字符串对象的值,有一部分原因是Numpy缺少对缺失字符串值的支持。因为Python是一种高层、解析型语言,它没有提供很好的对内存中数据如何存储的细粒度控制。&/p&&p&&br&&/p&&p&这一限制导致了字符串以一种碎片化方式进行存储,消耗更多的内存,并且访问速度低下。在object列中的每一个元素实际上都是存放内存中真实数据位置的指针。&/p&&p&&br&&/p&&p&下图对比展示了数值型数据怎样以Numpy数据类型存储,和字符串怎样以Python内置类型进行存储的。&/p&&img src=&/50/v2-a9aa408a5c3a78bd3fba18_b.png& data-rawwidth=&871& data-rawheight=&486& class=&origin_image zh-lightbox-thumb& width=&871& data-original=&/50/v2-a9aa408a5c3a78bd3fba18_r.png&&&p&&i&图示来源并改编自&u&&a href=&/?target=https%3A//jakevdp.github.io/blog//why-python-is-slow/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Why Python Is Slow&i class=&icon-external&&&/i&&/a&&/u&&/i&&/p&&p&&br&&/p&&p&你可能注意到上文表中提到object类型数据使用可变(variable)大小的内存。由于一个指针占用1字节,因此每一个字符串占用的内存量与它在Python中单独存储所占用的内存量相等。我们用sys.getsizeof()来证明这一点,先来看看在Python单独存储字符串,再来看看使用pandas的series的情况。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&kn&&from&/span& &span class=&nn&&sys&/span& &span class=&kn&&import&/span& &span class=&n&&getsizeof&/span&
&span class=&n&&s1&/span& &span class=&o&&=&/span& &span class=&s1&&'working out'&/span&
&span class=&n&&s2&/span& &span class=&o&&=&/span& &span class=&s1&&'memory usage for'&/span&
&span class=&n&&s3&/span& &span class=&o&&=&/span& &span class=&s1&&'strings in python is fun!'&/span&
&span class=&n&&s4&/span& &span class=&o&&=&/span& &span class=&s1&&'strings in python is fun!'&/span&
&span class=&k&&for&/span& &span class=&n&&s&/span& &span class=&ow&&in&/span& &span class=&p&&[&/span&&span class=&n&&s1&/span&&span class=&p&&,&/span& &span class=&n&&s2&/span&&span class=&p&&,&/span& &span class=&n&&s3&/span&&span class=&p&&,&/span& &span class=&n&&s4&/span&&span class=&p&&]:&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&getsizeof&/span&&span class=&p&&(&/span&&span class=&n&&s&/span&&span class=&p&&))&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&mi&&60&/span&
&span class=&mi&&65&/span&
&span class=&mi&&74&/span&
&span class=&mi&&74&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&obj_series&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&Series&/span&&span class=&p&&([&/span&&span class=&s1&&'working out'&/span&&span class=&p&&,&/span&
&span class=&s1&&'memory usage for'&/span&&span class=&p&&,&/span&
&span class=&s1&&'strings in python is fun!'&/span&&span class=&p&&,&/span&
&span class=&s1&&'strings in python is fun!'&/span&&span class=&p&&])&/span&
&span class=&n&&obj_series&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&n&&getsizeof&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&mi&&0&/span&
&span class=&mi&&60&/span&
&span class=&mi&&1&/span&
&span class=&mi&&65&/span&
&span class=&mi&&2&/span&
&span class=&mi&&74&/span&
&span class=&mi&&3&/span&
&span class=&mi&&74&/span&
&span class=&n&&dtype&/span&&span class=&p&&:&/span& &span class=&n&&int64&/span&
&/code&&/pre&&/div&&p&你可以看到这些字符串的大小在pandas的series中与在Python的单独字符串中是一样的。&/p&&p&&br&&/p&&h2&&b&用类别(categoricalas)类型优化object类型&/b&&/h2&&p&Pandas在0.15版本中引入类别类型。category类型在底层使用整型数值来表示该列的值,而不是用原值。Pandas用一个字典来构建这些整型数据到原数据的映射关系。当一列只包含有限种值时,这种设计是很不错的。当我们把一列转换成category类型时,pandas会用一种最省空间的int子类型去表示这一列中所有的唯一值。&/p&&img src=&/50/v2-80ed135daedc_b.png& data-rawwidth=&405& data-rawheight=&316& class=&content_image& width=&405&&&p&&br&&/p&&p&为了介绍我们何处会用到这种类型去减少内存消耗,让我们来看看我们数据中每一个object类型列中的唯一值个数。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&l_obj&/span& &span class=&o&&=&/span& &span class=&n&&gl&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s1&&'object'&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&copy&/span&&span class=&p&&()&/span&
&span class=&n&&gl_obj&/span&&span class=&o&&.&/span&&span class=&n&&describe&/span&&span class=&p&&()&/span&
&/code&&/pre&&/div&&img src=&/50/v2-28edfa5aff92_b.png& data-rawwidth=&583& data-rawheight=&163& class=&origin_image zh-lightbox-thumb& width=&583& data-original=&/50/v2-28edfa5aff92_r.png&&&p&可以看到在我们包含了近172000场比赛的数据集中,很多列只包含了少数几个唯一值。&/p&&p&我们先选择其中一个object列,开看看将其转换成类别类型会发生什么。这里我们选用第二列:day_of_week。&/p&&p&我们从上表中可以看到,它只包含了7个唯一值。我们用.astype()方法将其转换为类别类型。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&dow&/span& &span class=&o&&=&/span& &span class=&n&&gl_obj&/span&&span class=&o&&.&/span&&span class=&n&&day_of_week&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&dow&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&())&/span&
&span class=&n&&dow_cat&/span& &span class=&o&&=&/span& &span class=&n&&dow&/span&&span class=&o&&.&/span&&span class=&n&&astype&/span&&span class=&p&&(&/span&&span class=&s1&&'category'&/span&&span class=&p&&)&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&dow_cat&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&())&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&mi&&0&/span&
&span class=&n&&Thu&/span&
&span class=&mi&&1&/span&
&span class=&n&&Fri&/span&
&span class=&mi&&2&/span&
&span class=&n&&Sat&/span&
&span class=&mi&&3&/span&
&span class=&n&&Mon&/span&
&span class=&mi&&4&/span&
&span class=&n&&Tue&/span&
&span class=&n&&Name&/span&&span class=&p&&:&/span& &span class=&n&&day_of_week&/span&&span class=&p&&,&/span& &span class=&n&&dtype&/span&&span class=&p&&:&/span& &span class=&nb&&object&/span&
&span class=&mi&&0&/span&
&span class=&n&&Thu&/span&
&span class=&mi&&1&/span&
&span class=&n&&Fri&/span&
&span class=&mi&&2&/span&
&span class=&n&&Sat&/span&
&span class=&mi&&3&/span&
&span class=&n&&Mon&/span&
&span class=&mi&&4&/span&
&span class=&n&&Tue&/span&
&span class=&n&&Name&/span&&span class=&p&&:&/span& &span class=&n&&day_of_week&/span&&span class=&p&&,&/span& &span class=&n&&dtype&/span&&span class=&p&&:&/span& &span class=&n&&category&/span&
&span class=&n&&Categories&/span& &span class=&p&&(&/span&&span class=&mi&&7&/span&&span class=&p&&,&/span& &span class=&nb&&object&/span&&span class=&p&&):&/span& &span class=&p&&[&/span&&span class=&n&&Fri&/span&&span class=&p&&,&/span& &span class=&n&&Mon&/span&&span class=&p&&,&/span& &span class=&n&&Sat&/span&&span class=&p&&,&/span& &span class=&n&&Sun&/span&&span class=&p&&,&/span& &span class=&n&&Thu&/span&&span class=&p&&,&/span& &span class=&n&&Tue&/span&&span class=&p&&,&/span& &span class=&n&&Wed&/span&&span class=&p&&]&/span&
&/code&&/pre&&/div&&p&&br&&/p&&p&可以看到,虽然列的类型改变了,但数据看上去好像没什么变化。我们来看看底层发生了什么。&/p&&p&下面的代码中,我们用Series.cat.codes属性来返回category类型用以表示每个值的整型数字。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&dow_cat&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&cat&/span&&span class=&o&&.&/span&&span class=&n&&codes&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&mi&&0&/span&
&span class=&mi&&4&/span&
&span class=&mi&&1&/span&
&span class=&mi&&0&/span&
&span class=&mi&&2&/span&
&span class=&mi&&2&/span&
&span class=&mi&&3&/span&
&span class=&mi&&1&/span&
&span class=&mi&&4&/span&
&span class=&mi&&5&/span&
&span class=&n&&dtype&/span&&span class=&p&&:&/span& &span class=&n&&int8&/span&
&/code&&/pre&&/div&&p&可以看到,每一个值都被赋值为一个整数,而且这一列在底层是int8类型。这一列没有任何缺失数据,但是如果有,category子类型会将缺失数据设为-1。&/p&&p&最后,我们来看看这一列在转换为category类型前后的内存使用量。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&dow&/span&&span class=&p&&))&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&dow_cat&/span&&span class=&p&&))&/span&
&span class=&mf&&9.84&/span& &span class=&n&&MB&/span&
&span class=&mf&&0.16&/span& &span class=&n&&MB&/span&
&/code&&/pre&&/div&&p&&br&&/p&&p&内存用量从9.8兆降到0.16兆,近乎98%的降幅!注意这一特殊列可能代表了我们一个极好的例子——一个包含近172000个数据的列只有7个唯一值。&/p&&p&&br&&/p&&p&这样的话,我们把所有这种类型的列都转换成类别类型应该会很不错,但这里面也要权衡利弊。首要问题是转变为类别类型会丧失数值计算能力,在将类别类型转换成真实的数值类型前,我们不能对category列做算术运算,也不能使用诸如Series.min()和Series.max()等方法。&/p&&p&&br&&/p&&p&对于唯一值数量少于50%的object列,我们应该坚持首先使用category类型。如果某一列全都是唯一值,category类型将会占用更多内存。这是因为这样做不仅要存储全部的原始字符串数据,还要存储整型类别标识。有关category类型的更多限制,参看&u&&a href=&/?target=http%3A//pandas.pydata.org/pandas-docs/stable/categorical.html%23gotchas& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&pandas文档&i class=&icon-external&&&/i&&/a&&/u&。&/p&&p&&br&&/p&&p&下面我们写一个循环,对每一个object列进行迭代,检查其唯一值是否少于50%,如果是,则转换成类别类型。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&converted_obj&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&DataFrame&/span&&span class=&p&&()&/span&
&span class=&k&&for&/span& &span class=&n&&col&/span& &span class=&ow&&in&/span& &span class=&n&&gl_obj&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&&span class=&p&&:&/span&
&span class=&n&&num_unique_values&/span& &span class=&o&&=&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&gl_obj&/span&&span class=&p&&[&/span&&span class=&n&&col&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&unique&/span&&span class=&p&&())&/span&
&span class=&n&&num_total_values&/span& &span class=&o&&=&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&gl_obj&/span&&span class=&p&&[&/span&&span class=&n&&col&/span&&span class=&p&&])&/span&
&span class=&k&&if&/span& &span class=&n&&num_unique_values&/span& &span class=&o&&/&/span& &span class=&n&&num_total_values&/span& &span class=&o&&&&/span& &span class=&mf&&0.5&/span&&span class=&p&&:&/span&
&span class=&n&&converted_obj&/span&&span class=&o&&.&/span&&span class=&n&&loc&/span&&span class=&p&&[:,&/span&&span class=&n&&col&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&gl_obj&/span&&span class=&p&&[&/span&&span class=&n&&col&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&astype&/span&&span class=&p&&(&/span&&span class=&s1&&'category'&/span&&span class=&p&&)&/span&
&span class=&k&&else&/span&&span class=&p&&:&/span&
&span class=&n&&converted_obj&/span&&span class=&o&&.&/span&&span class=&n&&loc&/span&&span class=&p&&[:,&/span&&span class=&n&&col&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&gl_obj&/span&&span class=&p&&[&/span&&span class=&n&&col&/span&&span class=&p&&]&/span&
&/code&&/pre&&/div&&p&更之前一样进行比较:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&gl_obj&/span&&span class=&p&&))&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&converted_obj&/span&&span class=&p&&))&/span&
&span class=&n&&compare_obj&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&concat&/span&&span class=&p&&([&/span&&span class=&n&&gl_obj&/span&&span class=&o&&.&/span&&span class=&n&&dtypes&/span&&span class=&p&&,&/span&&span class=&n&&converted_obj&/span&&span class=&o&&.&/span&&span class=&n&&dtypes&/span&&span class=&p&&],&/span&&span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&1&/span&&span class=&p&&)&/span&
&span class=&n&&compare_obj&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s1&&'before'&/span&&span class=&p&&,&/span&&span class=&s1&&'after'&/span&&span class=&p&&]&/span&
&span class=&n&&compare_obj&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&Series&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&)&/span&
&span class=&mf&&752.72&/span& &span class=&n&&MB&/span&
&span class=&mf&&51.67&/span& &span class=&n&&MB&/span&
&/code&&/pre&&/div&&img src=&/50/v2-566f529e161e7ca45ff7c33b43ac6b7e_b.png& data-rawwidth=&190& data-rawheight=&105& class=&content_image& width=&190&&&p&这本例中,所有的object列都被转换成了category类型,但其他数据集就不一定了,所以你最好还是得使用刚才的检查过程。 &/p&&p&本例的亮点是内存用量从752.72兆降为51.667兆,降幅达93%。我们将其与我们dataframe的剩下部分合并,看看初始的861兆数据降到了多少。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&optimized_gl&/span&&span class=&p&&[&/span&&span class=&n&&converted_obj&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&converted_obj&/span&
&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&optimized_gl&/span&&span class=&p&&)&/span&
&span class=&s1&&'103.64 MB'&/span&
&/code&&/pre&&/div&&p&&br&&/p&&p&耶,看来我们的进展还不错!我们还有一招可以做优化,如果你记得我们刚才那张类型表,会发现我们数据集第一列还可以用datetime类型来表示。&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&date&/span& &span class=&o&&=&/span& &span class=&n&&optimized_gl&/span&&span class=&o&&.&/span&&span class=&n&&date&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&date&/span&&span class=&p&&))&/span&
&span class=&n&&date&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&()&/span&
&span class=&mf&&0.66&/span& &span class=&n&&MB&/span&
&span class=&mi&&0&/span&
&span class=&mi&&&/span&
&span class=&mi&&1&/span&
&span class=&mi&&&/span&
&span class=&mi&&2&/span&
&span class=&mi&&&/span&
&span class=&mi&&3&/span&
&span class=&mi&&&/span&
&span class=&mi&&4&/span&
&span class=&mi&&&/span&
&span class=&n&&Name&/span&&span class=&p&&:&/span& &span class=&n&&date&/span&&span class=&p&&,&/span& &span class=&n&&dtype&/span&&span class=&p&&:&/span& &span class=&n&&uint32&/span&
&/code&&/pre&&/div&&p&&br&&/p&&p&你可能还记得这一列之前是作为整型读入的,并优化成了uint32。因此,将其转换成datetime会占用原来两倍的内存,因为datetime类型是64位比特的。将其转换为datetime的意义在于它可以便于我们进行时间序列分析。&/p&&p&&br&&/p&&p&转换使用pandas.to_datetime()函数,并使用format参数告之日期数据存储为YYYY-MM-DD格式。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&optimized_gl&/span&&span class=&p&&[&/span&&span class=&s1&&'date'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&to_datetime&/span&&span class=&p&&(&/span&&span class=&n&&date&/span&&span class=&p&&,&/span&&span class=&n&&format&/span&&span class=&o&&=&/span&&span class=&s1&&'%Y%m&/span&&span class=&si&&%d&/span&&span class=&s1&&'&/span&&span class=&p&&)&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&mem_usage&/span&&span class=&p&&(&/span&&span class=&n&&optimized_gl&/span&&span class=&p&&))&/span&
&span class=&n&&optimized_gl&/span&&span class=&o&&.&/span&&span class=&n&&date&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&()&/span&
&span class=&mf&&104.29&/span& &span class=&n&&MB&/span&
&span class=&mi&&0&/span&
&span class=&mi&&1871&/span&&span class=&o&&-&/span&&span class=&mo&&05&/span&&span class=&o&&-&/span&&span class=&mo&&04&/span&
&span class=&mi&&1&/span&
&span class=&mi&&1871&/span&&span class=&o&&-&/span&&span class=&mo&&05&/span&&span class=&o&&-&/span&&span class=&mo&&05&/span&
&span class=&mi&&2&/span&
&span class=&mi&&1871&/span&&span class=&o&&-&/span&&span class=&mo&&05&/span&&span class=&o&&-&/span&&span class=&mo&&06&/span&
&span class=&mi&&3&/span&
&span class=&mi&&1871&/span&&span class=&o&&-&/span&&span class=&mo&&05&/span&&

我要回帖

更多关于 qwilt qb series 的文章

 

随机推荐