为什么awk可以awk分析nginx日志大

科学教育 | 学习帮助 | 出国/留学 | 工程技术科学 | 教育/科学 | 英语听力 | 梦幻西游电脑版 | 视频会议 | 口臭 | 暗黑破坏神3（游戏） | 面相 | 赛尔号 | linux | 山西省 | Xbox One | 思修 | 易经 | solidworks | 钢铁雄心4 | 休闲游戏 | 魔兽争霸3混乱之治 | 显卡 | 武汉大学 | 塞尔达传说（游戏） | 校服 | 剑侠情缘网络版叁 | 脱发 | 日本文化 | 数学建模 | 二次元 | 部落冲突（游戏） | 肖战 | 街机游戏 | 拳皇 | 马鞍山市 | 扑克 | 完美世界（游戏） | 三国志（游戏） | 热血传奇（游戏） | 意大利 | 跆拳道 | 东莞市 | 糖尿病 | 古琴 | 三国 | 电视节目 | 百度 | qq音乐 | 配音 | 电视 | 任天堂 | 科幻小说 | 虚拟专用服务器 | QQ游戏 | 大熊猫 | 微电影 | Android | 竞技游戏 | 动画制作 | QQ炫舞 | 电源 | 日语 | 魔兽争霸3冰封王座 | 产业 | ios开发 | 百度云 | 动画电影 | nba篮球 | 羽生结弦 | iOS应用 | galgame | 电吉他 | 平板电脑 | 周星驰（人物） | 离婚 | 后宫·甄嬛传（书籍） | 牙科 | 游戏开发 | 网络直播 | ios游戏 | 电子邮件 | SNH48 | 民国 | 美容 | 舰队 Collection | 心理 | Mac | 羽毛球技术 | 互联网公司 | 大学生兼职 | 烘焙 | 诸葛亮 | 跑跑卡丁车 | 武侠小说 | 微博 | 骨折 | 掌上游戏机 | 玉米 | 中国足球 | 电脑配置 | 洛奇英雄传 | 硬盘 | 张璐 | akb48 | 炉石传说 | 韩国 | 蓄电池 | QQ空间 | 房贷 | 麦克风 | 相声演员 | 抑郁 | 天下2（游戏） | 农业科学 | 神话 | 农历 | 中国足球协会超级联赛（CSL） | 流星花园 | 易烊千玺 | 火影忍者 | 日语歌曲 | 巴西 | 红酒 | 化疗 | 占地 | 网络小说 | 香烟 | 传奇世界 | 名字 | 日本电影 | 表演 | 西藏自治区 | 英雄传说：闪之轨迹（游戏） | 足球彩票 | 摩尔庄园 | 中国工商银行 | 游戏手柄 | 陈奕迅 | 联赛 | 天体物理学 | 英格兰足球超级联赛 | 超级机器人大战 | 命令与征服：红色警戒2（游戏） | 郭富城 | 一级方程式赛车（f1） | Adobe Photoshop | 英文歌曲 | 玄幻小说 | 猫和老鼠 | 杨凡 | 书籍改编电影 | 俄罗斯 | 网络赚钱 | 罗玉凤 | 刺客信条2 | 角色扮演 | 食物 | 药物 | 杨洋（演员） | 信息安全 | 胡歌（演员） | 张子枫 | 古典音乐 | 时尚 | 大片 | 电脑游戏 | 签证 | 徐佳莹 | 耽美 | 游戏攻略 | 音乐剧 | 前女友 | 男性 | 肠胃 | 刺客信条起源 | 剧场版 | 国际足联世界杯 | 彩虹六号（游戏） | 赵丽颖（演员） | 天体生物学 | 战神（游戏） | 吉他学习 | 飞机 | 三菱商事 | 关节炎 | 斗鱼直播 | 发电 | 张继科 | 华语流行音乐 | 搏击项目 | 主题曲 | 李信 | 刘德华（演员） | 即时战略游戏（RTS） | 欧阳娜娜 | 网址导航 | 海贼王 | 山地车 | 豆瓣电影 | 广场舞 |

你的位置：网站首页 >> 频道首页 >>理工学科 >>为什么awk可以awk分析nginx日志大

为什么awk可以awk分析nginx日志大

来源：蜘蛛抓取(WebSpider) 时间：2018-01-23 19:58 标签： python awk 日志分析

博客访问： 733112
博文数量： 81
博客积分： 2380
博客等级：大尉
技术积分： 1049
注册时间：
认证徽章：
Only to find a successful way, not to find excuses for failure!
IT168企业级官微
微信号：IT168qiye
系统架构师大会
微信号：SACC2013
分类： LINUX
HTTP Server 访问日志的格式
定义日志的格式
我们可以在 &HTTP Server 的配置文件中，使用预定义的经典格式，或者自定义访问日志的格式。下文中如无特别说明，将假设日志使用名称为&combined的经典格式。
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
以下是每个域的简单介绍：
%h = 发起请求的客户端 IP 地址。这里记录的 IP 地址并不一定是真实用户客户机的 IP 地址，它可能是私网客户端的公网映射地址或代理服务器地址。
%l = 客户机的 RFC 1413 标识 ( 参考 ) ，只有实现了 RFC 1413 规范的客户端，才能提供此信息。
%u = 访问用户的 ID
%t = 收到请求的时间
%r = 来自客户端的请求行
%>s = 服务器返回客户端的状态码
%b = 返回给客户端的字节大小，但不包括响应头的大小
%{Referer}i = 引用页
%{User-Agent}i = 浏览器的类型
以下三行为样例日志：
202.189.63.115 - - [31/Aug/:31 +0800] "GET / HTTP/1.1" 200 1365 "-"
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/ Firefox/15.0.1"
设置滚动日志
由于 Web 服务器每天的访问量可能巨大，我们需要把访问日志分别写到不同的日志文件中，这样可以避免单个文件太大，无法使用编辑器打开的情况。比如，我们可以在配置文件中定义每 5 M 生成一个日志文件。
Linux 服务器：
TransferLog "|/opt/data/HTTPServer/bin/rotatelogs /opt/data/HTTPServer/logs/access_log 5M"
Windows 服务器：
CustomLog "|C:/data/HTTPServer/bin/rotatelogs.exe
C:/data/HTTPServer/logs/access%Y_%m_%d_%H_%M_%S.log 5M" combined
AWK 是一种“样式扫描和处理语言”。它允许您创建简短的程序，这些程序读取输入文件、为数据排序、处理数据、对输入执行计算以及生成报表。它的名称取自于它的创始人 Alfred Aho、 Peter Weinberger 和 Brian Kernighan 姓氏的首个字母。
本文论述的 awk 命令主要指 Linux 操作系统中广泛包含的内置程序 /bin/gawk，它是 Unix awk 程序的 GNU 版本。此命令主要负责读入并运行 AWK 语言编写的程序。在 Windows 平台上可以使用 Cygwin 在模拟环境下运行 awk 命令。
基本上来说，awk 可以从输入（标准输入，或一个或多个文件）中是否存在指定模式的记录（即文本行）。每次发现匹配时，就执行相关联的动作（例如写入到标准输出或外部文件）。
AWK 语言基础
为了能理解 AWK 程序，我们下面概述其基本知识。AWK 程序可以由一行或多行文本构成，其中核心部分是包含一个模式和动作的组合。
pattern { action }
模式( pattern ) 用于匹配输入中的每行文本。对于匹配上的每行文本，awk 都执行对应的&动作( action )。模式和动作之间使用花括号隔开。awk 顺序扫描每一行文本，并使用&记录分隔符（一般是换行符）将读到的每一行作为&记录，使用&域分隔符( 一般是空格符或制表符 ) 将一行文本分割为多个&域，每个域分别可以使用 $1, $2, … $n 表示。$1 表示第一个域，$2 表示第二个域，$n 表示第 n 个域。 $0 表示整个记录。模式或动作都可以不指定，缺省模式的情况下，将匹配所有行。缺省动作的情况下，将执行动作 {print}，即打印整个记录。
使用 awk 分解出日志中的信息
由于我们在 &HTTP Server 配置文件中指定了访问日志的固定格式，因此，我们可以轻易地使用 awk 解析，抽取我们需要的数据。
以下面的示例日志为例：
202.189.63.115 - - [31/Aug/:31 +0800] "GET / HTTP/1.1" 200 1365 "-"
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/ Firefox/15.0.1"
$0 就是整个记录行
$1 就是访问 IP ” 202.189.63.115”
$4 就是请求时间的前半部分 “[31/Aug/:31”
$5 就是请求时间的后半部分 “+0800]”
以此类推……
当我们使用默认的域分割符时，我们可以从日志中解析出下面不同类型的信息：
awk '{print $1}' access.log
awk '{print $2}' access.log
# RFC 1413 标识
awk '{print $3}' access.log
awk '{print $4,$5}' access.log
# 日期和时间
awk '{print $7}' access _log
awk '{print $9}' access _log
# 状态码 (%>s)
awk '{print $10}' access _log
# 响应大小
我们不难发现，仅使用默认的域分隔符，不方便解析出请求行、引用页和浏览器类型等其他信息，因为这些信息之中包含不确定个数的空格。因此，我们需要把域分隔符修改为 “ ，就能够轻松读出这些信息。
awk -F\" '{print $2}' access.log
# 请求行 (%r)
awk -F\" '{print $4}' access.log
awk -F\" '{print $6}' access.log
注意：这里为了避免 Unix/Linux Shell 误解 “ 为字符串开始，我们使用了反斜杠，转义了 “ 。
现在，我们已经掌握了 awk 的基本知识，以及它是怎样解析日志的。下面我们做好准备开始到真实的世界里开始“冒险”了。
使用 awk 场景举例
统计浏览器类型
如果我们想知道那些类型的浏览器访问过网站，并按出现的次数倒序排列，我可以使用下面的命令：
awk -F\" '{print $6}' access.log | sort | uniq -c | sort -fr
此命令行首先解析出浏览器域，然后使用管道将输出作为第一个 sort 命令的输入。第一个 sort 命令主要是为了方便 uniq 命令统计出不同浏览器出现的次数。最后一个 sort 命令将把之前的统计结果倒序排列并输出。
发现系统存在的问题
我们可以使用下面的命令行，统计服务器返回的状态码，发现系统可能存在的问题。
awk '{print $9}' access.log | sort | uniq -c | sort
正常情况下，状态码 200 或 30x 应该是出现次数最多的。40x 一般表示客户端访问问题。50x 一般表示服务器端问题。
下面是一些常见的状态码：
200 - 请求已成功，请求所希望的响应头或数据体将随此响应返回。
206 - 服务器已经成功处理了部分 GET 请求
301 - 被请求的资源已永久移动到新位置
302 - 请求的资源现在临时从不同的 URI 响应请求
400 - 错误的请求。当前请求无法被服务器理解
401 - 请求未授权，当前请求需要用户验证。
403 - 禁止访问。服务器已经理解请求，但是拒绝执行它。
404 - 文件不存在，资源在服务器上未被发现。
500 - 服务器遇到了一个未曾预料的状况，导致了它无法完成对请求的处理。
503 - 由于临时的服务器维护或者过载，服务器当前无法处理请求。
HTTP 协议状态码定义可以参阅：
有关状态码的 awk 命令示例：
1. 查找并显示所有状态码为 404 的请求
awk '($9 ~ /404/)' access.log
2. 统计所有状态码为 404 的请求
awk '($9 ~ /404/)' access.log | awk '{print $9,$7}' | sort
现在我们假设某个请求 ( 例如 : URI: /path/to/notfound ) 产生了大量的 404 错误，我们可以通过下面的命令找到这个请求是来自于哪一个引用页，和来自于什么浏览器。
awk -F\" '($2 ~ "^GET /path/to/notfound "){print $4,$6}' access.log
追查谁在盗链网站图片
系统管理员有时候会发现其他网站出于某种原因，在他们的网站上使用保存在自己网站上的图片。如果您想知道究竟是谁未经授权使用自己网站上的图片，我们可以使用下面的命令：
awk -F\" '($2 ~ /\.(jpg|gif|png)/ && $4 !~ /^http:\/\/www\.example\.com/)\
{print $4}' access.log \ | sort | uniq -c | sort
注意：使用前，将 www.example.com 修改为自己网站的域名。
使用 ” 分解每一行；
请求行中必须包括 “.jpg” 、”.gif” 或 ”.png”；
引用页不是以您的网站域名字符串开始（在此例中，即 www.example.com ）；
显示出所有引用页，并统计出现的次数。
与访问 IP 地址相关的命令
统计共有多少个不同的 IP 访问：
awk '{print $1}' access.log |sort|uniq|wc – l
统计每一个 IP 访问了多少个页面：
awk '{++S[$1]} END {for (a in S) print a,S[a]}' log_file
将每个 IP 访问的页面数进行从小到大排序：
awk '{++S[$1]} END {for (a in S) print S[a],a}' log_file | sort -n
查看某一个 IP（例如 202.106.19.100 ）访问了哪些页面：
grep ^202.106.19.100 access.log | awk '{print $1,$7}'
统计 2012 年 8 月 31 日 14 时内有多少 IP 访问 :
awk '{print $4,$1}' access.log | grep 31/Aug/2012:14 | awk '{print $2}'| sort | uniq | \
统计访问最多的前十个 IP 地址
awk '{print $1}' access.log |sort|uniq -c|sort -nr |head -10
与响应页面大小的命令
列出传输大小最大的几个文件
cat access.log |awk '{print $10 " " $1 " " $4 " " $7}'|sort -nr|head -100
列出输出大于 204800 byte ( 200kb) 的页面以及对应页面发生次数
cat access.log |awk '($10 > 200000){print $7}'|sort -n|uniq -c|sort -nr|head -100
与页面响应时间相关的命令
如果日志最后一列记录的是页面文件传输时间 (%T)，例如我们可以自定义日志格式为：
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T" combined
可以使用下面的命令统计出所有响应时间超过 3 秒的日志记录。
awk '($NF > 3){print $0}' access.log
注意：NF 是当前记录中域的个数。$NF 即最后一个域。
列出相应时间超过 5 秒的请求
awk '($NF > 5){print $0}' access.log | awk -F\" '{print $2}' |sort -n|
uniq -c|sort -nr|head -20
阅读(10983) | 评论(0) | 转发(1) |
相关热门文章
给主人留下些什么吧！~~
请登录后评论。博客访问： 753845
博文数量： 215
注册时间：
专注于性能调优，欢迎探讨。
ITPUB论坛APP
ITPUB论坛APP
APP发帖享双倍积分
IT168企业级官微
微信号：IT168qiye
系统架构师大会
微信号：SACC2013
分类： Linux
awk非常适合于文本文件处理，特别是利用内置数组，可以实现多个数据行关联处理。比如，从应用日志中分析各类请求数量、处理时间等。
1,待处理日志文件a.trc-------------------------------------------------------------------------request=100001,amt=30,time=00request=100002,amt=30,time=00response=100001,time=10request=100003,amt=30,time=00response=100002,time=20request=100004,amt=30,time=00response=100003,time=30request=100005,amt=30,time=00response=100004,time=40request=100006,amt=30,time=00response=100005,time=50request=100007,amt=30,time=00response=100006,time=50request=100008,amt=30,time=00response=100007,time=50-------------------------------------------------------------------------
2,命令：使用三种方法实现相同的逻辑，可以根据需要，通过比较内存、处理效率、灵活性选择。-------------------------------------------------------------------------awk -f a1.awk a.trcawk -f a2.awk a.trcawk -f a3.awk a.trc-------------------------------------------------------------------------
3,处理结果：-------------------------------------------------------------------------reqid:100001 amt:30 bgn:00 end:10 ela:10 reqid:100002 amt:30 bgn:00 end:20 ela:20 reqid:100003 amt:30 bgn:00 end:30 ela:30 reqid:100004 amt:30 bgn:00 end:40 ela:40 reqid:100005 amt:30 bgn:00 end:50 ela:50 reqid:100006 amt:30 bgn:00 end:50 ela:50 reqid:100007 amt:30 bgn:00 end:50 ela:50 resp:7 sumamt:240 -------------------------------------------------------------------------
4.1 a1.awk-------------------------------------------------------------------------function myprintf(v1,v2,v3,v4){&&&&& t1 = v3;&&&&& t1f = substr(t1,1,4) " " substr(t1,5,2) " " substr(t1,7,2) " " substr(t1,9,2) " " substr(t1,11,2) " " substr(t1,13,2);&&&&& t2 = v4;&&&&& t2f = substr(t2,1,4) " " substr(t2,5,2) " " substr(t2,7,2) " " substr(t2,9,2) " " substr(t2,11,2) " " substr(t2,13,2);&&&&& if(length(t1)>0 && length(t2)>0) {ela = mktime(t2f)-mktime(t1f);} else {ela = 0;}&&&&& printf("reqid:%s amt:%s bgn:%s end:%s ela:%d \n",v1,v2,v3,v4,ela);}BEGIN {FS="="; }/request/ {reqp++;&&&&&&&&&& vbid=$2;sub(",amt","",vbid);&&&&&&&&&& vb=$3;sub(",time","",vb);reqa[vbid,1]=&&&&&&&&&& sumamt=sumamt+&&&&&&&&&& reqa[vbid,2]=$4; &&&&&&&&&&}
/response/{vb=$2;sub(",time","",vb);&&&&&&&&&& myprintf(vb,reqa[vb,1],reqa[vb,2],$3);&&&&&&&&&& delete reqa[vb];&&&&&&&&&& resp++;&&&&&&&&&&}END{&& printf("resp:%d sumamt:%d \n",resp,sumamt);}
-------------------------------------------------------------------------
4.2 a2.awk-------------------------------------------------------------------------BEGIN {FS="=";}/request/ {reqp++;&&&&&&&&&& #print($0);&&&&&&&&&& vb=$2;sub(",amt","",vb);reqa[reqp,1]=&&&&&&&&&& reqida[vb]=&&&&&&&&&& vb=$3;sub(",time","",vb);reqa[reqp,2]=sumamt=sumamt+&&&&&&&&&& reqa[reqp,3]=$4;&&&&&&&&&&}
/response/{vb=$2;sub(",time","",vb);&&&&&&&&&& #printf("%s %s \n",vb,$0);&&&&&&&&&& vb3 = reqida[vb];&&&&&&&&&& if(vb3>0){reqa[vb3,4]=$3;}&&&&&&&&&& resp++;&&&&&&&&&&}
END{&& for(i=1;i<=i++){&&&&& t1 = reqa[i,3];&&&&& t1f = substr(t1,1,4) " " substr(t1,5,2) " " substr(t1,7,2) " " substr(t1,9,2) " " substr(t1,11,2) " " substr(t1,13,2);&&&&& t2 = reqa[i,4];&&&&& t2f = substr(t2,1,4) " " substr(t2,5,2) " " substr(t2,7,2) " " substr(t2,9,2) " " substr(t2,11,2) " " substr(t2,13,2);&&&&& if(length(t1)>0 && length(t2)>0) {ela = mktime(t2f)-mktime(t1f);} else {ela = 0;}&&&&& printf("reqid:%s amt:%s bgn:%s end:%s ela:%d \n",reqa[i,1],reqa[i,2],reqa[i,3],reqa[i,4],ela);&& }&& printf("resp:%d sumamt:%d \n",resp,sumamt);}
-------------------------------------------------------------------------
4.3 a3.awk-------------------------------------------------------------------------BEGIN {FS="=";}/request/ {reqp++;&&&&&&&&&& #print($0);&&&&&&&&&& vb=$2;sub(",amt","",vb);reqa[reqp,1]=&&&&&&&&&& vb=$3;sub(",time","",vb);reqa[reqp,2]=sumamt=sumamt+&&&&&&&&&& reqa[reqp,3]=$4;&&&&&&&&&& &&&&&&&&&&}
/response/{vb=$2;sub(",time","",vb);&&&&&&&&&& #printf("%s %s \n",vb,$0);&&&&&&&&&& for(i=1;i<=i++){&&&&&&&&&&&& if(reqa[i,1] == vb) {&&&&&&&&&&&&&& vb2=$3;&&&&&&&&&&&&&& reqa[i,4]=vb2;&&&&&&&&&&&&&& resp++;&&&&&&&&&&&&&&&&&&&&&&&&&& }&&&&&&&&&& }&&&&&&&&&&}
END{&& for(i=1;i<=i++){&&&&& t1 = reqa[i,3];&&&&& t1f = substr(t1,1,4) " " substr(t1,5,2) " " substr(t1,7,2) " " substr(t1,9,2) " " substr(t1,11,2) " " substr(t1,13,2);&&&&& t2 = reqa[i,4];&&&&& t2f = substr(t2,1,4) " " substr(t2,5,2) " " substr(t2,7,2) " " substr(t2,9,2) " " substr(t2,11,2) " " substr(t2,13,2);&&&&& if(length(t1)>0 && length(t2)>0) {ela = mktime(t2f)-mktime(t1f);} else {ela = 0;}&&&&& printf("reqid:%s amt:%s bgn:%s end:%s ela:%d \n",reqa[i,1],reqa[i,2],reqa[i,3],reqa[i,4],ela);&& }&& printf("resp:%d sumamt:%d \n",resp,sumamt);}-------------------------------------------------------------------------
阅读(2085) | 评论(0) | 转发(0) |
相关热门文章
给主人留下些什么吧！~~
请登录后评论。使用awk找出一列数据中的最大值
找出data文件中第二列最大值的行的内容。
data数据样例：
脚本中的内容：
#!/usr/bin/awk
{if (maxnum&$2)
& delete arr
& maxnum=$2
& arr[NR]=$0i
else if (maxnum==$2)
& arr[NR]=$0
END{for (i in arr)
& print arr[i]
已投稿到：
以上网友发言只代表其个人观点，不代表新浪网的观点或立场。I am currently dealing with log files with sizes approx. 5gb.
I'm quite new to parsing log files and using UNIX bash, so I'll try to be as precise as possible.
While searching through log files, I do the following: provide the request number to look for, then optionally to provide the action as a secondary filter.
A typical command looks like this:
fgrep '' example.log | fgrep 'action: example'
This is fine dealing with smaller files, but with a log file that is 5gb, it's unbearably slow.
I've read online it's great to use sed or awk to improve performance (or possibly even combination of both), but I'm not sure how this is accomplished.
For example, using awk, I have a typical command:
awk '// {print}' example.log
Basically my ultimate goal is to be able print/return the records (or line number) that contain the strings (could be up to 4-5, and I've read piping is bad) to match in a log file efficiently.
On a side note, in bash shell, if I want to use awk and do some processing, how is that achieved?
For example:
BEGIN { print "File\tOwner" }
{ print $8, "\t", \
END { print " - DONE -" }
That is a pretty simple awk script, and I would assume there's a way to put this into a one liner bash command?
But I'm not sure how the structure is.
Thanks in advance for the help.
解决方案 You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform.
Try some tests like this:
time fgrep '' example.log &/dev/null
time egrep '' example.log &/dev/null
time sed -e '//!d' example.log &/dev/null
time awk '// {print}' example.log &/dev/null
Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that.
Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching.
But it really depends on the implementation, and the correct way to find out is to run tests.
Run them each several times so you don't get messed up by things like caching and competing processes.
As @Ron pointed out, your search process may be disk I/O bound.
If you will be searching the same log file a number of times, it may be faster to compres this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first.
Try something like this:
compress -c example2.log &example2.log.Z
time zgrep '' example2.log.Z &/dev/null
gzip -c example2.log &example2.log.gz
time zgrep '' example2.log.gz &/dev/null
bzip2 -k example.log
time bzgrep '' example.log.bz2 &/dev/null
I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall.
Your computer will have different disk and CPU performance than mine, so your results may be different.
If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.
Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you might be interested in?
If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing.
As with compression, you spend some extra time up front, but then each individual search runs faster.
A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work.
But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it.
Also, consider whether you're actually passing all of the data through the entire pipe.
In the example you gave, fgrep '' example.log | fgrep 'action: example', the first fgrep will disc the pipe and second command only have to process the small fraction of the log that contains '', so the slowdown will likely be negligible.
dr TEST ALL THE THINGS!
EDIT: if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan.
Something like this:
# Precompress:
gzip -v -c example.log &example.log.gz
compressedsize=$(gzip -l example.log.gz | awk '{if(NR==2) print $2}')
# Search the compressed file + recent additions:
{ gzip -cdfq example.log. tail -c +$compressedsize example. } | egrep ''
If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:
# Prescan for a particular request (repeat for each request you'll be working with):
gzip -cdfq example.log.gz | egrep '' & prescan-.log
# Search the prescanned file + recent additions:
{ cat prescan-. tail -c +$compressedsize example.log | egrep ''; } | egrep 'action: example'
本文地址： &
我目前正在处理与大小约日志文件。 5GB。我是很新的分析日志文件，并使用UNIX的bash，所以我会尽量为precise越好。同时，通过日志文件搜索，我做到以下几点：提供查找请求数，然后有选择地提供行动作为辅助过滤器。一个典型的命令如下：比fgrep''example.log |比fgrep'行动：例如“ 这是罚款处理小文件，但是这是一个5GB的日志文件，这是不能忍受缓慢。我在网上看了它的伟大使用awk或者sed来提高性能（或者甚至可能两者的组合），但我不知道这是怎么完成的。例如，使用awk中，我有一个典型的命令：的awk'/
/ {}打印“example.log 基本上我的最终目标是能够打印/返回记录（或行号），包含字符串（可达到4-5个，和我读过的管道是坏的），以在日志文件中匹配效率在一个侧面说明，在bash shell中，如果我想用awk，并做一些处理，如何实现这一点实现的？例如：
BEGIN {打印“文件\\陶纳”}{打印$ 8日，“\\ t”的，\\$ 3}END {打印“ - 完成 - ”} 这是pretty简单的awk脚本，我会假设有把它变成单行bash命令的方法吗？但我不知道该结构是如何的。在此先感谢您的帮助。干杯。解决方案您需要执行一些测试来找出你的瓶颈，你的各种工具如何快速执行。尝试一些测试是这样的：时间比fgrep''example.log＆GT;的/ dev / null的时间egrep的''example.log＆GT;的/ dev / null的时间的sed -e'/
/ D！“example.log＆GT;的/ dev / null的时间的awk'/
/ {}打印“example.log＆GT;的/ dev / null的传统地，egrep的应该是最快的一群（是的，比fgrep一样快）的，但一些现代的实现是自适应和自动切换到最适当的搜索算法。如果你有bmgrep（使用博耶 - 穆尔搜索算法），尝试。一般情况下，sed和因为它们设计成更通用的文字处理工具，而不是调整搜索的具体工作awk的速度会变慢。但它确实依赖于实现，并找出正确的方法是运行测试。运行它们每个几次，所以你不要被类似的东西缓存和竞争进程搞砸了。由于@Ron指出，搜索过程可能是磁盘I / O绑定。如果将搜索相同的日志文件中的一些的时候，它可以是PSS日志文件第一速度更快的COM $ P $;这使得它更快地从磁盘读取，但需要更多的CPU时间来处理，因为它必须是DECOM pressed第一。尝试是这样的：
COM preSS -c example2.log＆GT; example2.log.Z时间zgrep''example2.log.Z＆GT;的/ dev / null的gzip的-c example2.log＆GT; example2.log.gz时间zgrep''example2.log.gz＆GT;的/ dev / null的bzip2的-k example.log时间bzgrep''example.log.bz2＆GT;的/ dev / null的我只是跑了一个快速测试一个相当COM pressible文本文件，发现bzip2的COM pressed最好的，但后来拿了远远更多的CPU时间来DECOM preSS，所以zgip选项伤口高达是最快的整体。您的计算机将有比我的不同的磁盘和CPU的性能，因此您的结果可能会有所不同。如果您有任何其他COM pressors躺在身边，试戴为好，并且/或者尝试不同层次的gzip COM pression等。 preprocessing说起：如果你正在寻找相同的登录一遍又一遍，有没有办法preSELECT出来只是你的日志行可能的有兴趣？如果是这样，用grep出来到一个较小的（也许COM pressed）文件，然后搜索的，而不是整个事情。与COM pression，你花一些额外的时间达阵，但每个人的搜索运行速度更快。有关管道的说明：在其他条件相同，通过多个管道的命令一个巨大的文件会比有一个命令慢做的所有工作。但是，所有的事情都是不相等这里，如果在管道中使用多个命令（这是zgrep和bzgrep做）你买更好的整体性能，为它去。另外，还要考虑是否实际上是传递所有的数据在整个管道。在你给，比fgrep''example.log的例子|比fgrep'动作：例如，第一fgrep一样将放弃大部分的文件;管和第二命令只需要处理包含“”的日志的小部分，所以减速将可能是微不足道的。 TL;！医生测试所有的东西。编辑：如果日志文件（正在增加，即新的条目）“活”，但它的大部分是静态的，您可以使用部分preprocess方法：COM preSS（＆安培;也许preSCAN）日志，然后扫描时使用COM pressed（安培;因为你做了preSCAN / prescanned）版本加日志的部分的尾部添加。事情是这样的：＃preCOM preSS：gzip的-v -c example.log＆GT; example.log.gzCOM pressedsize = $（gzip的-l example.log.gz | awk的'{如果（NR == 2）打印$ 2}“）＃搜索COM pressed文件+最近加入：{gzip的-cdfq example.log.尾部-c + $ COM pressedsize example. } | egrep的'' 如果你打算做几个相关的搜索（例如特定的请求，那么这项要求的具体行动），可以节省prescanned版本：＃preSCAN特定请求（重复的每一个请求，你会与合作）：gzip的-cdfq example.log.gz | egrep的''＆GT; preSCAN，.log＃搜索prescanned文件+最近加入：{猫preSCAN，.尾部-c + $ COM pressedsize example.log | egrep的''; } | egrep的'动作：例如“
本文地址： &
扫一扫关注官方微信

为什么awk可以awk分析nginx日志大

我要回帖

更多关于 python awk 日志分析的文章

随机推荐

为什么awk可以awk分析nginx日志大

我要回帖

更多关于 python awk 日志分析 的文章

随机推荐

更多关于 python awk 日志分析的文章