如何描写心情愉快的诗句地写个小parser

科学教育 | 学习帮助 | 出国/留学 | 工程技术科学 | 教育/科学 | 英语听力 | 梦幻西游电脑版 | 视频会议 | 口臭 | 暗黑破坏神3（游戏） | 面相 | 赛尔号 | linux | 山西省 | Xbox One | 思修 | 易经 | solidworks | 钢铁雄心4 | 休闲游戏 | 魔兽争霸3混乱之治 | 显卡 | 武汉大学 | 塞尔达传说（游戏） | 校服 | 剑侠情缘网络版叁 | 脱发 | 日本文化 | 数学建模 | 二次元 | 部落冲突（游戏） | 肖战 | 街机游戏 | 拳皇 | 马鞍山市 | 扑克 | 完美世界（游戏） | 三国志（游戏） | 热血传奇（游戏） | 意大利 | 跆拳道 | 东莞市 | 糖尿病 | 古琴 | 三国 | 电视节目 | 百度 | qq音乐 | 配音 | 电视 | 任天堂 | 科幻小说 | 虚拟专用服务器 | QQ游戏 | 大熊猫 | 微电影 | Android | 竞技游戏 | 动画制作 | QQ炫舞 | 电源 | 日语 | 魔兽争霸3冰封王座 | 产业 | ios开发 | 百度云 | 动画电影 | nba篮球 | 羽生结弦 | iOS应用 | galgame | 电吉他 | 平板电脑 | 周星驰（人物） | 离婚 | 后宫·甄嬛传（书籍） | 牙科 | 游戏开发 | 网络直播 | ios游戏 | 电子邮件 | SNH48 | 民国 | 美容 | 舰队 Collection | 心理 | Mac | 羽毛球技术 | 互联网公司 | 大学生兼职 | 烘焙 | 诸葛亮 | 跑跑卡丁车 | 武侠小说 | 微博 | 骨折 | 掌上游戏机 | 玉米 | 中国足球 | 电脑配置 | 洛奇英雄传 | 硬盘 | 张璐 | akb48 | 炉石传说 | 韩国 | 蓄电池 | QQ空间 | 房贷 | 麦克风 | 相声演员 | 抑郁 | 天下2（游戏） | 农业科学 | 神话 | 农历 | 中国足球协会超级联赛（CSL） | 流星花园 | 易烊千玺 | 火影忍者 | 日语歌曲 | 巴西 | 红酒 | 化疗 | 占地 | 网络小说 | 香烟 | 传奇世界 | 名字 | 日本电影 | 表演 | 西藏自治区 | 英雄传说：闪之轨迹（游戏） | 足球彩票 | 摩尔庄园 | 中国工商银行 | 游戏手柄 | 陈奕迅 | 联赛 | 天体物理学 | 英格兰足球超级联赛 | 超级机器人大战 | 命令与征服：红色警戒2（游戏） | 郭富城 | 一级方程式赛车（f1） | Adobe Photoshop | 英文歌曲 | 玄幻小说 | 猫和老鼠 | 杨凡 | 书籍改编电影 | 俄罗斯 | 网络赚钱 | 罗玉凤 | 刺客信条2 | 角色扮演 | 食物 | 药物 | 杨洋（演员） | 信息安全 | 胡歌（演员） | 张子枫 | 古典音乐 | 时尚 | 大片 | 电脑游戏 | 签证 | 徐佳莹 | 耽美 | 游戏攻略 | 音乐剧 | 前女友 | 男性 | 肠胃 | 刺客信条起源 | 剧场版 | 国际足联世界杯 | 彩虹六号（游戏） | 赵丽颖（演员） | 天体生物学 | 战神（游戏） | 吉他学习 | 飞机 | 三菱商事 | 关节炎 | 斗鱼直播 | 发电 | 张继科 | 华语流行音乐 | 搏击项目 | 主题曲 | 李信 | 刘德华（演员） | 即时战略游戏（RTS） | 欧阳娜娜 | 网址导航 | 海贼王 | 山地车 | 豆瓣电影 | 广场舞 |

你的位置：网站首页 >> 频道首页 >>学习 >>如何描写心情愉快的诗句地写个小parser

如何描写心情愉快的诗句地写个小parser

来源：蜘蛛抓取(WebSpider) 时间：2017-04-18 06:10 标签： iniparser写ini文件

您所在的位置： >
MBTileParser，小游戏引擎
MBTileParser，小游戏引擎
基本介绍：
MBTileParser是一个小的游戏引擎，支持TMX文件和TexturePacker文件直接加载到UIKit
源码下载地址:
温馨提示:&&&&&&&本站转载的均为开源代码,版权归原作者所有,请遵守作者许可证协议。转载本站内容请注明出处:懒人ios代码库-
[相关浏览]
这个工程是一个比较完整的工程游戏: PalCard 仙剑奇侠传五前传卡牌小游戏iPhone版 App名:仙剑卡
源码下载地址: 本地迅雷下载 github面板下载
这个完成度还是相当高的游戏有很多关 coco2d 做的懒人iOS代码 Github下载
一个很完整的游戏我声音有配乐设置等等游戏的内容就是尽快找到2个一样的图标后消除就
一个个可扩展的游戏模板和基于位置的服务的RPG游戏。角色在游戏中是分布在真实的世界。你
CCControlExtension CCControlExtension是一个开源库,提供了很多方便的控制对象Cocos2D v2.0为iPhone和Mac如按更多公众号：programmer_life十年漫漫程序人生，打过各种杂，也做过让我骄傲的软件；管理过十多人的团队，还带领一班兄弟姐妹创过业。关注程序人生，了解程序猿，学做程序猿，让我们的人生不再屌丝化。最新文章相关作者文章搜狗：感谢您阅读如何愉快地写个小parser 本文版权归原作者所有，本文由网友投递产生，如有侵权请联系，会第一时间为您处理删除。《新智元笔记：找茬拷问立氏parser》
已有 1705 次阅读
|个人分类:|系统分类:|关键词:中文 parsing 鲁棒休眠结构分析
洪爷有诗道：伟爷强弩上大弓，先吃豆腐显轻松。 NLP parser天天弄，小菜十碟拌饭红。我: 语义求解的目的何在？如果不与domain、客户需求、语用或应用连接，那就是普遍性语义求解，那是一个很容易陷进去出不来的地方。一旦与应用对接，则大不相同，很多事情往往没有想象的那么复杂深奥。宋: 对。普遍性的语义求解是一个陷阱，但有上下文约束会好一些。孤立地看一个短语，确定它有什么坑（语法的或语义的），比较困难。上下文的作用不仅是填坑，而且是约束坑的类型。我: 回宋老师，那些依存关系的标签基本是 self-evident 可以顾名思义的，不外就是逻辑主语（actor）、逻辑宾语（under），动词性宾语（cinf），谓词补语（buyu），补足语（comp），定语从句（relmod），修饰语（mod），状语（adv），同位语（equiv），话题（dummysubj），并列（conj）、介宾（pobj），有当无的关系（dummy），连动或接续关系（nextoken) 等。宋: 什么是“有当无的关系”？我: 就是一些附加语啊，或者无所谓的小词啊。连上也许后去会用上，不连上丢掉其实也可以。宋: 谢谢立委。你的parser确实可以解决很大问题。顾: 怎么把应用语境程序表示呢？应用相关的规则集？我: parsing 提供的是树结构，应用就在树结构上做模式匹配。这比传统的在线性序列上匹配（譬如 ngram 匹配），威力强大多了。以一当十可不止，这种匹配是以一当百。不仅是数量上的一当百，而且线性匹配够不着的模式，结构也能够得着，一些所谓远距离的现象。因此，对于NLP应用，比起关键词和ngram的小米步枪，parsing 就是核武器，这就是我们一直在鼓吹的：subtree matching is way more powerful than linear pattern matching （）。绝大多数应用的基础，不是搜索就是抽取。抽取可以看成是线下（offline）的搜索（可以叫做 structure indexing），搜索就是实时（real time）的抽取。搜索解决的是无法 pre-computing 的信息检索，抽取是为预订（predefined）的信息需求做更周全的服务。二者都可以用parsing，也都可以用关键词，但效果和质量天差地别。（见《》）顾: 但这里的问题是对语义的理解造成树不同，如果树不准确，匹配也不好做吧我: 恰恰相反，树不准，应用中的语义条件可以做弥补，因为应用的时候有已经大大缩小了或简化了的语用现场，以及domain知识。顾: 是否一个关键是在特定应用中，一般无需将每句话都parse正确？我: 当然。如果必须每句话都parse对了，才能应用，世界上就没有 parsing 的应用了。事实上，人也会遇到 parsing 和理解的困难和错误。顾: 我就是问这个domain knowledge在程序中怎么实现的。是个比较初级的问题，您的核武器我一向佩服的。我: 甚至信息抽取做了近20年了，90%以上的抽取应用，连 parsing 都不用，更甭说 parsing 对不对了。domain knowledge 有几个角度，都有帮助和弥补的作用：首先，是对于抽取的目标的定义。抽取就是填空，但填空前必须定义语义名称和关系，数据库里面的数据 fields 也是预先定义的，定义抽取template作为语义目标与此类似。譬如，你要抽取会议信息，你就会定义会议名称、会议时间、会议地点、主讲人等等，所有这些都是domain的语义关系。有了domain语义的定义，你的目标就聚焦了，与这些关系无关的信息和句子，一律排除出局。不仅如此，因为所有这些定义都是围绕一个domain的语义主题（“会议事件”），你对普遍句法关系的容错性大大增强。第二个就是所谓 domain ontology 也可以派上用场。顾: 我的理解是立委对每类应用有多棵目标语法树，然后文字parse时往这些树上靠，无需考虑其它的非目标树的解析法.我：恰恰相反。不是语法树往不同应用去靠，而是直接支持应用的抽取模块往语法树靠。是为以不变应万变。parser是独立于应用的，核心引擎不轻易为具体应用做改变。因为应用总在变，引擎是稳定的、轻装的。Qing: 我个人直觉结果的稳定性比正确性重要我: 鲁棒性比正确性重要，不鲁棒就会够不着信息。鲁棒了至少可以有路径，哪怕路径有误，只要错误的路径有一致性或是可以预见的，那么信息抽取的时候从错误的路径抽取到正确的信息，也是完全可能的，所谓负负得正。因为有了路径，节点之间的语义相谐性（semantic coherence）就可以弥补句法的不足。Qing: 以后我们每个人要约束自己。有话好好说，才能愉快地聊天。要自觉地向立委的语法规则靠拢。白老师曰，别抬杠，别找茬。我: 这是哪里焊哪里。找茬和抬杠都不是问题。时髦话说：不zuo不活。扛得住找茬和抬杠才能鲁棒坚强。宋: 对，我相信你的这些策略肯定是非常有用的。我在刚才说的“接桩”的问题中，也曾经想用parser来选择接桩的结果，但找不到性能好的parser，于是只能用类似于ngram的方法建立一种学习策略，找最优接桩结果，效果只能说凑合。立委，你能否用你的parser做这样一件事情。原文是：西藏银行部门去年新增贷款十四点四一亿元，比上年增加八亿多元。农业生产贷款比上年新增四点三八亿元。其中后两个标点句前面都缺成分，需要补上。第2标点句前面补上成分后可能的结果是：比上年增加八亿多元。西藏银行部门比上年增加八亿多元。西藏银行部门去年比上年增加八亿多元。西藏银行部门去年新增比上年增加八亿多元。西藏银行部门去年新增贷款比上年增加八亿多元。西藏银行部门去年新增贷款十四点四一亿元比上年增加八亿多元。这6个结果用你的parser从中选一个最优的，再把这个最优的用上述策略同第3标点句匹配，或者用前3个较优的分别用上边的策略同第3标点句匹配，通过你的parser再选最优，看最后结果是什么。我: 你看看，这不是难为我么？顾: 您说了可以找茬抬杠，那宋老师出这题不算过分宋: 难为立委了，有时间有兴趣就试试，无所谓。我: 我只管结构parsing，算是给你的目标提供靠谱的基础。利用这样的基础，你自己的系统也许可以更好地去选择接桩。好，把parser的错改正增强了，结构应该是没问题了，其他的事情是应用和domain层面了。这下对了吧，再不对，我就砸机器了：湖: 李老师的parser很强啊，有试用版吗?我: 没有试用。内部产品使用，不对外。用户也就可数的几个（其实满憋屈的，大材小用），虽然都是大户人家。湖: 独门秘术啊。白: 农业生产贷款，典型的主谓宾，为啥不中？我: 农业生产贷款，不能算典型的 SVO，因为农业生产是典型的 compound。湖: 分析中人工干预吗?我: 分析是全自动的，怎么干预？如果干预指的是开发过程不断地维护和提高（fine tuning），那与机器学习的培训（training or retraining）是类似的。所有的电脑自动系统的开发不都是这样的么？除非现在打包成黑箱子，把引擎封死固化。宋: 立委。你没有直接回答我的问题呀。我的问题是，第二个标点句补全成分的6种选择，应该选哪一个？第三个标点句如何补全成分？我: 我回答不了啊。语义我不在行，我就懂一点句法。你的问题我不认为是句法范围的问题，而是句法后的语义问题。句法可以提供基础，真正的语义工作还是要你自己的模块去做啊。我提供一个比较靠谱的句法结构基础，在上面怎么玩语义的把戏是语义学家或 domain specialist 的任务了，各司其职。宋: 立委，我并不要求作语义分析，只是在6个候选句中用parser挑一个最优的。当然，这也许同你的目标不大一样。你的parser是在同一个句子的不同parser结果中找最优，不知能否用于不同句子parser结果中找最优，从而确定哪个句子最合理。我: 不能，句法不跨句，甚至都基本不跨从句。跨句的NLP需要另外的机制。白: 那，压力测试环境，会如何？压力测试环境已经准备好了。试试。我：好：&湖: 够快啊我: 本来就是“线速”啊。就是还要转成图形费点事儿。白老师心里想什么我知道湖: 这超出预期我: 以前说过的，最大可能的先出来，其余应该休眠。白: 橱柜消毒喷剂昨天买到了我: 这个露馅了。“喷剂” 居然不在词典。本来以为词典够全的了呢。&加进词典了：&白: 炸药投射装置在操场北侧。湖: 这个例子太牛我:&例子再牛还有我parser牛么？kidding ......湖: 这个分析器确实很强我: 说到这里想起一件好笑的事儿。白: 拿社保养老保险吗？我: 先等我讲个故事，也算NLP掌故吧。话说近30年前吧，我与前辈大牛聊天。前辈不仅学问深不见底，也是一个特别幽默直爽的人，直爽到让我目瞪口呆。当时前辈经过多年的努力刚刚正式推出NLP系统支持的产品，正得意中。大概所处的状态与我现在类似。他提到他的两个作品，一个是他儿子，一个是系统，that's my real baby。说：儿子不算真的作品，根本没费劲儿呢，也就是个自然灾害。系统才算，那真是好多年的呕心沥血呢。前辈的这个对比让我忍俊不住，被幽默了好多年。白: 试试，拿社保养老保险吗？我: 白老师今天是不气死机器不罢休的阵势啊。来了： &白: “养老保险”词法就搞到一起了，没有休眠机制，“保险”没办法做谓语的。救不活了。我: 现在是没有。休眠机制还在考虑中 ....... 但是还是可以救活的，等我有机会再专门讲一下如何救活它，并不难，就是词驱动的 reparsing，如果 reparsing 单做的话，譬如可以专门做一个 NP 的reparsing。实现一个 NP 的 re-parser, 以提供更多的 parses 选项，不是一件很难的事儿，只要想做，就可以做。问题是，在应用层面还没有想好接口之前，这个工作暂缓。休眠救活的具体策略，以后专找时间详论。（见：&）白: reparsing是需要划禁区的。试试：媒体挖掘真相的速度比不上谣言制造工厂的速度我: 这句后半句掉链子了：&我：小小抗议一下，白老师的那个句子，刚刚重新读了一遍，我是 native，也给你绕糊涂了，你那后半句是“人话”还是“准人话”？什么叫 “谣言制造工厂的速度” ？我用人脑 parse 了几遍也没搞明白。明显就是个坑，我如果往左parse，你可以说错了，如果往右，你还可以说错了。人根本就没法统一意见的事儿，怎么教给机器？还有前一句“拿社保养老保险么”，我人脑的parsing与机器输出的完全一致。听上去又不对了。也许是因为我缺乏背景知识，不了解国内的社保制度。我觉得意思就是，(你是）用社保来做养老保险么？难道不对？白: “谣言制造工厂”和“炸药投射装置”在结构上是完全平行的，但和“媒体挖掘真相”不平行。后者是svo，前两者都是np。我: 这个我懂。你生生地狗尾续貂，加了 “的速度”，我头脑就 parse 不了了。白: 工厂不可以有速度吗？我: 怎么那么别扭呢。一般来说，物是没有速度的，动作才有速度。白: 况且这会儿都还没见到语义呢，都在做句法分析。句法不别扭就OK。湖: 刘翔的速度。白: 信息生产，无论造谣还是辟谣，都可以谈论其速度。湖: 不过这类语义粘合度低的，确实让机器分析有难度。白: 进而也可以把速度投射到生产者身上。湖: 我分析@wei 的分析器在词典里有利用语义的。我: 汉语如果连词典语义都不用，parser 不就寸步难行么？湖: 是的我: 那会有多少伪parses 啊。如果只从POS类考察汉语，那几乎就是爆炸，核当量的，根本无法收场。几乎所有 POS 都可以互相结合。湖: 真心觉得@wei 的分析器很强大。按白老师的奥卡姆剃刀，POS完人该被X的。最近我潜意识里认为乔式的最简理论中的移动理论还有道理。句法要解决的是语义结合及其顺序、成分共享、焦点表达。我: 哦，我明白了，“拿社保养老保险么”，指的是【拿社保养老】保险么？白: 正是。“保险”可作谓词，“养老保险”不行。我: 不过【拿社保】【养老保险】的 parsing。虽然结构不同，核心语义是一样的。应该算对，我觉得。湖: 白老师的是个歧义句，一是保险为核，一是拿为核。两个歧义句意思完全不一样。@wei 您的句子我觉得也有两个意思啊我: 因为我的parser 把【养老保险】当谓语看的，【拿社保】是手段状语。以前也见过一些逻辑结构不同，但语义相同或类似的例子。尤其是英语的 PP-attachment 的歧义句子，如果这个PP 是一个 for+NP，可以找到相当一批句子，无论分析为NP的定语，还是Pred的状语，其核心语义是一样的：last year we built 20 schools for blind children。白: 拿“养老保险”作谓语，出轨的尺度比较大。不是不行，是不到万一不应采用。我: 某种程度上，已经万一了。这就好比 PP 做谓语一样，句子里没有更强的谓语选项了。白: 现实是，没有休眠机制，只能将就。有休眠机制，就不这么认为了。“谣言制造工厂”，必须通过相对强大的构词法，才有希望结合成整体，不受“速度”的干扰。【相关】&&&&&&&《》&& &
转载本文请联系原作者获取授权，同时请注明本文来自李维科学网博客。链接地址：
上一篇：下一篇：
当前推荐数：1
评论 ( 个评论)
扫一扫，分享此博文
作者的精选博文
作者的其他最新博文
热门博文导读
Powered by
Copyright &编译原理（1）
计算机基础（2）
在微信公众号程序人生上看到一篇文章
作者介绍了几种文本处理工具(lex/yacc,clojure下的神器instaparse…)
其中Clojure下的神器instaparse引起我极大的兴趣,原因如下
作者对其评价如下「首先是clojure下的神器instaparse。instaparse是那种如果让你做个parser，不限定语言，那你一定要尝试使用的工具。别的工具一天能做出来的效果，instaparse一小时就能搞定。」很有吸引力吧，简单就是美嘛。
之前接触FP较少，希望借此机会接触一下，练练手。
Clojure依托于JVM(TM)，作为一名Java程序员，上手应该比较快。
2.关于函数式编程
在此不做更多的赘述.
3.Clojure开发环境的配置
安装Leiningen
Leiningen简单来就是Clojure世界中的maven.
lein new helloworld
新建项目helloworld 其中 project.clj类似maven中的pom
Intellij IDEA上安装Cursive Plugin
上Cursive官网上下载Plugin之后本地安装一下，重启一下idea即可.
Clojure HelloWorld
本文并不讨论Clojure的语法细节问题，如果读者对Clojure不是很了解，可以上查阅相关细节。
4.接下来就让我们愉快的写个小Parser吧(∩_∩)
4.1instaparse简介
的产品说明
What if context-free grammars were as easy to use as regular expressions?
仅限于解决上下文无关文法哦
产品的目标是使用上和正则表达式一样简单.
: := ::= =
End of rule
; . (optional) newline
Concatenation
whitespace or ,
Alternation
One or more
Zero or more
String terminal
‘a’ “a”
Regex terminal
‘a’ #”a”
Epsilon epsilon EPSILON eps ε “” ”
S = ‘a’ S
(* This is a comment *)
4.2上下文无关文法简介
忘记的同学自行回顾一下
4.4 do it!
参考知识库
* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
访问：11148次
排名：千里之外
原创：59篇
(1)(6)(1)(48)(2)(5)(2)(1)(1)(1)htmlparser的编码问题 -
- ITeye技术网站
博客分类：
&&& htmlparser在提取网站内容时，有时会出现乱码或者是编码不能转换的问题。这是htmlparser的一个小bug，因为htmlparser作为一个开源软件已经很长时间没有更新了。
org.htmlparser.util.EncodingChangeException: character mismatch (new: 中 [0x4e2d] != old:& [0xd6?]) for encoding change from ISO-8859-1 to GB2312 at character offset 23或者会出现页面的乱码问题。
&&& 为了彻底避免上述问题，我们可以改下htmlparser的源码的两个类。
&&& package org.htmlparser.和InputStreamSource类。另外我们还要用 CodepageDetectorProxy（根据二进制流来分析网页编码）来提前解析网页编码。
&&& htmlparser中设置编码一般为
&&& Parser parser=new Parser(url);
&&& parser.setEnconding("bianma");
&&& 但存在漏洞。
&&& htmlparser编码的分析过程：htmlparser会根据服务器返回的文件头信息与网页的meta标签中的编码进行对比，如果服务器返回的文件头编码为空，默认返回为ISO-8859-1的编码，它会meta标签的charset里的编码对比。
&&& 改进的思路：利用CodepageDetectorProxy.jar对网页进行编码分析，获得网页的编码格式。将htmlparser的服务器返回的默认编码设置为CodepageDetectorProxy解析的编码。这样的编码和meta标签的编码总能保持一致了。。代码如下：
整个修改过程如下：&&&
import info.monitorenter.cpdetector.io.CodepageDetectorP
import info.monitorenter.cpdetector.io.ParsingD
import java.net.MalformedURLE
import java.net.URL;
import org.htmlparser.lexer.P
public class WebEncoding {
public String AnalyEnconding(String path){
url=new URL(path);
} catch (MalformedURLException e) {
e.printStackTrace();
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
detector.add(new ParsingDetector(false));
java.nio.charset.Charset charset =
charset = detector.detectCodepage(url);
} catch (Exception ex) {ex.printStackTrace();}
if(charset.name().equalsIgnoreCase("utf-8")||charset.name().equals("UTF-8")){
Page.GaoBinDEFAULT_CHARSET="utf-8";
Page.GaoBinDEFAULT_CHARSET="gb2312";
return Page.getGaoBinDEFAULT_CHARSET();
package org.htmlparser.
import java.io.*;
import java.lang.reflect.InvocationTargetE
import java.lang.reflect.M
import java.net.*;
import java.util.zip.*;
import org.htmlparser.http.ConnectionM
import org.htmlparser.util.ParserE
// Referenced classes of package org.htmlparser.lexer:
InputStreamSource, PageIndex, StringSource, Cursor,
Stream, Source
public class Page
implements Serializable
public Page()
public Page(URLConnection connection)
throws ParserException
if(null == connection)
throw new IllegalArgumentException("connection cannot be null");
setConnection(connection);
mBaseUrl =
public Page(InputStream stream, String charset)
throws UnsupportedEncodingException
if(null == stream)
throw new IllegalArgumentException("stream cannot be null");
if(null == charset)
charset = "ISO-8859-1";
mSource = new InputStreamSource(stream, charset);
mIndex = new PageIndex(this);
mConnection =
mBaseUrl =
public Page(String text, String charset)
if(null == text)
throw new IllegalArgumentException("text cannot be null");
if(null == charset)
charset = "ISO-8859-1";
mSource = new StringSource(text, charset);
mIndex = new PageIndex(this);
mConnection =
mBaseUrl =
public Page(String text)
this(text, null);
public Page(Source source)
if(null == source)
throw new IllegalArgumentException("source cannot be null");
mIndex = new PageIndex(this);
mConnection =
mBaseUrl =
public static ConnectionManager getConnectionManager()
return mConnectionM
public static void setConnectionManager(ConnectionManager manager)
mConnectionManager =
public String getCharset(String content)
String CHARSET_STRING = "charset";
if(null == mSource)
ret = "ISO-8859-1";
ret = mSource.getEncoding();
if(null != content)
int index = content.indexOf("charset");
if(index != -1)
content = content.substring(index + "charset".length()).trim();
if(content.startsWith("="))
content = content.substring(1).trim();
index = content.indexOf(";");
if(index != -1)
content = content.substring(0, index);
if(content.startsWith("\"") && content.endsWith("\"") && 1 & content.length())
content = content.substring(1, content.length() - 1);
if(content.startsWith("'") && content.endsWith("'") && 1 & content.length())
content = content.substring(1, content.length() - 1);
ret = findCharset(content, ret);
public static String findCharset(String name, String fallback)
Class cls = Class.forName("java.nio.charset.Charset");
Method method = cls.getMethod("forName", new Class[] {
java.lang.String.class
Object object = method.invoke(null, new Object[] {
method = cls.getMethod("name", new Class[0]);
object = method.invoke(object, new Object[0]);
ret = (String)
catch(ClassNotFoundException cnfe)
catch(NoSuchMethodException nsme)
catch(IllegalAccessException ia)
catch(InvocationTargetException ita)
System.out.println("unable to determine cannonical charset name for " + name + " - using " + fallback);
private void writeObject(ObjectOutputStream out)
throws IOException
if(null != getConnection())
out.writeBoolean(true);
out.writeInt(mSource.offset());
String href = getUrl();
out.writeObject(href);
setUrl(getConnection().getURL().toExternalForm());
Source source = getSource();
PageIndex index = mI
out.defaultWriteObject();
out.writeBoolean(false);
String href = getUrl();
out.writeObject(href);
setUrl(null);
out.defaultWriteObject();
setUrl(href);
private void readObject(ObjectInputStream in)
throws IOException, ClassNotFoundException
boolean fromurl = in.readBoolean();
if(fromurl)
int offset = in.readInt();
String href = (String)in.readObject();
in.defaultReadObject();
if(null != getUrl())
URL url = new URL(getUrl());
setConnection(url.openConnection());
catch(ParserException pe)
throw new IOException(pe.getMessage());
Cursor cursor = new Cursor(this, 0);
for(int i = 0; i & i++)
getCharacter(cursor);
catch(ParserException pe)
throw new IOException(pe.getMessage());
setUrl(href);
String href = (String)in.readObject();
in.defaultReadObject();
setUrl(href);
public void reset()
getSource().reset();
mIndex = new PageIndex(this);
public void close()
throws IOException
if(null != getSource())
getSource().destroy();
protected void finalize()
throws Throwable
public URLConnection getConnection()
public void setConnection(URLConnection connection)
throws ParserException
mConnection =
mConnection.setConnectTimeout(6000);
mConnection.setReadTimeout(6000);
getConnection().connect();
catch(UnknownHostException uhe)
throw new ParserException("Connect to " + mConnection.getURL().toExternalForm() + " failed.", uhe);
catch(IOException ioe)
throw new ParserException("Exception connecting to " + mConnection.getURL().toExternalForm() + " (" + ioe.getMessage() + ").", ioe);
String type = getContentType();
String charset = getCharset(type);
String contentEncoding = connection.getContentEncoding();
System.out.println("contentEncoding="+contentEncoding);
if(null != contentEncoding && -1 != contentEncoding.indexOf("gzip"))
stream = new Stream(new GZIPInputStream(getConnection().getInputStream()));
if(null != contentEncoding && -1 != contentEncoding.indexOf("deflate"))
stream = new Stream(new InflaterInputStream(getConnection().getInputStream(), new Inflater(true)));
stream = new Stream(getConnection().getInputStream());
* 原因:当String charset = getCharset(type);返回来的是ISO-8859-1的时候,需要处理一下
if(charset.indexOf("ISO-8859-1")!=-1){
charset =getGaoBinDEFAULT_CHARSET() ;
mSource = new InputStreamSource(stream, charset);
catch(UnsupportedEncodingException uee)
charset = "ISO-8859-1";
mSource = new InputStreamSource(stream, charset);
catch(IOException ioe)
throw new ParserException("Exception getting input stream from " + mConnection.getURL().toExternalForm() + " (" + ioe.getMessage() + ").", ioe);
mUrl = connection.getURL().toExternalForm();
mIndex = new PageIndex(this);
public String getUrl()
public void setUrl(String url)
public String getBaseUrl()
return mBaseU
public void setBaseUrl(String url)
mBaseUrl =
public Source getSource()
public String getContentType()
String ret = "text/html";
URLConnection connection = getConnection();
if(null != connection)
String content = connection.getHeaderField("Content-Type");
if(null != content)
public char getCharacter(Cursor cursor)
throws ParserException
int i = cursor.getPosition();
int offset = mSource.offset();
if(offset == i)
i = mSource.read();
if(-1 == i)
ret = '\uFFFF';
ret = (char)i;
cursor.advance();
catch(IOException ioe)
throw new ParserException("problem reading a character at position " + cursor.getPosition(), ioe);
if(offset & i)
ret = mSource.getCharacter(i);
catch(IOException ioe)
throw new ParserException("can't read a character at position " + i, ioe);
cursor.advance();
throw new ParserException("attempt to read future characters from source " + i + " & " + mSource.offset());
if('\r' == ret)
ret = '\n';
if(mSource.offset() == cursor.getPosition())
i = mSource.read();
if(-1 != i)
if('\n' == (char)i)
cursor.advance();
mSource.unread();
catch(IOException ioe)
throw new ParserException("can't unread a character at position " + cursor.getPosition(), ioe);
catch(IOException ioe)
throw new ParserException("problem reading a character at position " + cursor.getPosition(), ioe);
if('\n' == mSource.getCharacter(cursor.getPosition()))
cursor.advance();
catch(IOException ioe)
throw new ParserException("can't read a character at position " + cursor.getPosition(), ioe);
if('\n' == ret)
mIndex.add(cursor);
public void ungetCharacter(Cursor cursor)
throws ParserException
cursor.retreat();
int i = cursor.getPosition();
char ch = mSource.getCharacter(i);
if('\n' == ch && 0 != i)
ch = mSource.getCharacter(i - 1);
if('\r' == ch)
cursor.retreat();
catch(IOException ioe)
throw new ParserException("can't read a character at position " + cursor.getPosition(), ioe);
public String getEncoding()
return getSource().getEncoding();
public void setEncoding(String character_set)
throws ParserException
Page.GaoBinDEFAULT_CHARSET = character_
getSource().setEncoding(character_set);
public URL constructUrl(String link, String base)
throws MalformedURLException
return constructUrl(link, base, false);
public URL constructUrl(String link, String base, boolean strict)
throws MalformedURLException
if(!strict && '?' == link.charAt(0))
if(-1 != (index = base.lastIndexOf('?')))
base = base.substring(0, index);
url = new URL(base + link);
url = new URL(new URL(base), link);
String path = url.getFile();
boolean modified =
boolean absolute = link.startsWith("/");
if(!absolute)
if(!path.startsWith("/."))
if(path.startsWith("/../"))
path = path.substring(3);
modified =
if(!path.startsWith("/./") && !path.startsWith("/."))
path = path.substring(2);
modified =
} while(true);
while(-1 != (index = path.indexOf("/\\")))
path = path.substring(0, index + 1) + path.substring(index + 2);
modified =
if(modified)
url = new URL(url, path);
public String getAbsoluteURL(String link)
return getAbsoluteURL(link, false);
public String getAbsoluteURL(String link, boolean strict)
if(null == link || "".equals(link))
String base = getBaseUrl();
if(null == base)
base = getUrl();
if(null == base)
URL url = constructUrl(link, base, strict);
ret = url.toExternalForm();
catch(MalformedURLException murle)
public int row(Cursor cursor)
return mIndex.row(cursor);
public int row(int position)
return mIndex.row(position);
public int column(Cursor cursor)
return mIndex.column(cursor);
public int column(int position)
return mIndex.column(position);
public String getText(int start, int end)
throws IllegalArgumentException
ret = mSource.getString(start, end - start);
catch(IOException ioe)
throw new IllegalArgumentException("can't get the " + (end - start) + "characters at position " + start + " - " + ioe.getMessage());
public void getText(StringBuffer buffer, int start, int end)
throws IllegalArgumentException
if(mSource.offset() & start || mSource.offset() & end)
throw new IllegalArgumentException("attempt to extract future characters from source" + start + "|" + end + " & " + mSource.offset());
if(end & start)
length = end -
mSource.getCharacters(buffer, start, length);
catch(IOException ioe)
throw new IllegalArgumentException("can't get the " + (end - start) + "characters at position " + start + " - " + ioe.getMessage());
public String getText()
return getText(0, mSource.offset());
public void getText(StringBuffer buffer)
getText(buffer, 0, mSource.offset());
public void getText(char array[], int offset, int start, int end)
throws IllegalArgumentException
if(mSource.offset() & start || mSource.offset() & end)
throw new IllegalArgumentException("attempt to extract future characters from source");
if(end & start)
length = end -
mSource.getCharacters(array, offset, start, end);
catch(IOException ioe)
throw new IllegalArgumentException("can't get the " + (end - start) + "characters at position " + start + " - " + ioe.getMessage());
public String getLine(Cursor cursor)
int line = row(cursor);
int size = mIndex.size();
if(line & size)
start = mIndex.elementAt(line);
if(++line &= size)
end = mIndex.elementAt(line);
end = mSource.offset();
start = mIndex.elementAt(line - 1);
end = mSource.offset();
return getText(start, end);
public String getLine(int position)
return getLine(new Cursor(this, position));
public String toString()
if(mSource.offset() & 0)
StringBuffer buffer = new StringBuffer(43);
int start = mSource.offset() - 40;
if(0 & start)
start = 0;
buffer.append("...");
getText(buffer, start, mSource.offset());
ret = buffer.toString();
ret = super.toString();
public static final String DEFAULT_CHARSET = "ISO-8859-1";
public static String GaoBinDEFAULT_CHARSET;
public static final String DEFAULT_CONTENT_TYPE = "text/html";
public static final char EOF = 65535;
protected String mU
protected String mBaseU
protected Source mS
protected PageIndex mI
protected transient URLConnection mC
protected static ConnectionManager mConnectionManager = new ConnectionManager();
public static String getGaoBinDEFAULT_CHARSET() {
return GaoBinDEFAULT_CHARSET;
public static void setGaoBinDEFAULT_CHARSET(String gaoBinDEFAULT_CHARSET) {
GaoBinDEFAULT_CHARSET = gaoBinDEFAULT_CHARSET;
package org.htmlparser.
import java.io.ByteArrayInputS
import java.io.IOE
import java.io.InputS
import java.io.InputStreamR
import java.io.ObjectInputS
import java.io.ObjectOutputS
import java.io.UnsupportedEncodingE
import org.htmlparser.util.EncodingChangeE
import org.htmlparser.util.ParserE
public class InputStreamSource
* An initial buffer size.
* Has a default value of {16384}.
public static int BUFFER_SIZE = 16384;
* The stream of bytes.
* Set to &code&null&/code& when the source is closed.
protected transient InputStream mS
* The character set in use.
protected String mE
* The converter from bytes to characters.
protected transient InputStreamReader mR
* The characters read so far.
protected char[] mB
* The number of valid bytes in the buffer.
protected int mL
* The offset of the next byte returned by read().
protected int mO
* The bookmark.
protected int mM
* Create a source of characters using the default character set.
* @param stream The stream of bytes to use.
* @exception UnsupportedEncodingException If the default character set
* is unsupported.
public InputStreamSource (InputStream stream)
UnsupportedEncodingException
this (stream, null, BUFFER_SIZE);
public InputStreamSource (InputStream stream, String charset)
UnsupportedEncodingException
this (stream, charset, BUFFER_SIZE);
* Create a source of characters.
* @param stream The stream of bytes to use.
* @param charset The character set used in encoding the stream.
* @param size The initial character buffer size.
* @exception UnsupportedEncodingException If the character set
* is unsupported.
public InputStreamSource (InputStream stream, String charset, int size)
UnsupportedEncodingException
if (null == stream)
stream = new Stream (null);
// bug #1044707 mark()/reset() issues
if (!stream.markSupported ())
// wrap the stream so we can reset
stream = new Stream (stream);
if (null == charset)
mReader = new InputStreamReader (stream);
mEncoding = mReader.getEncoding ();
mEncoding =
mReader = new InputStreamReader (stream, charset);
mBuffer = new char[size];
mLevel = 0;
mOffset = 0;
mMark = -1;
* Serialization support.
* @param out Where to write this object.
* @exception IOException If serialization has a problem.
private void writeObject (ObjectOutputStream out)
IOException
if (null != mStream)
// remember the offset, drain the input stream, restore the offset
offset = mO
buffer = new char[4096];
while (EOF != read (buffer))
out.defaultWriteObject ();
* Deserialization support.
* @param in Where to read this object from.
* @exception IOException If deserialization has a problem.
private void readObject (ObjectInputStream in)
IOException,
ClassNotFoundException
in.defaultReadObject ();
if (null != mBuffer) // buffer is null when destroy's been called
// pretend we're open, mStream goes null when exhausted
mStream = new ByteArrayInputStream (new byte[0]);
* Get the input stream being used.
* @return The current input stream.
public InputStream getStream ()
return (mStream);
* Get the encoding being used to convert characters.
* @return The current encoding.
public String getEncoding ()
return (mEncoding);
* Begins reading from the source with the given character set.
* If the current encoding is the same as the requested encoding,
* this method is a no-op. Otherwise any subsequent characters read from
* this page will have been decoded using the given character set.&p&
* Some magic happens here to obtain this result if characters have already
* been consumed from this source.
* Since a Reader cannot be dynamically altered to use a different character
* set, the underlying stream is reset, a new Source is constructed
* and a comparison made of the characters read so far with the newly
* read characters up to the current position.
* If a difference is encountered, or some other problem occurs,
* an exception is thrown.
* @param character_set The character set to use to convert bytes into
* characters.
* @exception ParserException If a character mismatch occurs between
* characters already provided and those that would have been returned
* had the new character set been in effect from the beginning. An
* exception is also thrown if the underlying stream won't put up with
* these shenanigans.
public void setEncoding (String character_set)
ParserException
char[] new_
encoding = getEncoding ();
if(encoding!=null){
character_set=
if (!encoding.equalsIgnoreCase (character_set))
stream = getStream ();
buffer = mB
offset = mO
stream.reset ();
mEncoding = character_
mReader = new InputStreamReader (stream, character_set);
mBuffer = new char[mBuffer.length];
mLevel = 0;
mOffset = 0;
mMark = -1;
if (0 != offset)
new_chars = new char[offset];
if (offset != read (new_chars))
throw new ParserException ("reset stream failed");
for (int i = 0; i & i++)
if (new_chars[i] != buffer[i])
throw new EncodingChangeException ("character mismatch (new: "
+ new_chars[i]
+ Integer.toString (new_chars[i], 16)
+ "] != old: "
+ Integer.toString (buffer[i], 16)
+ buffer[i]
+ "]) for encoding change from "
+ encoding
+ character_set
+ " at character offset "
catch (IOException ioe)
throw new ParserException (ioe.getMessage (), ioe);
catch (IOException ioe)
// bug #1044707 mark()/reset() issues
throw new ParserException ("Stream reset failed ("
+ ioe.getMessage ()
+ "), try wrapping it with a org.htmlparser.lexer.Stream",
* Fetch more characters from the underlying reader.
* Has no effect if the underlying reader has been drained.
* @param min The minimum to read.
* @exception IOException If the underlying reader read() throws one.
protected void fill (int min)
IOException
if (null != mReader) // mReader goes null when it's been sucked dry
size = mBuffer.length - mL // available space
if (size & min) // oops, better get some buffer space
// unknown length... keep doubling
size = mBuffer.length * 2;
read = mLevel +
if (size & read) // or satisfy min, whichever is greater
min = size - mL // read the max
buffer = new char[size];
buffer = mB
// read into the end of the 'new' buffer
read = mReader.read (buffer, mLevel, min);
if (EOF == read)
mReader.close ();
if (mBuffer != buffer)
// copy the bytes previously read
System.arraycopy (mBuffer, 0, buffer, 0, mLevel);
// todo, should repeat on read shorter than original min
* Does nothing.
* It's supposed to close the source, but use destroy() instead.
* @exception IOException &em&not used&/em&
* @see #destroy
public void close () throws IOException
* Read a single character.
* This method will block until a character is available,
* an I/O error occurs, or the end of the stream is reached.
* @return The character read, as an integer in the range 0 to 65535
* (&tt&0x00-0xffff&/tt&), or {@link #EOF EOF} if the end of the stream has
* been reached
* @exception IOException If an I/O error occurs.
public int read () throws IOException
if (mLevel - mOffset & 1)
if (null == mStream)
throw new IOException ("source is closed");
if (mOffset &= mLevel)
ret = EOF;
ret = mBuffer[mOffset++];
ret = mBuffer[mOffset++];
return (ret);
* Read characters into a portion of an array.
This method will block
* until some input is available, an I/O error occurs, or the end of the
* stream is reached.
* @param cbuf Destination buffer
* @param off Offset at which to start storing characters
* @param len Maximum number of characters to read
* @return The number of characters read, or {@link #EOF EOF} if the end of
* the stream has been reached
* @exception IOException If an I/O error occurs.
public int read (char[] cbuf, int off, int len) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if ((null == cbuf) || (0 & off) || (0 & len))
throw new IOException ("illegal argument read ("
+ ((null == cbuf) ? "null" : "cbuf")
+ ", " + off + ", " + len + ")");
if (mLevel - mOffset & len)
fill (len - (mLevel - mOffset)); // minimum to satisfy this request
if (mOffset &= mLevel)
ret = EOF;
ret = Math.min (mLevel - mOffset, len);
System.arraycopy (mBuffer, mOffset, cbuf, off, ret);
mOffset +=
return (ret);
* Read characters into an array.
* This method will block until some input is available, an I/O error occurs,
* or the end of the stream is reached.
* @param cbuf Destination buffer.
* @return The number of characters read, or {@link #EOF EOF} if the end of
* the stream has been reached.
* @exception IOException If an I/O error occurs.
public int read (char[] cbuf) throws IOException
return (read (cbuf, 0, cbuf.length));
* Reset the source.
* Repositions the read point to begin at zero.
* @exception IllegalStateException If the source has been closed.
public void reset ()
IllegalStateException
if (null == mStream)
throw new IllegalStateException ("source is closed");
if (-1 != mMark)
mOffset = mM
mOffset = 0;
* Tell whether this source supports the mark() operation.
* @return &code&true&/code&.
public boolean markSupported ()
return (true);
* Mark the present position in the source.
* Subsequent calls to {@link #reset()}
* will attempt to reposition the source to this point.
readAheadLimit &em&Not used.&/em&
* @exception IOException If the source is closed.
public void mark (int readAheadLimit) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
mMark = mO
* Tell whether this source is ready to be read.
* @return &code&true&/code& if the next read() is guaranteed not to block
* for input, &code&false&/code& otherwise.
* Note that returning false does not guarantee that the next read will block.
* @exception IOException If the source is closed.
public boolean ready () throws IOException
if (null == mStream)
throw new IOException ("source is closed");
return (mOffset & mLevel);
* Skip characters.
* This method will block until some characters are available,
* an I/O error occurs, or the end of the stream is reached.
* &em&Note: n is treated as an int&/em&
* @param n The number of characters to skip.
* @return The number of characters actually skipped
* @exception IllegalArgumentException If &code&n&/code& is negative.
* @exception IOException If an I/O error occurs.
public long skip (long n)
IOException,
IllegalArgumentException
if (null == mStream)
throw new IOException ("source is closed");
if (0 & n)
throw new IllegalArgumentException ("cannot skip backwards");
if (mLevel - mOffset & n)
fill ((int)(n - (mLevel - mOffset))); // minimum to satisfy this request
if (mOffset &= mLevel)
ret = EOF;
ret = Math.min (mLevel - mOffset, n);
mOffset +=
return (ret);
* Undo the read of a single character.
* @exception IOException If the source is closed or no characters have
* been read.
public void unread () throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if (0 & mOffset)
mOffset--;
throw new IOException ("can't unread no characters");
* Retrieve a character again.
* @param offset The offset of the character.
* @return The character at &code&offset&/code&.
* @exception IOException If the offset is beyond {@link #offset()} or the
* source is closed.
public char getCharacter (int offset) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if (offset &= mBuffer.length)
throw new IOException ("illegal read ahead");
ret = mBuffer[offset];
return (ret);
* Retrieve characters again.
* @param array The array of characters.
* @param offset The starting position in the array where characters are to be placed.
* @param start The starting position, zero based.
* @param end The ending position
* (exclusive, i.e. the character at the ending position is not included),
* zero based.
* @exception IOException If the start or end is beyond {@link #offset()}
* or the source is closed.
public void getCharacters (char[] array, int offset, int start, int end) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
System.arraycopy (mBuffer, start, array, offset, end - start);
* Retrieve a string.
* @param offset The offset of the first character.
* @param length The number of characters to retrieve.
* @return A string containing the &code&length&/code& characters at &code&offset&/code&.
* @exception IOException If the offset or (offset + length) is beyond
* {@link #offset()} or the source is closed.
public String getString (int offset, int length) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if (offset + length & mBuffer.length)
throw new IOException ("illegal read ahead");
ret = new String (mBuffer, offset, length);
return (ret);
* Append characters already read into a &code&StringBuffer&/code&.
* @param buffer The buffer to append to.
* @param offset The offset of the first character.
* @param length The number of characters to retrieve.
* @exception IOException If the offset or (offset + length) is beyond
* {@link #offset()} or the source is closed.
public void getCharacters (StringBuffer buffer, int offset, int length) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
buffer.append (mBuffer, offset, length);
* Close the source.
* Once a source has been closed, further {@link #read() read},
* {@link #ready ready}, {@link #mark mark}, {@link #reset reset},
* {@link #skip skip}, {@link #unread unread},
* {@link #getCharacter getCharacter} or {@link #getString getString}
* invocations will throw an IOException.
* Closing a previously-closed source, however, has no effect.
* @exception IOException If an I/O error occurs
public void destroy () throws IOException
if (null != mReader)
mReader.close ();
mLevel = 0;
mOffset = 0;
mMark = -1;
* Get the position (in characters).
* @return The number of characters that have already been read, or
* {@link #EOF EOF} if the source is closed.
public int offset ()
if (null == mStream)
ret = EOF;
return (ret);
* Get the number of available characters.
* @return The number of characters that can be read without blocking or
* zero if the source is closed.
public int available ()
if (null == mStream)
ret = mLevel - mO
return (ret);
浏览: 85129 次
来自: 天津
没有绝对的安全
为什么还是显示不出来?..
wxl123 写道[u][/u]
private static List&I ...

如何描写心情愉快的诗句地写个小parser

我要回帖

更多关于 iniparser写ini文件的文章

随机推荐

如何描写心情愉快的诗句地写个小parser

我要回帖

更多关于 iniparser写ini文件 的文章

随机推荐

更多关于 iniparser写ini文件的文章