如何描写心情愉快的诗句地写个小parser

您所在的位置: >
MBTileParser,小游戏引擎
MBTileParser,小游戏引擎
基本介绍:
MBTileParser是一个小的游戏引擎,支持TMX文件和TexturePacker文件直接加载到UIKit
源码下载地址:
温馨提示:&&&&&&&本站转载的均为开源代码,版权归原作者所有,请遵守作者许可证协议。转载本站内容请注明出处:懒人ios代码库-
[相关浏览]
这个工程是一个比较完整的工程游戏: PalCard 仙剑奇侠传五前传卡牌小游戏iPhone版 App名:仙剑卡
源码下载地址: 本地迅雷下载 github面板下载
这个完成度还是相当高的游戏 有很多关 coco2d 做的 懒人iOS代码 Github下载
一个很完整的游戏 我声音 有配乐 设置等等 游戏的内容就是尽快找到2个一样的图标 后消除就
一个个可扩展的游戏模板和基于位置的服务的RPG游戏。 角色在游戏中是分布在真实的世界。你
CCControlExtension CCControlExtension是一个开源库,提供了很多方便的控制对象Cocos2D v2.0为iPhone和Mac如按更多公众号:programmer_life十年漫漫程序人生,打过各种杂,也做过让我骄傲的软件;管理过十多人的团队,还带领一班兄弟姐妹创过业。关注程序人生,了解程序猿,学做程序猿,让我们的人生不再屌丝化。最新文章相关作者文章搜狗:感谢您阅读如何愉快地写个小parser 本文版权归原作者所有,本文由网友投递产生,如有侵权请联系 ,会第一时间为您处理删除。《新智元笔记:找茬拷问立氏parser》
已有 1705 次阅读
|个人分类:|系统分类:|关键词:中文 parsing 鲁棒 休眠 结构分析
洪爷有诗道:伟爷强弩上大弓,先吃豆腐显轻松。 NLP parser天天弄,小菜十碟拌饭红。我: 语义求解的目的何在?如果不与domain、客户需求、语用或应用连接,那就是普遍性语义求解,那是一个很容易陷进去出不来的地方。一旦与应用对接,则大不相同,很多事情往往没有想象的那么复杂深奥。宋: 对。普遍性的语义求解是一个陷阱,但有上下文约束会好一些。孤立地看一个短语,确定它有什么坑(语法的或语义的),比较困难。上下文的作用不仅是填坑,而且是约束坑的类型。我: 回宋老师,那些依存关系的标签基本是 self-evident 可以顾名思义的,不外就是逻辑主语(actor)、逻辑宾语(under),动词性宾语(cinf),谓词补语(buyu),补足语(comp),定语从句(relmod),修饰语(mod),状语(adv),同位语(equiv),话题(dummysubj),并列(conj)、介宾(pobj),有当无的关系(dummy),连动或接续关系(nextoken) 等。宋: 什么是“有当无的关系”?我: 就是一些附加语啊,或者无所谓的小词啊。连上也许后去会用上,不连上丢掉其实也可以。宋: 谢谢立委。你的parser确实可以解决很大问题。顾: 怎么把应用语境程序表示呢?应用相关的规则集?我: parsing 提供的是树结构,应用就在树结构上做模式匹配。这比传统的在线性序列上匹配(譬如 ngram 匹配),威力强大多了。以一当十可不止,这种匹配是以一当百。不仅是数量上的一当百,而且线性匹配够不着的模式,结构也能够得着,一些所谓远距离的现象。因此,对于NLP应用,比起关键词和ngram的小米步枪,parsing 就是核武器,这就是我们一直在鼓吹的:subtree matching is way more powerful than linear pattern matching ()。绝大多数应用的基础,不是搜索就是抽取。抽取可以看成是线下(offline)的搜索(可以叫做 structure indexing),搜索就是实时(real time)的抽取。搜索解决的是无法 pre-computing 的信息检索,抽取是为预订(predefined)的信息需求做更周全的服务。二者都可以用parsing,也都可以用关键词,但效果和质量天差地别。(见《》)顾: 但这里的问题是对语义的理解造成树不同,如果树不准确,匹配也不好做吧我: 恰恰相反,树不准,应用中的语义条件可以做弥补,因为应用的时候有已经大大缩小了或简化了的语用现场,以及domain知识。顾: 是否一个关键是在特定应用中,一般无需将每句话都parse正确?我: 当然。如果必须每句话都parse对了,才能应用,世界上就没有 parsing 的应用了。事实上,人也会遇到 parsing 和理解的困难和错误。顾: 我就是问这个domain knowledge在程序中怎么实现的。是个比较初级的问题,您的核武器我一向佩服的。我: 甚至信息抽取做了近20年了,90%以上的抽取应用,连 parsing 都不用,更甭说 parsing 对不对了。domain knowledge 有几个角度,都有帮助和弥补的作用:首先,是对于抽取的目标的定义。抽取就是填空,但填空前必须定义语义名称和关系,数据库里面的数据 fields 也是预先定义的,定义抽取template作为语义目标与此类似。譬如,你要抽取会议信息,你就会定义会议名称、会议时间、会议地点、主讲人等等,所有这些都是domain的语义关系。有了domain语义的定义,你的目标就聚焦了,与这些关系无关的信息和句子,一律排除出局。不仅如此,因为所有这些定义都是围绕一个domain的语义主题(“会议事件”),你对普遍句法关系的容错性大大增强。第二个就是所谓 domain ontology 也可以派上用场。顾: 我的理解是立委对每类应用有多棵目标语法树,然后文字parse时往这些树上靠,无需考虑其它的非目标树的解析法.我:恰恰相反。不是语法树往不同应用去靠,而是直接支持应用的抽取模块往语法树靠。是为以不变应万变。parser是独立于应用的,核心引擎不轻易为具体应用做改变。因为应用总在变,引擎是稳定的、轻装的。Qing: 我个人直觉结果的稳定性比正确性重要我: 鲁棒性比正确性重要,不鲁棒就会够不着信息。鲁棒了至少可以有路径,哪怕路径有误,只要错误的路径有一致性或是可以预见的,那么信息抽取的时候从错误的路径抽取到正确的信息,也是完全可能的,所谓负负得正。因为有了路径,节点之间的语义相谐性(semantic coherence)就可以弥补句法的不足。Qing: 以后我们每个人要约束自己。有话好好说,才能愉快地聊天。要自觉地向立委的语法规则靠拢。白老师曰,别抬杠,别找茬。我: 这是哪里焊哪里。找茬和抬杠都不是问题。时髦话说:不zuo不活。扛得住找茬和抬杠才能鲁棒坚强。宋: 对,我相信你的这些策略肯定是非常有用的。我在刚才说的“接桩”的问题中,也曾经想用parser来选择接桩的结果,但找不到性能好的parser,于是只能用类似于ngram的方法建立一种学习策略,找最优接桩结果,效果只能说凑合。立委,你能否用你的parser做这样一件事情。原文是:西藏银行部门去年新增贷款十四点四一亿元,比上年增加八亿多元。农业生产贷款比上年新增四点三八亿元。其中后两个标点句前面都缺成分,需要补上。第2标点句前面补上成分后可能的结果是:比上年增加八亿多元。西藏银行部门比上年增加八亿多元。西藏银行部门去年比上年增加八亿多元。西藏银行部门去年新增比上年增加八亿多元。西藏银行部门去年新增贷款比上年增加八亿多元。西藏银行部门去年新增贷款十四点四一亿元比上年增加八亿多元。这6个结果用你的parser从中选一个最优的,再把这个最优的用上述策略同第3标点句匹配,或者用前3个较优的分别用上边的策略同第3标点句匹配,通过你的parser再选最优,看最后结果是什么。我: 你看看,这不是难为我么?顾: 您说了可以找茬抬杠,那宋老师出这题不算过分宋: 难为立委了,有时间有兴趣就试试,无所谓。我: 我只管结构parsing,算是给你的目标提供靠谱的基础。利用这样的基础,你自己的系统也许可以更好地去选择接桩。好,把parser的错改正增强了,结构应该是没问题了,其他的事情是应用和domain层面了。这下对了吧,再不对,我就砸机器了:湖: 李老师的parser很强啊,有试用版吗?我: 没有试用。内部产品使用,不对外。用户也就可数的几个(其实满憋屈的,大材小用),虽然都是大户人家。湖: 独门秘术啊。白: 农业生产贷款,典型的主谓宾,为啥不中?我: 农业生产贷款,不能算典型的 SVO,因为 农业生产 是典型的 compound。湖: 分析中人工干预吗?我: 分析是全自动的,怎么干预?如果干预指的是开发过程不断地维护和提高(fine tuning),那与机器学习的培训(training or retraining)是类似的。所有的电脑自动系统的开发不都是这样的么?除非现在打包成黑箱子,把引擎封死固化。宋: 立委。你没有直接回答我的问题呀。我的问题是,第二个标点句补全成分的6种选择,应该选哪一个?第三个标点句如何补全成分?我: 我回答不了啊。语义我不在行,我就懂一点句法。你的问题我不认为是句法范围的问题,而是句法后的语义问题。句法可以提供基础,真正的语义工作还是要你自己的模块去做啊。我提供一个比较靠谱的句法结构基础,在上面怎么玩语义的把戏是语义学家或 domain specialist 的任务了,各司其职。宋: 立委,我并不要求作语义分析,只是在6个候选句中用parser挑一个最优的。当然,这也许同你的目标不大一样。你的parser是在同一个句子的不同parser结果中找最优,不知能否用于不同句子parser结果中找最优,从而确定哪个句子最合理。我: 不能,句法不跨句,甚至都基本不跨从句。跨句的NLP需要另外的机制。白: 那,压力测试环境,会如何?压力测试环境已经准备好了。试试。我:好:&湖: 够快啊我: 本来就是“线速”啊。就是还要转成图形费点事儿。白老师心里想什么我知道湖: 这超出预期我: 以前说过的,最大可能的先出来,其余应该休眠。白: 橱柜消毒喷剂昨天买到了我: 这个露馅了。“喷剂” 居然不在词典。本来以为词典够全的了呢。&加进词典了:&白: 炸药投射装置在操场北侧。湖: 这个例子太牛我:&例子再牛还有我parser牛么?kidding ......湖: 这个分析器确实很强我: 说到这里想起一件好笑的事儿。白: 拿社保养老保险吗?我: 先等我讲个故事,也算NLP掌故吧。话说近30年前吧,我与前辈大牛聊天。前辈不仅学问深不见底,也是一个特别幽默直爽的人,直爽到让我目瞪口呆。当时前辈经过多年的努力刚刚正式推出NLP系统支持的产品,正得意中。大概所处的状态与我现在类似。他提到他的两个作品,一个是他儿子,一个是系统,that's my real baby。说:儿子不算真的作品,根本没费劲儿呢,也就是个自然灾害。系统才算,那真是好多年的呕心沥血呢。前辈的这个对比让我忍俊不住,被幽默了好多年。白: 试试,拿社保养老保险吗?我: 白老师今天是不气死机器不罢休的阵势啊。来了: &白: “养老保险”词法就搞到一起了,没有休眠机制,“保险”没办法做谓语的。救不活了。我: 现在是没有。休眠机制还在考虑中 ....... 但是还是可以救活的,等我有机会再专门讲一下如何救活它,并不难,就是词驱动的 reparsing,如果 reparsing 单做的话,譬如可以专门做一个 NP 的reparsing。实现一个 NP 的 re-parser, 以提供更多的 parses 选项,不是一件很难的事儿,只要想做,就可以做。问题是,在应用层面还没有想好接口之前,这个工作暂缓。休眠救活的具体策略,以后专找时间详论。(见:&)白: reparsing是需要划禁区的。试试:媒体挖掘真相的速度比不上谣言制造工厂的速度我: 这句后半句掉链子了:&我:小小抗议一下,白老师的那个句子,刚刚重新读了一遍,我是 native,也给你绕糊涂了,你那后半句是“人话”还是“准人话”?什么叫 “谣言制造工厂的速度” ?我用人脑 parse 了几遍也没搞明白。明显就是个坑,我如果往左parse,你可以说错了,如果往右,你还可以说错了。人根本就没法统一意见的事儿,怎么教给机器?还有前一句“拿社保养老保险么”,我人脑的parsing与机器输出的完全一致。听上去又不对了。也许是因为我缺乏背景知识,不了解国内的社保制度。我觉得意思就是,(你是)用社保来做养老保险么?难道不对?白: “谣言制造工厂”和“炸药投射装置”在结构上是完全平行的,但和“媒体挖掘真相”不平行。后者是svo,前两者都是np。我: 这个我懂。你生生地狗尾续貂,加了 “的速度”,我头脑就 parse 不了了。白: 工厂不可以有速度吗?我: 怎么那么别扭呢。一般来说,物是没有速度的,动作才有速度。白: 况且这会儿都还没见到语义呢,都在做句法分析。句法不别扭就OK。湖: 刘翔的速度。白: 信息生产,无论造谣还是辟谣,都可以谈论其速度。湖: 不过这类语义粘合度低的,确实让机器分析有难度。白: 进而也可以把速度投射到生产者身上。湖: 我分析@wei 的分析器在词典里有利用语义的。我: 汉语如果连词典语义都不用,parser 不就寸步难行么?湖: 是的我: 那会有多少 伪parses 啊。如果只从POS类考察汉语,那几乎就是爆炸,核当量的,根本无法收场。几乎所有 POS 都可以互相结合。湖: 真心觉得@wei 的分析器很强大。按白老师的奥卡姆剃刀,POS完人该被X的。最近我潜意识里认为乔式的最简理论中的移动理论还有道理。句法要解决的是语义结合及其顺序、成分共享、焦点表达。我: 哦,我明白了,“拿社保养老保险么”,指的是 【拿社保养老】保险么?白: 正是。“保险”可作谓词,“养老保险”不行。我: 不过 【拿社保】【养老保险】 的 parsing。虽然结构不同,核心语义是一样的。应该算对,我觉得。湖: 白老师的是个歧义句,一是保险为核,一是拿为核。两个歧义句意思完全不一样。@wei 您的句子我觉得也有两个意思啊我: 因为我的parser 把【养老保险】当谓语看的,【拿社保】是手段状语。以前也见过一些逻辑结构不同,但语义相同或类似的例子。尤其是英语的 PP-attachment 的歧义句子,如果这个PP 是一个 for+NP,可以找到相当一批句子,无论分析为NP的定语,还是Pred的状语,其核心语义是一样的:last year we built 20 schools for blind children。白: 拿“养老保险”作谓语,出轨的尺度比较大。不是不行,是不到万一不应采用。我: 某种程度上,已经万一了。这就好比 PP 做谓语一样,句子里没有更强的谓语选项了。白: 现实是,没有休眠机制,只能将就。有休眠机制,就不这么认为了。“谣言制造工厂”,必须通过相对强大的构词法,才有希望结合成整体,不受“速度”的干扰。【相关】&&&&&&&《》&& &
转载本文请联系原作者获取授权,同时请注明本文来自李维科学网博客。链接地址:
上一篇:下一篇:
当前推荐数:1
评论 ( 个评论)
扫一扫,分享此博文
作者的精选博文
作者的其他最新博文
热门博文导读
Powered by
Copyright &编译原理(1)
计算机基础(2)
在微信公众号程序人生上看到一篇文章
作者介绍了几种文本处理工具(lex/yacc,clojure下的神器instaparse…)
其中Clojure下的神器instaparse引起我极大的兴趣,原因如下
作者对其评价如下「首先是clojure下的神器instaparse。instaparse是那种如果让你做个parser,不限定语言,那你一定要尝试使用的工具。别的工具一天能做出来的效果,instaparse一小时就能搞定。」很有吸引力吧,简单就是美嘛。
之前接触FP较少,希望借此机会接触一下,练练手。
Clojure依托于JVM(TM),作为一名Java程序员,上手应该比较快。
2.关于函数式编程
在此不做更多的赘述.
3.Clojure开发环境的配置
安装Leiningen
Leiningen简单来就是Clojure世界中的maven.
lein new helloworld
新建项目helloworld 其中 project.clj类似maven中的pom
Intellij IDEA上安装Cursive Plugin
上Cursive官网上下载Plugin之后本地安装一下,重启一下idea即可.
Clojure HelloWorld
本文并不讨论Clojure的语法细节问题,如果读者对Clojure不是很了解,可以上查阅相关细节。
4.接下来就让我们愉快的写个小Parser吧(∩_∩)
4.1instaparse简介
的产品说明
What if context-free grammars were as easy to use as regular expressions?
仅限于解决上下文无关文法哦
产品的目标是使用上和正则表达式一样简单.
: := ::= =
End of rule
; . (optional) newline
Concatenation
whitespace or ,
Alternation
One or more
Zero or more
String terminal
‘a’ “a”
Regex terminal
‘a’ #”a”
Epsilon epsilon EPSILON eps ε “” ”
S = ‘a’ S
(* This is a comment *)
4.2上下文无关文法简介
忘记的同学自行回顾一下
4.4 do it!
参考知识库
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
访问:11148次
排名:千里之外
原创:59篇
(1)(6)(1)(48)(2)(5)(2)(1)(1)(1)htmlparser的编码问题 -
- ITeye技术网站
博客分类:
&&& htmlparser在提取网站内容时,有时会出现乱码或者是编码不能转换的问题。这是htmlparser的一个小bug,因为htmlparser作为一个开源软件已经很长时间没有更新了。
org.htmlparser.util.EncodingChangeException: character mismatch (new: 中 [0x4e2d] != old:& [0xd6?]) for encoding change from ISO-8859-1 to GB2312 at character offset 23或者会出现页面的乱码问题。
&&& 为了彻底避免上述问题,我们可以改下htmlparser的源码的两个类。
&&& package org.htmlparser.和InputStreamSource类。另外我们还要用 CodepageDetectorProxy(根据二进制流来分析网页编码)来提前解析网页编码。
&&& htmlparser中设置编码一般为
&&& Parser parser=new Parser(url);
&&& parser.setEnconding("bianma");
&&& 但存在漏洞。
&&& htmlparser编码的分析过程:htmlparser会根据服务器返回的文件头信息与网页的meta标签中的编码进行对比,如果服务器返回的文件头编码为空,默认返回为ISO-8859-1的编码,它会meta标签的charset里的编码对比。
&&& 改进的思路:利用CodepageDetectorProxy.jar对网页进行编码分析,获得网页的编码格式。将htmlparser的服务器返回的默认编码设置为CodepageDetectorProxy解析的编码。这样的编码和meta标签的编码总能保持一致了。。代码如下:
整个修改过程如下:&&&
import info.monitorenter.cpdetector.io.CodepageDetectorP
import info.monitorenter.cpdetector.io.ParsingD
import java.net.MalformedURLE
import java.net.URL;
import org.htmlparser.lexer.P
public class WebEncoding {
public String AnalyEnconding(String path){
url=new URL(path);
} catch (MalformedURLException e) {
e.printStackTrace();
CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
detector.add(new ParsingDetector(false));
java.nio.charset.Charset charset =
charset = detector.detectCodepage(url);
} catch (Exception ex) {ex.printStackTrace();}
if(charset.name().equalsIgnoreCase("utf-8")||charset.name().equals("UTF-8")){
Page.GaoBinDEFAULT_CHARSET="utf-8";
Page.GaoBinDEFAULT_CHARSET="gb2312";
return Page.getGaoBinDEFAULT_CHARSET();
package org.htmlparser.
import java.io.*;
import java.lang.reflect.InvocationTargetE
import java.lang.reflect.M
import java.net.*;
import java.util.zip.*;
import org.htmlparser.http.ConnectionM
import org.htmlparser.util.ParserE
// Referenced classes of package org.htmlparser.lexer:
InputStreamSource, PageIndex, StringSource, Cursor,
Stream, Source
public class Page
implements Serializable
public Page()
public Page(URLConnection connection)
throws ParserException
if(null == connection)
throw new IllegalArgumentException("connection cannot be null");
setConnection(connection);
mBaseUrl =
public Page(InputStream stream, String charset)
throws UnsupportedEncodingException
if(null == stream)
throw new IllegalArgumentException("stream cannot be null");
if(null == charset)
charset = "ISO-8859-1";
mSource = new InputStreamSource(stream, charset);
mIndex = new PageIndex(this);
mConnection =
mBaseUrl =
public Page(String text, String charset)
if(null == text)
throw new IllegalArgumentException("text cannot be null");
if(null == charset)
charset = "ISO-8859-1";
mSource = new StringSource(text, charset);
mIndex = new PageIndex(this);
mConnection =
mBaseUrl =
public Page(String text)
this(text, null);
public Page(Source source)
if(null == source)
throw new IllegalArgumentException("source cannot be null");
mIndex = new PageIndex(this);
mConnection =
mBaseUrl =
public static ConnectionManager getConnectionManager()
return mConnectionM
public static void setConnectionManager(ConnectionManager manager)
mConnectionManager =
public String getCharset(String content)
String CHARSET_STRING = "charset";
if(null == mSource)
ret = "ISO-8859-1";
ret = mSource.getEncoding();
if(null != content)
int index = content.indexOf("charset");
if(index != -1)
content = content.substring(index + "charset".length()).trim();
if(content.startsWith("="))
content = content.substring(1).trim();
index = content.indexOf(";");
if(index != -1)
content = content.substring(0, index);
if(content.startsWith("\"") && content.endsWith("\"") && 1 & content.length())
content = content.substring(1, content.length() - 1);
if(content.startsWith("'") && content.endsWith("'") && 1 & content.length())
content = content.substring(1, content.length() - 1);
ret = findCharset(content, ret);
public static String findCharset(String name, String fallback)
Class cls = Class.forName("java.nio.charset.Charset");
Method method = cls.getMethod("forName", new Class[] {
java.lang.String.class
Object object = method.invoke(null, new Object[] {
method = cls.getMethod("name", new Class[0]);
object = method.invoke(object, new Object[0]);
ret = (String)
catch(ClassNotFoundException cnfe)
catch(NoSuchMethodException nsme)
catch(IllegalAccessException ia)
catch(InvocationTargetException ita)
System.out.println("unable to determine cannonical charset name for " + name + " - using " + fallback);
private void writeObject(ObjectOutputStream out)
throws IOException
if(null != getConnection())
out.writeBoolean(true);
out.writeInt(mSource.offset());
String href = getUrl();
out.writeObject(href);
setUrl(getConnection().getURL().toExternalForm());
Source source = getSource();
PageIndex index = mI
out.defaultWriteObject();
out.writeBoolean(false);
String href = getUrl();
out.writeObject(href);
setUrl(null);
out.defaultWriteObject();
setUrl(href);
private void readObject(ObjectInputStream in)
throws IOException, ClassNotFoundException
boolean fromurl = in.readBoolean();
if(fromurl)
int offset = in.readInt();
String href = (String)in.readObject();
in.defaultReadObject();
if(null != getUrl())
URL url = new URL(getUrl());
setConnection(url.openConnection());
catch(ParserException pe)
throw new IOException(pe.getMessage());
Cursor cursor = new Cursor(this, 0);
for(int i = 0; i & i++)
getCharacter(cursor);
catch(ParserException pe)
throw new IOException(pe.getMessage());
setUrl(href);
String href = (String)in.readObject();
in.defaultReadObject();
setUrl(href);
public void reset()
getSource().reset();
mIndex = new PageIndex(this);
public void close()
throws IOException
if(null != getSource())
getSource().destroy();
protected void finalize()
throws Throwable
public URLConnection getConnection()
public void setConnection(URLConnection connection)
throws ParserException
mConnection =
mConnection.setConnectTimeout(6000);
mConnection.setReadTimeout(6000);
getConnection().connect();
catch(UnknownHostException uhe)
throw new ParserException("Connect to " + mConnection.getURL().toExternalForm() + " failed.", uhe);
catch(IOException ioe)
throw new ParserException("Exception connecting to " + mConnection.getURL().toExternalForm() + " (" + ioe.getMessage() + ").", ioe);
String type = getContentType();
String charset = getCharset(type);
String contentEncoding = connection.getContentEncoding();
System.out.println("contentEncoding="+contentEncoding);
if(null != contentEncoding && -1 != contentEncoding.indexOf("gzip"))
stream = new Stream(new GZIPInputStream(getConnection().getInputStream()));
if(null != contentEncoding && -1 != contentEncoding.indexOf("deflate"))
stream = new Stream(new InflaterInputStream(getConnection().getInputStream(), new Inflater(true)));
stream = new Stream(getConnection().getInputStream());
* 原因:当String charset = getCharset(type);返回来的是ISO-8859-1的时候,需要处理一下
if(charset.indexOf("ISO-8859-1")!=-1){
charset =getGaoBinDEFAULT_CHARSET() ;
mSource = new InputStreamSource(stream, charset);
catch(UnsupportedEncodingException uee)
charset = "ISO-8859-1";
mSource = new InputStreamSource(stream, charset);
catch(IOException ioe)
throw new ParserException("Exception getting input stream from " + mConnection.getURL().toExternalForm() + " (" + ioe.getMessage() + ").", ioe);
mUrl = connection.getURL().toExternalForm();
mIndex = new PageIndex(this);
public String getUrl()
public void setUrl(String url)
public String getBaseUrl()
return mBaseU
public void setBaseUrl(String url)
mBaseUrl =
public Source getSource()
public String getContentType()
String ret = "text/html";
URLConnection connection = getConnection();
if(null != connection)
String content = connection.getHeaderField("Content-Type");
if(null != content)
public char getCharacter(Cursor cursor)
throws ParserException
int i = cursor.getPosition();
int offset = mSource.offset();
if(offset == i)
i = mSource.read();
if(-1 == i)
ret = '\uFFFF';
ret = (char)i;
cursor.advance();
catch(IOException ioe)
throw new ParserException("problem reading a character at position " + cursor.getPosition(), ioe);
if(offset & i)
ret = mSource.getCharacter(i);
catch(IOException ioe)
throw new ParserException("can't read a character at position " + i, ioe);
cursor.advance();
throw new ParserException("attempt to read future characters from source " + i + " & " + mSource.offset());
if('\r' == ret)
ret = '\n';
if(mSource.offset() == cursor.getPosition())
i = mSource.read();
if(-1 != i)
if('\n' == (char)i)
cursor.advance();
mSource.unread();
catch(IOException ioe)
throw new ParserException("can't unread a character at position " + cursor.getPosition(), ioe);
catch(IOException ioe)
throw new ParserException("problem reading a character at position " + cursor.getPosition(), ioe);
if('\n' == mSource.getCharacter(cursor.getPosition()))
cursor.advance();
catch(IOException ioe)
throw new ParserException("can't read a character at position " + cursor.getPosition(), ioe);
if('\n' == ret)
mIndex.add(cursor);
public void ungetCharacter(Cursor cursor)
throws ParserException
cursor.retreat();
int i = cursor.getPosition();
char ch = mSource.getCharacter(i);
if('\n' == ch && 0 != i)
ch = mSource.getCharacter(i - 1);
if('\r' == ch)
cursor.retreat();
catch(IOException ioe)
throw new ParserException("can't read a character at position " + cursor.getPosition(), ioe);
public String getEncoding()
return getSource().getEncoding();
public void setEncoding(String character_set)
throws ParserException
Page.GaoBinDEFAULT_CHARSET = character_
getSource().setEncoding(character_set);
public URL constructUrl(String link, String base)
throws MalformedURLException
return constructUrl(link, base, false);
public URL constructUrl(String link, String base, boolean strict)
throws MalformedURLException
if(!strict && '?' == link.charAt(0))
if(-1 != (index = base.lastIndexOf('?')))
base = base.substring(0, index);
url = new URL(base + link);
url = new URL(new URL(base), link);
String path = url.getFile();
boolean modified =
boolean absolute = link.startsWith("/");
if(!absolute)
if(!path.startsWith("/."))
if(path.startsWith("/../"))
path = path.substring(3);
modified =
if(!path.startsWith("/./") && !path.startsWith("/."))
path = path.substring(2);
modified =
} while(true);
while(-1 != (index = path.indexOf("/\\")))
path = path.substring(0, index + 1) + path.substring(index + 2);
modified =
if(modified)
url = new URL(url, path);
public String getAbsoluteURL(String link)
return getAbsoluteURL(link, false);
public String getAbsoluteURL(String link, boolean strict)
if(null == link || "".equals(link))
String base = getBaseUrl();
if(null == base)
base = getUrl();
if(null == base)
URL url = constructUrl(link, base, strict);
ret = url.toExternalForm();
catch(MalformedURLException murle)
public int row(Cursor cursor)
return mIndex.row(cursor);
public int row(int position)
return mIndex.row(position);
public int column(Cursor cursor)
return mIndex.column(cursor);
public int column(int position)
return mIndex.column(position);
public String getText(int start, int end)
throws IllegalArgumentException
ret = mSource.getString(start, end - start);
catch(IOException ioe)
throw new IllegalArgumentException("can't get the " + (end - start) + "characters at position " + start + " - " + ioe.getMessage());
public void getText(StringBuffer buffer, int start, int end)
throws IllegalArgumentException
if(mSource.offset() & start || mSource.offset() & end)
throw new IllegalArgumentException("attempt to extract future characters from source" + start + "|" + end + " & " + mSource.offset());
if(end & start)
length = end -
mSource.getCharacters(buffer, start, length);
catch(IOException ioe)
throw new IllegalArgumentException("can't get the " + (end - start) + "characters at position " + start + " - " + ioe.getMessage());
public String getText()
return getText(0, mSource.offset());
public void getText(StringBuffer buffer)
getText(buffer, 0, mSource.offset());
public void getText(char array[], int offset, int start, int end)
throws IllegalArgumentException
if(mSource.offset() & start || mSource.offset() & end)
throw new IllegalArgumentException("attempt to extract future characters from source");
if(end & start)
length = end -
mSource.getCharacters(array, offset, start, end);
catch(IOException ioe)
throw new IllegalArgumentException("can't get the " + (end - start) + "characters at position " + start + " - " + ioe.getMessage());
public String getLine(Cursor cursor)
int line = row(cursor);
int size = mIndex.size();
if(line & size)
start = mIndex.elementAt(line);
if(++line &= size)
end = mIndex.elementAt(line);
end = mSource.offset();
start = mIndex.elementAt(line - 1);
end = mSource.offset();
return getText(start, end);
public String getLine(int position)
return getLine(new Cursor(this, position));
public String toString()
if(mSource.offset() & 0)
StringBuffer buffer = new StringBuffer(43);
int start = mSource.offset() - 40;
if(0 & start)
start = 0;
buffer.append("...");
getText(buffer, start, mSource.offset());
ret = buffer.toString();
ret = super.toString();
public static final String DEFAULT_CHARSET = "ISO-8859-1";
public static String GaoBinDEFAULT_CHARSET;
public static final String DEFAULT_CONTENT_TYPE = "text/html";
public static final char EOF = 65535;
protected String mU
protected String mBaseU
protected Source mS
protected PageIndex mI
protected transient URLConnection mC
protected static ConnectionManager mConnectionManager = new ConnectionManager();
public static String getGaoBinDEFAULT_CHARSET() {
return GaoBinDEFAULT_CHARSET;
public static void setGaoBinDEFAULT_CHARSET(String gaoBinDEFAULT_CHARSET) {
GaoBinDEFAULT_CHARSET = gaoBinDEFAULT_CHARSET;
package org.htmlparser.
import java.io.ByteArrayInputS
import java.io.IOE
import java.io.InputS
import java.io.InputStreamR
import java.io.ObjectInputS
import java.io.ObjectOutputS
import java.io.UnsupportedEncodingE
import org.htmlparser.util.EncodingChangeE
import org.htmlparser.util.ParserE
public class InputStreamSource
* An initial buffer size.
* Has a default value of {16384}.
public static int BUFFER_SIZE = 16384;
* The stream of bytes.
* Set to &code&null&/code& when the source is closed.
protected transient InputStream mS
* The character set in use.
protected String mE
* The converter from bytes to characters.
protected transient InputStreamReader mR
* The characters read so far.
protected char[] mB
* The number of valid bytes in the buffer.
protected int mL
* The offset of the next byte returned by read().
protected int mO
* The bookmark.
protected int mM
* Create a source of characters using the default character set.
* @param stream The stream of bytes to use.
* @exception UnsupportedEncodingException If the default character set
* is unsupported.
public InputStreamSource (InputStream stream)
UnsupportedEncodingException
this (stream, null, BUFFER_SIZE);
public InputStreamSource (InputStream stream, String charset)
UnsupportedEncodingException
this (stream, charset, BUFFER_SIZE);
* Create a source of characters.
* @param stream The stream of bytes to use.
* @param charset The character set used in encoding the stream.
* @param size The initial character buffer size.
* @exception UnsupportedEncodingException If the character set
* is unsupported.
public InputStreamSource (InputStream stream, String charset, int size)
UnsupportedEncodingException
if (null == stream)
stream = new Stream (null);
// bug #1044707 mark()/reset() issues
if (!stream.markSupported ())
// wrap the stream so we can reset
stream = new Stream (stream);
if (null == charset)
mReader = new InputStreamReader (stream);
mEncoding = mReader.getEncoding ();
mEncoding =
mReader = new InputStreamReader (stream, charset);
mBuffer = new char[size];
mLevel = 0;
mOffset = 0;
mMark = -1;
* Serialization support.
* @param out Where to write this object.
* @exception IOException If serialization has a problem.
private void writeObject (ObjectOutputStream out)
IOException
if (null != mStream)
// remember the offset, drain the input stream, restore the offset
offset = mO
buffer = new char[4096];
while (EOF != read (buffer))
out.defaultWriteObject ();
* Deserialization support.
* @param in Where to read this object from.
* @exception IOException If deserialization has a problem.
private void readObject (ObjectInputStream in)
IOException,
ClassNotFoundException
in.defaultReadObject ();
if (null != mBuffer) // buffer is null when destroy's been called
// pretend we're open, mStream goes null when exhausted
mStream = new ByteArrayInputStream (new byte[0]);
* Get the input stream being used.
* @return The current input stream.
public InputStream getStream ()
return (mStream);
* Get the encoding being used to convert characters.
* @return The current encoding.
public String getEncoding ()
return (mEncoding);
* Begins reading from the source with the given character set.
* If the current encoding is the same as the requested encoding,
* this method is a no-op. Otherwise any subsequent characters read from
* this page will have been decoded using the given character set.&p&
* Some magic happens here to obtain this result if characters have already
* been consumed from this source.
* Since a Reader cannot be dynamically altered to use a different character
* set, the underlying stream is reset, a new Source is constructed
* and a comparison made of the characters read so far with the newly
* read characters up to the current position.
* If a difference is encountered, or some other problem occurs,
* an exception is thrown.
* @param character_set The character set to use to convert bytes into
* characters.
* @exception ParserException If a character mismatch occurs between
* characters already provided and those that would have been returned
* had the new character set been in effect from the beginning. An
* exception is also thrown if the underlying stream won't put up with
* these shenanigans.
public void setEncoding (String character_set)
ParserException
char[] new_
encoding = getEncoding ();
if(encoding!=null){
character_set=
if (!encoding.equalsIgnoreCase (character_set))
stream = getStream ();
buffer = mB
offset = mO
stream.reset ();
mEncoding = character_
mReader = new InputStreamReader (stream, character_set);
mBuffer = new char[mBuffer.length];
mLevel = 0;
mOffset = 0;
mMark = -1;
if (0 != offset)
new_chars = new char[offset];
if (offset != read (new_chars))
throw new ParserException ("reset stream failed");
for (int i = 0; i & i++)
if (new_chars[i] != buffer[i])
throw new EncodingChangeException ("character mismatch (new: "
+ new_chars[i]
+ Integer.toString (new_chars[i], 16)
+ "] != old: "
+ Integer.toString (buffer[i], 16)
+ buffer[i]
+ "]) for encoding change from "
+ encoding
+ character_set
+ " at character offset "
catch (IOException ioe)
throw new ParserException (ioe.getMessage (), ioe);
catch (IOException ioe)
// bug #1044707 mark()/reset() issues
throw new ParserException ("Stream reset failed ("
+ ioe.getMessage ()
+ "), try wrapping it with a org.htmlparser.lexer.Stream",
* Fetch more characters from the underlying reader.
* Has no effect if the underlying reader has been drained.
* @param min The minimum to read.
* @exception IOException If the underlying reader read() throws one.
protected void fill (int min)
IOException
if (null != mReader) // mReader goes null when it's been sucked dry
size = mBuffer.length - mL // available space
if (size & min) // oops, better get some buffer space
// unknown length... keep doubling
size = mBuffer.length * 2;
read = mLevel +
if (size & read) // or satisfy min, whichever is greater
min = size - mL // read the max
buffer = new char[size];
buffer = mB
// read into the end of the 'new' buffer
read = mReader.read (buffer, mLevel, min);
if (EOF == read)
mReader.close ();
if (mBuffer != buffer)
// copy the bytes previously read
System.arraycopy (mBuffer, 0, buffer, 0, mLevel);
// todo, should repeat on read shorter than original min
* Does nothing.
* It's supposed to close the source, but use destroy() instead.
* @exception IOException &em&not used&/em&
* @see #destroy
public void close () throws IOException
* Read a single character.
* This method will block until a character is available,
* an I/O error occurs, or the end of the stream is reached.
* @return The character read, as an integer in the range 0 to 65535
* (&tt&0x00-0xffff&/tt&), or {@link #EOF EOF} if the end of the stream has
* been reached
* @exception IOException If an I/O error occurs.
public int read () throws IOException
if (mLevel - mOffset & 1)
if (null == mStream)
throw new IOException ("source is closed");
if (mOffset &= mLevel)
ret = EOF;
ret = mBuffer[mOffset++];
ret = mBuffer[mOffset++];
return (ret);
* Read characters into a portion of an array.
This method will block
* until some input is available, an I/O error occurs, or the end of the
* stream is reached.
* @param cbuf Destination buffer
* @param off Offset at which to start storing characters
* @param len Maximum number of characters to read
* @return The number of characters read, or {@link #EOF EOF} if the end of
* the stream has been reached
* @exception IOException If an I/O error occurs.
public int read (char[] cbuf, int off, int len) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if ((null == cbuf) || (0 & off) || (0 & len))
throw new IOException ("illegal argument read ("
+ ((null == cbuf) ? "null" : "cbuf")
+ ", " + off + ", " + len + ")");
if (mLevel - mOffset & len)
fill (len - (mLevel - mOffset)); // minimum to satisfy this request
if (mOffset &= mLevel)
ret = EOF;
ret = Math.min (mLevel - mOffset, len);
System.arraycopy (mBuffer, mOffset, cbuf, off, ret);
mOffset +=
return (ret);
* Read characters into an array.
* This method will block until some input is available, an I/O error occurs,
* or the end of the stream is reached.
* @param cbuf Destination buffer.
* @return The number of characters read, or {@link #EOF EOF} if the end of
* the stream has been reached.
* @exception IOException If an I/O error occurs.
public int read (char[] cbuf) throws IOException
return (read (cbuf, 0, cbuf.length));
* Reset the source.
* Repositions the read point to begin at zero.
* @exception IllegalStateException If the source has been closed.
public void reset ()
IllegalStateException
if (null == mStream)
throw new IllegalStateException ("source is closed");
if (-1 != mMark)
mOffset = mM
mOffset = 0;
* Tell whether this source supports the mark() operation.
* @return &code&true&/code&.
public boolean markSupported ()
return (true);
* Mark the present position in the source.
* Subsequent calls to {@link #reset()}
* will attempt to reposition the source to this point.
readAheadLimit &em&Not used.&/em&
* @exception IOException If the source is closed.
public void mark (int readAheadLimit) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
mMark = mO
* Tell whether this source is ready to be read.
* @return &code&true&/code& if the next read() is guaranteed not to block
* for input, &code&false&/code& otherwise.
* Note that returning false does not guarantee that the next read will block.
* @exception IOException If the source is closed.
public boolean ready () throws IOException
if (null == mStream)
throw new IOException ("source is closed");
return (mOffset & mLevel);
* Skip characters.
* This method will block until some characters are available,
* an I/O error occurs, or the end of the stream is reached.
* &em&Note: n is treated as an int&/em&
* @param n The number of characters to skip.
* @return The number of characters actually skipped
* @exception IllegalArgumentException If &code&n&/code& is negative.
* @exception IOException If an I/O error occurs.
public long skip (long n)
IOException,
IllegalArgumentException
if (null == mStream)
throw new IOException ("source is closed");
if (0 & n)
throw new IllegalArgumentException ("cannot skip backwards");
if (mLevel - mOffset & n)
fill ((int)(n - (mLevel - mOffset))); // minimum to satisfy this request
if (mOffset &= mLevel)
ret = EOF;
ret = Math.min (mLevel - mOffset, n);
mOffset +=
return (ret);
* Undo the read of a single character.
* @exception IOException If the source is closed or no characters have
* been read.
public void unread () throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if (0 & mOffset)
mOffset--;
throw new IOException ("can't unread no characters");
* Retrieve a character again.
* @param offset The offset of the character.
* @return The character at &code&offset&/code&.
* @exception IOException If the offset is beyond {@link #offset()} or the
* source is closed.
public char getCharacter (int offset) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if (offset &= mBuffer.length)
throw new IOException ("illegal read ahead");
ret = mBuffer[offset];
return (ret);
* Retrieve characters again.
* @param array The array of characters.
* @param offset The starting position in the array where characters are to be placed.
* @param start The starting position, zero based.
* @param end The ending position
* (exclusive, i.e. the character at the ending position is not included),
* zero based.
* @exception IOException If the start or end is beyond {@link #offset()}
* or the source is closed.
public void getCharacters (char[] array, int offset, int start, int end) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
System.arraycopy (mBuffer, start, array, offset, end - start);
* Retrieve a string.
* @param offset The offset of the first character.
* @param length The number of characters to retrieve.
* @return A string containing the &code&length&/code& characters at &code&offset&/code&.
* @exception IOException If the offset or (offset + length) is beyond
* {@link #offset()} or the source is closed.
public String getString (int offset, int length) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
if (offset + length & mBuffer.length)
throw new IOException ("illegal read ahead");
ret = new String (mBuffer, offset, length);
return (ret);
* Append characters already read into a &code&StringBuffer&/code&.
* @param buffer The buffer to append to.
* @param offset The offset of the first character.
* @param length The number of characters to retrieve.
* @exception IOException If the offset or (offset + length) is beyond
* {@link #offset()} or the source is closed.
public void getCharacters (StringBuffer buffer, int offset, int length) throws IOException
if (null == mStream)
throw new IOException ("source is closed");
buffer.append (mBuffer, offset, length);
* Close the source.
* Once a source has been closed, further {@link #read() read},
* {@link #ready ready}, {@link #mark mark}, {@link #reset reset},
* {@link #skip skip}, {@link #unread unread},
* {@link #getCharacter getCharacter} or {@link #getString getString}
* invocations will throw an IOException.
* Closing a previously-closed source, however, has no effect.
* @exception IOException If an I/O error occurs
public void destroy () throws IOException
if (null != mReader)
mReader.close ();
mLevel = 0;
mOffset = 0;
mMark = -1;
* Get the position (in characters).
* @return The number of characters that have already been read, or
* {@link #EOF EOF} if the source is closed.
public int offset ()
if (null == mStream)
ret = EOF;
return (ret);
* Get the number of available characters.
* @return The number of characters that can be read without blocking or
* zero if the source is closed.
public int available ()
if (null == mStream)
ret = mLevel - mO
return (ret);
浏览: 85129 次
来自: 天津
没有绝对的安全
为什么还是显示不出来?..
wxl123 写道[u][/u]
private static List&I ...

我要回帖

更多关于 iniparser写ini文件 的文章

 

随机推荐