1.读取搜狗语料库的内容(即sougou.txt里的内容)里面有很多条每条格式说明如下:
<url>标签后一行直到</doc>标签结束的中间部分即为网页原始内容,保留了HTML标记
<doc><docno>页面ID</docno><url>页面URL</url>页面原始内容</doc>2.将读取的内容存入到mysql数据库中(主要保存每条里面的主题内容)
3.参考程序:http://www.xmsydw.com
4.实现语言java
数据库结构如下:
这是数据库和表格:sohunews_all对于搜狐分类的全部语料,souhunews_reduced对应其reduced版本。
表格字段如下:
搜狗数据格式如下:
<doc>
<url>页面URL</url><docno>页面ID</docno><contenttitle>页面标题</Contenttitle><content>页面内容</content></doc>得到的效果图如下
以下是对于程序的解释:
主程序调用如下:在设置好数据库配置后,只是简单地添加语料库目录就可以将搜狗语料导入到数据库中。
SogouCSProcessor类,用于对数据处理有一个整体的控制,其中主要的正则匹配处理在getDocBean()函数中,抽取出来以后再对其
去掉<>和<>内部的元素.
类别探测其如下:找出与category对应的类别,注意:在这里对国内,国际和社会分类都归于社会分类。原因有2:
1是给的一些网站的语料库无法根据url区分出这三个主题,2:这三个主题内容本来就比较相似,区分度不大。
public final class CategoryDetector {
private static String category = null;
public static String detectCategory(String url) {
// xinhuanet.if (url.startsWith("http://www.xinhuanet.com/auto/")) { category = "car";} else if (url.startsWith("http://www.xinhuanet.com/fortune")) { category = "finance";} else if (url.startsWith("http://www.xinhuanet.com/internet/")) { category = "IT";} else if (url.startsWith("http://www.xinhuanet.com/health/")) { category = "health";} else if (url.startsWith("http://www.xinhuanet.com/sports")) { category = "sports";} else if (url.startsWith("http://www.xinhuanet.com/travel")) { category = "travel";} else if (url.startsWith("http://www.xinhuanet.com/edu")) { category = "education";} else if (url.startsWith("http://www.xinhuanet.com/employment")) { category = "employment";} else if (url.startsWith("http://www.xinhuanet.com/life")) { category = "culture";} else if (url.startsWith("http://www.xinhuanet.com/mil")) { category = "military";} else if (url.startsWith("http://www.xinhuanet.com/olympics/")) { category = "olympics";} else if (url.startsWith("http://www.xinhuanet.com/society")|| url.startsWith("http://www.xinhuanet.com/local/")|| url.startsWith("http://www.xinhuanet.com/world")) { category = "society";} else if (url.startsWith("http://www.xinhuanet.com/house")) { category = "house";} else if (url.startsWith("http://www.xinhuanet.com/ent")) { category = "ent";} else if (url.startsWith("http://www.xinhuanet.com/lady")) { category = "lady";} else if (url.startsWith("http://www.xinhuanet.com/school")) { category = "school";}// chinaif (url.startsWith("http://auto.china.com/")) { category = "car";} else if (url.startsWith("http://caifu.china.com/")) { category = "finance";} else if (url.startsWith("http://tech.china.com/zh_cn/news/net/")) { category = "IT";} else if (url.startsWith("http://health.china.com/")) { category = "health";} else if (url.startsWith("http://sports.china.com/")) { category = "sports";} else if (url.startsWith("http://goo66.china.com/")) { category = "travel";} else if (url.startsWith("http://edu.533.com/")) { category = "education";} else if (url.startsWith("http://culture.china.com/")) { category = "culture";} else if (url.startsWith("http://military.china.com/")) { category = "military";} else if (url.startsWith("http://2008.china.com/")) { category = "olympics";} else if (url.startsWith("http://news.china.com/zh_cn/social/")|| url.startsWith("http://news.china.com/zh_cn/domestic/")|| url.startsWith("http://news.china.com/zh_cn/international/")) { category = "society";} else if (url.startsWith("http://china.soufun.com/")) { category = "house";} else if (url.startsWith("http://fun.china.com/zh_cn/star/")) { category = "ent";} else if (url.startsWith("http://meirong.533.com/")) { category = "lady";} else if (url.startsWith("http://edu.533.com/news/xiaoyuan/")) { category = "school";}// sina.com.cnif (url.startsWith("http://auto.sina.com.cn/")) { category = "car";} else if (url.startsWith("http://finance.sina.com.cn/")) { category = "finance";} else if (url.startsWith("http://tech.sina.com.cn/it/")) { category = "IT";} else if (url.startsWith("http://sina.kangq.com/")) { category = "health";} else if (url.startsWith("http://sports.sina.com.cn/")) { category = "sports";} else if (url.startsWith("http://tour.sina.com.cn/")) { category = "travel";} else if (url.startsWith("http://edu.sina.com.cn/j/")) { // employment和education不能互换.category = "employment";} else if (url.startsWith("http://edu.sina.com.cn/")) { category = "education";} else if (url.startsWith("http://cul.book.sina.com.cn/")) { category = "culture";} else if (url.startsWith("http://mil.news.sina.com.cn/")) { category = "military";} else if (url.startsWith("http://2008.sina.com.cn/")) { category = "olympics";} else if (url.startsWith("http://news.sina.com.cn/society/")|| url.startsWith("http://news.sina.com.cn/china/")|| url.startsWith("http://news.sina.com.cn/world/")) { category = "society";} else if (url.startsWith("http://house.sina.com.cn/")) { category = "house";} else if (url.startsWith("http://ent.sina.com.cn/")) { category = "ent";} else if (url.startsWith("http://eladies.sina.com.cn/")) { category = "lady";} else if (url.startsWith("http://edu.sina.com.cn/y/")) { category = "school";}// 163.comif (url.startsWith("http://auto.163.com/")) { category = "car";} else if (url.startsWith("http://money.163.com/")) { category = "finance";} else if (url.startsWith("http://tech.163.com/it/")) { category = "IT";} else if (url.startsWith("http://163.39.net/")) { category = "health";} else if (url.startsWith("http://sports.163.com/")) { category = "sports";} else if (url.startsWith("http://war.163.com/")) { category = "military";} else if (url.startsWith("http://2008.163.com/")) { category = "olympics";} else if (url.startsWith("http://news.163.com/shehui/")|| url.startsWith("http://news.163.com/domestic/")|| url.startsWith("http://news.163.com/world/")) { category = "society";} else if (url.startsWith("http://house.163.com/")) { category = "house";} else if (url.startsWith("http://ent.163.com/")) { category = "ent";} else if (url.startsWith("http://lady.163.com/")) { category = "lady";}// qq.comif (url.startsWith("http://auto.qq.com/")) { category = "car";} else if (url.startsWith("http://finance.qq.com/")) { category = "finance";} else if (url.startsWith("http://tech.qq.com/a/")) { category = "IT";} else if (url.startsWith("http://sports.qq.com/")) { category = "sports";} else if (url.startsWith("http://edu.qq.com/job/")) { // employment和education不能互换.category = "employment";} else if (url.startsWith("http://edu.qq.com/")) { category = "education";} else if (url.startsWith("http://cul.qq.com/")) { category = "culture";} else if (url.startsWith("http://mil.qq.com/")) { category = "military";} else if (url.startsWith("http://news.qq.com/a/")) { category = "society";} else if (url.startsWith("http://2008.qq.com/")) { category = "olympics";} else if (url.startsWith("http://house.qq.com/")) { category = "house";} else if (url.startsWith("http://ent.qq.com/")) { category = "ent";} else if (url.startsWith("http://lady.qq.com/")) { category = "lady";} else if (url.startsWith("http://campus.qq.com/")) { category = "school";}// sohu.comif (url.startsWith("http://auto.sohu.com/")) { category = "car";} else if (url.startsWith("http://business.sohu.com/")) { category = "finance";} else if (url.startsWith("http://it.sohu.com/")) { category = "IT";} else if (url.startsWith("http://health.sohu.com/")) { category = "health";} else if (url.startsWith("http://sports.sohu.com/")) { category = "sports";} else if (url.startsWith("http://travel.sohu.com/")) { category = "travel";} else if (url.startsWith("http://learning.sohu.com/")) { category = "education";} else if (url.startsWith("http://career.sohu.com/")) { category = "employment";} else if (url.startsWith("http://cul.sohu.com/")) { category = "culture";} else if (url.startsWith("http://news.sohu.com/")) { category = "society";} else if (url.startsWith("http://mil.news.sohu.com/")) { category = "military";} else if (url.startsWith("http://2008.sohu.com/")) { category = "olympics";} else if (url.startsWith("http://house.sohu.com/")) { category = "house";} else if (url.startsWith("http://yule.sohu.com/")) { category = "ent";} else if (url.startsWith("http://women.sohu.com/")) { category = "lady";}return category;}public static void main(String args[]) {
category = CategoryDetector.detectCategory("http://edu.sina.com.cn/");System.out.println(category);}}
DocBean类如下:
其他类都是一些支持类,在此不再一一详述。代码如下:
FileListViewer类:
FileUtil类:
DBUtil类和SogouDBManager类:
完整源码http://apenny.taobao.com专业程序代写