博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
java处理搜狐新闻数据库sogou.txt,正则表达式,mysql数据库
阅读量:6894 次
发布时间:2019-06-27

本文共 8729 字,大约阅读时间需要 29 分钟。

1.读取搜狗语料库的内容(即sougou.txt里的内容)里面有很多条每条格式说明如下:

 

<url>标签后一行直到</doc>标签结束的中间部分即为网页原始内容,保留了HTML标记

<doc>
<docno>页面ID</docno>
<url>页面URL</url>
页面原始内容
</doc>

2.将读取的内容存入到mysql数据库中(主要保存每条里面的主题内容)

3.参考程序:http://www.xmsydw.com

4.实现语言java

数据库结构如下:

这是数据库和表格:sohunews_all对于搜狐分类的全部语料,souhunews_reduced对应其reduced版本。

表格字段如下:

搜狗数据格式如下:

<doc>

<url>页面URL</url>
<docno>页面ID</docno>
<contenttitle>页面标题</Contenttitle>
<content>页面内容</content>
</doc>

得到的效果图如下

以下是对于程序的解释:

主程序调用如下:在设置好数据库配置后,只是简单地添加语料库目录就可以将搜狗语料导入到数据库中。

public static void main(String args[]) {
SogouCSProcessor pro = new SogouCSProcessor();
resultToFile("data/Sogou.txt");
pro.processor("D:\\语料库\\Sogou语料库\\SogouCA.reduced");
}

SogouCSProcessor类,用于对数据处理有一个整体的控制,其中主要的正则匹配处理在getDocBean()函数中,抽取出来以后再对其

去掉<>和<>内部的元素.

类别探测其如下:找出与category对应的类别,注意:在这里对国内,国际和社会分类都归于社会分类。原因有2:

1是给的一些网站的语料库无法根据url区分出这三个主题,2:这三个主题内容本来就比较相似,区分度不大。

 
package zju.dawn.ai.corpus.sogou;

 

 

public final class CategoryDetector {

private static String category = null;

public static String detectCategory(String url) {

// xinhuanet.
if (url.startsWith("http://www.xinhuanet.com/auto/")) {
category = "car";
} else if (url.startsWith("http://www.xinhuanet.com/fortune")) {
category = "finance";
} else if (url.startsWith("http://www.xinhuanet.com/internet/")) {
category = "IT";
} else if (url.startsWith("http://www.xinhuanet.com/health/")) {
category = "health";
} else if (url.startsWith("http://www.xinhuanet.com/sports")) {
category = "sports";
} else if (url.startsWith("http://www.xinhuanet.com/travel")) {
category = "travel";
} else if (url.startsWith("http://www.xinhuanet.com/edu")) {
category = "education";
} else if (url.startsWith("http://www.xinhuanet.com/employment")) {
category = "employment";
} else if (url.startsWith("http://www.xinhuanet.com/life")) {
category = "culture";
} else if (url.startsWith("http://www.xinhuanet.com/mil")) {
category = "military";
} else if (url.startsWith("http://www.xinhuanet.com/olympics/")) {
category = "olympics";
} else if (url.startsWith("http://www.xinhuanet.com/society")
|| url.startsWith("http://www.xinhuanet.com/local/")
|| url.startsWith("http://www.xinhuanet.com/world")) {
category = "society";
} else if (url.startsWith("http://www.xinhuanet.com/house")) {
category = "house";
} else if (url.startsWith("http://www.xinhuanet.com/ent")) {
category = "ent";
} else if (url.startsWith("http://www.xinhuanet.com/lady")) {
category = "lady";
} else if (url.startsWith("http://www.xinhuanet.com/school")) {
category = "school";
}
// china
if (url.startsWith("http://auto.china.com/")) {
category = "car";
} else if (url.startsWith("http://caifu.china.com/")) {
category = "finance";
} else if (url.startsWith("http://tech.china.com/zh_cn/news/net/")) {
category = "IT";
} else if (url.startsWith("http://health.china.com/")) {
category = "health";
} else if (url.startsWith("http://sports.china.com/")) {
category = "sports";
} else if (url.startsWith("http://goo66.china.com/")) {
category = "travel";
} else if (url.startsWith("http://edu.533.com/")) {
category = "education";
} else if (url.startsWith("http://culture.china.com/")) {
category = "culture";
} else if (url.startsWith("http://military.china.com/")) {
category = "military";
} else if (url.startsWith("http://2008.china.com/")) {
category = "olympics";
} else if (url.startsWith("http://news.china.com/zh_cn/social/")
|| url.startsWith("http://news.china.com/zh_cn/domestic/")
|| url.startsWith("http://news.china.com/zh_cn/international/")) {
category = "society";
} else if (url.startsWith("http://china.soufun.com/")) {
category = "house";
} else if (url.startsWith("http://fun.china.com/zh_cn/star/")) {
category = "ent";
} else if (url.startsWith("http://meirong.533.com/")) {
category = "lady";
} else if (url.startsWith("http://edu.533.com/news/xiaoyuan/")) {
category = "school";
}
// sina.com.cn
if (url.startsWith("http://auto.sina.com.cn/")) {
category = "car";
} else if (url.startsWith("http://finance.sina.com.cn/")) {
category = "finance";
} else if (url.startsWith("http://tech.sina.com.cn/it/")) {
category = "IT";
} else if (url.startsWith("http://sina.kangq.com/")) {
category = "health";
} else if (url.startsWith("http://sports.sina.com.cn/")) {
category = "sports";
} else if (url.startsWith("http://tour.sina.com.cn/")) {
category = "travel";
} else if (url.startsWith("http://edu.sina.com.cn/j/")) {
// employment和education不能互换.
category = "employment";
} else if (url.startsWith("http://edu.sina.com.cn/")) {
category = "education";
} else if (url.startsWith("http://cul.book.sina.com.cn/")) {
category = "culture";
} else if (url.startsWith("http://mil.news.sina.com.cn/")) {
category = "military";
} else if (url.startsWith("http://2008.sina.com.cn/")) {
category = "olympics";
} else if (url.startsWith("http://news.sina.com.cn/society/")
|| url.startsWith("http://news.sina.com.cn/china/")
|| url.startsWith("http://news.sina.com.cn/world/")) {
category = "society";
} else if (url.startsWith("http://house.sina.com.cn/")) {
category = "house";
} else if (url.startsWith("http://ent.sina.com.cn/")) {
category = "ent";
} else if (url.startsWith("http://eladies.sina.com.cn/")) {
category = "lady";
} else if (url.startsWith("http://edu.sina.com.cn/y/")) {
category = "school";
}
// 163.com
if (url.startsWith("http://auto.163.com/")) {
category = "car";
} else if (url.startsWith("http://money.163.com/")) {
category = "finance";
} else if (url.startsWith("http://tech.163.com/it/")) {
category = "IT";
} else if (url.startsWith("http://163.39.net/")) {
category = "health";
} else if (url.startsWith("http://sports.163.com/")) {
category = "sports";
} else if (url.startsWith("http://war.163.com/")) {
category = "military";
} else if (url.startsWith("http://2008.163.com/")) {
category = "olympics";
} else if (url.startsWith("http://news.163.com/shehui/")
|| url.startsWith("http://news.163.com/domestic/")
|| url.startsWith("http://news.163.com/world/")) {
category = "society";
} else if (url.startsWith("http://house.163.com/")) {
category = "house";
} else if (url.startsWith("http://ent.163.com/")) {
category = "ent";
} else if (url.startsWith("http://lady.163.com/")) {
category = "lady";
}
// qq.com
if (url.startsWith("http://auto.qq.com/")) {
category = "car";
} else if (url.startsWith("http://finance.qq.com/")) {
category = "finance";
} else if (url.startsWith("http://tech.qq.com/a/")) {
category = "IT";
} else if (url.startsWith("http://sports.qq.com/")) {
category = "sports";
} else if (url.startsWith("http://edu.qq.com/job/")) {
// employment和education不能互换.
category = "employment";
} else if (url.startsWith("http://edu.qq.com/")) {
category = "education";
} else if (url.startsWith("http://cul.qq.com/")) {
category = "culture";
} else if (url.startsWith("http://mil.qq.com/")) {
category = "military";
} else if (url.startsWith("http://news.qq.com/a/")) {
category = "society";
} else if (url.startsWith("http://2008.qq.com/")) {
category = "olympics";
} else if (url.startsWith("http://house.qq.com/")) {
category = "house";
} else if (url.startsWith("http://ent.qq.com/")) {
category = "ent";
} else if (url.startsWith("http://lady.qq.com/")) {
category = "lady";
} else if (url.startsWith("http://campus.qq.com/")) {
category = "school";
}
// sohu.com
if (url.startsWith("http://auto.sohu.com/")) {
category = "car";
} else if (url.startsWith("http://business.sohu.com/")) {
category = "finance";
} else if (url.startsWith("http://it.sohu.com/")) {
category = "IT";
} else if (url.startsWith("http://health.sohu.com/")) {
category = "health";
} else if (url.startsWith("http://sports.sohu.com/")) {
category = "sports";
} else if (url.startsWith("http://travel.sohu.com/")) {
category = "travel";
} else if (url.startsWith("http://learning.sohu.com/")) {
category = "education";
} else if (url.startsWith("http://career.sohu.com/")) {
category = "employment";
} else if (url.startsWith("http://cul.sohu.com/")) {
category = "culture";
} else if (url.startsWith("http://news.sohu.com/")) {
category = "society";
} else if (url.startsWith("http://mil.news.sohu.com/")) {
category = "military";
} else if (url.startsWith("http://2008.sohu.com/")) {
category = "olympics";
} else if (url.startsWith("http://house.sohu.com/")) {
category = "house";
} else if (url.startsWith("http://yule.sohu.com/")) {
category = "ent";
} else if (url.startsWith("http://women.sohu.com/")) {
category = "lady";
}
return category;
}

public static void main(String args[]) {

category = CategoryDetector.detectCategory("http://edu.sina.com.cn/");
System.out.println(category);
}
}

 

 

 

DocBean类如下:

 

其他类都是一些支持类,在此不再一一详述。代码如下:

FileListViewer类:

 

FileUtil类:

 

DBUtil类和SogouDBManager类:

完整源码http://apenny.taobao.com专业程序代写

转载于:https://www.cnblogs.com/sourcecode2014/p/3297648.html

你可能感兴趣的文章
海康威视 - 萤石云开放平台 js 版
查看>>
关于分销平台
查看>>
jquery实用的一些方法
查看>>
质数方阵
查看>>
jQuery $.each用法
查看>>
C语言结构体指针成员强制类型转换
查看>>
基于域的无线安全认证方案
查看>>
Thread类常用方法
查看>>
几乎所有编程语言的hello, world程序(3)
查看>>
CentOs 设置静态IP 方法
查看>>
Nginx内置变量以及日志格式变量参数详解
查看>>
Docker 命令
查看>>
如何在andorid native layer中加log function.【转】
查看>>
杂七杂八的文档资料。
查看>>
C#.NET 大型企业信息化系统集成快速开发平台 4.2 版本 - 访问频率限制功能实现、防止黑客扫描、防止恶意刷屏...
查看>>
如何在Hyper-V虚拟中安装Hyper-V角色
查看>>
通用XPE操作系统
查看>>
Opentracing Zipkin
查看>>
构建高可用服务器之四 Keepalive冗余Nginx
查看>>
android音频采集
查看>>