怎样使用C获取百度搜索结果,并且解析到目标网址

上传人：0*** IP属地：湖北上传时间：2022-01-07 格式：DOCX 页数：7 大小：377.48KB 积分：28 举报 版权申诉

已阅读5页，还剩2页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、怎样使用C# 获取百度搜索结果，并且解析到目标网址我们首先应该分析百度的搜索结果，发现百度的搜索结果的格式为：图中标记部分可以知道，百度的搜索结果都是在id=”content_left”的结果中的，每个搜索项目的是以class=”result c-container”作为一项，每项中的题目又是包含在h3标签中，如下图所示：因此我们有了思路：1. 根据关键字获取到百度搜索结果的整个HTML文本2. 正则匹配到搜索结果容器的HTML3. 正则匹配到搜索结果每一项的HTML4. 取出每项结果中的题目和链接地址直接来干的，看下面的代码：using System; using System.Colle

2、ctions.Generic; using System.Text; using System.Text.RegularExpressions; using System.Web; using System.Net;using System.IO;namespace BaiduSearchTest struct BaiduEntry public string title, brief, link; class Program static string GetHtml(string keyword) string url = " string encodedKeyword = Ht

3、tpUtility.UrlEncode(keyword, Encoding.GetEncoding(936); /百度使用codepage 936字符编码来作为查询串，果然专注于中文搜索 /更不用说，还很喜欢微软 /谷歌能正确识别UTF-8编码和codepage这两种情况，不过本身网页在HTTP头里标明是UTF-8的 /估计谷歌也不讨厌微软（以及微软的专有规范） string query = "s?wd=" + encodedKeyword; HttpWebRequest req; HttpWebResponse response; Stream stream;

4、 req = (HttpWebRequest)WebRequest.Create(url + query); response = (HttpWebResponse)req.GetResponse(); stream = response.GetResponseStream(); int count = 0; byte buf = new byte8192; string decodedString = null; StringBuilder sb = new StringBuilder(); try Console.WriteLine("正在读取网页0的内容", url

5、+ query); do count = stream.Read(buf, 0, buf.Length); if (count > 0) decodedString = Encoding.GetEncoding("utf-8").GetString(buf, 0, count); sb.Append(decodedString); while (count > 0); catch Console.WriteLine("网络连接失败，请检查网络设置。"); return sb.ToString(); static void PrintResul

6、t(List<BaiduEntry> entries) int count = 0; entries.ForEach(delegate(BaiduEntry entry) Console.WriteLine("找到了百度的第0条搜索结果：", count += 1); if (entry.link != null) Console.WriteLine("找到了一条链接："); Console.WriteLine(entry.link); if (entry.title != null) Console.WriteLine("标题为：

7、"); Console.WriteLine(entry.title); if (entry.brief != null) Console.WriteLine("下面是摘要："); Console.WriteLine(entry.brief); Program.Cut(); ); static void simpleOutput() string html = "<table><tr><td><font>test</font><a>hello</a><br>&l

8、t;/td></tr></table>" Console.WriteLine(RemoveSomeTags(html); static string RemoveVoidTag(string html) string filter = "<br>" ; foreach (string tag in filter) html = html.Replace(tag, ""); return html; static string ReleaseXmlTags(string html) string filt

9、er = "<a.*?>", "</a>", "<em>", "</em>", "<b>", "</b>", "<font.*?>", "</font>" ; foreach (string tag in filter) html = Regex.Replace(html, tag, ""); return html; &

10、#160; static string RemoveSomeTags(string html) html = RemoveVoidTag(html); html = ReleaseXmlTags(html); return html; static void Cut() Console.WriteLine(""); static void MainProc(string input) MainProc(input, false); static void MainProc(string input, bool tagsForBrief) Regex r = n

11、ew Regex("<h3sS*?</h3>", RegexOptions.IgnoreCase); MatchCollection matchCollection = r.Matches(input); List<string> collection = new List<string>(); foreach(Match m in matchCollection) string textReg = "<as*>*>(sS+?)</a>"

12、0; MatchCollection textMatchCollection = Regex.Matches(m.Value, textReg, RegexOptions.IgnoreCase); foreach (Match match in textMatchCollection) if (match.Success) Console.Write(match.Result("$1"); string LinkReg = "http:/(w-+.)+w-+(/w- ./?%&=*)?" MatchColle

13、ction linkMatchCollection = Regex.Matches(m.Value, LinkReg, RegexOptions.IgnoreCase); foreach (Match match in linkMatchCollection) if (match.Success) Console.Write(match.Groups0.Value); public static void Main(string args) Console.WriteLine("请输入一个关键字。"); string keyword; keyword = Con

14、sole.ReadLine(); Console.WriteLine("正在从百度上获取结果，请稍等"); string input; input = GetHtml(keyword); Regex r = new Regex("<div id="content_left"sS*</div><div style="clear:both;height:0;"></div>", RegexOptions.IgnoreCase); input = r.Match(input).V

15、alue; MainProc(input); Console.ReadKey(true); 程序结果如下图所示：通过上面的例子你应该明白怎样使用.NET/C# 获取百度搜索结果项了吧，程序可以直接使用，如果没有得到结果说明是百度搜索的结构变了，请按程序思路改正。Ok, 我们看到此时的搜索结果的url 都是被被百度重定向过后的url 地址。都是以开头的。那么我们需要再多做一点才能得到真正的url地址。我们用C# Get一下发现其实访问的是一个包含我们目标网址的一个网页。再多用一次正则去或者真正的目标网页的url就好了。 public static string GetTheRedirect

16、Url(string originalAddress) string redirectUrl="" HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(originalAddress); HttpWebResponse response = (HttpWebResponse)request.GetResponse(); redirectUrl =(String) response.ResponseUri.AbsoluteUri; Stream ReceiveStream = response.GetR

17、esponseStream(); Encoding encode = System.Text.Encoding.GetEncoding("utf-8"); StreamReader readStream = new StreamReader(ReceiveStream, encode); Char read = new Char256; / Read 256 charcters at a time. int count = readStream.Read(read, 0, 256); Console.WriteLine("HTML.rn"); while (count > 0) String str = new String(read, 0, count); count = readStream.Read(read, 0, 256); String patternstr = "url=s*(?:"(?<1>"*)"|(?<1>S+)"></noscript>" Regex pattern = new Regex( patternstr, RegexOption

人人文库> 全部分类> 教育资料 > 辅导培训

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

怎样使用C获取百度搜索结果,并且解析到目标网址

文档简介

温馨提示

最新文档

评论

怎样使用C获取百度搜索结果,并且解析到目标网址

文档简介

温馨提示

最新文档

评论

相关文档