大数据采集与预处理李俊翰习题答案

上传人：大*** IP属地：四川上传时间：2024-08-14 格式：DOCX 页数：18 大小：40.18KB 积分：20 举报 版权申诉

已阅读5页，还剩13页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

第一章通过PyCharm建立一个项目，项目名称自定。在该项目中实现一个WelcometoPython!程序。PyCharm创建项目。首先，双击桌面的PyCharm图标打开PyCharm程序，选择“File”—>“NewProject”，然后在弹出的窗口中的location文本框中自定义项目名称为：你的项目名称，并将该项目存放在位置：项目存放路径。将鼠标移到项目根节点，右击鼠标，选择“New”—>“PythonFile”。这样就可以在PyCharm中创建一个基于Python3.6基础解释器作为编程环境的Python文件。在此，将该文件命名为：你的文件名称在右边的代码编辑框中输入：print("WelcometoPython!")运行该文件即可。2.通过PyCharm建立一个项目，项目名称自定。在该项目中定义一个列表，并使用列表函数append()向该列表中添加数据，最后使用for循环语句遍历输出。list1=['hello','world',2020]

list1.append('python')##使用append()添加元素

print("list1[3]:",list1[3])#输出添加的元素

foreinlist1:#循环遍历列表元素

print(e)第二章1.通过导入requests库，使用该库爬取Python官方网站页面数据。importrequests

req=requests.get('/')

req.encoding='utf-8'

print(req.text)2.通过导入lxml和BeautifulSoup，使用该库解析爬取的Python官方网站页面数据。frombs4importBeautifulSoup

req=requests.get('/')

req.encoding='utf-8'

soup=BeautifulSoup(req.text,'lxml')

print(soup.title.string)

item=soup.select('#top>nav>ul>li.python-meta.current_item.selectedcurrent_branch.selected>a')

print(item)第三章1.使用Python读取和输出CSV和JSON数据。读取CSV，数据自定义。importcsv

file_to_use='学生信息.csv'

withopen(file_to_use,'r',encoding='utf-8')asf:

r=csv.reader(f)

file_header=next(r)

print(file_header)

forid,file_header_colinenumerate(file_header):

print(id,file_header_col)

forrowinr:

ifrow[2]=='学号':

print(row)写入CSV，数据自定义。importcsv

withopen('学生信息.csv','a',encoding='utf-8')asf:

wr=csv.writer(f)

wr.writerows([['大数据运维','hadoop','高级技术员','张三'],['大数据开发','python','中级技术员','李四']])

wr.writerow(['大数据运维','hadoop','高级技术员','张三'])

wr.writerow(['大数据开发','python','中级技术员','李四'])2.使用Python连接MySQL，创建数据库和表，并实现增删查改。importpymysql

db=pymysql.connect("localhost","root","你的密码","你的数据库名称")

cursor=db.cursor()

cursor.execute("DROPTABLEIFEXISTSemployee")

#创建表格

sql="""CREATETABLE`employee`(

`id`int(10)NOTNULLAUTO_INCREMENT,

`first_name`char(20)NOTNULL,

`last_name`char(20)DEFAULTNULL,

`age`int(11)DEFAULTNULL,

`sex`char(1)DEFAULTNULL,

`income`floatDEFAULTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor.execute(sql)

print("CreatedtableSuccessfully.")

#插入

sql2="""INSERTINTOEMPLOYEE(FIRST_NAME,

LAST_NAME,AGE,SEX,INCOME)

VALUES('Mac','Su',20,'M',5000)"""

cursor.execute(sql2)

print("InserttableSuccessfully.")

#查询

sql3="""SELECT*FROMEMPLOYEE"""

cursor.execute(sql3)

print("SELECTtableSuccessfully.")

#修改

sql4="""UPDATEEMPLOYEESETFIRST_NAME='Sam'WHEREID=3'"""

cursor.execute(sql4)

print("UPDATEtableSuccessfully.")

#删除

sql5="""DELETEFROMEMPLOYEEWHEREFIRST_NAME='Mac'"""

cursor.execute(sql5)

print("DELETEtableSuccessfully.")第四章1.利用业务网站提供的API实现数据采集，清洗和存储。importrequests

importpymysql

api_url='/search/repositories?q=spider'

req=requests.get(api_url)

print('状态码：',req.status_code)

req_dic=req.json()

print('与spider有关的库总数：',req_dic['total_count'])

print('本次请求是否完整:',req_dic['incomplete_results'])

req_dic_items=req_dic['items']

print('当前页面返回的项目数量：',len(req_dic_items))

names=[]

forkeyinreq_dic_items:

names.append(key['name'])

sorted_names=sorted(names)

db=pymysql.connect(host='localhost',user='root',password='这里要使用自己密码',port=3306)

cursor=db.cursor()

cursor.execute("CREATEDATABASE数据库名称DEFAULTCHARACTERSETutf8mb4")

db.close()

db2=pymysql.connect("localhost","root","这里要使用自己密码","数据库名称",3306)

cursor2=db2.cursor()

cursor2.execute("DROPTABLEIFEXISTS数据库名称")

sql1="""CREATETABLE`数据库名称`(

`id`int(10)NOTNULLAUTO_INCREMENT,

`full_name`char(20)NOTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor2.execute(sql1)

print("CreatedtableSuccessfull.")

forindex,nameinenumerate(sorted_names):

print('项目索引号：',index,'项目名称：',name)

sql2='INSERTINTO数据库名称(id,full_name)VALUES(%s,%s)'

try:

cursor2.execute(sql2,(index,name))

mit()

except:

db2.rollback()

db2.close()2.通过分析特定页面结构和数据的各项内容，使用Python实现AJAX的数据采集，并将结果存储到MySQL数据库中。fromurllib.parseimporturlencode

importrequests

importpymysql

original_url='/ashx/AjaxIndexHotCarByDsj.ashx?'

requests_headers={

'Referer':'/beijing/',

'User-Agent':'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/57.0.2987.133Safari/537.36',

'X-Requested-With':'XMLHttpRequest',

}

db=pymysql.connect(host='localhost',user='root',password='这里要使用自己密码',port=3306)

cursor=db.cursor()

cursor.execute("CREATEDATABASEAJAXDEFAULTCHARACTERSETutf8mb4")

db.close()

db2=pymysql.connect("localhost","root","这里要使用自己密码","AJAX",3306)

cursor2=db2.cursor()

cursor2.execute("DROPTABLEIFEXISTSajax")

sql1="""CREATETABLE`ajax`(

`car_name`char(20)NOTNULL,

`id`int(10)NOTNULLAUTO_INCREMENT,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor2.execute(sql1)

print("CreatedtableSuccessfull.")

defget_one(cityid):

p={

'cityid':cityid

}

complete_url=original_url+urlencode(p)

try:

response=requests.get(url=complete_url,params=requests_headers)

ifresponse.status_code==200:

returnresponse.json()

exceptrequests.ConnectionErrorase:

print('Error',e.args)

defparse_three(json):

ifjson:

foriinjson:

forbini.get('SeriesList'):

item_list=b.get('Name')

item_list2=b.get('Id')

print(item_list+':'+str(item_list2))

sql2='INSERTINTOajax(car_name,id)VALUES(%s,%s)'

try:

cursor2.execute(sql2,(item_list,item_list2))

mit()

except:

db2.rollback()

if__name__=='__main__':

city_list=[{'北京':'110100'},{'重庆':'500100'},{'上海':'310100'}]

forcityincity_list:

jo=get_one(city.values())

parse_three(jo)

db2.close()

#jo=get_one(110100)

#parse_one(jo)

#parse_two(jo)

#parse_three(jo)

#defparse_one(json):

#ifjson:

#foriinjson:

#item_list=i.get('Name')

#print(item_list)

#defparse_two(json):

#ifjson:

#foriinjson:

#forbini.get('SeriesList'):

#item_list=b.get('Name')

#print(item_list)第五章一、判断题1、Selenium库的主要作用是什么（）A.．进行数据存储B.．自动化浏览器操作和网页访问C.．数据可视化处理D.．编写网页前端代码二、判断题2、WebDriverWait是Selenium中用于实现等待条件的方法之一，可以等待特定元素的出现。（）3、使用Selenium进行网页自动化操作时，不需要关心页面的加载时间和元素的出现顺序。（）答案：1、B2、对3、错

三、实践题请编写Python代码，使用Selenium访问业务网站首页，然后从搜索框中输入关键字"Python编程"，并模拟点击搜索按钮fromseleniumimportwebdriver#创建浏览器驱动browser=webdriver.Chrome()#打开百度首页browser.get("")#定位搜索框并输入关键字search_box=browser.find_element_by_id("kw")search_box.send_keys("Python编程")#定位搜索按钮并点击search_button=browser.find_element_by_id("su")search_button.click()第六章使用Scrapy创建项目，爬取网站的页面数据，并保存到MySQL数据库中（网站可自行指定）。SpiderDemo.py爬虫主代码：importscrapy

#引入本地的模板

fromDemoAuto.itemsimportDemoautoItem

classMyScr(scrapy.Spider):

#设置全局唯一的name

name='DemoAuto'

#填写爬取地址

start_urls=['/all/#pvareaid=3311229']

#编写爬取方法

defparse(self,response):

#实例一个容器保存爬取的信息

item=DemoautoItem()

#这部分是爬取部分，使用xpath的方式选择信息，具体方法根据网页结构而定

#先获取每个课程的div

fordivinresponse.xpath('//*[@id="auto-channel-lazyload-article"]/ul/li/a'):

#获取div中的课程标题

item['title']=div.xpath('.//h3/text()').extract()[0].strip()

item['content']=div.xpath('.//p/text()').extract()[0].strip()

#返回信息

yielditemItems.py代码importscrapy

classDemoautoItem(scrapy.Item):

#definethefieldsforyouritemherelike:

#name=scrapy.Field()

#储存标题

title=scrapy.Field()

content=scrapy.Field()

passMiddlewares.py的代码#-*-coding:utf-8-*-

#Defineherethemodelsforyourspidermiddleware

#Seedocumentationin:

#/en/latest/topics/spider-middleware.html

fromscrapyimportsignals

classDemoautoSpiderMiddleware(object):

#Notallmethodsneedtobedefined.Ifamethodisnotdefined,

#scrapyactsasifthespidermiddlewaredoesnotmodifythe

#passedobjects.

@classmethod

deffrom_crawler(cls,crawler):

#ThismethodisusedbyScrapytocreateyourspiders.

s=cls()

crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)

returns

defprocess_spider_input(self,response,spider):

#Calledforeachresponsethatgoesthroughthespider

#middlewareandintothespider.

#ShouldreturnNoneorraiseanexception.

returnNone

defprocess_spider_output(self,response,result,spider):

#CalledwiththeresultsreturnedfromtheSpider,after

#ithasprocessedtheresponse.

#MustreturnaniterableofRequest,dictorItemobjects.

foriinresult:

yieldi

defprocess_spider_exception(self,response,exception,spider):

#Calledwhenaspiderorprocess_spider_input()method

#(fromotherspidermiddleware)raisesanexception.

#ShouldreturneitherNoneoraniterableofResponse,dict

#orItemobjects.

pass

defprocess_start_requests(self,start_requests,spider):

#Calledwiththestartrequestsofthespider,andworks

#similarlytotheprocess_spider_output()method,except

#thatitdoesn’thavearesponseassociated.

#Mustreturnonlyrequests(notitems).

forrinstart_requests:

yieldr

defspider_opened(self,spider):

('Spideropened:%s'%)

classDemoautoDownloaderMiddleware(object):

#Notallmethodsneedtobedefined.Ifamethodisnotdefined,

#scrapyactsasifthedownloadermiddlewaredoesnotmodifythe

#passedobjects.

@classmethod

deffrom_crawler(cls,crawler):

#ThismethodisusedbyScrapytocreateyourspiders.

s=cls()

crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)

returns

defprocess_request(self,request,spider):

#Calledforeachrequestthatgoesthroughthedownloader

#middleware.

#Musteither:

#-returnNone:continueprocessingthisrequest

#-orreturnaResponseobject

#-orreturnaRequestobject

#-orraiseIgnoreRequest:process_exception()methodsof

#installeddownloadermiddlewarewillbecalled

returnNone

defprocess_response(self,request,response,spider):

#Calledwiththeresponsereturnedfromthedownloader.

#Musteither;

#-returnaResponseobject

#-returnaRequestobject

#-orraiseIgnoreRequest

returnresponse

defprocess_exception(self,request,exception,spider):

#Calledwhenadownloadhandleroraprocess_request()

#(fromotherdownloadermiddleware)raisesanexception.

#Musteither:

#-returnNone:continueprocessingthisexception

#-returnaResponseobject:stopsprocess_exception()chain

#-returnaRequestobject:stopsprocess_exception()chain

pass

defspider_opened(self,spider):

('Spideropened:%s'%)Pipelines.py的代码#-*-coding:utf-8-*-

#Defineyouritempipelineshere

#Don'tforgettoaddyourpipelinetotheITEM_PIPELINESsetting

#See:/en/latest/topics/item-pipeline.html

importjson

importpymysql

classDemoautoPipeline(object):

def__init__(self):

#打开文件

self.file=open('data.json','w',encoding='utf-8')

#该方法用于处理数据

defprocess_item(self,item,spider):

#读取item中的数据

line=json.dumps(dict(item),ensure_ascii=False)+"\n"

#写入文件

self.file.write(line)

#返回item

returnitem

#该方法在spider被开启时被调用。

defopen_spider(self,spider):

pass

#该方法在spider被关闭时被调用。

defclose_spider(self,spider):

pass

defdbHandle():

conn=pymysql.connect("localhost","root","你的数据库密码","test")

returnconn

classMySQLPipeline(object):

defprocess_item(self,item,spider):

dbObject=dbHandle()

cursor=dbObject.cursor()

sql='insertintotable123(title,content)values(%s,%s)'

try:

cursor.execute(sql,(item['title'],item['content']))

dbOmit()

except:

dbObject.rollback()

returnitemSettings.py的代码#-*-coding:utf-8-*-

#ScrapysettingsforDemoAutoproject

#Forsimplicity,thisfilecontainsonlysettingsconsideredimportantor

#commonlyused.Youcanfindmoresettingsconsultingthedocumentation:

#/en/latest/topics/settings.html

#/en/latest/topics/downloader-middleware.html

#/en/latest/topics/spider-middleware.html

BOT_NAME='DemoAuto'

SPIDER_MODULES=['DemoAuto.spiders']

NEWSPIDER_MODULE='DemoAuto.spiders'

ITEM_PIPELINES={

'DemoAuto.pipelines.DemoautoPipeline':300,#保存到文件中

'DemoAuto.pipelines.MySQLPipeline':300,#保存到mysql数据库

}

MYSQL_HOST='localhost'

MYSQL_DATABASE='test'

MYSQL_USER='root'

MYSQL_PASSWORD='你自己的数据库密码'

MYSQL_PORT='3306'

#Crawlresponsiblybyidentifyingyourself(andyourwebsite)ontheuser-agent

#USER_AGENT='DemoAuto(+)'

#Obeyrobots.txtrules

ROBOTSTXT_OBEY=True

第七章1.编写网络爬虫代码采集业务网站二手房数据，完成数据清洗后，存入数据库。2.使用echarts和flask完成前后端可视化展示，图形自拟。答案：参考正文第七章案例第一章一、判断题1、以下选项不属于程序设计语言类别的是（）A.．机器语言B.．汇编语言C.．高级语言D.．解释语言2、下列Python语句正确的是（）A、min=xifx<yelseyB、max=x>y?x:yC、if(x>y)printxD、whileTrue:pass3、以下不能创建一个字典的语句是（）A、dict1={}B、dict2={3:5}C、dict3={[1,2,3]:“uestc”}D、dict4={(1,2,3):“uestc”}二、判断题4、模块文件的扩展名不一定是.py。（）5、字符和列表均支持成员关系操作符（in）和长度计算函数（len()）。（）答案：1、D2、D3、C4、错5、对第二章一、选择题1、以下哪些是爬虫技术可能存在风险（）A、大量占用爬去网站的资源B、网站敏感信息的获取造成的不良后果C、违背网站爬去设置D、以上都是2、下面表示一个文本类型的是（）A、<head>B、<html>C、<meta>D、<title>3、Tag有很多方法和属性,下列哪个属性不是Tag中最重要的属性（）A、nameB、attributesC、stringD、type二、判断题4、纵向爬虫主要面向大范围精确信息的爬取。（）5、URL包含的信息指出文件的位置以及浏览器应该怎么处理它，所有互联网上的每个文件都有一个唯一的URL。（）答案：1、D2、B3、D4、错5、对第三章一、选择题1、下列四项中，不属于数据库系统特点的是（）A、数据共享 B、数据完整性 C、数据冗余度高 D、数据独立性高2、数据库系统的数据独立性体现在( )A、不会因为数据的变化而影响到应用程序 B、不会因为系统数据存储结构与数据逻辑结构的变化而影响应用程序 C、不会因为存储策略的变化而影响存储结构 D、不会因为某些存储结构的变化而影响其他的存储结构3、.在数据结构中，从逻辑上可以把数据结构分成（）A、动态结构和静态结构 B、紧凑结构和非紧凑结构 C、线性结构和非线性结构 D、内部结构和外部结构二、判断题4、数据存储反映的是系统中静止的数据，表现出静态数据的特征。（）5、.每种数据结构都具备3个基本运算：插入、删除和查找。（）答案：1、C2、B3、C4、对5、错第四章一、选择题1、对字符串的说法正确的是()A、字符串是基本数据类型B、字符串值存储在栈内存中C、字符串值初始化后可以被改变D、字符串值一旦初始化就不会被改变2、能切割字符串的方法是()A、indexOf()B、substring()C、split()D、trim()3、1.String类中的getBytes()方法的作用是

人人文库> 全部分类> 教育资料 > 考试试卷

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

大数据采集与预处理李俊翰习题答案

文档简介

温馨提示

最新文档

评论

大数据采集与预处理李俊翰习题答案

文档简介

温馨提示

最新文档

评论

相关文档