《大数据采集与预处理》课内习题和答案

上传人：熊*** IP属地：山东上传时间：2024-10-17 格式：DOCX 页数：14 大小：28.16KB 积分：15 举报 版权申诉

已阅读5页，还剩9页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

《大数据采集与预处理》课内习题和答案第一章通过PyCharm建立一个项目，项目名称自定。在该项目中实现一个WelcometoPython!程序。PyCharm创建项目。首先，双击桌面的PyCharm图标打开PyCharm程序，选择“File”—>“NewProject”，然后在弹出的窗口中的location文本框中自定义项目名称为：你的项目名称，并将该项目存放在位置：项目存放路径。将鼠标移到项目根节点，右击鼠标，选择“New”—>“PythonFile”。这样就可以在PyCharm中创建一个基于Python3.6基础解释器作为编程环境的Python文件。在此，将该文件命名为：你的文件名称在右边的代码编辑框中输入：print("WelcometoPython!")运行该文件即可。2.通过PyCharm建立一个项目，项目名称自定。在该项目中定义一个列表，并使用列表函数append()向该列表中添加数据，最后使用for循环语句遍历输出。list1=['hello','world',2020]

list1.append('python')##使用append()添加元素

print("list1[3]:",list1[3])#输出添加的元素

foreinlist1:#循环遍历列表元素

print(e)第二章1.通过导入requests库，使用该库爬取Python官方网站页面数据。importrequests

req=requests.get('***.python***/')

req.encoding='utf-8'

print(req.text)2.通过导入lxml和BeautifulSoup，使用该库解析爬取的Python官方网站页面数据。frombs4importBeautifulSoup

req=requests.get('***.python***/')

req.encoding='utf-8'

soup=BeautifulSoup(req.text,'lxml')

print(soup.title.string)

item=soup.select('#top>nav>ul>li.python-meta.current_item.selectedcurrent_branch.selected>a')

print(item)第三章1.使用Python读取和输出CSV和JSON数据。读取CSV，数据自定义。importcsv

file_to_use='学生信息.csv'

withopen(file_to_use,'r',encoding='utf-8')asf:

r=csv.reader(f)

file_header=next(r)

print(file_header)

forid,file_header_colinenumerate(file_header):

print(id,file_header_col)

forrowinr:

ifrow[2]=='学号':

print(row)写入CSV，数据自定义。importcsv

withopen('学生信息.csv','a',encoding='utf-8')asf:

wr=csv.writer(f)

wr.writerows([['大数据运维','hadoop','高级技术员','张三'],['大数据开发','python','中级技术员','李四']])

wr.writerow(['大数据运维','hadoop','高级技术员','张三'])

wr.writerow(['大数据开发','python','中级技术员','李四'])2.使用Python连接MySQL，创建数据库和表，并实现增删查改。importpymysql

db=pymysql.connect("localhost","root","你的密码","你的数据库名称")

cursor=db.cursor()

cursor.execute("DROPTABLEIFEXISTSemployee")

#创建表格

sql="""CREATETABLE`employee`(

`id`int(10)NOTNULLAUTO_INCREMENT,

`first_name`char(20)NOTNULL,

`last_name`char(20)DEFAULTNULL,

`age`int(11)DEFAULTNULL,

`sex`char(1)DEFAULTNULL,

`income`floatDEFAULTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor.execute(sql)

print("CreatedtableSuccessfully.")

#插入

sql2="""INSERTINTOEMPLOYEE(FIRST_NAME,

LAST_NAME,AGE,SEX,INCOME)

VALUES('Mac','Su',20,'M',5000)"""

cursor.execute(sql2)

print("InserttableSuccessfully.")

#查询

sql3="""SELECT*FROMEMPLOYEE"""

cursor.execute(sql3)

print("SELECTtableSuccessfully.")

#修改

sql4="""UPDATEEMPLOYEESETFIRST_NAME='Sam'WHEREID=3'"""

cursor.execute(sql4)

print("UPDATEtableSuccessfully.")

#删除

sql5="""DELETEFROMEMPLOYEEWHEREFIRST_NAME='Mac'"""

cursor.execute(sql5)

print("DELETEtableSuccessfully.")第四章1.利用业务网站提供的API实现数据采集，清洗和存储。importrequests

importpymysql

api_url='***//api.github***/search/repositories?q=spider'

req=requests.get(api_url)

print('状态码：',req.status_code)

req_dic=req.json()

print('与spider有关的库总数：',req_dic['total_count'])

print('本次请求是否完整:',req_dic['incomplete_results'])

req_dic_items=req_dic['items']

print('当前页面返回的项目数量：',len(req_dic_items))

names=[]

forkeyinreq_dic_items:

names.append(key['name'])

sorted_names=sorted(names)

db=pymysql.connect(host='localhost',user='root',password='这里要使用自己密码',port=3306)

cursor=db.cursor()

cursor.execute("CREATEDATABASE数据库名称DEFAULTCHARACTERSETutf8mb4")

db.close()

db2=pymysql.connect("localhost","root","这里要使用自己密码","数据库名称",3306)

cursor2=db2.cursor()

cursor2.execute("DROPTABLEIFEXISTS数据库名称")

sql1="""CREATETABLE`数据库名称`(

`id`int(10)NOTNULLAUTO_INCREMENT,

`full_name`char(20)NOTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor2.execute(sql1)

print("CreatedtableSuccessfull.")

forindex,nameinenumerate(sorted_names):

print('项目索引号：',index,'项目名称：',name)

sql2='INSERTINTO数据库名称(id,full_name)VALUES(%s,%s)'

try:

cursor2.execute(sql2,(index,name))

db2***mit()

except:

db2.rollback()

db2.close()2.通过分析特定页面结构和数据的各项内容，使用Python实现AJAX的数据采集，并将结果存储到MySQL数据库中。fromurllib.parseimporturlencode

importrequests

importpymysql

original_url='***.autohome***.cn/ashx/AjaxIndexHotCarByDsj.ashx?'

requests_headers={

'Referer':'***.autohome***.cn/beijing/',

'User-Agent':'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/57.0.2987.133Safari/537.36',

'X-Requested-With':'XMLHttpRequest',

}

db=pymysql.connect(host='localhost',user='root',password='这里要使用自己密码',port=3306)

cursor=db.cursor()

cursor.execute("CREATEDATABASEAJAXDEFAULTCHARACTERSETutf8mb4")

db.close()

db2=pymysql.connect("localhost","root","这里要使用自己密码","AJAX",3306)

cursor2=db2.cursor()

cursor2.execute("DROPTABLEIFEXISTSajax")

sql1="""CREATETABLE`ajax`(

`car_name`char(20)NOTNULL,

`id`int(10)NOTNULLAUTO_INCREMENT,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8mb4;"""

cursor2.execute(sql1)

print("CreatedtableSuccessfull.")

defget_one(cityid):

p={

'cityid':cityid

}

complete_url=original_url+urlencode(p)

try:

response=requests.get(url=complete_url,params=requests_headers)

ifresponse.status_code==200:

returnresponse.json()

exceptrequests.ConnectionErrorase:

print('Error',e.args)

defparse_three(json):

ifjson:

foriinjson:

forbini.get('SeriesList'):

item_list=b.get('Name')

item_list2=b.get('Id')

print(item_list+':'+str(item_list2))

sql2='INSERTINTOajax(car_name,id)VALUES(%s,%s)'

try:

cursor2.execute(sql2,(item_list,item_list2))

db2***mit()

except:

db2.rollback()

if__name__=='__main__':

city_list=[{'北京':'110100'},{'重庆':'500100'},{'上海':'310100'}]

forcityincity_list:

jo=get_one(city.values())

parse_three(jo)

db2.close()

#jo=get_one(110100)

#parse_one(jo)

#parse_two(jo)

#parse_three(jo)

#defparse_one(json):

#ifjson:

#foriinjson:

#item_list=i.get('Name')

#print(item_list)

#defparse_two(json):

#ifjson:

#foriinjson:

#forbini.get('SeriesList'):

#item_list=b.get('Name')

#print(item_list)第五章一、判断题1、Selenium库的主要作用是什么（）A.．进行数据存储B.．自动化浏览器操作和网页访问C.．数据可视化处理D.．编写网页前端代码二、判断题2、WebDriverWait是Selenium中用于实现等待条件的方法之一，可以等待特定元素的出现。（）3、使用Selenium进行网页自动化操作时，不需要关心页面的加载时间和元素的出现顺序。（）答案：1、B2、对3、错

三、实践题请编写Python代码，使用Selenium访问业务网站首页，然后从搜索框中输入关键字"Python编程"，并模拟点击搜索按钮fromseleniumimportwebdriver#创建浏览器驱动browser=webdriver.Chrome()#打开百度首页browser.get("***.baidu***")#定位搜索框并输入关键字search_box=browser.find_element_by_id("kw")search_box.send_keys("Python编程")#定位搜索按钮并点击search_button=browser.find_element_by_id("su")search_button.click()第六章使用Scrapy创建项目，爬取网站的页面数据，并保存到MySQL数据库中（网站可自行指定）。SpiderDemo.py爬虫主代码：importscrapy

#引入本地的模板

fromDemoAuto.itemsimportDemoautoItem

classMyScr(scrapy.Spider):

#设置全局唯一的name

name='DemoAuto'

#填写爬取地址

start_urls=['***.autohome***.cn/all/#pvareaid=3311229']

#编写爬取方法

defparse(self,response):

#实例一个容器保存爬取的信息

item=DemoautoItem()

#这部分是爬取部分，使用xpath的方式选择信息，具体方法根据网页结构而定

#先获取每个课程的div

fordivinresponse.xpath('//*[@id="auto-channel-lazyload-article"]/ul/li/a'):

#获取div中的课程标题

item['title']=div.xpath('.//h3/text()').extract()[0].strip()

item['content']=div.xpath('.//p/text()').extract()[0].strip()

#返回信息

yielditemItems.py代码importscrapy

classDemoautoItem(scrapy.Item):

#definethefieldsforyouritemherelike:

#name=scrapy.Field()

#储存标题

title=scrapy.Field()

content=scrapy.Field()

passMiddlewares.py的代码#-*-coding:utf-8-*-

#Defineherethemodelsforyourspidermiddleware

#Seedocumentationin:

#***//doc.scrapy***/en/latest/topics/spider-middleware.html

fromscrapyimportsignals

classDemoautoSpiderMiddleware(object):

#Notallmethodsneedtobedefined.Ifamethodisnotdefined,

#scrapyactsasifthespidermiddlewaredoesnotmodifythe

#passedobjects.

@classmethod

deffrom_crawler(cls,crawler):

#ThismethodisusedbyScrapytocreateyourspiders.

s=cls()

crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)

returns

defprocess_spider_input(self,response,spider):

#Calledforeachresponsethatgoesthroughthespider

#middlewareandintothespider.

#ShouldreturnNoneorraiseanexception.

returnNone

defprocess_spider_output(self,response,result,spider):

#CalledwiththeresultsreturnedfromtheSpider,after

#ithasprocessedtheresponse.

#MustreturnaniterableofRequest,dictorItemobjects.

foriinresult:

yieldi

defprocess_spider_exception(self,response,exception,spider):

#Calledwhenaspiderorprocess_spider_input()method

#(fromotherspidermiddleware)raisesanexception.

#ShouldreturneitherNoneoraniterableofResponse,dict

#orItemobjects.

pass

defprocess_start_requests(self,start_requests,spider):

#Calledwiththestartrequestsofthespider,andworks

#similarlytotheprocess_spider_output()method,except

#thatitdoesn’thavearesponseassociated.

#Mustreturnonlyrequests(notitems).

forrinstart_requests:

yieldr

defspider_opened(self,spider):

('Spideropened:%s'%)

classDemoautoDownloaderMiddleware(object):

#Notallmethodsneedtobedefined.Ifamethodisnotdefined,

#scrapyactsasifthedownloadermiddlewaredoesnotmodifythe

#passedobjects.

@classmethod

deffrom_crawler(cls,crawler):

#ThismethodisusedbyScrapytocreateyourspiders.

s=cls()

crawler.signals.connect(s.spider_opened,signal=signals.spider_opened)

returns

defprocess_request(self,request,spider):

#Calledforeachrequestthatgoesthroughthedownloader

#middleware.

#Musteither:

#-returnNone:continueprocessingthisrequest

#-orreturnaResponseobject

#-orreturnaRequestobject

#-orraiseIgnoreRequest:process_exception()methodsof

#installeddownloadermiddlewarewillbecalled

returnNone

defprocess_response(self,request,response,spider):

#Calledwiththeresponsereturnedfromthedownloader.

#Musteither;

#-returnaResponseobject

#-returnaRequestobject

#-orraiseIgnoreRequest

returnresponse

defprocess_exception(self,request,exception,spider):

#Calledwhenadownloadhandleroraprocess_request()

#(fromotherdownloadermiddleware)raisesanexception.

#Musteither:

#-returnNone:continueprocessingthisexception

#-returnaResponseobject:stopsprocess_exception()chain

#-returnaRequestobject:stopsprocess_exception()chain

pass

defspider_opened(self,spider):

('Spideropened:%s'%)Pipelines.py的代码#-*-coding:utf-8-*-

#Defineyouritempipelineshere

#Don'tforgettoaddyourpipelinetotheITEM_PIPELINESsetting

#See:***//doc.scrapy***/en/latest/topics/item-pipeline.html

importjson

importpymysql

classDemoautoPipeline(object):

def__init__(self):

#打开文件

self.file=open('data.json','w',encoding='utf-8')

#该方法用于处理数据

defprocess_item(self,item,spider):

#读取item中的数据

line=json.dumps(dict(item),ensure_ascii=False)+"\n"

#写入文件

self.file.write(line)

#返回item

returnitem

#该方法在spider被开启时被调用。

defopen_spider(self,spider):

pass

#该方法在spider被关闭时被调用。

defclose_spider(self,spider):

pass

defdbHandle():

conn=pymysql.connect("localhost","root","你的数据库密码","test")

returnconn

classMySQLPipeline(object):

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

《大数据采集与预处理》课内习题和答案

文档简介

温馨提示

最新文档

评论

《大数据采集与预处理》课内习题和答案

文档简介

温馨提示

最新文档

评论

相关文档