版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Chapter 22: Advanced Querying and Information Retrieval陕西师范大学计算机科学学院数据库原理与应用课程组1Copyright: Silberschatz, Korth and SudarshanChapter 22: Advanced Querying Chapter 22: Advanced Querying and Information RetrievalDecision-Support SystemsData Analysis OLAPExtended aggregation features in SQLWindowing and
2、 rankingData Mining Data Warehousing Information-Retrieval Systems Including Web search2Copyright: Silberschatz, Korth and SudarshanChapter 22: Advanced Querying Decision Support SystemsDecision-support systems are used to make business decisions often based on data collected by on-line transaction-
3、processing systems.Examples of business decisions:what items to stock?What insurance premium to change?Who to send advertisements to?Examples of data used for making decisionsRetail sales transaction detailsCustomer profiles (income, age, sex, etc.)3Copyright: Silberschatz, Korth and SudarshanDecisi
4、on Support SystemsDecisiDecision-Support Systems: OverviewData analysis tasks are simplified by specialized tools and SQL extensionsExample tasksFor each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last yearAs above, f
5、or each product category and each customer categoryStatistical analysis packages (e.g., : S+) can be interfaced with databasesStatistical analysis is a large field will not study it hereData mining seeks to discover knowledge automatically in the form of statistical rules and patterns from Large dat
6、abases.A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site.Important for large businesses which generate data from multiple divisions, possibly at multiple sitesData may also be purchased externally4Copyright: Silberschatz, Kor
7、th and SudarshanDecision-Support Systems: OverData Analysis and OLAPAggregate functions summarize large volumes of dataOnline Analytical Processing (OLAP)Interactive analysis of data, allowing data to be summarized and viewed in different ways in an online fashion (with negligible delay)Data that ca
8、n be modeled as dimension attributes and measure attributes are called multidimensional data.Given a relation used for data analysis, we can identify some of its attributes as measure attributes, since they measure some value, and can be aggregated upon. For instance, the attribute number of the sal
9、es relation is a measure attribute, since it measures the number of units sold.Some of the other attributes of the relation are identified as dimension attributes, since they define the dimensions on which measure attributes, and summaries of measure attributes, are viewed.5Copyright: Silberschatz,
10、Korth and SudarshanData Analysis and OLAPAggregatCross Tabulation of sales by item-name and colorThe table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table.A cross-tab is a table where values for one of the dimension attributes form the row headers, values for
11、 another dimension attribute form the column headersOther dimension attributes are listed on topValues in individual cells are (aggregates of) the values of the dimension attributes that specify the cell.6Copyright: Silberschatz, Korth and SudarshanCross Tabulation of sales by iRelational Representa
12、tion of CrosstabsCrosstabs can be represented as relationsThe value all is used to represent aggregatesThe SQL:1999 standard actually uses null values in place of allMore on this later. 7Copyright: Silberschatz, Korth and SudarshanRelational Representation of CThree-Dimensional Data CubeA data cube
13、is a multidimensional generalization of a crosstabCannot view a three-dimensional object in its entirety but crosstabs can be used as views on a data cube8Copyright: Silberschatz, Korth and SudarshanThree-Dimensional Data CubeA dOnline Analytical ProcessingThe operation of changing the dimensions us
14、ed in a cross-tab is called pivotingSuppose an analyst wishes to see a cross-tab on item-name and color for a fixed value of size, for example, large, instead of the sum across all sizes. Such an operation is referred to as slicing. The operation is sometimes called dicing, particularly when values
15、for multiple dimensions are fixed.The operation of moving from finer-granularity data to a coarser granularity is called a rollup.The opposite operation - that of moving from coarser-granularity data to finer-granularity data is called a drill down.9Copyright: Silberschatz, Korth and SudarshanOnline
16、 Analytical ProcessingThHierarchies on DimensionsHierarchy on dimension attributes: lets dimensions to be viewed at different levels of detailE.g. the dimension DateTime can be used to aggregate by hour of day, date, day of week, month, quarter or year10Copyright: Silberschatz, Korth and SudarshanHi
17、erarchies on DimensionsHieraCross Tabulation With HierarchyCrosstabs can be easily extended to deal with hierarchiesCan drill down or roll up on a hierarchy11Copyright: Silberschatz, Korth and SudarshanCross Tabulation With HierarchOLAP ImplementationThe earliest OLAP systems used multidimensional a
18、rrays in memory to store data cubes, and are referred to as multidimensional OLAP (MOLAP) systems.OLAP implementations using only relational database features are called relational OLAP (ROLAP) systemsHybrid systems, which store some summaries in memory and store the base data and other summaries in
19、 a relational database, are called hybrid OLAP (HOLAP) systems.12Copyright: Silberschatz, Korth and SudarshanOLAP ImplementationThe earliesOLAP Implementation (Cont.)Early OLAP systems precomputed all possible aggregates in order to provide online responseSpace and time requirements for doing so can
20、 be very high2n combinations of group byIt suffices to precompute some aggregates, and compute others on demand from one of the precomputed aggregatesCan compute aggregate on (item-name, color) from an aggregate on (item-name, color, size) For all but a few “non-decomposable” aggregates such as medi
21、anis cheaper than computing it from scratch Several optimizations available for computing multiple aggregatesCan compute aggregate on (item-name, color) from an aggregate on (item-name, color, size)Can compute aggregates on (item-name, color, size), (item-name, color) and (item-name) using a single
22、sorting of the base data13Copyright: Silberschatz, Korth and SudarshanOLAP Implementation (Cont.)EarExtended AggregationSQL-92 aggregation quite limitedMany useful aggregates are either very hard or impossible to specifyData cubeComplex aggregates (median, variance)binary aggregates (correlation, re
23、gression curves)ranking queries (“assign each student a rank based on the total marks”SQL:1999 OLAP extensions provide a variety of aggregation functions to address above limitationsSupported by several databases, including Oracle and IBM DB214Copyright: Silberschatz, Korth and SudarshanExtended Agg
24、regationSQL-92 aggExtended Aggregation in SQL:1999The cube operation computes union of group bys on every subset of the specified attributesE.g. consider the queryselect item-name, color, size, sum(number)from salesgroup by cube(item-name, color, size) This computes the union of eight different grou
25、pings of the sales relation: (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) where ( ) denotes an empty group by list.For each grouping, the result contains the null value for attributes not present in the grouping. 15Copyright: Silbe
26、rschatz, Korth and SudarshanExtended Aggregation in SQL:19Extended Aggregation (Cont.)Relational representation of crosstab that we saw earlier, but with null in place of all, can be computed byselect item-name, color, sum(number)from salesgroup by cube(item-name, color)The function grouping() can b
27、e applied on an attributeReturns 1 if the value is a null value representing all, and returns 0 in all other cases. select item-name, color, size, sum(number),grouping(item-name) as item-name-flag,grouping(color) as color-flag,grouping(size) as size-flag,from salesgroup by cube(item-name, color, siz
28、e)Can use the function decode() in the select clause to replace such nulls by a value such as allE.g. replace item-name in first query by decode( grouping(item-name), 1, all, item-name)16Copyright: Silberschatz, Korth and SudarshanExtended Aggregation (Cont.)ReExtended Aggregation (Cont.)The rollup
29、construct generates union on every prefix of specified list of attributes E.g. select item-name, color, size, sum(number)from salesgroup by rollup(item-name, color, size)Generates union of four groupings: (item-name, color, size), (item-name, color), (item-name), ( ) Rollup can be used to generate a
30、ggregates at multiple levels of ahierarchy.E.g., suppose table itemcategory(item-name, category) gives the category of each item. Then select category, item-name, sum(number) from sales, itemcategory where sales.item-name = itemcategory.item-name group by rollup(category, item-name)would give a hier
31、archical summary by item-name and by category.17Copyright: Silberschatz, Korth and SudarshanExtended Aggregation (Cont.)ThExtended Aggregation (Cont.)Multiple rollups and cubes can be used in a single group by clauseEach generates set of group by lists, cross product of sets gives overall set of gro
32、up by listsE.g., select item-name, color, size, sum(number) from sales group by rollup(item-name), rollup(color, size) generates the groupings item-name, () X (color, size), (color), () = (item-name, color, size), (item-name, color), (item-name), (color, size), (color), ( ) 18Copyright: Silberschatz
33、, Korth and SudarshanExtended Aggregation (Cont.)MuRankingRanking is done in conjunction with an order by specification. Given a relation student-marks(student-id, marks) find the rank of each student.select student-id, rank( ) over (order by marks desc) as s-rankfrom student-marksAn extra order by
34、clause is needed to get them in sorted orderselect student-id, rank ( ) over (order by marks desc) as s-rankfrom student-marks order by s-rankRanking may leave gaps: e.g. if 2 students have the same top mark, both have rank 1, and the next rank is 3dense_rank does not leave gaps, so next dense rank
35、would be 219Copyright: Silberschatz, Korth and SudarshanRankingRanking is done in conjRanking (Cont.)Ranking can be done within partition of the data.“Find the rank of students within each section.”select student-id, section,rank ( ) over (partition by section order by marks desc) as sec-rankfrom st
36、udent-marks, student-sectionwhere student-marks.student-id = student-section.student-idorder by section, sec-rankMultiple rank clauses can occur in a single select clauseRanking is done after applying group by clause/aggregationExercises:Find students with top n ranksMany systems provide special (no
37、n-standard) syntax for “top-n” queriesRank students by sum of their marks in different courses given relation student-course-marks(student-id, course, marks)20Copyright: Silberschatz, Korth and SudarshanRanking (Cont.)Ranking can be Ranking (Cont.)Other ranking functions: percent_rank (within partit
38、ion, if partitioning is done)cume_dist (cumulative distribution) fraction of tuples with preceding valuesrow_number (non-deterministic in presence of duplicates)SQL:1999 permits the user to specify nulls first or nulls last select student-id, rank ( ) over (order by marks desc nulls last) as s-rankf
39、rom student-marks21Copyright: Silberschatz, Korth and SudarshanRanking (Cont.)Other ranking fRanking (Cont.)For a given constant n, the ranking the function ntile(n) takes the tuples in each partition in the specified order, and divides them into n buckets with qual numbers of tuples. For instance,
40、we an sort employees by salary, and use ntile(3) to find which range (bottom third, middle third, or top third) each employee is in, and compute the total salary earned by employees in each range:select threetile, sum(salary)from (select salary, ntile(3) over (order by salary) as threetilefrom emplo
41、yee) as sgroup by threetile22Copyright: Silberschatz, Korth and SudarshanRanking (Cont.)For a given conWindowingE.g.: “Given sales values for each date, calculate for each date the average of the sales on that day, the previous day, and the next day”Such moving average queries are used to smooth out
42、 random variations.In contrast to group by, the same tuple can exist in multiple windowsWindow specification in SQL:Ordering of tuples, size of window for each tuple, aggregate functionE.g. given relation sales(date, value) select date, sum(value) over (order by date between rows 1 preceding and 1 f
43、ollowing) from salesExamples of other window specifications:between rows unbounded preceding and currentrows unbounded precedingrange between 10 preceding and current rowAll rows with values between current row value 10 to current valuerange interval 10 day precedingNot including current row23Copyri
44、ght: Silberschatz, Korth and SudarshanWindowingE.g.: “Given sales vaWindowing (Cont.)Can do windowing within partitionsE.g. Given a relation transaction(account-number, date-time, value), where value is positive for a deposit and negative for a withdrawal“Find total balance of each account after eac
45、h transaction on the account”select account-number, date-time, sum(value) over(partition by account-number order by date-timerows unbounded preceding) as balancefrom transactionorder by account-number, date-time24Copyright: Silberschatz, Korth and SudarshanWindowing (Cont.)Can do windowData Mining25
46、Copyright: Silberschatz, Korth and SudarshanData Mining25Copyright: SilberData MiningBroadly speaking, data mining is the process of semi-automatically analyzing large databases to find useful patternsLike knowledge discovery in artificial intelligence data mining discovers statistical rules and pat
47、terns Differs from machine learning in that it deals with large volumes of data stored primarily on disk.Some types of knowledge discovered from a database can be represented by a set of rules.e.g.,: “Young women with annual incomes greater than $50,000 are most likely to buy sports cars”Other types
48、 of knowledge represented by equations, or by prediction functionsSome manual intervention is usually requiredPre-processing of data, choice of which type of pattern to find, postprocessing to find novel patterns26Copyright: Silberschatz, Korth and SudarshanData MiningBroadly speaking, dApplications
49、 of Data MiningPrediction based on past historyPredict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, .) and past historyPredict if a customer is likely to switch brand loyaltyPredict if a customer is likely to respond to “junk mail”Predict if a
50、 pattern of phone calling card usage is likely to be fraudulentSome examples of prediction mechanisms:ClassificationGiven a training set consisting of items belonging to different classes, and a new item whose class is unknown, predict which class it belongs toRegression formulae given a set of para
51、meter-value to function-result mappings for an unknown function, predict the function-result for a new parameter-value27Copyright: Silberschatz, Korth and SudarshanApplications of Data MiningPreApplications of Data Mining (Cont.)Descriptive PatternsAssociationsFind books that are often bought by the
52、 same customers. If a new customer buys one such book, suggest that he buys the others too.Other similar applications: camera accessories, clothes, etc.Associations may also be used as a first step in detecting causationE.g. association between exposure to chemical X and cancer, or new medicine and
53、cardiac problemsClustersE.g. typhoid cases were clustered in an area surrounding a contaminated wellDetection of clusters remains important in detecting epidemics28Copyright: Silberschatz, Korth and SudarshanApplications of Data Mining (CClassification RulesClassification rules help assign new objec
54、ts to a set of classes. E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk?Classification rules for above example could use a variety of knowledge, such as educational level of applicant, salary of applicant, age of applicant, etc.
55、person P, P.degree = masters and P.income 75,000 P.credit = excellent person P, P.degree = bachelors and (P.income 25,000 and P.income 75,000) P.credit = goodRules are not necessarily exact: there may be some misclassificationsClassification rules can be compactly shown as a decision tree.29Copyrigh
56、t: Silberschatz, Korth and SudarshanClassification RulesClassificaDecision Tree30Copyright: Silberschatz, Korth and SudarshanDecision Tree30Copyright: SilbConstruction of Decision TreesTraining set: a data sample in which the grouping for each tuple is already known.Consider credit risk example: Sup
57、pose degree is chosen to partition the data at the root. Since degree has a small number of possible values, one child is created for each value.At each child node of the root, further classification is done if required. Here, partitions are defined by income. Since income is a continuous attribute,
58、 some number of intervals are chosen, and one child created for each interval.Different classification algorithms use different ways of choosing which attribute to partition on at each node, and what the intervals, if any, are.In generalDifferent branches of the tree could grow to different levels.
59、Different nodes at the same level may use different partitioning attributes.31Copyright: Silberschatz, Korth and SudarshanConstruction of Decision TreesConstruction of Decision Trees (Cont.)Greedy top down generation of decision trees.Each internal node of the tree partitions the data into groups ba
60、sed on a partitioning attribute, and a partitioning condition for the nodeMore on choosing partioning attribute/condition shortlyAlgorithm is greedy: the choice is made once and not revisited as more of the tree is constructedThe data at a node is not partitioned further if either all (or most) of t
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 五年级品德与社会上册教案
- 地方公务员贵州申论41
- 四川申论真题2006年四川
- 北师大版数学七年级上册 第4章 第38课时 角习题课件
- 四川省申论模拟44
- 2024年印制CDR光盘合同
- 2024年酒店承包协议书
- 2024年借贷合同范本中英文
- 浙江行政职业能力104
- 2024年提供合作申购系统协议
- 2023公司法修订解读
- (高清版)TDT 1032-2011 基本农田划定技术规程
- 2024全国职业院校技能大赛ZZ060母婴照护赛项规程+赛题
- 小学科学课堂有效提问研究的中期报告
- 电气线路设备检修作业中安全保证规程培训
- 《声音的产生与传播》说课课件
- 小学高年级学生数学自主学习能力培养现状的调查问卷(学生卷)
- 社工创业计划书
- 黑龙江省哈尔滨市第九中学2023-2024学年高三年级上册期中考试语文试题(解析版)
- 《工装夹具设计》课程标准
- 医院药房岗前培训
评论
0/150
提交评论