关于数据仓库的十个最长问的问题

数据|问题

Although there are various approaches to data mining that seem to offer distinct features and benefits, many may not be powerful enough to meet your corporate knowledge discovery needs. But in fact just a few fundamental questions can quickly clarify the business benefits and the power of a data mining system, setting its advantages in a clear perspective. These questions need to be asked both from the view points of business and technical users. However, please note that these questions refer to data mining -- please also see the many benefits of the knowledge access paradigm which uses the patterns discovered by data mining within a PatternWarehouseTM. Here are two sets of "Top Ten Data Mining Questions" from business and technical perspectives. Each question has three parts that together highlight one specific aspect of a data mining system's power and capability. The Top Ten Data Mining Business Questions The top ten business question should be asked by business users about the benefits, quality and usability of the system. They are:Question 1: Business Benefits a) How will this system help us? b) How well does this system work for our industry-specific applications? c) What information can we get that we do not already have?It is essential to ask this question again and again. You should, of course, get new refined information, but it is not enough just to know something -- you should have information that allows you to "act" within the context of your industry. And, you should measure the bottom-line dollar benefits delivered by a data mining system. See the paper "Measuring the Dollar Value f Mined Information" for a framework for this.Question 2: Technical Know-how a) How technically sophisticated do we need to be to use it? b) Can business users operate it without calling the IS group all the time? c) Is it as easy to use as an internet browser?Business users should be empowered with direct, on-demand access to refined knowledge. They should not have to know statistics, yet should be given consistent and correct answers. The system interface should be as easy to use as a web-browser.Question 3: Understandability and Explanations a) Are the results intuitive or difficult to understand? b) Do we get clear explanations for any information item presented? c) Will the explanations be in technical statistical terms or in a form that we can understand?Results should be presented to business users in plain English, accompanied with graphs. The system should be able to explain each piece of information it presents in clear, English-like terms that business users can easily comprehend and use.Question 4: Follow-up Questions a) What kinds of follow-up questions can we ask from the system? b) Do we need to go to an analyst for further question answering? c) How fast can we drill-down on the fly to see more patterns?Response to follow-up questions must be immediate. Business users should not need to use intermediaries such as analysts to get more information after they have seen some results. If follow-up questions take time and involve intermediaries, the business users effectiveness will be impacted. Business users should get refined information, as they need it, when they need it.Question 5: Business Users a) How many business users can this system support? b) Can the business users tailor their own questions for the system? c) Can users utilize the knowledge for day-to-day decision making?The system should be able to use the same fundamental knowledge to support a few hundred business users, each with a different group-perspective. Yet, all of these users must be given consistent answers as they ask their own questions. The information must be presented such that can be utilized for day-to-day actions.Question 6: Accuracy, Completeness and Consistency a) How accurate are the results the system delivers? b) Can some patterns be missed by the system? c) Are the results always consistent or can 100 users get 100 different answers?The system must cover a wide range of patterns and should provide high quality, information. The knowledge provided to business users should be derived from the entire data set (and not samples) in order to increase accuracy. All business users should access the same knowledge so that they all receive consistent answers, increasing the quality of corporate information.Question 7: Incremental Analysis a) Can we automatically analyze weekly / monthly data as it becomes available? b) Can the system compare the "month to month" results and patterns by itself? c) Can we get automatic pattern detection over time, every week or month?The system should analyze data as it becomes available every week or month and perform on-going trend analysis, highlighting the key items and influence factors that impact significant changes. The incremental analysis should be performed automatically in the background, informing the user of significant trends and the underlying causes.Question 8: Data Handling a) How much data can the system deal with? b) Can it work directly on our database, or do we need to extract data? c) If it works on extracts, how do we know that some patterns are not missed?The system should handle moderate to large volumes of data on a powerful server -- of course, large data volumes should not be expected to be managed on small servers. The system should work directly on the SQL database, without extracts so that patterns are not missed and performance is improved.Question 9: Integration a) How will it integrate into our computing environment? b) Will it just work on our existing SQL database? c) How easily will the system work on our intranet?The system should run smoothly on existing open server platforms (e.g. Unix) and popular DBMS engines (e.g. Oracle, Sybase Informix, etc.) on the server. The system should present results to users on the corporate intranet. The absence of data conditioning requirements and extract files will make integration much easier.Question 10: Support Staff a) What staff do I need to keep this system installed and running? b) How do we get support and training to get started? c) What happens after we install the system?After the initial system design, the support personnel for the system should be kept minimal. One database administrator should be able to manage the DBMS, and one analyst should occasionally help in setting up discovery models, etc. Thereafter, business users should be able to use the system on their own. There should be no need for a large number of resident support analyst to act as intermediaries for the business users. The Top Ten Data Mining Technical QuestionsThe top ten technical question should be asked by technical users about the architecture, power and the scalability of the system. They are:Question 1: Architecture a) How are computations distributed between the client and the server? b) Is any data brought from the server to the client? c) Can the system run in a three tiered architecture?The best option is for the discovery to take place entirely on the server. Any attempt to bring data to the client will seriously limit the applicability of the system to larger databases. The best architecture is a thin-client, three-tiered system that uses the power of a large server-based SQL engine but operates on an intranet. Question 2: Access to Real Data a) Does the system work on the real SQL database or on samples and extracts? b) If it samples or extracts, how do we know that it is accurate? c) If it builds flat files, who manages this activity and cleans up for on-going analyses, and how can it sample across several tables?The best option is for a data mining system to work on the real databases and not on samples, extracts and/or flat files. Working on the real database uses the SQL engine's power (e.g. parallel execution) and provide much more accurate results. And, the system should be able to access database tables in their native form, reaching across tables by itself.Question 3: Performance and Scalability a) How large of a database can the system analyze? b) How long does it take to perform discovery on a large database? c) Can the system run in parallel on a multi-processor server?The system should work on databases with a large number of records. It should derive its capabilities from the power of the server and the SQL engine, whenever possible. The system should be able to use the built-in parallelism of the SQL engine, but should also be able to use multiple processors for its own parallel non-SQL computations.Question 4: Multi-Table Databases a) Does the system work on a single table only or can it analyze multiple tables? b) Does the system need to perform a huge join to access all of our tables? c) If it works on a single table, how can we feed it our existing data schema? The real world is full of multi-table databases which can not be joined and meshed into a single view. In fact, the theory of normalization came about because data needs to be in more than one table. Using single tables is an affront to a decade of work on database design. If you challenge the DBA of a really large database to put things in a single table you will either get a laugh or a blank stare -- in many cases the database size will balloon beyond control. The system should be able to mine large multi-table databases directly by itself on the server. Question 5: Multi-Dimensional Analysis a) Does the system analyze data along a single dimension only? b) How are multi-dimensional patterns discovered and expressed by the system? c) How do we specify the dimensional structure of our data to the system?The OLAP phenomenon has conclusively demonstrated that the business world's data is not single-dimensional. Hence a data mining system should be able to automatically discover patterns along multiple dimensions. In fact, there are many cases where no single dimensional view can correctly represent the semantics of influence because the influence ratios will always be off regardless of how one aggregates. See the paper: OLAP & Data Mining: Bridging the Gap for a detailed discussion of this. Question 6: Types and Classes of Patterns Discovered a) How powerful and general are the patterns the system can discover and express? b) Can the system mix different pattern types, e.g. influence and affinity patterns? c) Can the system discover time-based patterns and trends?The format of the patterns discovered by the system is very general and goes far beyond decision trees or simple affinities. The advantage to this is that the general rules discovered are far more powerful than decision trees. Decision trees are very limited in that they cannot find all the information in a database. Being rule-based keeps the system from being constrained to one part of a search space and makes sure that many more clusters and patterns are found -- allowing the system to provide more information and better predictions.Question 7: System Initiative a) Does the system use its own initiative to perform discovery or is it guided by the user? b) Can the system discover unexpected patterns by itself? c) Can the system start-up by itself on a weekly or monthly basis and perform discovery?In some cases the user has to interact and guide the system, e.g. build a decision tree. However, a better approach is for the system to use its own initiative in the data mining process, forming hypothesis automatically based on the character of the data. The system should start-up by itself, select the significant patterns in the data and filter the unimportant trends. The analyses should be done routinely on a weekly or monthly basis.Question 8: Treatment of Data Types a) Are all data types handled in their own form or translated to other types? b) Can the system find numeric ranges in data by itself? c) Do a large number of non-numeric values cause problems for the system?The system should manage all data types in a uniform manner and in their native formats, i.e. numbers, dates and constants should remain numbers, dates and constants internally. Interesting ranges in the data should be discovered by the system, not requiring "number bin" construction by the user. A large number of constant values in the database should not choke the system.Question 9: Data Dependencies and Hierarchies a) Can the system be told about the functional dependencies in our database? b) Does the system understand the concept of data hierarchy? c) How does the system use dependencies and/or hierarchies for discovery?The system should be capable of using the functional (and other dependencies) that exist in a database. The use of these dependencies can significantly enhance the power of a discovery -- in fact ignoring them can lead to confusion. The system should understand the concept of hierarchy and should be able to use it for discovery along multiple dimensions.Question 10: Flexibility and Noise Sensitivity a) How brittle is the system when dealing with noisy data? b) How well does the system cope with data exceptions and low quality data? c) Can the system provide statements with flexible numeric ranges discovered by itself in the data?The system should not be sensitive to noise and should internally use fuzzy logic to smooth data brittleness. As the data gathers noise, the system should only reduce the level on confidence associated with the results provided, not suddenly change direction in discovery. However, the system should still produce the most significant findings from the data set, even if noise is present.

时间: 2024-09-30 22:48:36

关于数据仓库的十个最长问的问题的相关文章

Word字体间距过大、过长该如何调整?

在Word中输入文字的时候字体间距变得非常大,不知道是什么原因造成.仔细回想下这个问题不下于十个人曾经问过我,相信这个字体间距过大的问题常常被人遇到,所以在此作出解答,也方便日后各位朋友遇到此问题可以很好的解决. 如下图所示,输入的文字没有达到一排的字数,然后回车,文字之间就会有很大的间距. 说明:造成字体间距的原因可能很多,我也不知道大家到底遇到的是哪一种,所以我将可能会遇到此问题的解决方法全部整理出来,大家可以一个一个的尝试,绝对有一种可以帮你解决! 第一种解决方法:将光标放到需调整的那一排

咨询公司那些所谓的聪明人值得创业公司挖吗?

最近朋友圈疯转<这些聪明人为什么不来创业?聊聊投行和咨询公司的学霸们->,看完我忍不住来讲几句.我,上海交通大学毕业,拿着理科学士的学位,一毕业就踏入了当时所有人梦寐以求的一线咨询公司贝恩咨询.2009 年,又有幸去哈佛商学院攻读 MBA.回国后在跨国公司很快做到了中高层管理,工资待遇不错,压力也不大,事业发展直线上升.故事到这里听上去很熟悉.但是,2013 年我选择了创业.到现在已经超过一年的时间,一秒钟也没有后悔过.创业公司需要这些聪明人吗?创业型公司招人总结下来需要两类人,一类是做事情很

如何选择易出效果的长尾词

许多有经验的优化者都找到,当今用长尾词来优化更能出结果长尾核心词该怎样选择核心词呢?实际上,笔者在一最先也没有什么偏向,但颠末连续串的,可是研究,也得出一些结论,找到几个好方法,一起分享给各人. 1:去竞争竞争者的网站中找. 可以去看对方的核心词是什么,然后察看对方网站的导出链接,就会很轻易的看出对方用过哪些词,把词统计出来,做一些比拟,获得本身的长尾核心词. 2:给网站摆设统计工具,用数字剖析. 当今险些所有的统计工具有一项功效,就是看清来路,让你清晰的获悉每一个ip从那里来,从哪个词来,这个

Mariadb Thread Pool分析

对于MySQL5.5来说只有企业版本中含有 Thread Pool,但幸运的是 mariadb 5.1中就已存在该功能,mariadb 5.5 中进行了改进. 本篇暂且介绍FAQ:后期会放出其工作原理及使用情况. 商业版本中 5.5.16 添加了 thread  handling model (线程池)来应对多客户连接问题.代替一个session 一个线程的问题.在MariaDB 5.1-5.3 中已经实现该功能,在MariaDB 5.5中实现优化,动态的线程池等,这个和商业版本有少许区别,具体

在Linux的命令行中实现字符出现频率统计的方法

  Linux 命令行有很多的乐趣,我们可以很容易并且完善地执行很多繁琐的任务.比如,我们计算一个文本文件中字和字符的出现频率,这就是我们打算在这篇文章中讲到的. 立刻来到我们脑海的命令,计算字和字符在一个文本文件中出现频率的 Linux 命令是 wc 命令. 在使用的脚本来分析文本文件之前,我们必须有一个文本文件.为了保持一致性,我们将创建一个文本文件,man命令的输出如下所述. 代码如下: $ man man > man.txt 以上命令是将man命令的使用方式导入到man.txt文件里.

用批处理解决数学问题的代码第1/4页_DOS/BAT

#01 ! 求水仙花数? #02 ! 有四个数,其中任意三个数相加,所得的和分别是84,88,99,110,求这四个数? #03 ! 赵姑娘的岁数有以下特点: 1. 它的3次方是一个四位数,而4次方是一个六位数; 2. 这四位数和六位数正好是0到9这十个数字组成. 问,这个数应该是什么数? #04 ! 排一本辞典的页码共用了4889个数字.这本辞典共有多少页? 答案:1499 #05 ! 阿聪说他这次去西北看见一群骆驼,共有23个驼峰,60只脚.请问单.双峰骆驼各多少只? #06 ! 有一个五位

Linux终端的乐趣之把玩字词计数

Linux终端的乐趣之把玩字词计数 Linux 命令行有很多的乐趣,我们可以很容易并且完善地执行很多繁琐的任务.比如,我们计算一个文本文件中字和字符的出现频率,这就是我们打算在这篇文章中讲到的. 立刻来到我们脑海的命令,计算字和字符在一个文本文件中出现频率的 Linux 命令是 wc 命令. 在使用的脚本来分析文本文件之前,我们必须有一个文本文件.为了保持一致性,我们将创建一个文本文件,man命令的输出如下所述. $ man man > man.txt 以上命令是将man命令的使用方式导入到ma

小白学数据之NoSQL数据库 进阶篇

写在前面 这篇是小白学数据系列的NoSQL数据库的第二篇:进阶篇.数据分析方向的从业人员可以从中获取数据仓库软件市场的现状和分析,以增加自己的知识储备,为可能的技术转型打基础.而工程师可以找到关于NoSQL主流产品的分析介绍以及选择数据库的一些准则.NoSQL不是万能药,采用技术最好不要跟风,选择适合自己数据和应用的才是最好的哟~没有看过NoSQL基础篇的读者可以在文末的历史文章回顾中找到. ◆ ◆ ◆ 小白问:上次问了NoSQL,SQL的区别,好像有点忘了,我们可以温故而知新一下吗? 答:..

我的独立面对

我1994年大学毕业,从事了我梦想中的职业--教师.经历了助课环节的训练,开始走上讲台,由讲非计算机专业的公共课,再到担任计算机专业课的主讲教师,我从教道路的起步阶段走得顺风顺水.在我工作后的三四年的那段时间,每当我走上讲台,就好像脚底下装了弹簧,兴奋得就要跳起来一般.然而逐渐地,我高兴不起来了.同事们对这个年青教师的教学给了足够多的赞誉,但我发现了我的教学并不是大家所感觉到的那么完美,我意识到,不能说我讲得不好,但是学生学的效果并不好.在大学的学习中,老师教是一个方面,学生学也是一个方面,只有