数据|问题
Although there are various approaches to data mining that seem to offer distinct features and benefits, many may not be powerful enough to meet your corporate knowledge discovery needs. But in fact just a few fundamental questions can quickly clarify the business benefits and the power of a data mining system, setting its advantages in a clear perspective. These questions need to be asked both from the view points of business and technical users. However, please note that these questions refer to data mining -- please also see the many benefits of the knowledge access paradigm which uses the patterns discovered by data mining within a PatternWarehouseTM. Here are two sets of "Top Ten Data Mining Questions" from business and technical perspectives. Each question has three parts that together highlight one specific aspect of a data mining system's power and capability. The Top Ten Data Mining Business Questions The top ten business question should be asked by business users about the benefits, quality and usability of the system. They are:Question 1: Business Benefits a) How will this system help us? b) How well does this system work for our industry-specific applications? c) What information can we get that we do not already have?It is essential to ask this question again and again. You should, of course, get new refined information, but it is not enough just to know something -- you should have information that allows you to "act" within the context of your industry. And, you should measure the bottom-line dollar benefits delivered by a data mining system. See the paper "Measuring the Dollar Value f Mined Information" for a framework for this.Question 2: Technical Know-how a) How technically sophisticated do we need to be to use it? b) Can business users operate it without calling the IS group all the time? c) Is it as easy to use as an internet browser?Business users should be empowered with direct, on-demand access to refined knowledge. They should not have to know statistics, yet should be given consistent and correct answers. The system interface should be as easy to use as a web-browser.Question 3: Understandability and Explanations a) Are the results intuitive or difficult to understand? b) Do we get clear explanations for any information item presented? c) Will the explanations be in technical statistical terms or in a form that we can understand?Results should be presented to business users in plain English, accompanied with graphs. The system should be able to explain each piece of information it presents in clear, English-like terms that business users can easily comprehend and use.Question 4: Follow-up Questions a) What kinds of follow-up questions can we ask from the system? b) Do we need to go to an analyst for further question answering? c) How fast can we drill-down on the fly to see more patterns?Response to follow-up questions must be immediate. Business users should not need to use intermediaries such as analysts to get more information after they have seen some results. If follow-up questions take time and involve intermediaries, the business users effectiveness will be impacted. Business users should get refined information, as they need it, when they need it.Question 5: Business Users a) How many business users can this system support? b) Can the business users tailor their own questions for the system? c) Can users utilize the knowledge for day-to-day decision making?The system should be able to use the same fundamental knowledge to support a few hundred business users, each with a different group-perspective. Yet, all of these users must be given consistent answers as they ask their own questions. The information must be presented such that can be utilized for day-to-day actions.Question 6: Accuracy, Completeness and Consistency a) How accurate are the results the system delivers? b) Can some patterns be missed by the system? c) Are the results always consistent or can 100 users get 100 different answers?The system must cover a wide range of patterns and should provide high quality, information. The knowledge provided to business users should be derived from the entire data set (and not samples) in order to increase accuracy. All business users should access the same knowledge so that they all receive consistent answers, increasing the quality of corporate information.Question 7: Incremental Analysis a) Can we automatically analyze weekly / monthly data as it becomes available? b) Can the system compare the "month to month" results and patterns by itself? c) Can we get automatic pattern detection over time, every week or month?The system should analyze data as it becomes available every week or month and perform on-going trend analysis, highlighting the key items and influence factors that impact significant changes. The incremental analysis should be performed automatically in the background, informing the user of significant trends and the underlying causes.Question 8: Data Handling a) How much data can the system deal with? b) Can it work directly on our database, or do we need to extract data? c) If it works on extracts, how do we know that some patterns are not missed?The system should handle moderate to large volumes of data on a powerful server -- of course, large data volumes should not be expected to be managed on small servers. The system should work directly on the SQL database, without extracts so that patterns are not missed and performance is improved.Question 9: Integration a) How will it integrate into our computing environment? b) Will it just work on our existing SQL database? c) How easily will the system work on our intranet?The system should run smoothly on existing open server platforms (e.g. Unix) and popular DBMS engines (e.g. Oracle, Sybase Informix, etc.) on the server. The system should present results to users on the corporate intranet. The absence of data conditioning requirements and extract files will make integration much easier.Question 10: Support Staff a) What staff do I need to keep this system installed and running? b) How do we get support and training to get started? c) What happens after we install the system?After the initial system design, the support personnel for the system should be kept minimal. One database administrator should be able to manage the DBMS, and one analyst should occasionally help in setting up discovery models, etc. Thereafter, business users should be able to use the system on their own. There should be no need for a large number of resident support analyst to act as intermediaries for the business users. The Top Ten Data Mining Technical QuestionsThe top ten technical question should be asked by technical users about the architecture, power and the scalability of the system. They are:Question 1: Architecture a) How are computations distributed between the client and the server? b) Is any data brought from the server to the client? c) Can the system run in a three tiered architecture?The best option is for the discovery to take place entirely on the server. Any attempt to bring data to the client will seriously limit the applicability of the system to larger databases. The best architecture is a thin-client, three-tiered system that uses the power of a large server-based SQL engine but operates on an intranet. Question 2: Access to Real Data a) Does the system work on the real SQL database or on samples and extracts? b) If it samples or extracts, how do we know that it is accurate? c) If it builds flat files, who manages this activity and cleans up for on-going analyses, and how can it sample across several tables?The best option is for a data mining system to work on the real databases and not on samples, extracts and/or flat files. Working on the real database uses the SQL engine's power (e.g. parallel execution) and provide much more accurate results. And, the system should be able to access database tables in their native form, reaching across tables by itself.Question 3: Performance and Scalability a) How large of a database can the system analyze? b) How long does it take to perform discovery on a large database? c) Can the system run in parallel on a multi-processor server?The system should work on databases with a large number of records. It should derive its capabilities from the power of the server and the SQL engine, whenever possible. The system should be able to use the built-in parallelism of the SQL engine, but should also be able to use multiple processors for its own parallel non-SQL computations.Question 4: Multi-Table Databases a) Does the system work on a single table only or can it analyze multiple tables? b) Does the system need to perform a huge join to access all of our tables? c) If it works on a single table, how can we feed it our existing data schema? The real world is full of multi-table databases which can not be joined and meshed into a single view. In fact, the theory of normalization came about because data needs to be in more than one table. Using single tables is an affront to a decade of work on database design. If you challenge the DBA of a really large database to put things in a single table you will either get a laugh or a blank stare -- in many cases the database size will balloon beyond control. The system should be able to mine large multi-table databases directly by itself on the server. Question 5: Multi-Dimensional Analysis a) Does the system analyze data along a single dimension only? b) How are multi-dimensional patterns discovered and expressed by the system? c) How do we specify the dimensional structure of our data to the system?The OLAP phenomenon has conclusively demonstrated that the business world's data is not single-dimensional. Hence a data mining system should be able to automatically discover patterns along multiple dimensions. In fact, there are many cases where no single dimensional view can correctly represent the semantics of influence because the influence ratios will always be off regardless of how one aggregates. See the paper: OLAP & Data Mining: Bridging the Gap for a detailed discussion of this. Question 6: Types and Classes of Patterns Discovered a) How powerful and general are the patterns the system can discover and express? b) Can the system mix different pattern types, e.g. influence and affinity patterns? c) Can the system discover time-based patterns and trends?The format of the patterns discovered by the system is very general and goes far beyond decision trees or simple affinities. The advantage to this is that the general rules discovered are far more powerful than decision trees. Decision trees are very limited in that they cannot find all the information in a database. Being rule-based keeps the system from being constrained to one part of a search space and makes sure that many more clusters and patterns are found -- allowing the system to provide more information and better predictions.Question 7: System Initiative a) Does the system use its own initiative to perform discovery or is it guided by the user? b) Can the system discover unexpected patterns by itself? c) Can the system start-up by itself on a weekly or monthly basis and perform discovery?In some cases the user has to interact and guide the system, e.g. build a decision tree. However, a better approach is for the system to use its own initiative in the data mining process, forming hypothesis automatically based on the character of the data. The system should start-up by itself, select the significant patterns in the data and filter the unimportant trends. The analyses should be done routinely on a weekly or monthly basis.Question 8: Treatment of Data Types a) Are all data types handled in their own form or translated to other types? b) Can the system find numeric ranges in data by itself? c) Do a large number of non-numeric values cause problems for the system?The system should manage all data types in a uniform manner and in their native formats, i.e. numbers, dates and constants should remain numbers, dates and constants internally. Interesting ranges in the data should be discovered by the system, not requiring "number bin" construction by the user. A large number of constant values in the database should not choke the system.Question 9: Data Dependencies and Hierarchies a) Can the system be told about the functional dependencies in our database? b) Does the system understand the concept of data hierarchy? c) How does the system use dependencies and/or hierarchies for discovery?The system should be capable of using the functional (and other dependencies) that exist in a database. The use of these dependencies can significantly enhance the power of a discovery -- in fact ignoring them can lead to confusion. The system should understand the concept of hierarchy and should be able to use it for discovery along multiple dimensions.Question 10: Flexibility and Noise Sensitivity a) How brittle is the system when dealing with noisy data? b) How well does the system cope with data exceptions and low quality data? c) Can the system provide statements with flexible numeric ranges discovered by itself in the data?The system should not be sensitive to noise and should internally use fuzzy logic to smooth data brittleness. As the data gathers noise, the system should only reduce the level on confidence associated with the results provided, not suddenly change direction in discovery. However, the system should still produce the most significant findings from the data set, even if noise is present.