Data Mining :: Ontario College of Technology, Toronto

Data mining is the principle of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but it is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods. It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases".

Background

Traditionally, analysts have performed the task of extracting useful information from recorded data, but the increasing volume of data in modern business and science calls for computer-based approaches. As data sets have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools. The modern technologies of computers, networks, and sensors have made data collection and organization much easier. However, the captured data needs to be converted into information and knowledge to become useful. Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, from data.

Data mining identifies trends within data that go beyond simple analysis. Through the use of sophisticated algorithms, users have the ability to identify key attributes of business processes and target opportunities.

Although data mining is a relatively new term, the technology is not. Companies for a long time have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports. Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of analysis.The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction. Knowledge discovery provides explicit information that has a readable form and can be understood by a user. Forecasting, or predictive modeling provides predictions of future events and may be transparent and readable in some approaches (e.g. rule based systems) and opaque in others such as neural networks. Moreover, some data mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.

Metadata, or data about a given data set, are often expressed in a condensed data mine-able format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

Data mining relies on the use of real world data. This data is extremely vulnerable to collinearity precisely because data from the real world may have unknown interrelations. An unavoidable weakness of data mining is that the critical data that may explain the relationships is never observed. Alternative approaches using an experiment based approach such as Choice Modelling for human generated data may be used. Inherent correlations are either controlled for or removed altogether through the construction of an experimental design.

Recently, there were some efforts to define a standard for data mining, for example the CRISP-DM standard for analysis processes or the Java Data Mining Standard. Independent of these standardization efforts, freely available open-source software systems like RapidMiner and Weka have become an informal standard for defining data mining processes.

Privacy concerns

There are also privacy and human rights concerns associated with data mining, specifically regarding the source of the data analyzed. Data mining provides information that would not be available otherwise, and it must be properly interpreted to be useful. When the data
collected involves individual people, there are many questions concerning privacy, legality, and ethics.In particular, data mining
government or commercial data sets for national security or law enforcement purposes has raised privacy concerns.

There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could be used to find combinations of drugs exhibiting harmful interactions. Since any particular combination may occur in only 1 out of 1000 people, a great deal of data would need to be examined to discover such an interaction. A project involving pharmacies could reduce the number of drug reactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database.

Notable uses of data mining

Politics
Data mining has been cited as the method by which the U.S. Army unit Able Danger had identified the September 11, 2001 attacks leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an Al Qaeda cell operating in the U.S. more than a year before the attack.

It has been suggested that both the Central Intelligence Agency and their Canadian counterparts, Canadian Security Intelligence Service, have put this method of interpreting data to work for them as well[7], although they have not said how.

Games
Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art i.e. pre-tablebase knowledge is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.

Business
Data mining in customer relationship management applications can contribute significantly to the bottom line. Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted. More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer.

Businesses employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than one model to predict which customers will churn, a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to automated data mining.Another example of data mining, often called the market basket analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favour silk shirts over cotton ones.

Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rules may also be present within a database. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem, will develop a secondary problem within the next six months.

Related to an integrated circuit production line, an example of data-mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing"[8]. In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.