Data Mining

Data Mining is a process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the web, and other information repositories or data that are streamed into the system dynamically.

Why Do Businesses Need Data Extraction?

With the advent of Big Data, data mining has become more prevalent. Big data is extremely large sets of data that can be analyzed by computers to reveal certain patterns, associations, and trends that can be understood by humans. Big data has extensive information about varied types and varied content.

Thus with this amount of data, simple statistics with manual intervention would not work. This need is fulfilled by the data mining process. This leads to change from simple data statistics to complex data mining algorithms.

The data mining process will extract relevant information from raw data such as transactions, photos, videos, flat files and automatically process the information to generate reports useful for businesses to take action.

Thus, the data mining process is crucial for businesses to make better decisions by discovering patterns & trends in data, summarizing the data and taking out relevant information.

Data Extraction As A Process

Any business problem will examine the raw data to build a model that will describe the information and bring out the reports to be used by the business. Building a model from data sources and data formats is an iterative process as the raw data is available in many different sources and many forms.
Data is increasing day by day, hence when a new data source is found, it can change the results.

Cross-Industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM is a reliable data mining model consisting of six phases. It is a cyclical process that provides a structured approach to the data mining process. The six phases can be implemented in any order but it would sometimes require backtracking to the previous steps and repetition of actions.

The six phases of CRISP-DM include:

1) Business Understanding: In this step, the goals of the businesses are set and the important factors that will help in achieving the goal are discovered.

2) Data Understanding: This step will collect the whole data and populate the data in the tool (if using any tool). The data is listed with its data source, location, how it is acquired and if any issue encountered. Data is visualized and queried to check its completeness.

3) Data Preparation: This step involves selecting the appropriate data, cleaning, constructing attributes from data, integrating data from multiple databases.

4) Modeling: Selection of the data mining technique such as decision-tree, generate test design for evaluating the selected model, building models from the dataset and assessing the built model with experts to discuss the result is done in this step.

5) Evaluation: This step will determine the degree to which the resulting model meets the business requirements. Evaluation can be done by testing the model on real applications. The model is reviewed for any mistakes or steps that should be repeated.

6) Deployment: In this step a deployment plan is made, strategy to monitor and maintain the data mining model results to check for its usefulness is formed, final reports are made and review of the whole process is done to check any mistake and see if any step is repeated.

Types of Data Mining Processes

Different data mining processes can be classified into two types: data preparation or data pre-processing and data mining. In fact, the first four processes, that are data cleaning, data integration, data selection and data transformation, are considered as data preparation processes. The last three processes including data mining, pattern evaluation and knowledge representation are integrated into one process called data mining.

1. Data Cleaning

Data cleaning is the process where the data gets cleaned. Data in the real world is normally incomplete, noisy and inconsistent. The data available in data sources might be lacking attribute values, data of interest etc. For example, you want the demographic data of customers and what if the available data does not include attributes for the gender or age of the customers? Then the data is of course incomplete.

Data cleaning involves a number of techniques including filling in the missing values manually, combined computer and human inspection, etc. The output of data cleaning process is adequately cleaned data.

2. Data Integration

Data integration is the process where data from different data sources are integrated into one. Data lies in different formats in different locations. Data could be stored in databases, text files, spreadsheets, documents, data cubes, Internet and so on. Data integration is a really complex and tricky task because data from different sources does not match normally. Suppose a table A contains an entity named customer_id where as another table B contains an entity named number. It is really difficult to ensure that whether both these entities refer to the same value or not. Metadata can be used effectively to reduce errors in the data integration process.

3. Data Selection

Data mining process requires large volumes of historical data for analysis. So, usually the data repository with integrated data contains much more data than actually required. From the available data, data of interest needs to be selected and stored. Data selection is the process where the data relevant to the analysis is retrieved from the database.

4. Data Transformation

Data transformation is the process of transforming and consolidating the data into different forms that are suitable for mining. Data transformation normally involves normalization, aggregation, generalization etc. For example, a data set available as “-5, 37, 100, 89, 78” can be transformed as “-0.05, 0.37, 1.00, 0.89, 0.78”. Here data becomes more suitable for data mining. After data integration, the available data is ready for data mining.

5. Data Mining

Data mining is the core process where a number of complex and intelligent methods are applied to extract patterns from data. Data mining process includes a number of tasks such as association, classification, prediction, clustering, time series analysis and so on.

6. Pattern Evaluation

The pattern evaluation identifies the truly interesting patterns representing knowledge based on different types of interestingness measures. A pattern is considered to be interesting if it is potentially useful, easily understandable by humans, validates some hypothesis that someone wants to confirm or valid on new data with some degree of certainty.

7. Knowledge Representation

The information mined from the data needs to be presented to the user in an appealing way. Different knowledge representation and visualization techniques are applied to provide the output of data mining to the users.

Conclusion

Data Mining is an iterative process where the mining process can be refined, and new data can be integrated to get more efficient results. Data Mining meets the requirement of effective, scalable and flexible data analysis.

Data mining processes can be performed on any kind of data such as database data and advanced databases such as time series etc. The data mining process comes with its own challenges as well.