CodeBalance: A Theorical Introduction to Data Mining

A Theorical Introduction to Data Mining

Posted on 2010-12-14 by CB

This article introduces the aim of data mining and explains basic concepts and terms.

Data Mining (i. e. Knowledge discovery from data): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.

Data Warehouse : A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] Data warehouses are used for data mining.

Potential Usages : Web information mining, spam filtering, medical data mining, weather data mining, market sale strategies etc.

Data Mining Related Operations

Preprocessing:

Handling Noisy Data : Handling missing, duplicate or errorneous data before data mining. Noisy data can be removed, or corrected by a specific approach (i.e. correlation analysis).

Integration : Combining data from multiple sources.

Normalization : Scaling data to specified range. For example, scaling 750 in [500, 1000] to range [0,1] (the result is 0.5)

Feature Selection : Selecting only useful features (i.e. attributes for record data) of data.

Data Mining:

Classification: Finding a model for a class attribute of data to predict the values of other attributes. (An example class attribute: CustomerBuysProduct (bool))

Different methods can be used for classification:

Decision Trees: Uses decision trees to make model and evaluates new data on the tree.
Rule-Based Classifying: Deduces rules on the data (if X = Y and if Z z T result is W etc.).
Bayes Classifying: Uses previous probabilities to classify.
K-Nearest Neighbor Classifying: Uses distances between previous data to new data, to classify.
...

Clustering: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

Different methods can be used for clustering:

K-means Clustering: Splits data according to a previously known number of clusters.
Hierarchical Clustering: Produces a set of nested clusters organized as a hierarchical tree.
...

Association (Rule) Discovery: Producing dependency rules which will predict occurrence of a feature (i.e. attribute) of data based on occurrences of other features.

Pattern Discovery: Deducing patterns as a result of classification, clustering, Pattern discovery etc.

Postprocessing: Evaluating and selecting interesting patterns, interpreting and visualizing them as an information report.

3 Responses to A Theorical Introduction to Data Mining

Kasper Sørensen says:

12/16/2010 7:47 AM

Nice sum-up of a lot of otherwise confusing or hard-to-grasp concepts. Also I would like to add data profiling as a discipline in-between preprocessing and data mining. In general profiling is about applying standard metrics to help you discover where to look when you want to do a deeper analysis or processing of data.
loginworks says:

12/12/2012 2:04 AM

Data mining can be defined as the process harvesting and discovering useful and valuable information through the analysis of enormous amounts of data found in databases, websites or data warehouses through the use of a number of techniques such as artificial intelligence, statistical and machine learning. It is a relatively a new and promising technology..

Introduction to Data Mining Processes
Anonymous says:

6/06/2013 6:56 PM

Interеsting blοg! Ιs your theme custom made
οr did you downloaԁ it from sоmewhere?

A thеme like yours with a few simple adjustements would really
make my blog stand out. Pleasе lеt mе know whегe you got your thеmе.

Ϻany thanks

my web sitе - legal hallucinogens powders

A Theorical Introduction to Data Mining

3 Responses to A Theorical Introduction to Data Mining

Leave a Reply

Search This Blog

Recent Posts

Categories

Archives

Useful Links

CodeBalance Visitors