## Abstract

Normalization of feature vectors of datasets is widely used in a number of fields of data mining, in particular in cluster analysis, where it is used to prevent features with large numerical values from dominating in distance-based objective functions. In this study, a unified statistical approach to normalization of all attributes of mixed databases, when different metrics are used for numerical and categorical data, is proposed. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. Formulae for the statistically normalized Minkowski mixed *p*-metrics are given in an explicit way. It is shown that the classic *z*-score standardization and the min–max normalization are particular cases of the statistical normalization, when the objective function is, respectively, based on the Euclidean or the Tchebycheff (Chebyshev) metrics. Finally, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the *k*-prototypes (for *p*=2) or another algorithm (for *p*≠2).

## 1. Introduction

Data mining (DM) is a process of extracting relations and patterns from data; therefore, DM transforms raw collections of data into information. It is well known (Jain & Dubes 1988; Cios *et al.* 2007) that data can have diverse formats and can be stored through a variety of different storage models. In a number of fields of machine intelligence, e.g. in texture clustering, image retrieval, speech recognition and clustering of databases, an object is often represented by a vector variable, namely the feature vector **A**. The collection of objects described by the same features is called a dataset. The goal of clustering is to discover similarities and patterns within a large dataset by splitting data into clusters (groups). Because it is assumed that the data are unlabelled, clustering is often considered as the most important unsupervised learning problem (Cios *et al.* 2007).

Depending on the clustering goals, different formalization and different methods for clustering are involved (Mirkin 2005). Clustering techniques can be divided into three main categories: hierarchical clustering (HC); model-based clustering; and objective function-based clustering (Cios *et al.* 2007; Gan *et al.* 2007). HC is based on creating a hierarchical (multi-level) decomposition of the set of data points using some criteria or models. The clustering results are represented in a form of a graph (dendrogram). The standard HC methods can handle data with numeric and categorical values. However, it is generally accepted (Jain & Dubes 1988; Cios *et al.* 2007) that the quadratic computational cost makes HC unacceptable for clustering large datasets.

In model-based clustering methods, it is assumed that the data are generated by a mixture of underlying probability distributions in which each component represents a different cluster (Fraley & Raftery 2002; Cios *et al.* 2007). The expectation–maximization (EM) algorithm is a popular approach in model-based clustering when the observations can be viewed as incomplete data (Dempster *et al.* 1977). It admits both categorical and continuous attributes. The distance-based techniques, including *k*-prototypes; and model-based techniques, including the EM algorithm, have their advantages and drawbacks. For example, the *k*-prototypes algorithm is computationally easy, fast and memory-efficient; however, its procedure does not guarantee the convergence to a global extremum (Mirkin 2005) and it works well mainly when resulting clusters are close to hyperspherical in shape. The model-based clustering is more flexible and EM algorithms do not require the specification of distance measures; however, there are various reasons that may make the EM algorithm not practical for models with very large numbers of components and its rate of convergence can be very slow (Fraley & Raftery 2002). Because the distance-based techniques, including *k*-prototypes, are currently the major clustering techniques, this study will deal only with distance-based objective function clustering.

Normalization (standardization) of feature vectors is a long-standing problem of DM. Normalization of attributes has been discussed by many authors (Kaufman & Rousseeuw 2005; Larose 2005; Gan *et al.* 2007; Kamath 2009). A direct application of geometric measures (distances) to attributes with large ranges will implicitly assign bigger contributions to the metrics than the application to attributes with small ranges. In addition, the attributes should be dimensionless because the numerical values of the ranges of dimensional attributes depend on the units of measurements and, therefore, the choice of the units of measurements may greatly affect the results of clustering. Hence, one should not use distance measures such as the Euclidean distance without normalization of data (Aksoy & Haralick 2001; Larose 2005). In fact, two kinds of normalizations have been discussed in the overwhelming majority of papers, namely the min–max normalization and *z*-score standardization. Aksoy & Haralick (2001) gave a review of normalization techniques that have to be applied to numerical data before conducting a cluster analysis. In essence, the goal of all normalization procedures reviewed was the normalization of each feature component to the [0,1] range, i.e. the min–max normalization. Furthermore, one needs to apply normalization not only to numerical attributes but also to categorical attributes. As noted by Kamath (2009), normalization is not a trivial task, nevertheless it has to be discussed because ‘in the real world, dirty data sets need cleaning; raw data need to be normalized; outliers need to be checked’ (Larose 2005, p. xiii). Normalization relies on the use of various mathematical concepts, and, hence, it is important to develop appropriate mathematical tools for this procedure.

In general, unsupervised DM with no *a priori* information about the importance of attributes should follow the ‘principle of equal importance of features’ (Mirkin 2005, p. 218). Therefore, normalization should give all attributes equal influence on characterizing overall dissimilarity between pairs of objects (Hastie *et al.* 2001; Pham *et al.* 2006). However, Hastie *et al.* (2001), after introducing a correct interpretation of the normalization procedure, gave an example where standardization obscured the two well-separated groups. They argued that giving all attributes equal influence in this case would ‘tend to obscure the groups to the point where a clustering algorithm cannot uncover them’ (Hastie *et al.* 2001, p. 458). In certain cases, clustering without normalization has given good results (Jain & Dubes 1988; Hastie *et al.* 2001). However, the fact that the attributes had proper weights was purely by chance.

A statistical approach for normalization of mixed metrics will be presented in this study. Although statistical approaches to DM are very popular and it is well known that cluster analysis can be based on probability and statistical models (Bock 1996; Mirkin 1996; Fraley & Raftery 2002), these statistical approaches were not targeted to normalization of metrics. Almost no papers were devoted to discussion of normalization of categorical and mixed datasets. For example, the *k*-prototypes algorithm was applied to a non-normalized metric by Huang (1998). Although Larose (2005) underlined that when measuring distance the numerical attribute values should be normalized, he suggested to use for categorical attributes the matching dissimilarity measure without normalization or as he mentioned, ‘perhaps, when mixing categorical and continuous variables, the min–max normalization may be preferred’ (Larose 2005, p. 101). It seems to the authors that only Mirkin (1996, 1997, 1998, 2005) discussed standardization of mixed features based on their contributions to the quadratic data scatter. He said that ‘with feature rescaling, feature scales become balanced according to the principle of equal importance of each feature brought into the data table’ (Mirkin 2005, p. 64). In spite of Mirkin's work, normalization was neither mentioned in application to mixed data (Huang 1998; Bachner 2000) nor used in heuristic mixed metrics (Gibert & Cortés 1997). In 1998, Mirkin stated that methods for analysis of data in mixed feature space are still an issue.

It is natural to follow the principle of equal importance of features and assume that the average contribution of the *j*th feature component to the total similarity measure is equal to its mean, and, therefore, the goal of a normalization procedure is the equalization of the attribute contributions to the measure. If it is known *a priori* that some attributes are irrelevant to the problem under consideration, then they can be removed from the feature vector. A unified statistical treatment to both numerical and categorical features of mixed datasets is used in this study. New normalized metrics are introduced such that the average contributions of all attributes to the measures are equal to each other from a statistical point of view. Although this idea has been recently discussed in the literature (Hastie *et al.* 2001; Pham *et al.* 2006), nothing was said about the statistical consistency of the proposed estimators. In addition, Hastie *et al.* (2001) used biased estimators. Finally, the previously proposed approaches were not applicable to some metrics. Here, mathematically rigorous treatment of the normalization procedure is presented and examples of normalized metrics (e.g. the Minkowski and Tchebycheff metrics) are given in an explicit way. Besides the proposed approach is extended to the case of mixed metrics, i.e. when different metrics are used for numerical and categorical data, respectively. Advantages of the introduced normalized metrics are demonstrated on examples when the *k*-prototypes algorithm is applied to both non-normalized and properly normalized mixed metrics. It will be shown that the accuracy usually increases when clustering is performed using normalized metrics. The questions concerning the mathematical proof of the consistency and unbiasedness of the proposed estimators are discussed elsewhere (Suarez-Alvarez 2010).

## 2. Preliminaries

### (a) Mathematical representation of datasets

Datasets may be stored as flat files. Flat (rectangular) files are the most common way to store the datasets and further only flat files are considered. The rows represent objects (also known as records, individuals, patterns, data points) and the columns represent features. In application to databases, the features are called attributes, and hence **A** can be defined as a set of attributes, **A**=(*A*_{1},*A*_{2},…*A*_{m+l}). In many traditional applications, it is assumed usually that all the features are the same type. However, real-life datasets are often mixed; they consist of both numerical and categorical types, i.e. each attribute can take on a finite (categorical attribute) or infinite (continuous) number of possible values (numerical attribute). Only these two data types of attributes are considered here because other types of attributes can be mapped to these two types. In this study, it is accepted the term *categorical* as a general term that can be split into nominal and ordinal. If categorical variables have ordered scales they are called ordinal variables, whereas the variables having no ordered scales are called nominal variables. Binary variables are considered here as categorical.

From the mathematical point of view, the vector of features **A** for mixed data can be split into **A**=(**A**^{n},**A**^{c}), namely the vector of numerical features and the vector of categorical features . For numerical or quantitative features, the feature domain Dom(*A*_{j}) can be represented on the real line, whereas for categorical features (sometimes these features are also called nominal or qualitative), the domain is a finite set of different states. Evidently, categorical features may be represented by numerical codes of possible different states of the feature.

A database can be represented as a matrix of size *N*×(*m*+*l*) where *N* is the number of records, and (*m*+*l*) is the total number of attributes, i.e. the *i*th row of the matrix represents the *i*th record of the database (1≤*i*≤*N*). This row is a vector (*x*_{i1},…,*x*_{im},*y*_{i1},…,*y*_{il}), whose values *x*_{i1},…,*x*_{im} are numerical, whereas the values *y*_{i1},…,*y*_{il} are categorical.

### (b) Clustering, objective functions and similarity measures

A very general category of clustering is concerned with building partitions (clusters) of datasets on the basis of some performance index known also as an objective or cost function. This is a function associated with an optimization problem where the best element from some set of available alternatives is chosen to minimize or maximize the function (Zhigljavsky 2008). The value of this function determines how good the chosen solution is. There are various clustering algorithms for objective function-based clustering (MacQueen 1967; Huang 1997; Pham *et al.* 2011) because it is practically unfeasible to find a global optimum for the objective function by considering all possible combinations of elements (exhaustive search).

Here, one needs to distinguish hard and fuzzy cluster methods. The clustering is hard if the obtained groups satisfy the following requirements of a partition (i) each group must contain at least one object and (ii) each object must belong to exactly one group (Kaufman & Rousseeuw 2005). In this case, data are divided into distinct clusters, where each data element belongs to exactly one cluster. In fuzzy clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels (Bezdek 1980).

Any measure of the degree of closeness is called similarity measure. In clustering analysis of numerical datasets, it is very common to calculate the similarity or dissimilarity between two feature vectors **x**_{1}=(*x*_{11},…,*x*_{1m}) and **x**_{2}=(*x*_{21},…,*x*_{2m}) using a square distance measure. Indeed, it is very natural to use the Euclidean metric (distance) *ρ*_{E} (or *L*_{2} metric)
2.1as a measure for continuous numerical features because this metric is in everyday use. Often, other similarity measures are also used. For example, the city block distance (or *L*_{1} metric)
2.2the Minkowski distance *ρ*_{p} (or *L*_{p} metric)
2.3where *p* is a positive number, ; and the Tchebycheff (Chebyshev) or maximum norm metric
2.4The Euclidean metric and city block distance are particular cases of the Minkowski metric for *p*=2 and *p*=1, respectively. The Tchebycheff metric can be obtained from the Minkowski metric as the following limit . Other metrics are also applied to numerical datasets. However, there are no such natural similarity measures for categorical data and for mixed (numeric and categorical) data. Therefore, two different similarity measures are often combined for clustering of mixed data (Gibert & Cortés 1997).

#### (i) Objective functions

If the objective (cost) function *J* is based on the Euclidean distance, then the *k*-means algorithm (MacQueen 1967) can be used in application to numerical data, while an extension of this algorithm, namely the *k*-prototypes algorithm (Huang 1997) can be used in application to mixed data. These clustering algorithms minimize the cost function (Bezdek 1980; Huang 1997) for ‘hard’ (non-fuzzy) *k*-partition of a dataset into *k* clusters
2.5Here, *ρ*_{E} is a similarity metric and *u*_{ij} is an element of the partition matrix. The condition *u*_{ij}=1 means that the record *A*_{i} is assigned to cluster *j* with prototype *Q*_{j}.

If the cost function *J* is based on the general case of the Minkowski distance, then the *k*-means algorithm is not applicable and instead of the above standard cost function , the cost function will be used
2.6Here, *ρ*_{p} is the Minkowski mixed *p*-metric.

#### (ii) Statistical treatment of feature vectors

With geometric similarity measures, usually no assumption is made about the probability distribution of the attributes and similarity (dissimilarity) is based on the distances between feature vectors in the feature space (Aksoy & Haralick 2001). However, each record (the row) of a database may be regarded as a random sample of a population under consideration (Mirkin 2005), i.e. one has a database of *N* observations (samples) and each sample (record) is a realization of possible values of the feature vector **A**. Hence, it was suggested (Chen 1973) to use a weighted Euclidean distance that may take into account the dispersion of samples within a cluster.

For statistical treatment of feature vectors, one needs to know the probability distributions of their attributes. For a numerical attribute , the probability distribution identifies the probability of the attribute value falling within a particular interval within the range of possible values. For a categorical attribute , the probability distribution identifies the probability of certain states occurring.

It is assumed usually in the literature that each numerical feature has a normal (Gaussian) distribution with mean *μ*_{j} and variance *σ*_{j}. Their values may be estimated using standard statistical methods. However, in the general case, distribution functions are not known in advance and another function may be a better model for the attributes than the Gaussian distribution.

### (c) Classic normalization of numerical attributes

The normalization procedure can be implemented in different ways. The most cited of these approaches normalizes the data by dividing the attribute value *x*_{ij} by its range using scaling with a shift (the min–max normalization)
2.7Here, is the normalized attribute value in the database, and and are the maximum and the minimum values of attribute *A*_{j}, respectively. The results scaled by (2.7) do not depend on the original units of data measurements, and this linear scaling will transform the data to the range [0,1]. However, in the general case, this normalization procedure does not achieve equalization of the attribute means. Hence, the application of the transformation (2.7) for normalization of real-world datasets and consequent clustering using the Minkowski norm (Doherty *et al.* 2004) does not give equal contributions of variables to the similarity measures because the means of the different normalized attributes are not necessarily equal to each other.

For numerical databases with attributes having normal (Gaussian) distributions, the most common normalization procedure is to transform the attribute to a random variable with zero mean and unit variance (the *z*-score standardization) by
2.8where *μ*_{j} and *σ*_{j} are the mean and standard deviation for values of the *j*th attribute , respectively. As it will be shown, this scaling provides equal contributions of variables to the Euclidean similarity measure.

The standard preliminary transformation of the feature columns of arbitrary type of attributes *a*_{ij} can be written as
2.9where *m*_{j} is the grand mean, i.e. the average over the entire entity set, and *b*_{j} ‘it can be either the standard deviation or range or other quantity reflecting the variables spread’ (Mirkin 1997, p. 475, 2005, p. 65). Evidently, (2.8) is a particular case of (2.9). As Mirkin (2005) noted, the standardization (2.9) does not guarantee equal contributions of all features to the results.

In addition to the above methods, Milligan & Cooper (1988) (see also Gan *et al.* 2007) suggested several other ways for standardization of attributes, e.g.
2.10and
2.11However, it will follow from the statistical analysis below that (2.11) and (2.10) do not give equal contributions to appropriate similarity measures.

Aksoy & Haralick (2001) reviewed five normalization methods for numerical data, namely linear scaling to unit range, linear scaling to unit variance, transformation to a uniform [0,1] random variable, rank normalization and normalization by fitting distributions. All these approaches intended to normalize each feature component to the [0,1] range. It was noted that providing all attributes are normally distributed, the probability of the attribute value normalized by (2.8) is in the [−1,1] range is equal to 68 per cent. If one applies an additional shift and rescaling as 2.12then this guarantees 99 per cent of the values to be in the [0,1] range Aksoy & Haralick (2001). However, any shifting of the whole attribute column does not affect the distance metrics (2.1)–(2.2) and, therefore, it has no practical importance for clustering of datasets.

The idea to use a weighted Euclidean distance that may take into account the dispersion of samples within a cluster (e.g. Chen 1973) was generalized for the Minkowski distance by Pham *et al.* (2006), who were not aware that a very similar idea was formulated earlier for a more general case by Hastie *et al.* (2001). However, one has to realize that their estimators were biased. Further, the question concerning the consistency of the proposed estimators was not discussed. Hastie *et al.* (2001) considered as example an only the same case as in Chen (1973), namely the weighted Euclidean distance. On the other hand, there are metrics for which the above approach is not valid. Indeed, if one considers the Tchebycheff metric (2.4) for numerical attributes then the approach suggested by Hastie *et al.* (2001) is not directly applicable. How can one normalize this metric? Finally, if one studies a mixed metric that is a sum of two different metrics (one metric is used for numerical data, whereas another metric is used for categorical data), then their approach is also not applicable.

## 3. Normalization of the feature vectors

This study presents normalized attributes whose average contributions to the measures, regardless of type of attributes, are equal to each other from a statistical point of view.

### (a) Normalization for numerical datasets

#### (i) Normalization of the Minkowski metric

To obtain a new normalized metric in the general case of the Minkowski metric (2.3), one should calculate the mean contribution of each *j*th attribute to the metric *E*|*X*_{1j}−*X*_{2j}|^{p} (here *E* means the expectation of a variable) and divide the attribute in all records by this mean (if the mean is equal to zero then this attribute should be removed from the feature vector). Hence, the normalized Minkowski metric can be introduced in the following way
3.1where *α*_{j}=1/*E*|*X*_{1j}−*X*_{2j}|^{p}, *X*_{1j} and *X*_{2j} are independent random variables whose values are distributed in accordance with the distribution of the *j*th attribute.

In the general case, the distribution of the *j*th attribute is not known in advance, therefore, to estimate the expectation |*X*_{1j}−*X*_{2j}|^{p} one can use the sample mean
3.2

The estimation (3.2) is a biased estimator of *E*|*X*_{1j}−*X*_{2j}|^{p}, hence for small datasets it is better to use the following estimation
3.3that is an unbiased estimator. It can be proved (Suarez-Alvarez 2010) that (3.2) and (3.3) are consistent estimators.

#### (ii) Normalization of the Euclidean metric

For *p*=2, the above-mentioned Minkowski metric (2.3) is the Euclidean metric. Normalization in this case is well known (Pham *et al.* 2006). However, for the sake of completeness, this case is considered here. Because *X*_{1j} and *X*_{2j} are independent random variables having the same distribution, for *p*=2 one obtains , where *σ*_{j} is the standard deviation of the *j*th attribute. Thus, the normalized Euclidean metric has the following form
3.4For estimating in (3.4), it is possible to use the following unbiased estimator of the sample variance
where is the sample mean for the *j*th attribute.

From (3.4) one obtains the known form of normalization of features
where *μ*_{j} is the mean of the *j*th attribute.

#### (iii) Minkowski metrics with normally distributed numerical features

Let us consider the case when one knows in advance that the values of the *j*th attribute are normally distributed. In this case, there is a quite attractive property of *ρ*_{p}. Let be a normal distribution with mean *μ*_{j} and variance
One can calculate now the expectation of the contribution of the *j*th attribute to the Minkowski metric
where

If variables *t*_{1} and *t*_{2} are introduced
then, one has
where
and *Γ* is the Euler gamma function.

Thus, it follows from (3.1) that a new metric *ρ** for a dataset with Minkowski metric and numerical features having a normal distribution with mean *μ*_{j} and variance *σ*_{j} is
3.5where . If *p*=2, then (3.5) is reduced to (3.4).

#### (iv) Normalization of the Tchebycheff metric

The above approach can also be applied to normalize the Tchebycheff metric (2.4) for numerical attributes. Let *X* be a random variable and *p*≥1 then put ∥*X*∥_{p}=(*E*|*X*|^{p})^{1/p} and denote the essential supremum of *X*, i.e.
It will be assumed further that the essential supremum of *X* is finite for all *X* under consideration.

In the case under consideration, **X**_{1j} and **X**_{2j} have the same distribution, and if the distribution function is unknown, then it is estimated using sampling distribution. In the latter case,
3.6

For the normal distribution, the value of equals to and the above-described normalization procedure cannot be used. It can be proved (Suarez-Alvarez 2010) that the normalized Tchebycheff metric is given by 3.7where .

### (b) Datasets with mixed attributes

For datasets with categorical attributes, it is possible to introduce different metrics (Ralambondrainy 1995; Gibert & Cortés 1997; Huang 1998). One of the most cited variants of metrics is studied here (Huang 1998; Gan *et al.* 2007), namely the distance between two categorical feature vectors **y**_{1}=(*y*_{11},…,*y*_{1l}) and **y**_{2}=(*y*_{21},…,*y*_{2l}) is defined as
3.8where
Evidently, the square of the metric (3.8) is
3.9

#### (i) The Euclidean mixed metric

Combining *ρ*_{E} and *ρ*_{cat} for mixed data, one obtains that the square distance between two mixed feature vectors (**x**_{1},**y**_{1}) and (**x**_{2},**y**_{2}) is
3.10where *ρ*^{2}_{E}(**x**_{1},**x**_{2}) is defined by (2.1) and is defined by (3.9).

The same approach that has been applied to numerical features, will be applied here to categorical ones, namely the contribution of each attribute to the distance measure will be divided by the contribution mean. Hence, the normalized mixed metric is defined similarly to (3.1)
3.11where *α*_{j}=1/*E*(*X*_{1j}−*X*_{2j})^{2}, *β*_{j}=1/*Eω*^{2}(*Y* _{1j},*Y* _{2j}) and *Y* _{1j}, *Y* _{2j} are independent random variables whose values are distributed in accordance with the distribution of the th attribute. If the attribute can take *q*_{j} values {*y*_{j1},*y*_{j2},…,*y*_{jqj}} and the probabilities {*p*_{j1},*p*_{j2},…,*p*_{jqj}} of these values are known then
or
Thus, it follows from the above equality and (3.4) that and
3.12

#### (ii) The Minkowski mixed *p*-metric

Let us extend the above approach to the general case of the Minkowski metric. It follows from the Minkowski inequality that the following function is a metric (Suarez-Alvarez 2010)
3.13It will be called the Minkowski mixed *p*-metric.

The normalization of the *p*-metric (3.13) is fulfilled in the same way as the normalization of the Euclidean mixed metric
3.14where *α*_{j}=1/*E*|*X*_{1j}−*X*_{2j}|^{p} and *β*_{j}=1/*Eω*^{p}(*Y* _{1j},*Y* _{2j}). Because *Eω*^{p}(*Y* _{1j},*Y* _{2j})=*Eω*(*Y* _{1j},*Y* _{2j}), *β*_{j} are calculated in the same way as in (3.12).

If the distribution of the attributes is unknown, then to calculate *α*_{j} one can use the estimation (3.3), and to estimate *Eω*(*Y* _{1j},*Y* _{2j}) one can use the sampling mean
3.15

The estimation (3.15) is a biased estimator of *Eω*^{p}(*Y* _{1j},*Y* _{2j}), hence for small databases it is better to use the following estimation
3.16which is an unbiased estimator. It can be proved (Suarez-Alvarez 2010) that (3.15) and (3.16) are consistent estimators.

### (c) A general algorithm for normalization of mixed metrics

Let a metric *ρ* be a sum of two metrics *ρ*_{1} and *ρ*_{2}
First, one needs to normalize metrics *ρ*_{1} and *ρ*_{2}, i.e. one needs to find and . Then, the metric *ρ* is normalized by
Here, and , *m* and *l* are the number of attributes in vectors **X**_{1} and **Y**_{1}, respectively. The normalization of the sum of an arbitrary number of metrics can be fulfilled in the same way.

As an example, let us consider the case of a mixed metric that is the sum of the Tchebycheff metric applied to the numerical attributes and the metric *ρ*_{2}=*ρ*_{cat}(**y**_{1},**y**_{2}) applied to the categorical attributes.

One can normalize this mixed metric using (3.7) and (3.11), namely
where *β*_{j}=1/*Eω*(*Y* _{1j},*Y* _{2j}).

Thus, in accordance with the described algorithm, the normalized mixed metric *ρ** is
where and *β*=*l*/*E*[*ρ*^{*}_{cat}(**Y**_{1},**Y**_{2})].

## 4. Applications to datasets

The above-mentioned methods will be applied to several datasets from the University of California, Irvine repository (Asuncion & Newman 2007). All records in those datasets have class labels and, hence, ‘true clustering’ can be checked.

### (a) Accuracy of the clustering algorithm

The traditional statistical approaches to data clustering validation are based on the use of Rand index or its modifications that give measures of classification agreement between two partitions of the same set of objects. The classic definition is the following (Rand 1971).

Let us consider a set *S* of *N* elements, and two partitions **C**={*C*_{1},…,*C*_{k}} and **D**={*D*_{1},…,*D*_{k}} of the dataset. To calculate the Rand index, one needs first to calculate the following numbers: *a* is the number of pairs of elements in *S* that are in the same set in **C** and in the same set in **D**; *b* is the number of pairs of elements in *S* that are in different sets in **C** and in different sets in **D**; *c* is the number of pairs of elements in *S* that are in the same set in **C** and in different sets in **D**; and *d* is the number of pairs of elements in *S* that are in different sets in **C** and in the same set in **D**. Then, the Rand index, (*R*), is calculated as
The above formula is used by many researchers. It is accepted that the larger the Rand index, the higher the accuracy of the clustering. However, for datasets whose inherent structures are known in advance, other methods for checking the accuracy of the proposed clustering algorithm with normalization may be used.

Let us consider a dataset having a categorical attribute *A* that may have *k* different states {*a*_{1},*a*_{2},…,*a*_{k}} that may be associated with labels of the clusters. The inherent structure of the dataset is associated with the states (labels) of this attribute. The clustering algorithm will map the records to a discrete set of labels (classes).

It is proposed to perform the normalization procedure of the dataset as described above and then to apply the clustering algorithm. After clustering, each record will belong to a cluster with a corresponding number *i*. Let us assign a state *a*_{φ(i)} of the attribute *A*={*a*_{1},*a*_{2},…,*a*_{k}} to the *i*th cluster. Evidently, different clusters should have different states of the attribute *A*. Let us denote by *n*_{i,j} the number of records with the attribute *A*=*a*_{j} that belongs to the *i*th cluster.

For a given *φ*, one can estimate the accuracy (*Acc*(*φ*)) of the clustering as
where *n*_{i,φ(i)} is the number of records of the *i*th cluster whose state of the attribute *A* is the same as the assigned *a*_{φ(i)} (Suarez-Alvarez 2010).

The clustering accuracy is defined as the maximum of *Acc*(*φ*) for all possible *φ*
Evidently, the closer is Acc to 1 the less is the difference between the partitioning of the data after clustering and the partitioning of the data associated with the attribute *A*. If *Acc*=1, then both partitioning into classes are the same.

Thus, one need to solve the assignment problem with an efficiency matrix *n*_{i,j}, (*i*,*j*=1,…,*k*) in order to find the clustering accuracy Acc.

### (b) Objective function based on the Euclidean distance

Here, a particular case of objective function (2.5) has been used in application to particular datasets, where *ρ* is defined by (3.10).

First, the clustering procedure has been applied to the dataset under consideration without normalization of the data. Then, the clustering procedure with normalization of all attributes has been applied to the dataset. For each run, the initial cluster prototypes have been generated randomly. Both procedures with and without normalization have been applied 100 times to the dataset.

The clustering accuracy and the Rand index values have to be taken, not as the best values of the accuracy or the Rand index, but in accordance with the obtained best objective function value because this is the criterion of the unsupervised objective function-based clustering.

#### (i) Soybean disease dataset

This set has 47 records (*N*=47) with 35 attributes. Each record is attributed to one of four diseases. This is a standard categorical dataset that was studied a number of times to test clustering algorithms (Michalski & Stepp 1983; Huang 1998).

Table 1 presents the cost function, the clustering accuracy and the number of attempts (*N*_{*}) whose outcome had the corresponding accuracy. Because the dataset is quite small, the ‘true clustering’ (*Acc*=1) has been obtained quite often in both cases. *Acc*=1 and *R*=1 have been obtained in 22 per cent after clustering without normalization and in 58 per cent after clustering with normalization. The average accuracy in both cases has been 0.8636 and 0.9077, respectively, for the former and the latter cases.

#### (ii) Wine dataset

This set has 178 records (*N*=178) with 14 attributes. The class attribute takes three categorical values, whereas the rest of the attributes are numerical. Although the dataset is quite small, the ‘true clustering’ (*Acc*=1 or *R*=1) has not been obtained (table 2). The values of the clustering accuracy corresponding to the obtained best objective function values are 0.702 (without normalization) and 0.966 (with normalization). After application of the normalization procedure to the dataset, the Rand index has increased from 0.7187 to 0.9543.

#### (iii) Heart diseases dataset

This set has 270 records (*N*=270) with 14 attributes. There are seven categorical attributes, six numerical and the class attribute. The ‘true clustering’ (*Acc*=1) has not been obtained (table 3). The values of the clustering accuracy corresponding to the obtained best objective function values is 0.593 (without normalization) and 0.804 (with normalization). After application of the normalization procedure to the dataset the Rand index has increased from 0.5154 to 0.6833.

#### (iv) Credit approval dataset

This set has 690 records (*N*=690) with 16 attributes: six attributes are numerical, whereas the rest of the attributes are categorical. The results of all 100 runs of the procedure without normalization have been the same and therefore equal to the average accuracy 0.55 (table 4). After application of the normalization procedure to the dataset, the clustering accuracy corresponding to the obtained best objective function value is 0.80, and the Rand index has increased from 0.5048 to 0.6806.

### (c) Clustering with objective functions based on the Minkowski distance

Here, the cost function *J* defined by (2.6) with the Minkowski mixed *p*-metric (3.13) has been used. The algorithm is based on ideas that were suggested by Miyamoto & Agusta (1998) and Hathaway *et al.* (2000) as a generalization of the fuzzy clustering strategies using *L*_{p} norm distances. The algorithm was described by Suarez-Alvarez (2010). The novelty of the present approach is that the algorithm has been developed for hard clustering using the Minkowski mixed *p*-metric. This algorithm has been used instead of the *k*-prototypes algorithm for the cases when *p*≠2. The algorithm has been applied 100 times to each dataset without and with normalization of attributes for various values of the Minkowski power *p*.

#### (i) Adult dataset

This set, also known as Census Income dataset, has 48 842 records and 30 162 records without missing values (*N*=30 162) with 14 attributes and one class attribute. Each record has six numerical attributes, whereas the rest of the attributes are categorical including a class attribute. Table 5 presents the Acc values corresponding to the best value of the objective function. The Acc values vary considerably with variation of *p*. In clustering without normalization, the best value has been obtained for *p*=5 and it is *Acc*=0.744. After normalization, the accuracy is better and its best value has been obtained for *p*=3÷5 and it is *Acc*=0.756.

#### (ii) Shuttle dataset

The full Shuttle dataset, also known as Statlog (Shuttle) dataset, has *N*=14 500 records with nine numerical attributes and one class attribute. As one can see from table 6, the Acc values vary considerably with variation of *p*. In clustering without normalization, the best value has been obtained for *p*=2.5 and it is *Acc*=0.829. After the runs of the procedure with normalization, the best accuracy increased to *Acc*=0.852 for *p*=3. One can see that it is more advantageous to apply the general Minkowski metrics to the Shuttle dataset than a particular case *p*=2 (the Euclidean metric).

Thus, the new normalized metrics has been used for clustering mixed data. It has been shown that when clustering has been performed using normalized metrics, the accuracy has increased in all considered examples. These examples have demonstrated the advantages of the introduced normalized metrics. Other examples of clustering of various databases with and without normalization of attributes were given by Suarez-Alvarez (2010).

Amorim & Mirkin (2012) have published an alternative algorithm for clustering of datasets, the so-called intelligent Minkowski metric weighted *k*-means (iMWK-means) algorithm that is a development of the weighted *k*-means (WK-means) introduced by Huang *et al.* (2005) and the ‘intelligent’ version of *k*-means (Mirkin 2005) algorithms. Similar to the WK-means algorithm that prescribes the specific weights to all features that minimize the objective function, the iMWK-means algorithm minimizes the cost function where metric is taken between rescaled cluster points and prototypes. Because the described statistical normalization procedure can be treated as a kind of weighting, one could expect that the iMWK-means algorithm would always overperform the algorithm that uses just the statistical normalization. In fact, the iMWK-means algorithm shows better results in application to several benchmark datasets, e.g. it gives *Acc*=0.8407 for *p*=2.7 in application to the Heart disease dataset (Asuncion & Newman 2007), whereas for the same dataset the best result of the algorithm with statistical normalization is *Acc*=0.8185 for *p*=1 and *p*=3.5. Both algorithms gave the same result *Acc*=0.8037 for *p*=2. However, for the Pima Indian Diabetes dataset (Asuncion & Newman 2007), the algorithm with statistical normalization shows slightly better result *Acc*=0.7083 for *p*=3 than *Acc*=0.6940 for *p*=4.9 shown by the iMWK-means algorithm. For the above-mentioned Wine dataset, the iMWK-means algorithm shows *Acc*=0.949 for *p*=1.2, *Acc*=0.921 for *p*=2 and *Acc*=0.938 for *p*=3, whereas the algorithm with statistical normalization shows *Acc*=0.955 for *p*=1.5, *Acc*=0.9663 for *p*=2 and *Acc*=0.927 for *p*=3 (table 7). The detailed comparison of performances of these algorithms and detailed analysis of the results are out of the scope of this study. It looks quite natural to apply the weighting procedure to metrics that have already been normalized by the above-described procedure. If one knows *a priori* that some attributes have larger contributions to similarity measures than the rest of the attributes then this can be taken into account by appropriate weighting.

## 5. Conclusion

In this study, a mixed database is treated as a random sample of an object under consideration. In the overwhelming majority of the earlier approaches to normalization, scaling was used for the Euclidean metric and/or the normal distribution of the variables in order to assure the values to be in the [0,1] range. However, it has been shown that, in general, this does not provide equal contributions of the features to the metrics.

A unified statistical approach has been applied to normalization of all attributes of the feature vectors of mixed datasets that assures that the means of the different normalized attributes are equal to each other and therefore, these variables give equal contributions to the similarity measures. The presented approach extends the ideas considered earlier (Hastie *et al.* 2001; Pham *et al.* 2006). The normalization is achieved by scaling the numerical attributes, whereas the categorical attributes are normalized by appropriate choice of their weights according to described statistical procedure. Normalized Minkowski metrics and metrics for mixed datasets have been introduced in an explicit way.

Following the general principle of equal importance of features, it has been shown that the classic min–max normalization (2.7) of numerical attributes is the proper one from a statistical point of view in the case of the Tchebycheff metric, whereas the *z*-score normalization should be used in the case of the Euclidean metric.

The main idea of normalization that has been developed in this study is valid not only for the above mentioned specific metrics, but also in application to any other measure. Hence, a general procedure for normalization of mixed metrics has been introduced.

The introduced normalized metrics have been applied to various datasets. Results on benchmark datasets are presented together with a comparison with other approaches. Clustering has been performed with non-normalized and introduced normalized mixed metrics using either the *k*-prototypes (for *p*=2) or another algorithm (for *p*≠2). It has been observed that the accuracy has increased in all considered examples when clustering has been performed using normalized metrics. It has also been shown on examples that sometimes it is more advantageous to use objective functions based on the general Minkowski metrics than on the Euclidean metric.

## Acknowledgements

We thank Prof. Feodor M. Borodich (Cardiff University) for his valuable comments on the study.

- Received December 1, 2011.
- Accepted March 20, 2012.

- This journal is © 2012 The Royal Society