Extensions to the k-means algorithm for clustering large data sets with categorical values

Abstract

p1: sud data mining and knowledge discovery kl657-03-huang october 27, 1998 12:59 data mining and knowledge discovery 2, 283–304 (1998) c(cid:176) 1998 kluwer academic publishers. manufactured in the netherlands. extensions to the k-means algorithm for clustering large data sets with categorical values zhexue huang acsys crc, csiro mathematical and information sciences, gpo box 664, canberra, act 2601, australia huang@mip.com.au abstract. the k-means algorithm is well known for its efﬁciency in clustering large data sets. however, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. in this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. the k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. with these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. the k-prototypes algorithm, through the deﬁnition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. we use the well known soybean disease and credit approval data sets to …

Source: pdf_first_chars

Document Metadata

Issuer: Springer Science and Business Media LLC
Document Type: Research / Academic Paper
Publication Year: 1998
Retrieved: 5 May 2026
Source: doi.org
Record ID: XF16NAJDSW
Validation: Inferred by XFID

Topics

Machine Learning

Cited by (1)

Other RESEARCH documents in the registry that cite this work.

Pricing of Green Bonds: Drivers and Dynamics of the Greenium (2022)

How to Cite This Record

Use the XFID in citations to create a stable, permanent reference that resolves to this registry entry regardless of the source URL.

Academic / report citation

Springer Science and Business Media LLC (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. XFID: XF-16NAJDS-W. Retrieved from https://xframework.id/XF16NAJDSW

Identifier only

XF-16NAJDS-W