K-means clustering algorithm in R

Cluster analysis in other terms unsupervised learning, is the process of placement a set of object into groups. These groups called clusters, so that objects in the same group are more similar to each other than other groups. The objective of the clustering is to determine the essential grouping a set of unlabeled data. Clustering algorithms can be classified according to their cluster model. These base models are connectivity, centroid, distribution, density, subspace and group models. Some famous clustering algorithms of these cluster models are hierarchical clustering (connectivity based), k-means clustering (centroid based), Expectation-Maximization algorithm (distribution based) and DBSCAN (density based).
K-means algorithm is one of the most popular basic clustering algorithm. The main idea is to define k centroids, one for each group and then to take each point belonging to a data set and associate it to the nearest centroid. After that re-calculate k new centroids provided by a loop, centroids of the clusters resulting from the previous stage. K-means algorithm implemented in R language is here. Note that this algorithm implementation is made example by Programming Collective Intelligence book.
The cluster function returns a list of list that each list consists of instances of each cluster.


kcluster <- function(data_matrix,k){
numOfRow <- nrow(data_matrix)
numOfCol <- ncol(data_matrix)
#minVal is a vector that contains min values in columns
minVal <- apply(data_matrix,2,min)
minVal <- as.vector(minVal)
#maxVal is a vector that contains max values in columns
maxVal <- apply(data_matrix,2,max)
maxVal <- as.vector(maxVal)
#difference is a vector that contains difference between
# min and max values in columns
difference <- as.vector(maxVal - minVal)
ClusterList <- list()

#Create k randomly placed centroids
for (j in 1:k){
#multiply each differences to element of uniform dist on (0,1)
randed_diff <- as.vector(difference * runif(numOfCol,0,1))
randed_diff <- as.vector(randed_diff + minVal)
ClusterList[[j]] <- randed_diff
}
for (i in 1:100){
#print(paste('Iteration', i))
bestMatches <-rep(list(list()),k)

#Find which centroid is the closest for each row
for(j in 1:numOfRow){
CurrentRow <- c(data_matrix[j,])
bestMatch <- 1
for(i in 1:k){
dist_1 <- sqrt(sum((ClusterList[[i]] - CurrentRow) ** 2))
dist_2 <- sqrt(sum((ClusterList[[bestMatch]] -CurrentRow) ** 2))
if(dist_1 < dist_2)
bestMatch <- i
}
bestMatches[[c(bestMatch,j)]] <- j
}
#Move the centroids to the average of their members
for(i in 1:k){
avgs <- matrix(0,1,numOfCol)
temp <-c()
if(length(bestMatches[[i]]) > 0){
for(i in bestMatches[[i]]){
temp <- append(temp,i)
}
for(index in temp){
avgs <-avgs+data_matrix[index,]
}
avgs<- avgs/length(temp)
ClusterList[[i]] <- avgs
}
}
}
return(bestMatches)
}

Reference: Segaran,T., Programming Collective Intelligence, 2007, O’Reilly Media.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s