
In this example – as I’m wanting to create three clusters, then I will need three starting points. Figure 4įor k-means clustering you typically pick some random cases (starting points or seeds) to get the analysis started. Please note that you can use this Excel approach to identify as many clusters as you like – just follow the same concept as explained below. Yes, there are four clusters evident in the diagram above, but that only looks at two of the variables. Step Three – Calculate the distance from each data point to the center of a clusterįor this walk-through example, let’s assume that we want to identify three segments/clusters only.
#How to use xlminer in excel for k clustering how to
As I have suggested, a good approach when there are only two variables to consider – but is this case we have three variables (and you could have more), so this visual approach will only work for basic data sets – so now let’s look at how to do the Excel calculation for k-means clustering. With this next graph, I have visibly identified probable cluster and circled them. In this case, you could identify three or four relatively distinct clusters – as shown in this next chart.

And, at times, you can cluster the data via visual means.Īs you can see in this scatter graph, each individual case (what I’m calling a consumer for this example) has been mapped, along with the average (mean) for all cases (the red circle).ĭepending upon how you view the data/graph – there appears to be a number of clusters. In this cluster analysis example we are using three variables – but if you have just two variables to cluster, then a scatter chart is an excellent way to start. Step Two – If just two variables, use a scatter graph on Excel Figure 2 You can see from this example set that three start positions have been highlighted – we will discuss those in Step Three below.

It’s just easier for me to classify that person in the “over $250,000” income bracket and scale income 1-9 – but that’s up to you depending upon the data you are working with. Say, for example, I am using income data (a demographic measure) – most of the data might be around $40,000 to $100,000, but I have one person with an income of $5m.

The reason for this is to “contain” any outliers. NOTE: I prefer to use scaled data – but it is not mandatory.
