![]() |
|
#1
|
|||
|
|||
|
Hi Everybody,
SPSS gives 3 methods of clustering procedures namely, Two step cluster analysis, K-means cluster analysis and Heirarchial clustering. Now, I want to know when the sample is large, which method would be appropriate when 1. The variables are ordinal (Eg: 5 point likert scale.) 2. The variables are binary(Eg: Yes/No data) 3. The variables are continous. Also, how to decide on the standardization method when the variables are of mixed type, Eg: A combination of binary and ordinal data. In other words can somebody comment on the data considerations in using cluster analysis? Thanks for the help. Karanth.
Last edited by karanth; 04-24-2006 at 07:24 AM. |
|
#2
|
||||
|
||||
|
For starters, if you've got a large sample size (1,500 +) then you would probably want to start with K-means, unless you've got quite a lot of processing power at your disposal.
Your other option, depending on the SPSS version you've got, is 2 step Clustering which accepts any variable type and even suggests the best cluster solution. Tried to use it once, but the solution it gave (2 clusters) was not that brilliant. That leaves you K-means, just be sure you standarize your variables (Z-scores) before including them in the procedure. |
|
#3
|
||||
|
||||
|
Carlos did a good abbreviated description but I strongly suggest you simply go to SPSS' Help>Topics and enter cluster. This should give you all you need and more (I am running V14 but don't think much has changed since, perhaps, V12)
BTY, Carlos, I am not sure that K-means "requires" using the Z-scores, only that the data are interval or ratio?
__________________
WMB Statistical Services SPSS Beta Site mailto:info.statman@earthlink.net http://home.earthlink.net/~info.statman ======================================= Last edited by Statman; 04-24-2006 at 01:59 PM. |
|
#4
|
|||
|
|||
|
Hello,
Do I need to standardize the variables if all of them are of the same data type? Example: If all of my data is in 5-point likert scale do I have to standardize variables? Is it not like when the data types are mixed only we have to standardize the data? Also, I see that K-means are applied for interval and ratio scale data. Is it applicable for ordinal data? Or is it applicable for binary data? Thanks, Karanth |
|
#5
|
||||
|
||||
|
One thing that always gets me with cluster analysis using likert scale inputs is those two dang clusters that always seem to form of respondents who tend to use the top end of the scale most of the time and respondents who tend to use the botton end of the scale most of the time. This little trick has been successful for me.
Rather than normalizing on the average score for the survey question, I first like to normalize on the average score for the respondent. This assumes that each individual respondent has their own internal reference point rather than that defined by the scale. Then all I'm clustering on is how the respondent considers each measure relative to their internal reference point. I know I'm bad, but sometimes you just have to trick the respondents into being useful despite their best efforts to thwart us. |
|
#6
|
||||
|
||||
|
So right Phillip wrt the respondent and an interesting normalization.
BTY Phillip, are the scales now "scale," still ordinal or interval? [Refer back to the thread on measurement scales] S
__________________
WMB Statistical Services SPSS Beta Site mailto:info.statman@earthlink.net http://home.earthlink.net/~info.statman ======================================= |
|
#7
|
||||
|
||||
|
Quote:
Before the transformation, the survey measures are ordinal Likert values. After the transformation, each measure is the # of standard deviations of the measure from the mean across all the Likert ratings, and so is interval data. I'm thinking at this point some of the readers may be lost without an example so here's what we're talking about. Lots of surveys have question series like this: How strongly do you agree or disagree with the following statements, 7=strongly agree, 1=strongly disagree, 4 = neither/neutral ___ This forum is fabulous ___ I want people to know I use this forum ___ Scott Spain is a studd ___ Statman's posts are informative ___ Philip's posts are confusing ___ This forum is better than other forums like it etc Often, with this type of series you will get respondent data that looks like: case 1: 6,7,6,7,7,5 case 2: 7,7,5,5,6,7 case 3: 4,5,4,4,5,5 case 4: 2,3,2,2,2,3 Clearly the variation across the measures is more a function of how the respondents use the scale than a function of actual variation across the measure. When I encounter this pattern in data, I control for the scale-use tendencies of each respondent by calculating the mean and standard deviation for all of the individual respondent's answers on the same question type, then normalize each measure by dividing the difference between the response and the scale-use mean by the scale-use standard deviation. So the questions where the respondent deviates from their scale-use mean the most have the greatest values (positive or negative). These normalized values usually give me a much more robust and meaningful cluster solution. |
|
#8
|
|||
|
|||
|
Phillip,
Could you please clarify/explain a bit more what you mean by "scale use mean", "scale use standard deviation"? Thanks. |
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Data consideration for Cluster Analysis | zabb_4u | General Research Discussion | 2 | 05-29-2007 03:23 PM |
| which data analysis technique i should use | jagtaprv | General Research Discussion | 2 | 11-01-2006 09:56 AM |
| Filter question for data analysis | Adriane | General Research Discussion | 1 | 03-13-2006 03:35 PM |
| Quick cluster analysis tutoring | ehblancz | General Research Discussion | 5 | 07-01-2005 11:28 AM |
| Cluster Analysis | c.knigge | General Research Discussion | 1 | 11-10-2004 03:10 PM |