PDA

View Full Version : Difference b/w Cluster Analysis and Factor Analysis



mphan
08-15-2010, 09:28 PM
Hi everyone, I'm a totally newbie at market research but my project at work requires me to understand some of the methods. Please excuse me for asking questions that might seem obvious.

Can someone explain (or provide a link to an article) that explain the difference between factor analysis and cluster analysis. I'm trying to segment the sample of respondents into characteristic groups based on responses that they gave on a number of questions. The data are mainly ordinal (Likert scale...can factor analysis work on this?) and nominal. Which method should I choose and what criteria should I consider?

peaceman
08-16-2010, 04:24 AM
Factor analysis is a data reduction technique and cluster analysis is generally used to classify a group of respondents. Too much of data are at times difficult to interpret and as a result, factor analysis (FA) is used to reduce them. e.g. one may have 20 statements related to attitude of the consumers but it is difficult to derive sense from 20 variables. FA can be used to reduce these 20 variables to a small number of factors which would explain most of the variance in the data.

In case one wants to segment the group of respondents, cluster analysis can be used - the input would be the factors produced by the FA.

FA works on Likert scale (interval). Not a big fan of using factor or cluster on nominal data. Run factor (if you have too many variables) and then cluster to derive segments.

Ian Straus
08-16-2010, 02:53 PM
.....Can someone explain (or provide a link to an article) that explain the difference between factor analysis and cluster analysis. .....?

In factor analysis, you are taking two or more variables (ordinal or scale variables) and generating a smaller number of variables that capture as much as possible of their variance (presumably also capturing their meaning).
That's why it's called a data reduction technique: You may be reducing four or seven to one or two.

As an example of why that is desirable and how factoring can be used:
I have a survey in which customers are asked four questions about cleanliness. They all affect satisfaction. But when I run multiple regression for a key driver analysis, only one will come into the model because the four are highly correlated.
The high correlation should be no surprise because they all reflect the same cleaning process: they just reflect different ways dirt and trash occur. If I force all four into the regression then one makes sense and the other three end up with counter-intuitive signs and low significance.
Yet they are not redundant in the real world.

So, what to do? I factored them, generating one number, "cleanliness factor". When I use that instead of the separate variables, my adjusted R-squared improves (the model is more explanatory). And it has a clear real-world meaning.

Note that factoring only works to combine variables that are correlated.
If you try to factor your whole survey then the software will probably generate more than one factor if the survey deals with more than one topic. I'd have to show you some factoring output to illustrate what I mean.


OK, change to your second subject, cluster analysis:
All the various methods of clustering attempt to group your cases in a multi-dimensional space, so that each group of cases are "closer to one another" than they are to the members of the other groups.
Each variable in your data file is a dimension.
This is easy to illustrate in two or three dimensions, in fact you could do it by eye with a pencil and a ruler. But with many dimensions you have to begin to trust the computer.

You may want to factor to reduce those dimensions and then cluster as a second step, using the factor scores as input to the clustering program.

Clustering is rightly called "an art", meaning that experience helps and the result can change with something as mundane as the order of the cases.
There is not really a significance number for clustering, although there are statistics that reflect the extent to which your clusters differ or overlap.

Because each variable is a dimension, if you mix nominal variables into the data then either they are binary and tend to take over the model or they are multi-valued and so not suitable for the clustering method. That's why peaceman is against clustering nominal data. So am I, because the results are pretty useless. If you have a lot of nominal data you may just have to try to reach the same practical end by hand, which requires a knowledge of your particular business.

mphan
08-17-2010, 03:29 AM
Ok...so it seems that Likert scale can be considered interval data (for some reason I've read somewhere that Likert is ordinal data)

I'm trying to replicate the segmentation process used in this article (1st page): http://www.mvsolution.com/wp-content/uploads/The-Power-of-Segmentation.pdf

Originally, I thought that the segmentation process that they used ended after the factor analysis is complete. It seems that after they finished the factor analysis, they loaded the factors into a cluster analysis. This seems to be common industry practice.

Am I right in this interpretation?

Thank You for all your help.

Ian Straus
08-17-2010, 08:56 AM
That sure looks like what they did. And the article lays it out with reference to the SPSS program. Table 3 looks very like the output from the factoring step, just prettied up a little for publicaiton. Of course they used the option to save the factor scores as variables, so they could use them in the cluster analysis.

peaceman
08-17-2010, 10:40 AM
If you are dealing with 4-6 variables you need not use factor analysis - cluster can be used directly. If the number of variables is high, it is better to use factor and then cluster.

mphan
08-18-2010, 08:53 PM
So I have a question that asks how much a client value a series (9) of qualities in our product/service before they buy them. The second question asks if they are satisfied with those same qualities in our particular product/service.

So overall I have 18 variables all on Likert scale. Should I only perform factor analysis on the variables that deals with "importance" of a quality or should I include variables that deals with satisfaction also (or should I do factor analysis on the "satisfaction" variables only). Do I even need to use a factor analysis if I eventually decide on only 9 variables?

Ian Straus
08-19-2010, 09:05 AM
mphan

I wish I knew what the nine were and to what extent they are correlated with each other. I'm inclined to think that you should segment separately on the importance question. I'd say do a factor analysis and look at the results before you use it.
How much of the variance do the factors explain?
Do the factors make sense and help you tell a story?

Consider what your next step is. I don't know what it is, from what you have wtritten so far.
Are you going to find clusters and then evaluate their satisfaction with the most valued features of the product?
Or are you going to attempt gap analysis?

Note that it would not surprise me if some people answered the importance and satisfaction items together as a bargaining tool, if the questions are adjacent to one another.

peaceman
08-20-2010, 04:33 AM
It depends on what you want to do. Do you want to segment respondents based on importance variables? You can use both importance and satisfaction or one of them to derive segments based on your objective(s). However, i agree with Ian that you should segment them on importance variables but again this is up to you as you know better what you want.

Regarding FA, do you reckon 9 variables/factors are distinct and different from each other? As Ian said, you can always do FA and check if you can build some story and reduce the number of variables.