[Previous] Comments on An Eye for An I: Philosophies of Personal Power | Home | [Next] Goals and Status Update

Comments on: Personality or performance?: The case against personality testing in management

Robert Spillane wrote:

Szasz's argument can be supported empirically by the many Australian work organisations whose managers secure psychological profiles on their subordinates despite overwhelming evidence that psychological (especially personality) tests have consistently and strikingly failed to predict work performance (Spillane, 1994).

I was particularly interested in this evidence. Psychologist Jordan Peterson (whose videos I generally like) has claimed the research shows that personality tests do correlate with various life outcomes. For example, he said agreeableness correlates to doing well at university (teachers like people who agree with them and grading is biased). I'd like to know if he's wrong about the correlation research (which I know is very different than understanding what's actually going on).

Peterson specifically says the big five personality traits (openness, conscientiousness, extraversion, agreeableness and neuroticism), plus IQ, are the important ones. He says psychologists don't check if their constructs are accounted for by the big five plus IQ because then they'd find out they haven't invented anything, they've just found a proxy for something that's already been discovered.

Peterson says they discovered these traits by asking people a wide variety of questions and finding the answers to some groups of questions are correlated. That is, if you give some conscientious answers you're likely to give other conscientious answers too. The point is that different questions are related, and the questions about personality ending up statistically falling into five groups.

Note that psychologists cannot be trusted to make true statistical claims. For example, the big five wikipedia page says:

Genetically informative research, including twin studies, suggest that heritability and environmental factors both influence all five factors to the same degree.[71] Among four recent twin studies, the mean percentage for heritability was calculated for each personality and it was concluded that heritability influenced the five factors broadly. The self-report measures were as follows: openness to experience was estimated to have a 57% genetic influence, extraversion 54%, conscientiousness 49%, neuroticism 48%, and agreeableness 42%.[72]

I know this is bullshit because I've researched heritability and twin studies before. (Yet More on the Heritability and Malleability of IQ is very good.) They define "heritability" to refer to a mathematical correlation which doesn't imply anything is passed down genetically from your parents. They do this to be misunderstood, on purpose, by people who think they're talking about the standard English meaning of "heritability". And their twin studies don't address gene-culture interactions, and they know that and dishonestly ignore it. They also look at the variation in traits, rather than the cause of the traits themselves (e.g. they would study why you're a little bit happier than some other people, then announce they found a gene controlling happiness.)

For example of a gene-culture interaction, a gene for being taller could correlate to basketball success. That doesn't actually mean that basketball success is genetically passed down. Becoming a good basketball player depends on cultural factors like whether basketball is popular in your society or even exists. Nevertheless they will correlate some gene to basketball success and announce they've discovered basketball skill is 60% hereditary. And they will imply this is determined by your genes and outside your control, and that it couldn't be changed by writing a popular blog post with new ideas. But that's completely false and the "heritability" they study says nothing about what interventions would be successful in changing the results. (In other words, when they say a trait is 60% genetically determined, that actually allows for the possibility that an essay would change 100% of the trait. The more educated psychologists know that and making misleading statements anyway because they believe these kinds of caveats don't really matter and the bulk of their conclusions are about right.)

So I read Spillane's paper: Personality or performance?: The case against personality testing in management1:

The failure of psychologists to produce laws of behaviour or discoveries of importance has stimulated the study of behaviour called reductionism.

Reductionism is refuted in The Fabric Of Reality, ch. 1.

To explain introverted behaviour by reference to an 'introvert trait' in a person betrays an insensitivity to logic since the 'explanation' is viciously circular.

Introvert is a loose description or label. It's a shortcut which condenses and summarizes many facts. E.g. I observed Joe reading a book instead of socializing three times, and I remember one word instead of three events.

I don't think "introverted" is very useful (it's too vague). But shortcut labels in general are OK, e.g. "Popperian", "Aristotelian", or "Kantian". These are more specific than "introverted" and I find them useful despite some ambiguity.

An explanation says why, how, or because. But calling someone introverted doesn't say why they're introverted. An explanation would say, "Joe is introverted because ..." It would then give a reason, e.g. because Joe found that many people are mean to him because he likes books. After you understand the reason for behavior, you can make better predictions. E.g. you won't be surprised if Joe is more outgoing at a book club meeting.

insurmountable problems for those who explain, say, the behaviour of individuals who withdrew their labour by reference to the traits 'aggression' or 'apathy'

"He didn't do much yesterday because he's apathetic" isn't an explanation. It's just a restatement with a synonym. Apathetic means not doing much. But why doesn't he do much?

This reminds me of: people often say they do stuff because they enjoy or like it. They find it fun or entertaining. And they act like that is an explanation and settles the matter. But why do they like it? What's fun about it? Often they are bad at introspection, uninterested in self-understanding, and don't know.

Maslow’s hypotheses have been vigorously tested and the results, far from supporting his theory, have invalidated it This would not have surprised Maslow himself who was bothered by the way his conjectures were so readily accepted as true and paraded as the latest example of erudite knowledge in management [emphasis added]

Sad story.

The results of personality tests, to which I now turn, are communications, not traits or needs, and they are particularly responsive to the demands of the social situation in which individuals are expected to perform. After decades of personality testing we can now say with confidence that the search for consistent personality and motivational traits has been strikingly unsuccessful (Mischel 1968). While self-descriptions on trait measures are reasonably consistent over short periods of time these measures change across social settings (Anastasi 1982). In other words, people answer questions about hypothetical situations in a reasonably consistent fashion, but when it comes to behaving in the world—the way the situation is perceived—the rewards and penalties obtained and the power one is able to exert influence the consistency of behaviour. It is not surprising, therefore, that efforts to predict performance from personality and motivational inferences have been consistently and spectacularly unsuccessful (Blinkhorn 1990; Fletcher, Blinkhorn & Johnson 1991; Guion & Gottier 1965).

The relevant part! It'd be a lot of work to check those cites though. Let's see what details Spillane provides.

For more that 30 years researchers have stated unequivocally that they cannot advocate the use of personality tests as a basis for making employment decisions about people (Guion & Gottier 1965; Guion 1991). Where significant predictable findings are reported they are barely above chance occurrence and explain only a small proportion (less than 10%) of the variance in behaviour which ‘is incredibly small for any source which is considered to be the basis of behavioural variations’ (Hunt 1965, p 10). [emphasis added]

This use of the word "explain" is standard in these fields and really bad. They use "explain" to talk about correlations, contrary to standard English. In regular English, explanations tell you why, how or because. The implication when they say "explain" is it's telling you why – that is, it's telling you about causes. But correlations aren't causes, so this use of language is dishonest.

The rest looks good.

In the face of low validity coefficients

This reminds me of Jordan Peterson who said psychologists used to underestimate their findings because the correlation coefficients they found were low. But then someone figured out to compare coefficients to other psychology research and call the top 25% of coefficients high no matter how low the actual numbers are! He thought this was a good idea. It reminds me of how poverty is now commonly defined to refer to relative poverty (being poorer than other people, no matter how wealthy you are).

On comparing three respected and widely used personality tests, two researchers found ‘little evidence that even the best personality test predict job performance, and a good deal of evidence of poorly understood statistical methods being pressed into service to buttress shaky claims (Blinkhorn & Johnson 1990, p 672).

Doh.

Poor validity is matched by poor internal consistency and test-retest reliability. In Cattell’s (1970) 16 personality factors, for example, only two out of 15 Alpha coefficients of internal reliability reach a statistically acceptable level, so testers cannot know what exactly the test has measured. This finding is not surprising given the vagueness of trait definitions and the fact that factor analysis ‘is a useful mathematical procedure for simplifying data but it does not automatically reveal basic traits. For example, the personality factors identified from ratings may partly reflect the rater’s conceptual categories’ (Mischel 1971).

Of course personality trait categorizations reflect the conceptual categories of the people who invented them. They chose what they thought was a question about personality to ask about in the first place.

It's like IQ tests, which all have a heavy cultural bias. So they don't accurately measure intelligence. But that doesn't necessarily make them worthless. Despite the bias, the results may still correlate to some types of success within the culture the tests are biased towards. In other words, an equally smart person who isn't as familiar with our culture will get a lower IQ score. But he may also, on average, go on to have less success (at getting high university grades or getting a high income) since he doesn't fit in as well.

IQ tests also deal with outliers badly. Some people are too smart for the test and see ambiguities in the questions and have trouble guessing what the questioners meant. Here's an example from testing the child of a friend of mine. They were asked what a cow and a pig have in common. And they tried answers like "mammal" or "four legs" or "both are found on farms". Nope, wrong! The right answer was they were both "animals". The child was too smart for the test and was marked wrong. The child was only told the right answer after the test was over, so they got a bunch of similar questions wrong too... Similarly, I recall reading the Richard Feynman scored like 125 on an IQ test, which is ridiculously low for him. He's the kind of person you'd expect to easily break 175 if the tests were meaningful that far from 100, which they aren't.

The technical deficiencies of most personality tests have been known for many years. Yet they are conveniently ignored by those with vested interests in their continued use. For example, the Edwards Personal Preference Scale is technically deficient in form and score interpretation and rests on poorly designed validation studies (Anastasi 1982). The limitations of the Myers-Briggs Temperament Indicator are well known: ‘The original Jungian concepts are distorted, even contradicted; there is no bi-modal distribution of preference scores; studies using the MBTI have not always confirmed either the theory or the measure' (Fumham 1992, p 60).

Cool. I may look those papers up. I'd really like one for the big five, though!

Testers rely on the validity of self-reports and assume that subjects have sufficient self-insight to report their feelings and behaviour accurately. However, evidence has shown that respondents frequency lack appropriate levels of self-awareness or are protected from exposing themselves by an army of defence mechanisms (Stone 1991).

Of course. So personality tests don't measure your real personality anymore than IQ tests measure your real intelligence. But, it could still be the case that people who claim to be agreeable on personality tests do better at university, on average (though without knowing why you can't understand what changes to our society would ruin the effect). One of the reasons I was interested by Peterson's comments on personality tests is he said basically the correlations exist and therefore there's something going on there even if we don't know what it is, and he's admitted that some of the big five personality traits aren't really understood, they are just names tacked on to the correlation which is the real discovery.

Correlations are worthless without any explanation. But they do have some explanatory context to put these correlations in. We already knew that some of people's communications reveal information about their preferences and skills. And it's not just what people openly say that matters, sometimes subtle clues are revealing. In that context, it could theoretically be possible to correlate some communications to some outcomes. It's like reading between the lines but then statistically checking if you're right very often or not.

Then there is the problem of faking which is so widespread that it is amazing that test scores obtained under conditions of duress or vested interest are taken seriously. The use of so called objective self-report tests requires the assumption that the subject’s score is free from artifacts that superficially raise or lower scores.
Yet many researchers list studies which show that personality tests are especially subject to faking (Anastasi 1982; Goldstein & Hersen 1990; Hogan 1991). So serious is this problem that one of the world’s best known personality psychologists, H J Eysenck (1976), will not endorse the use of his personality test where there is a vested interest in obtaining a particular result. Australian researchers have expressed similar reservations about the use of Cattell’s 16 personality factors in selection situations (Stone 1991; Spillane 1985). Yet the testing continues in the absence of countervailing evidence.

Right. I only had in mind only voluntary, confidential tests for personal use. If the test can affect getting a job offer, a raise, or college admissions, then of course people will lie. (People are really bad at introspection and personality tests could be a starting point for them to think about themselves. Yes a biased starting point, but still potentially useful for people who have no idea how to do better. I took some online personality tests in the past and found them interesting to think about. That's in the context of my belief that personality is changeable anyway. I never interpreted the tests as doing anything like authoritatively pronouncing what my life will be like, nor did I expect them to be unbiased or highly accurate.)

The claim that lie scales built into the tests weed out fakers is an insult to the intelligence of those who are subjected to them. Whyte (1956) explained 38 years ago how to fake these tests by summarising the strategies employed by bright people to make fools of the testers.

That sounds interesting. At least the test faking strategies. I bet if I look it up, the "lie scales" will boringly naive.

Then there is the question of cross-cultural applicability, fairness and discrimination. Most personality tests are derived from an Anglo-American environment and are therefore culturally biased. Such tests have been found to be sexually and racially discriminating (Anastasi 1982; Fumham 1992).

Of course they are. That doesn't make them worthless though. If your company is sexist and racist, then the white male who gets a higher test score may actually do better at your company... (Or have they updated the tests yet to promote "diversity" by biasing them in favor of brown females?)

Also, as far as hiring goes, I believe companies should use work sample tests. Typical interviews are extremely biased to find people who are socially-culturally similar to the interviewer, rather than people who would do a good job. It's also biased to outgoing people who are relaxed, rather than nervous, during interviews. Current hiring practices are so bad that many people are hired for programming positions who can't write working code. The trivial FizzBuzz work sample test actually improves hiring because the other hiring criteria being used, like interviews, are worthless.

Test scores can be interpreted in many ways. The most logical interpretation is that they reflect strategies adopted by the subject for the testing game. To argue that these strategies will necessarily equate to strategies adopted in the world of business is dishonest or naive.

Right. If you take a test because of personal curiosity, then you can try to answer honestly and see if the results say anything you find interesting. If personality tests were used for college admissions, then they'd be a test just like the SAT where you can read books telling you how to give answers that will get you admitted. It'd be funny if people wanted to retake a personality test to try again to get a better score, as they do now with the SAT.

Personality tests assess generalised attitudes and gloss over the rich subtleties of human behaviour.

Of course! Isn't that what they're supposed to do? They are trying to summarize a person – which is very complex – with e.g. scores on 5 continuums. Summary information like that necessarily loses a lots of detail. Does anyone deny it!?

Nowadays it is commonplace to hear apologists for personality testing admit that the tests don’t predict performance, but should be used nonetheless to ensure an appropriate fit of individual with organisational culture.

Seeking cultural fit at a company is one of the main excuses for not basing hiring primarily on work sample tests.

If companies cared more about work performance, they would come up with objective measures allowing no managerial discretion and then hand out bonus pay accordingly. (Some do this, many don't.)

... foist their crude ideas about human nature on to people who frequently don’t have the opportunity to assess their claims or to refuse to participate in the testing game.

A friend of mine got bored while taking an IQ tests and skipped the rest of the questions.

I got bored while taking a physics test at school, so I left most of it blank. The teacher didn't want to try to explain to anyone why a smart student who knew the material got a bad grade. Why rock the boat? So he just ignored the test result and asked me to take it again later. Grade falsification like this is common, and the amount of grade falsification depends on the teacher's opinion of you. A friend of mine went through school making friends with his teachers and then turning in many of his assignments weeks late and getting A's anyway.

No doubt one of the reasons for the continuing belief in personality traits and the instruments used to 'measure' them is the result of an outmoded inductivist view of science which emphasises confirming instances.

Yes. And induction is closely related to correlation. Induction involves picking out some pattern (correlation) from a data set and then extrapolating without thinking about explanations of the causal mechanisms. We know the sun will rise tomorrow because we know what it's made out of and what forces (gravity and the Earth's spin) are involved, not because of the correlation between 24 hours passing and the sun rising again.

But induction doesn't work because, among other reasons, there are always infinitely many patterns (and also explanations) which fit any finite data set. So too are there infinitely many patterns to be found in personality test data, and infinitely many explanations compatible with the test results. It's only by critical thinking about explanations that we can understand what's going on. Data can't guide us (contrary to the common claim that correlations hint at causations), we have to guide ourselves using data as a tool.

Final Comments

Even without any tests, people often use their personality as an excuse. They say they can't do some task well (e.g. go to a party and do business networking) because they are "introverted". Or, rather than simply say they don't like networking (and perhaps even giving some reasons), they say it makes them nervous or anxious because of its incompatibility with their personality type.

Many people would prefer to be victims of their personality, to have an excuse for their failures, rather than strive to better themselves.


Footnotes

1: Spillane, R. (1994) 'Personality or Performance? The Case Against Personality Testing in Management.' In A.R. Nankervis & R.L. Compton (eds) Strategic Human Resource Management, Melbourne: Nelson, Ch 14.


Update

Spillane comments on the Big Five in his book Psychomanagement: An Australian Affair:

In connection with any quantity which varies, such as job performance, the variation does not arise solely because of differences among personalities. So the correlation coefficient is used as an indicator of the proportion of the variation in job performance, which is related to personality scores. The correlation thus provides an answer to the question: how much does the variation in personality contribute to the variation in job performance? This question may be answered in terms of variance. The square of the correlation coefficient indicates the proportion of variation in job performance, which is accounted for by differences in personality scores. If the correlation between job performance and personality scores is +.9 then the proportion of variance accounted for is 81% (and 19% is unaccounted for) and personality would be a very strong predictor of job performance. For 50% of the variance of job performance to be accounted for by personality, a correlation coefficient of just over +.7 is required. Since important employment decisions are based on the assumption that personality scores predict job performance, one would expect and hope that the correlation coefficients are greater than +.7 otherwise decision makers will make a large number of rejecting and accepting errors.

What have the meta-analyses found? Four meta-analytic studies of the relationship between job performance and personality scores yielded the following average correlation coefficients: conscientiousness .21; neuroticism .15; extraversion .13; agreeableness .13; openness to experience .12. These results are worrying enough since the much-quoted result of .21 for conscientiousness means that the proportion of variance unaccounted for is 95.6%. Responsible decisions about hiring, promotion or training cannot be made on the basis of these figures.

However, the actual situation is far worse since it makes an important difference to the results when personality scores are correlated with ‘hard’ or ‘soft’ performance criteria. Soft criteria include subjective ratings whereas hard criteria include productivity data, salary, turnover/tenure, change of status. Since personality scores are better predictors of subjective performance ratings than objective performance measures, it is reasonable to conclude that raters rely on personality when evaluating job performance, thereby raising the question whether the relationship between personality and performance is the result of the bias of the rater rather than actual performance. In the much-quoted study by Barrick and Mount, the correlation coefficient dropped from .26 for soft criteria to .14 for hard criteria. The average correlation between the Big Five and job performance (hard criteria) was .07.2

The footnote is:

M.R. Barrick & M.K. Mount, ‘The Big Five Personality Dimensions and Job Performance: A Meta-Analysis’, Personnel Psychology, 1991, 44, pp. 1-26.

That article is freely available online. I read some and Spillane seems to be factually correct. It looks like Jordan Peterson is badly wrong.


Elliot Temple on July 7, 2017

Comments (3)

Fine Reader was a big help so I didn't have to type in the quotes. And it has better OCR accuracy than Adobe.

https://www.abbyy.com/en-us/finereader/

curi at 3:53 PM on July 7, 2017 | #8786
i thought this post was interesting. i realized I didn't really understand the math of correlations and R-squared and stuff, and wanted to know more so i could understand some parts of the post better.

i tried watching a few videos on youtube. most of them looked bad, confusing and more focused on homework and calculation steps than explanation. you can often just tell by the title or within 30 seconds. i found this lack of good explanations annoying and it's something i've encountered when looking for *explanations* of math stuff before.

this video was okay though:

https://www.youtube.com/watch?v=aq8VU5KLmkY

the guy uses "explain" in the way elliot criticizes here:

>This use of the word "explain" is standard in these fields and really bad. They use "explain" to talk about correlations, contrary to standard English. In regular English, explanations tell you why, how or because. The implication when they say "explain" is it's telling you why – that is, it's telling you about causes. But correlations aren't causes, so this use of language is dishonest.

but video guy also talks sensibly about the mathematical concepts, so it was good to see how the mathematical concepts connected with this use of language

BTW i'm not very good at math so i might have misunderstood him. i watched the vid a couple of times and paused a bunch to try and follow it.

anyways as far as i could tell, he said that if you have some data points, and you figure out what the mean of the y-values (he calls this y-bar) and the line of best fit (he calls this y-hat) are, then you call the variation of each data point from y-bar to y-hat the amount of the variation that is "explained" by changes in x, and the variation of each data point from y-hat to the actual data point is "unexplained."

i thought this was interesting. cuz it seems like, under this usage of language, the only way you couldn't say there was some portion of some variable that was explained by some other variable is if the relationship between them was zero!

Anonymous at 6:35 AM on July 9, 2017 | #8792
It was a 125 IQ score for Feynman. (i originally said 140 in the post but i'll edit the number now)

https://www.psychologytoday.com/blog/finding-the-next-einstein/201112/polymath-physicist-richard-feynmans-low-iq-and-finding-another

The page gives the excuse-explanation that the IQ test focused on verbal skills. Wouldn't that make it inaccurate, which is the point?

curi at 9:22 AM on July 9, 2017 | #8794

What do you think?

(This is a free speech zone!)