How to Design and Report Experiments Read online

Page 7

Factorial validity: This validity is more an issue in constructing proper questionnaires. If I wanted to assess my students’ ‘motivation to succeed’ on my statistics course I might measure attentiveness in seminars, the amount of notes taken in seminars, and the number of questions asked during seminars – all of which might be components of the underlying trait ‘motivation to succeed’. Often psychological constructs have these sub-components (e.g. fear of spiders might be broken down into fear of quick movements, fear of being bitten, fear of things with more than two legs, etc.). When we ask lots of different questions about a particular construct (such as fear of spiders) we can use a statistical technique called factor analysis (see Field, 2000, Chapter 11) to find out which questions relate to each other. Put another way we can find out what sub-components (or factors) exist. Factorial validity simply means that when you do a factor analysis, the sub-components that emerge should make intuitive sense. As such, factorial validity is assessed through factor analysis – if your factors are made up of items that seem to go together meaningfully then you can infer factorial validity.

Box 2.2: Self-report measures

Self-report measures are any measure that requires someone to report how they feel about something and so relies on their subjective experience. Typically, as social scientists we ask people to respond to set questions and then supply them with a set of options. Here we look at some of the different types of rating scales that can be used:

Yes/No Scale

This type of scale involves asking questions to which participants can respond only with yes or no (Example from the Spider Phobia Questionnaire, Watts & Sharrock, 1984):

Do you often think about parts of spiders, for example fangs?

There are several disadvantages of this kind of scale. First it forces people to give one answer or another even though they might feel that they are neither a yes nor no. Imagine you were measuring intrusive thoughts and you had an item ‘I think about killing children’. Chances are everyone would respond no to that statement (even if they did have those thoughts) because it is a very undesirable thing to admit. Therefore, all this item is doing is subtracting a value from everybody’s score – it tells you nothing meaningful, it is just noise in the data. One solution is to add a ‘don’t know’ option, but this can encourage people to opt for the neutral response all of the time because they cannot make up their mind. It is sometimes nice to have questions with a neutral point to help you identify which things people really have no feeling about. Without this midpoint you are simply making people go one way or the other which is comparable to balancing a coin on its edge and seeing which side up it lands when it falls. Basically, when forced 50% will choose one option while 50% will choose the opposite – this is just noise in your data.

Likert Scales

There are different types of Likert scale, but in its classic form it consists of a statement to which you can express varying degrees of agreement. Typically they have three, five or seven ordered categories (although you can have any number):

Example 1: 3-point Likert scale (the Penn State Worry Questionnaire by Meyer, Miller, Metzger, & Borkovec, 1990):

I know I shouldn’t worry about things, but I just can’t help it

Example 2: 5-point Likert scale (the Disgust Sensitivity Questionnaire by Haidt, McCauley, & Rozin, 1994):

You see a bowel movement left unflushed in a public toilet

Example 3: 7-point Likert scale (the Social Phobia and Anxiety Inventory by Turner, Beidel, & Dancu, 1996):

I feel anxious before entering a social situation

The advantages of Likert scales are that they give individuals more scope to express how they feel about something and are easily understood (which can be good if you’re testing children). However, with a limited number of response choices it can be easy for people to remember the responses they gave. This is a disadvantage if you’re using these scales to measure changes in responses over time because if participants think the experimenter is expecting a change then they may deliberately change their responses to conform to these expectations (or deliberately change them just to annoy the experimenter!).

Visual-Analog Rating Scales (VAS scales)

Visual-analog scales are a bit like Likert scales except that rather than having ordered categories, the scale is simply a line with numerical benchmarks along which a participant can mark their response with a cross (if your beginning and end values are 0 and 100 it is useful to have a 10 cm line so that each millimetre represents one point on the scale):

The advantage with VAS scales is that participants don’t know exactly what value they’ve given (you measure their score by calculating the distance from the start of the scale to where they place an X on the line) and so can’t remember their responses.

Thought 3: is my measure reliable?

Validity is a necessary but not sufficient condition of a questionnaire or self-report measure. A second consideration is reliability. Reliability is the ability of the measure to produce the same results under the same conditions. To be reliable the questionnaire must first be valid. Clearly the easiest way to assess reliability is to test the same group of people twice: if the questionnaire is reliable you’d expect each person’s scores to be the same at both points in time. So, scores on the questionnaire should correlate perfectly (or very nearly!). However, in reality, if we did test the same people twice then we’d expect some practice effects and confounding effects (people might remember their responses from last time). Also this method is not very useful for questionnaires purporting to measure something that we would expect to change (such as depressed mood or anxiety). These problems can be overcome using the alternate form method in which two comparable questionnaires are devised and compared. Needless to say this is a rather time-consuming way to ensure reliability and fortunately there are statistical methods to make life much easier.

The simplest statistical technique is the split-half method. This method randomly splits the questionnaire items into two groups. A score for each participant is then calculated based on each half of the scale. If a scale is very reliable we’d expect a person’s score to be the same on one half of the scale as the other, and so the two halves should correlate perfectly. The correlation between the two halves is the statistic computed in the split-half method – large correlations being a sign of reliability. The problem with this method is that there are several ways in which a set of data can be split into two and so the results might stem from the way in which the data were split. To overcome this problem, Cronbach suggested splitting the data in half in every conceivable way and computing the correlation coefficient for each split. The average of these values is known as Cronbach’s alpha, which is the most common measure of scale reliability. As a rough guide, a value of 0.8 is seen as an acceptable value for Cronbach’s alpha; values substantially lower indicate an unreliable scale. Although the details of this technique are beyond the scope of this book, you can find more details of how to carry out this analysis on my website.

Thought 4: measurement error

In Box 1.1 we came across the concept of measurement error. Put simply, measurement error is the difference between the scores we get on our measurement scale and the level of the construct we’re measuring. For example, imagine our actual weight was 73 kg (and let’s assume that we know this as an absolute truth) and one day we get on our bathroom scales and it says 79 kg. There is a difference of 6 kg between our actual weight and the weight given by our measurement tool (the scales). We can say that there is a measurement error of 6 kg. Although, if properly calibrated, bathroom scales should not produce a measurement error (despite what we might want to believe when our scales have just told us we’re really heavy!), self-report measures like the ones used in many social science experiments invariably will produce a measurement error because they are an indirect way of tapping the construct we’re trying to measure. Can you think of why indirect measures might produce measurement erro
rs?

When we use self-report measures we rely on participants accurately reporting their feelings. This won’t always be the case because other factors will affect how people respond to questions. For example, if we ask someone ‘do you let your dog lick your face after it has licked its backside?’ some dog owners might be unwilling to answer yes to this question even though it is probably true that many dog owners do encourage this activity (yuk!). We can sometimes improve things by using direct measures (e.g. skin conductance is directly related to anxiety) rather than indirect measures (self-report measures of anxiety may be influenced by other factors such as social pressures to appear non-anxious), however, be warned that even physiological measures can be influenced by things other than what you think you’re measuring.

Some Examples of Dependent Variables

Let’s look at the examples from Box 2.1 and see how the dependent variables could be measured in those situations:

Children learn more from interactive CD-ROM teaching tools than from books: Clearly the outcome here is learning (or knowledge about a topic). We can’t use physiological measures of knowledge and so we’re probably going to rely on some kind of written test. We would have to devise several questions (perhaps with multiple choice answers to standardize how this test is scored) that test various aspects of the topic being taught. We’re interested in acquired knowledge so we might want to administer a test before and after learning (so we can assess whether knowledge has improved) and, therefore, we’d have to consider some of the issues in the previous sections (is the measure reliable, valid and will people remember their answers over time?).

Frustration creates aggression: In this case we need to measure aggression. This could be done in a behavioural way; for example, in a classic developmental psychology study by Bandura, Ross & Ross (1963) children were shown films of people carrying out aggressive acts and then placed in a room with a Bobo doll and their aggressive behaviour observed. The behavioural measure was how many times the children struck the doll and the nature of the strikes. Aggression could also be measured using self-report such as a VAS or Likert scale asking questions relating to aggression. The behavioural measure is more direct and so probably has less measurement error.

Men and women use different criteria to select a mate: At face value this dependent variable can only really be self-report because we are interested in people’s subjective criteria. However, we could set up a situation in which we get lots of single men and women to mingle (in a club for example) and then at the end of the evening we could measure various aspects about the many couples that will have paired off (income, job, attractiveness or physical characteristics, introversion, sense of humour). This experiment would have several dependent variables and is known as multivariate (literally translated as many variables). Some of these variables such as income and job are direct measures (they can be corroborated) whereas others are self-report (such as measures of introversion or other personality characteristics). For each characteristic or dependent variable we could compare men and women.

Depressives have difficulty recalling specific memories: In this experiment we again rely on self-report because we are interested in the types of memories that depressives and non-depressives generate (perhaps to a standard prompt). However, the research question requires not just memories but some measure of their specificity. Therefore, we need to take some measure of the specificity of the memory, and self-report again seems the obvious choice. We would probably want two or more independent judges to assess the generality or specificity of each memory along some kind of scale (a Likert scale or VAS, probably). The eventual dependent variable could either be the number of memories a person generates that fall into the specific category, or we could average the specificity ratings of all of the memories for each participant and then compare these averaged scores for depressed and non-depressed people.

In the example of whether fear information changes children’s fear beliefs (see previous sections), we need some way to assess the children’s fear beliefs about each animal. We’re looking at beliefs and so we need to rely on self-report (if we were looking at fear rather than fear beliefs then we could perhaps construct a behavioural measure such as approach or avoidance towards the animal in question). We need to think about what kind of scale to use to measure fear beliefs and an important factor here is that we want to test children (who may not understand a complex scale). In the end I decided to use a five point Likert scale for a series of questions relating to the three animals (the questions were the same for each animal). Figure 2.2 shows the questions that were eventually picked.

The questions in Figure 2.2 are all scored so that a fearful response gets a high score (so, for example if a child completely agrees to ‘would you be scared if you saw a quokka?’ then they are given a score of 4, but if they also agree to ‘would you feel happy to hold a quoll?’ then this question would be reverse scored as 0, because they have indicated a non-fearful response). By averaging the scores for the seven questions for each animal, we can derive three fear belief scores for each child: one for beliefs about the quoll, one for the cuscus and one for the quokka. Therefore, each child produces three scores.

Figure 2.2 Fear belief questions

2.3 Summary: Is That It?

* * *

This chapter has described some of the initial stages in experimental design. This isn’t the whole story though, it’s just food for thought. We’ve had a look at how to refine a research question by exploring databases of research material and how we can narrow these sources down to a specific question. We can’t answer everything with one experiment so we generally have to constrain ourselves to a very specific question (and as we saw in Chapter 1 this question must be a scientific statement: something that is testable). Once we’ve defined a question we have to think about what variables to manipulate (the independent variables) and what we want to measure (the dependent variables). In considering how to measure our dependent variables we need to consider what type of measure to use (physiological, behavioural or self-report) and how best to construct these measures. If we use self-report measures then it is important that we try to ensure that the measure produces the same results in identical circumstances (reliability) and that it measures what we want it to measure (validity). To solidify these ideas we’ve worked through an actual research study to see what decisions were made at each stage of the design. Within research there are a limited number of designs that we can apply, and often very different experiments will conform to certain experimental structures. Chapter 3 moves on from what we’ve learnt here to talk about different types of design frameworks that researchers use.

2.4 Practical Tasks

* * *

First, how reliable and valid do you think my fear belief questionnaire is? Now, using the Examples 1–5 in Chapter 1 (practical tasks), go through each experimental design and write down the following:

The research question.

The independent variable or variables.

The dependent variable or variables.

How was the dependent variable measured (behaviourally, physiologically or using self-report)?

Was the dependent variable measured in a valid and reliable way or can you think of other ways in which it could have been measured?

Answers:

There are two independent variables: age of child (with four levels: 4–5, 6–7, 8–9 or 10–11 years of age) and gender of child (with two levels: male or female). Each child is allocated to only one condition (i.e. their particular permutation of age and gender). The dependent variable is age-estimation performance, measured in terms of the number of times the child correctly identified the older face within each of the 50 pairs presented.

There is one independent variable: type of training regime. This has three levels: visualization, specific practice, and general practice. The dependent variable is gymnastic performance, as rated by a panel of judges.

There are two independe
nt variables: time of day at which driving occurred (with two levels: daytime or night-time) and type of radio usage (with three levels: constant, none at all, or use only when the driver felt tired). Each driver participated in all of the six permutations of these two variables. The dependent variable is a measure of driver fatigue, the number of micro-sleeps performed during the journey.

There is one independent variable: whether or not food was consumed before the ship left harbour. This has two levels: food consumption and no food consumption. Each participant either had breakfast or didn’t. The dependent variable is the amount of sick produced, measured in millilitres.

There are three independent variables: gender (two levels: male and female), personality type (two levels: extrovert and introvert) and alcohol consumption (two levels: none or high). The dependent variable is rated level of embarrassment.

2.5 Further Reading

Banyard, P. & Grayson, A. (2000). Introducing psychological research (2nd edition). Basingstoke, UK: Palgrave. This book goes through several published research studies and discusses issues arising from each. It’s a great way to look at how real world research is carried out.

Wright, D. B. (1998). People, materials and situations. In J. A. Nunn (Ed.), Laboratory psychology (pp. 97–116). Hove: Lawrence Erlbaum. An accessible look at the ways in which we choose materials and situations to answer research questions. There are some nice examples of how different questions can be answered using different methods.

< Prev Next >