How to Design and Report Experiments Read online

Page 10


  Figure 3.1 The ‘one group post-test’ design

  Figure 3.2 The ‘one group pre-test/post-test’ design

  The interrupted time-series type of quasi-experimental design (Figure 3.3) is often used in applied research to evaluate the impact of changes in legislation or the effects of some treatment. With the interrupted time-series method, we make a series of measurements at different times, some of which take place before the intervention in question, and some of which take place afterwards. We then compare the pre-intervention measurements to those that were taken post-intervention. Suppose we wanted to know if seat-belts had had an effect on fatality rates for drivers. One way to look at this would be to look at the fatality rates for several years before and several years after compulsory seat-belt legislation was introduced, to see if there was a statistically significant difference.

  The problem with this method is that we don’t have full control over the manipulations of the independent variable. It might be that other factors changed at the same time as the seat-belt legislation was introduced, and that these have also acted to reduce accident rates. For example, at the same time that the legislation was introduced, there might have been an advertising campaign that was intended to inform people about the new legislation, but as a side-effect made drivers more aware of the risks of driving. There might be other changes, such as more police cars patrolling around, which coincided with the introduction of the seat belt law.

  Figure 3.3 Interrupted time-series designs

  It’s also possible that there have been changes in how accident statistics are collected and reported, so that what you are seeing is not really a change in accident rates per se, but a change in the numbers recorded. (The next time some old age pensioner goes on and on about how little crime and violence there was back in the good old days, try explaining to them about how reporting biases can give a potentially misleading impression. Take, for example, crimes of violence against women: it might seem that there are more now than 50 years ago, but that might be partly because 50 years ago women were more reluctant to report them to the police.) In short, although this design is an improvement over the previous ones, it still suffers from providing the potential for time and measurement threats to validity.

  In the static group comparison design (Figure 3.4) we have two groups – a control group to whom nothing is done by the experimenter, and an experimental group who receive some treatment. The difference between this and a proper experiment is that the participants are not assigned to the two conditions randomly. We just passively observe the effects of the experimental groups’ treatment. We have already encountered this design, in the discussion of the study on motorcycle headlight use. As mentioned there, because the participants are not allocated randomly, it can always be argued that any differences between the two conditions are due to factors other than our experimental manipulation. The strength of our conclusions then depends on the extent to which we can identify and eliminate these alternative explanations.

  Experimental Research: Between-Group and Within-Subjects Experimental Designs

  Between-groups (or ‘independent measures’) designs use separate groups of participants for each of the different conditions in the experiment. Each participant is tested once only. In within-subjects (or ‘repeated measures’) designs, each participant is exposed to all of the conditions of the experiment. These are the two extremes: you can have hybrid (‘mixed’) designs which involve a combination of between-groups and within-subjects variables.

  Figure 3.4 The ‘static group comparison’ design

  Each design has its strengths and weaknesses, and which one is most appropriate to use really depends on what’s being researched. Something both types of design have in common is that, if properly carried out, they enable fairly unambiguous identification of cause and effect (see Chapter 2). They achieve this by making sure that the only systematic effect on participants’ behaviour is the experimenter’s manipulations of the independent variable. An important factor in achieving this is the appropriate use of randomization.

  The Importance of Randomization in Experimental Design

  In a study with a between-groups design, it is essential that we allocate participants randomly to our experimental conditions. In a withinsubjects design, it is similarly essential that participants don’t all experience our experimental conditions in the same order (something we achieve by presenting the conditions in either a random order or by counterbalancing the order). Why is randomization so important?

  In an experiment, we want to isolate the effects of our manipulation of the independent variable. Recall that a score consists of a ‘true score’ (a measure of the thing we’re really interested in) and ‘error’ (from the influence on our participants of all sorts of other, extraneous factors). To distinguish between the true score and the error, we rely on the fact that variation in the true score should be related to our manipulation of the independent variable. For example, suppose we were interested in finding out whether playing Mozart to babies affects their intelligence in later life (Box 3.1).

  If playing Mozart affects intelligence to any significant extent, it should affect all babies who are exposed to it in much the same way. Due to factors that are outside of our direct control, the magnitude of the effect will probably vary from baby to baby, but by and large we would expect most of our babies to end up a bit smarter than those in a control group who did not hear any Mozart. We are expecting a systematic variation in performance, because we have behaved one way towards all of the babies in the Mozart-listening group, and we have behaved in a different way to all of the babies in the no-Mozart group. If anything else produces systematic variation in performance, it becomes hard to interpret our results, because if we find an ‘effect’ of Mozart, we won’t be able to tell whether this is due to the music, other factors which systematically affected every baby in that group, or some interaction between these factors and listening to Mozart.

  For this reason, all other possible influences on the babies’ performance must remain as unsystematic as possible. They can be unsystematic in the sense of affecting behaviour randomly. By allocating babies to different groups randomly, we (hopefully) randomly distribute across the groups potential influences on their behaviour. Unsystematic differences in things such as intelligence, motivation, anxiety, irritability, receptivity to auditory stimulation, etc., should, on average, cancel out: for every highly anxious baby in the Mozart group, there’s likely to be one in the control group too.

  An alternative way to prevent extraneous factors from having an effect is to keep them as constant as possible – either by matching participants carefully across groups (so that if one group contains a baby with a high IQ, so too does the other), or better still, by using a within-groups design – so that, in effect, each participant is perfectly matched with their counterpart in the other group because they are one and the same person! As we shall see later, this gives rise to its own problems: although within-subjects designs largely eliminate the effects of differences stemming from the participants themselves, we have to be careful not to produce systematic effects on their behaviour as a consequence of poor experimental design. If the experimental conditions are administered in the same order for each participant, this might affect their behaviour in a systematic way, by making participants more practised or more fatigued in one condition than another. So, we have to randomize or counterbalance the order of presentation of conditions, so that the order does not have systematic effects on participants’ behaviour that might be confused with our manipulations of the independent variable.

  Box 3.1: How randomization eliminates all systematic effects on behaviour other than the effects of our independent variable

  (a) Random allocation to groups (Mozart versus no Mozart) helps to ensure that any systematic differences between babies not under my control (e.g. motivation (M), irritability (I), worrying (W) and happiness (H)) are spread inconsistently across groups. The only
consistent difference between groups is whether or not they heard Mozart.

  (b) Non-random allocation may produce other consistent differences between the groups as well as those due to my experimental manipulations. I can’t tell whether any observed differences are due to listening to Mozart or due (at least partly) to the other systematic differences between the conditions (in this case, happiness and worry).

  If you fail to randomize participants to conditions in a between-groups design, or fail to randomize the order of conditions in a within-subjects design, no amount of clever statistical analysis can remedy the situation: you end up with uninterpretable results, and you’ve wasted your time and that of your participants.

  An interesting illustration of the importance of randomization comes from the Lanarkshire Milk Experiment of 1930. This was designed to examine the nutritional benefits of providing milk to school children. It was a huge study, involving 20,000 children and costing a whopping £7,500 (a fortune at the time). A control group of 10,000 children received no milk, and an experimental group of 10,000 received 3/4 pint of milk every day: half of the latter drank raw milk (yuk!), and the rest got pasteurized milk. Unfortunately, selection of children to be ‘feeders’ or ‘controls’ was fatally flawed. It was supposed to be random, but teachers were allowed to adjust the compositions of the feeder and control groups to obtain ‘a more level selection’, if they felt that too many well-fed or malnourished children had been allocated to one group or another. The result of this seems to be that the teachers were unconsciously biased in selection: faced with a choice, they tended to put poor children in the milk-drinking groups, and more well-nourished and affluent children into the control group. The end result was that the control group ended up markedly superior in weight and height to the milk drinkers! From this study, it proved impossible to draw any sound conclusions about the relative benefits of raw and pasteurized milk – or indeed about the effects of milk per se. An entire study had been rendered largely worthless as a consequence of a minor procedural error – a lack of randomization. Student (1931),1 in his review of this study’s methodological flaws, said:

  [The conclusion that milk is beneficial to schoolchildren] . . . is shifted from the sure ground of scientific inference to the less satisfactory foundation of mere authority and guesswork by the fact that the ‘controls’ and the ‘feeders’ were not randomly selected (Student, 1931: 403).

  How do you achieve random allocation in practice? Ideally, you should use something like a table of random numbers (or these days, the random number generator of a computer or calculator). As each participant arrives, follow a rule such as: if the next random number is even, put the participant in the control condition; if it is odd, put them in the experimental condition. In practice, it has to be said that many experimenters don’t do this. But at the very least, try to avoid running participants in ways which are likely to produce systematic differences between conditions – such as assigning all the participants who turn up in the morning to one condition, and all the participants who come in the afternoon to another condition.

  Between-Groups Designs

  A dvantages of between-groups designs

  Between-groups designs have several advantages, compared to repeated-measures designs.

  Simplicity: One advantage of a between-groups design is its simplicity: all you have to do is to make sure that you allocate participants randomly to the different conditions. You don’t have to worry about procedures like counterbalancing (see below), which can get a bit wearisome if you have a complex repeated-measures design.

  Less chance of practice and fatigue effects: You also don’t have to worry about practice and fatigue effects (see below). There is no possibility that performance in one condition can affect performance in another, as each participant participates in only one of the conditions.

  Useful when it is impossible for an individual to participate in all experimental conditions: A between-groups design is the only type of design that can be used if participation in one condition makes it impossible for a participant to take part in another. For example, if you are looking at sex differences in performance, you are stuck with a between-groups design, because participants are either male or female and can’t switch from one to the other for the sake of your experiment. The same is true if you are interested in how performance on some measure changes with age, unless you are prepared to test the same people at different ages; that might be possible with children, but it would be impractical if you wanted to compare children and pensioners. Sometimes, participating in one condition may alter the participant irreversibly so that they cannot meaningfully participate in another condition of the same experiment. For example, suppose you were interested in testing the effectiveness of two different methods of teaching people Sanskrit. Once a participant had learnt Sanskrit by one method, they wouldn’t be able to unlearn it in order to use the other method to learn it again – the first experimental treatment (i.e. the first method of learning Sanskrit that they encountered) would have changed them forever. Another example would be if you showed participants a set of words and then gave them a surprise memory test afterwards. Each participant could only be surprised once (unless you used people with extremely bad memories!), and so you would have to use a between-groups design in this instance.

  Disadvantages of between-groups designs

  Between-groups designs do have several disadvantages associated with them; these are large enough for me to suggest that, wherever possible, you should use a within-subjects design unless it’s absolutely impossible (for reasons that will be discussed shortly).

  Expense in terms of time, effort and participant numbers: Between-groups experiments require lots of participants, and participant recruitment is time-consuming and laborious. As the number of conditions in the experiment increases, so too does the number of participants. As most experiments use between 10 and 20 participants per group, you can rapidly arrive at a situation where an experiment is going to take a long time to run. This is bad enough for researchers, but it’s even worse for students: you are not likely to get sufficient marks to reward your efforts in collecting data, and you would probably be better off spending your time on writing the report and analysing the data. Since it’s hard to recruit participants, once you’ve trapped them in a darkened room, you might as well test them several times rather than just once. That way, you won’t need to run as many participants (or you can run as many as before, but get a lot more data from them).

  Insensitivity to experimental manipulations: All other things being equal, a between-groups design is less sensitive than a within-participants design. In other words, a between-groups design will be less likely to detect any effect of your experimental manipulations. Consider a simple two condition experiment in which we are interested in the effects on memory performance of a bang on the head. Participants in group A get a hefty whack on the head with a mallet, whereas participants in group B are left alone. The latter act as a control group, against which to evaluate the effects of head-bashing. Participants in both groups get the same memory test, and we look to see if whacking affects memory performance. In an ideal world, the only difference between participants in the two groups would be that we had whacked those in group A, and not those in group B: therefore any differences between the groups would be due entirely to our experimental manipulation. In practice, things are complicated by within-groups variation: within each group, there would be all kinds of differences between participants that might act to add ‘noise’ (or ‘variance’, in statistical jargon) to our data. Some participants in each group would have harder skulls than others; some of the participants in each group would have had very good memories, and others would have had very poor memories. Maybe sometimes we would get a good swing on the mallet, and other times we wouldn’t hit the participant quite so hard (although mallet-swinging could be automated, to ensure that it was consistent for all individuals in group A). As a result of all these factors, memo
ry scores within each group are likely to show variation between the members of that group.

  If we have allocated our participants to the two conditions randomly, it is unlikely that there will be any systematic differences between the groups on any variable other than the one that we are manipulating experimentally. (This is the main reason why you must always allocate participants to conditions randomly when using a between-groups design. If there are any systematic differences between your experimental conditions other than the ones produced by you as the experimenter, then you are doomed – whatever results you have are uninterpretable, as any differences between the groups could be due either to your experimental manipulation, the uncontrolled systematically-varying factors, or a mixture of both: see page 71). However, all this non-systematic variation between groups can make it hard to detect the systematic variation between the groups that is attributable to our experimental manipulation – especially if the effects of our manipulation are relatively small in size. If memory scores are all over the place to begin with, it is going to be that much harder for the effects of our mallet manipulations to reveal themselves.

  Some examples of between-groups designs

  The post-test only/control group design (Figure 3.5) is probably the most straightforward type of ‘true’ experiment that you can perform. It has a two-group independent measures design: you take several participants, divide them randomly into two groups, and give one group (the ‘experimental’ group’) some treatment that you don’t give to the other group (the ‘control’ group). The performance of the two groups is then measured: if it differs, then you can be reasonably confident that the difference is attributable to your experimental manipulation.