Very often, researchers (including me) use multiple-choice tests to collect data to determine whether or not an intervention has worked. Does the Dance Your Way to Math curriculum really result in higher test scores? Does Lollipop Spelling reduce the number of spelling errors? and on and on.
I remember being told that statistics to be generalized to the population, like internal consistency reliability or test-retest reliability should be computed either only using the pre-test scores (in the case of internal consistency) or only the control group in the case of both test-retest correlations and post-test internal consistency reliability. The reason, we are told, is that “something has been done” to the intervention group, which means that they are no longer representative of the population. While I agree with that reasoning in the case of test-retest correlation, I am not so convinced in the case of internal consistency.
Let’s talk about floor and ceiling effects for a minute.
A floor effect is when most of your subjects score near the bottom. There is very little variance because the floor of your test is too high. In layperson terms, your questions are too hard for the group you are testing. This is even more of a problem with multiple choice tests. With other types, if the subject doesn’t know, they aren’t likely to guess that the answer is, say (a+b)(a-b) and so they get it wrong. With a multiple-choice test with four choices, they will randomly get it correct 25% of the time. If there are a bunch of questions that are too hard, you have a bunch of people randomly getting each one right just by chance. Combine low variance with a lot of random error and your internal consistency reliability is going to be in the toilet. So, let’s say you have exactly that on your pre-test. Then, you test again after some time and your control group, having had no training in the meantime, is equally low, the problems are still too hard, you still have random guessing and low variance.
A ceiling effect is the opposite, all of your subjects score near the top. There is very little variance because the ceiling of your test is too low. In layperson terms, your questions are too easy for the group you are testing. Here you don’t have the problem of random guessing, but you do have low variance. Think back to Statistics 101 – restriction of range attenuates correlations. Again, in layperson terms, if you correlate height and weight of NBA players, for example, you find almost no relationship between height and weight because they are ALL very tall and ALL very heavy. If you make the questions on your pretest easier, that may give you better internal consistency reliability at pre-test, but since a good percentage of your subjects knew the questions at the beginning, by the end of your training maybe nearly all of them will, and then you run into a ceiling effect.
My suggestion is to compute internal consistency reliability at the beginning of your study for the whole group and at post-test for the control and intervention groups separately. You may find that, having successfully avoided both floor and ceiling effects for the post-test intervention group that you get good internal consistency reliability for them.