The effects of User Interface design on testing

Jason Dyer posted an excellent critique of some of the exam items and interface choices made by the American Institute for Research and their development of new Common Core exams.  The one he reviewed is called “SAGE”.

Designing an User Interface is difficult work.  Especially when your UI will serve thousands upon thousands of people.  But the most important part of the UI for education testing purposes is that all efforts should be made to minimize the effect of the UI on the measurement of the student’s test performance/score.

I believe the SAGE test has huge flaws that will affect the validity of the test results.  If students, teachers, schools, districts and even the Common Core standards themselves are to be judged on these results, I believe a lot more work needs to go into the UI Design.

Here is an example flaw, as described by Dyer:

The percents in the problem imply the answer will also be delivered as x%, but there is absolutely no way to type a percent symbol in the line (just typing % with the keyboard is unrecognized). So something like 51% would need to be typed as .51. Fractions are also unrecognized.

(Also the problem is not precisely worded, if I were nitpicky– and I am– then I would restate it to read, “If Ms. Jones chooses a student at random from her class, what is the probability…“)

We may intuit that the authors+designers of this question and its interface have either (a) not given thought to the difference between .51 and 51% or (b) did give though, but decided that the input box’s restrictions may be enough to guide towards the decimal response.  Note that in either case, the authors+designers do not consider the format change from the question to the answer to be significant.  What would Strunk and White say?

To a person with reasonable mathematical maturity, it is clear that .51 and 51% are interchangeable in this context, while not formally stylistically equivalent.  But these tests are supposed to measure a population whose mathematical maturity is in question.

The other more subtle element to this type of UI design is that the input box guides the user to do certain actions.  In a way, it is a part of the question.  A student may not type in 51/100.  A student attempting to type “%” is silently blocked from doing so.  These unspoken rules of communication and interaction should be paid attention to.  Do they affect student/user behavior?  Is the effect “regressive” in the sense that a low-performing student is more likely to have trouble deciphering the interface? (and therefore is more likely to not perform well on this test item)

The other example is rather indicative of the low level of attention the UI has recieved:

 

Enter each number on a separate line” is kind of the opposite of the issue from the percents problem above.  In this case, the authors+designers consider their interface to need direct instructions on how to respond.  On top of being incongruous in style, it is just plain wordy.

The designers of these tests seem eager to move away from the multiple choice style of test questioning. Multiple Choice has the big drawback in that the answer is given, hidden only by the distractors, which as a set may give clues as to which response is not like the others.  Its something that must be considered carefully when creating a MC test.  So, exploring alternative response systems enabled by the technology is reasonable.

Perhaps the authors did not want to label the input boxes? First Number _____  Second Number_____  I can certainly see the reason to avoid “first” and “second” along with “number 1” or “one number” and “another”.

Perhaps the authors did not want to use variables?  Two numbers, x and y, have a product of 323 and a difference of 2.  x=___ y=___  Ok, maybe variables would be beyond the scope of the test, or would be unfamiliar to the students at that stage.

But there are other options.  The answer input boxes could be embedded inline with a complete sentence.  The two numbers are  ___ and ___.

Finally, Dyer points out that the question implies one solution {17,19} but there is another {-17,-19}.  And it makes me wonder how the test is to be graded.  Wolfram Alpha has demonstrated the capability of software to interpret various mathematical inputs.  Can we grade these tests by computationally checking the answers rather than comparing to an answer key?  If a student inputs -17 and -19, why not compute the product and difference and see if it matches the stem?  Can a computer algebra system or dynamic geometry software check the results of student inputs by calculation?

I believe it is very difficult to design these items well, and I do think it is worthwhile to explore a variety of input methods, however, I believe that more effort needs to be taken to standardize input methods.  I believe that the best interface for these tests is one that the user is minimally aware of, i.e. the user should not have to figure out how to take the test in addition to taking the test.