Comfort-o-meter: how to measure the subjective

comfortable-uncomfortable I want to write today about measuring subjective qualities. I’m going to talk about “comfort”, but it applies to lots of other things: “easeness of use”, “satisfaction”, “goodness”, whatever you can think of that can’t be measured on a scale (i.e. scales: °C, meters, number of errors).

I’m working on a project that involves some ergonomics, more specifically it requires or would benefit from the label “comfortable”. Like we always do, we designed a test, collected participants, drafted consent forms, prepared the facilities… and then… the unexpected. To my embarrassment, we had to repeat our whole biomechanics experiment because we had gathered our results in a manner that didn’t afford any meaningful analysis. This is the brave account of what went wrong and how we solved it, which I send into the world hoping that at least one less designer will stumble against this cheeky stone ;)

Like I said, our goal was to determine if a particular physical interaction we were designing was “comfortable”. So, what did we do? I’m not going to explain exactly in what the experiment consisted, but the idea was have people try it and then use some validated questionnaires to tell us if they had felt physical discomfort during or after the tasks we proposed. The questionnaire had a scale with a few ordinal values (uncomfortable, moderately uncomfortable, slightly uncomfortable, comfortable). What can go wrong? Well, in the first place, we were surprised to see that our participants had go through considerable more discomfort than expected during our experiment. We were confused because we had tried the experiment while designing it and none of us had had *any* discomfort whatsoever. Still it could be that overall the system was comfortable enough, we could still do some kind of analysis: we had our ordinal variables… Here was where our lucky misfortune saved us from dataitis (dataitis is what you get when you forget that information is data+meaning, and I can’t emphasize that “+meaning” enough). The crazy discomfort outcome made us uneasy, clearly something had happened there, maybe Hawthorne but also maybe something else. We started questioning our method and the someone said

high-heelsand what if it actually is uncomfortable? what if all [interactions of this kind] are uncomfortable? what if for [this kind of interaction] comfortable just means less uncomfortable than average?

This made sense, maybe when we tried the experiment we hadn’t found it uncomfortable because we knew in which context the interaction belonged and our users didn’t. But this also opened a whole new set of questions:

  • What is comfortable? How should we measure it?
  • If all interactions of this kind are uncomfortable, and we measure with our ordinate categorical scale and aggregate the results, we’re going to find that our interaction is indeed uncomfortable to some degree but does it mean that it’s not comfortable enough?
  • And even worse, if all interactions of this kind (including ours) are comfortable, and we determine that our interaction is indeed comfortable through our experiment, will it still be comfortable out there in the market is someone does it better?

We were lucky, if the results had been positive, we would have never reflected on this: if you’re going to measure a subjective quality you have to do it in comparison to something else. In other words: if there’s no scale, you have to create your own. Users’ pronouncements on subjective qualities can measure the improvement of a product over time because there’s a past to which new results can be compared, and they can only measure how good a product is if there are other products to compare with.

So what we did was: we repeated the same test with two additional alternatives for the kind of interaction that we wanted to test. We did a within-user randomized test (with 12 participants), and asked users to rank all three interactions in the comfort scale. But there’s another tricky bit yet to come… how do we analyze the data? In these cases, one can be tempted to do the following things (all examples of things I’ve seen done, and even published!)

  • Convert the ordinal variables to numerical scores and use a Wilcoxon signed-rank test. This would be wrong because… the fact that you express an ordinal scale in a way that looks more like an interval scale does NOT turn you data into interval variables!! A scale that goes from uncomfortable to comfortable is not, and will never be, an interval scale because the *difference* between a value and the one immediately following is undefined. The only thing we know is that for each individual, “slightly uncomfortable” means more comfortable than “moderately uncomfortable” and this is it, we don’t know and there is no way to know how much more.
  • A t-test would not only be wrong on the same grounds as the Wilcoxon signed-rank, but also because you can’t assume the distribution to be normal. Using dependent t-tests in cases like this is something I’ve seen done and published many times :-(
  • The Mann-Whitney U test. Mann-Whitney is at least a non-parametric test. This means it works for ordinal data. However Mann-Whitney requires *mutual independence within and between samples*, which is not the case here. As the results where gathered in a within-user test, the way participants used the scale depended on their appreciation of the range of comfort provided by the three interactions and the rank they gave each interaction was definitely affected by its comparison to the other two. So Mann-Whitney is not a choice.

Milton-FriedmanMaybe there are many right choices (and maybe you can think of more possible wrong choices), but this is what we did: a Friedman test. The Friedman test is a non-parametric test used to compare observations repeated on the same subjects. The Friedman test is probably the littlest-known piece of math by Nobel prize winner economist Milton Friedman, you just have to have a look at the newspapers to see why people cared more about his demonstration of the complexity of stabilization policy… but for User Experience the Friedman test is key. A test that can answer a simple but powerful question:

N users rate k different products. Are any products ranked consistently higher or lower than the others?

0 Responses to “Comfort-o-meter: how to measure the subjective”


Comments are currently closed.