As an enthusiast of human vision and owner of two textbooks on perception, this white/gold or blue/black controversy is fascinating! If you don’t mind, I’ll gasbag about colour perception and adaptation for a bit.
We don’t fully understand human colour perception, but the basics aren’t too hard. Things start at the photoreceptor level, with your cones and rods. The best analogy for them that I’ve heard is a room full of mousetraps. Toss a ping-pong ball at a photo receptor, and a complicated cascade of ever-amplifying chemical reactions goes off. Or, in layman’s terms, SNAP SNAPSNAPSNANSASNAP SNAP.
Problem: those mousetraps need to be reset. Your photoreceptors solve that by pumping proteins into the cell that gradually reverse the cascades. For analogy’s sake, we’ll picture Tom Cruise dropping down from the ceiling to randomly flip back a mousetrap.
What happens if a photon/ping-pong ball hits a trap Cruise hasn’t reset yet? Nothing, as you’d expect. No SNAP SNAP, no signal that a photon has hit. This leads to a very simple way to handle varying light levels: if a lot of photons are streaming in, many will hit dud traps and be ignored, which in non-analogy form is known as bleaching. If few ping pong balls are hitting your photo receptors, Tom Cruise will have a chance to catch up and thus increase the sensitivity of the room. This negative feedback loop will adjust your photoreceptor’s sensitivity to match the scene its viewing, so long as your eye frequently zips around the scene (that’s called saccading) and your photoreceptors respond slowly to changes (they do, hence why you also evolved fast-responding irises).
Obviously, Tom Cruise is only an actor. He might be unable to keep up with the onslaught of photons, resulting in a visual “burn-out” which persists until your pumps catch up. This isn’t damaging per-se, unless it floods your retina with ionizing and semi-ionizing radiation or enough energy to literally fry it.
So, what does this have to do with colour adaptation? Your long wavelength photoreceptors adjust their sensitivity to match the average amount of red light streaming in. The medium and short wavelength receptors do the same, but while there’s some local communication between receptors, there’s little to prevent each type of receptor from adjusting to different levels. So an excessively red scene will result in your long wavelength cones becoming less sensitive compared to the other two types; as the level of blue increases, say, the short wavelength cones drop their sensitivity.
So the average colour of your entire field of vision is approximately gray, no matter what the real-world colour balance is. Ta daa, cheap and easy colour correction! This gets complicated as light levels drop and your rods kick in, though, and while the typical person has three cone types, a few have anywhere from potentially zero to four.
Sadly, that’s not the end of the complications. Well OK, technically it is, as the back of your eye is an extension of your brain, but that’s kind of the point: the signal from your eye is compressed, collected, and transmitted to the lowest levels of your brain’s perceptual system. Stretch out your arm to its full length and hold up your pinky. Stare at the base of your fingernail; the circle centred on that point, with a diameter ranging from the closest knuckle to the tip of your finger, contains about 90% of your eye’s cones. Half of your optic nerves are devoted to that area! By extension, though, your vision outside of that tiny section does not entirely come from your eye. It’s actually a mix of low-resolution sensation combined with memories and motion field data, a dollop of partial or even total hallucination.
Your brain can only pull this off via a massive mess of feedback and feedforward loops. We have algorithms that accurately duplicate human object detection, to the point that they even duplicate our failure cases, but they come with a catch: that eerie similarity only applies to human subjects where feedback from the higher brain functions into the lower levels of the visual system has been suppressed. Few people realize there’s just as much communication going down the visual system as there is up it.
So the human side of colour correction is frigging complicated, and depends both on high-level context and low-level biology. But that dress isn’t sitting directly in front of your eyes, it’s being represented by a photo. We also have to understand how digital cameras see colour to get the complete picture here.
If the human photoreceptor is a room full of mousetraps, digital camera photoreceptors are a bucket. Light slash ping-pong balls plunk in and quietly sit there. Every once in a while, Tom Cruise comes around and counts the number of balls. He then pulls out a cork in the bottom of the bucket, waits a fixed length of time, then pops it back in.
It’s rare the words “Quantum Mechanics” and “relatively simple” appear in the same sentence, but here we are. Don’t get me wrong, we could add complications to make this analogy more accurate (Cruise sheds ping-pong balls, corks get stuck, buckets don’t fully drain, balls bounce across buckets), but the basic level is more than enough for our purposes. Notice there’s very little self-adaptation here; the mousetraps gradually get less sensitive, whereas the buckets are consistently sensitive until they’ve filled completely. You can impose something like sensitivity by sending Tom Cruise out more often, but it’ll always be post-hoc and with millions of buckets to manage you can’t get fancy about it. Nor do buckets communicate with one another, unlike your photoreceptors, and digital cameras don’t saccade (outside of self-cleaning).
So colour adaptation has to come from somewhere other than the photo sensor. That must be your camera’s CPU, but while your biology easily handles many trillions of calculations per second via massive parallelism, that poor computer chip is comparatively serial and limited to working in lockstep with a clock capped at a few billion tics per second. Shortcuts are necessary.
Fortunately, shortcuts are in abundance. An outdoor scene typically has a ground, sky, and objects to scatter sunlight around; indoor scenes are usually dominated by indirect light from lamps. All this bouncing and diffusing reduces the contrast, enough so that light levels confine themselves to a narrow band around the mean. That means you don’t need to capture the full dynamic range that your eye can see, you can cheat and just capture a small window of that range. On top of that, our sun produces a very predictable spectrum of light, as do common incandescent and fluorescent lights. You can match that up with your eye’s response in those conditions and easily translate from buckets to mousetraps. So long those conditions exist, and are combined with strong, clear standards on every step from capture to display (which also exist), you can be pretty confident that what you saw will match what’s displayed.
Here’s an example of this. I took an image I captured in RAW format and changed the assumed lighting conditions around. Notice how the colour changes dramatically depending on the balance used, and even settings that you’d think would give accurate results badly mangle things. The shadow angle combined with the fact I was facing North-ish give away that this was taken near sunset, so the reddish cast of the automatic and camera white balances are a great match for the reddish cast of the sun… but isn’t rock normally a gray-ish colour, and the snow and clouds white? Since I properly exposed the image I can manually balance the colour off those objects and derive more accurate colours for everything in the scene, Sun be damned.
But what in that dress photo is white? The white portions are due to the digital camera’s buckets overflowing with photons, and no longer reflect real-world values. The original highlights of the image are lost, and thanks to the physics of light those tend to be the most accurate reflection of the colour spectrum used to light the scene. There isn’t anything else in the scene that we can identify as white, so we’re forced to rely instead on the camera’s colour balance settings.
Modern digital cameras embed a lot of information about their settings in an EXIF block hidden within the image. We could have used that to extract the colour balance the camera thought the scene had. Unfortunately, the EXIF block can also contain when and where the photo was taken, the model and serial number of the camera, and even the number of photos that camera has taken. It’s routinely stripped from images because of that, and alas is gone from this image.
So we have no clues from the image on colour balance, leaving it all on our visual system to figure out the colour adaptation itself and work backwards. As that evolved to deal with natural light scenes and not monitors, though, it’s at a severe disadvantage here. As XKCD points out, a simple change of the surrounding illumination can wildly shift our colour adaptation. Those viewing the photo of the dress in a darkened room, lit primarily by dim light from the bluish sky, would fall into the white/gold camp. In a brightly lit room, our visual system would tend to blue and black.
All of the above is why, when I was first asked what colour the dress was, my answer was “I don’t know, there isn’t enough information in that image.”
Thankfully, we don’t just have the image.
[HJH 2015/03/01: Minor grammar edits. And how could I talk about ping-pong balls and mousetraps without linking to a video?!]