Saturday, March 2, 2013

Visual Relatedness is in the Eye of the Beholder: Remember Paris

buttespagoda

How do we know if a tag is related to the visual content of an image? In this blogpost, I am going to argue that in order to answer that question, it is first necessary to decide who "we" is. In other words, it is necessary to first define who is the person or persons who are judging visual relatedness, and only then ask the question is this tag related to the visual content of the image.

I'll start out by remarking that an alternate way of approaching the issue is to get rid of the human judge all together. For example, this paper:

Aixin Sun and Sourav S. Bhowmick. 2009. Image tag clarity: in search of visual-representative tags for social images. In Proceedings of the first SIGMM workshop on Social media (WSM '09). ACM, New York, NY, USA, 19-26.

provides us with a clearly-defined notion of the "visual representativeness" of tags. A tag is considered to be visually representative if it describes the visual content of a photo. "Sunset" and "beach" are visually representative and "Asia" and "2008" may not be. A tag is visually representative if it associated with images whose visual representations diverge from that of the overall collection. The model in this paper uses a visual clarity score, which is the Kullback-Leibler divergence of language models based on visual-bag-of-words representations.

Why don't we like this alternative? Well, this definition of visual representativeness does not reflect visual representativeness as perceived by humans. It's not clear that we really are helping ourselves build multimedia systems that serve human users if we make things less complicated by getting rid of the human judge.

The issue is the following: Humans have no problem confirming that an image depicting a pagoda at sunset and an image depicting a busy intersection with digital billboards both depict "Asia".  There is something about the visual content of these two images that is representative of "Asia", and it seems to be a simple leap from there to conclude that the tag "Asia" is related to the visual content of these images.

But there was a time in my life where I didn't know what a pagoda was. It was less long ago than one may think (although certainly before the workshop at which the paper above was presented, held at ACM Multimedia 2009 in Beijing), which prompts me to think further.

A solution might be the following: We could stipulate that in my pre-pagoda-awareness years, I should have been excluded from the set of people who gets to judge if photos are related to Asia. But then would would then have to worry about my familiarity with digital billboards, and then the next Asia indicator and on and on until I and everyone that I know is excluded from the set of people who gets to judge the visual relatedness of photos to tags. In short, this solution does not lead to a clearer definition of how we can know that a tag relates to the visual content of an image.

Why do things get so complicated? The problem, I argue, is that we ask the question of a pair: "For this image and this tag (i,t) is the visual content of the image related to this tag?"  This question does not lead to a well-defined answer.

The answer is, however, well defined if we ask the question of a triple: "For this image, this tag and this person or group of people (i,t,P): is the visual content of the image related to this tag in the judgement of this person or group of people?" In other words, we need to look for the relationship between tags and the visually depicted content of images in the eye of the beholder.

We can then perform a little computational experiment: Put person or people P in a room and expose them to the visual content of image i and ask the yes/no question "Is tag t related to image i?"

The answer of P is going to depend on the method that P uses in order to reason from the visual content of i to the relatedness of tag t. Here's a list of different Ps who are able to identify Paris for different reasons.

(i, "paris" P1): I took the picture and when I see it, I remember it.
(i, "paris" P2): I was there when the picture was taken and when I see it, I remember this moment.
(i, "paris" P3): Someone told me about a picture that was taken in Paris and there is something that I see in this picture that tells me that this must be it.
(i, "paris" P4): I know of another picture that looks just like this one and it was labeled Paris.
(i, "paris" P5): I've seen other pictures like this an recognize it (the specific buildings that appear).
(i, "paris" P6): I've been there and recognize characteristics of the place (the type of architecture).
(i, "paris" P7): I am a multimedia forensic expert and have established a chain of logic that identifies the place as Paris.

Perhaps even more are possible. What is clear is the following: It would be nice if we would have ended up with two P's: expert annotators and non-expert annotators. However, it looks like what we have is judgements that are based on quite a few differences in personal history, previous exposure, world knowledge, and expertise.

If we want to develop truly useful algorithms that validate the match between the visual content and the tag, we have a lot more work to do, in order to cover all the (i,t,P).

The key is to get a chance to question enough Ps. Multimedia research needs the Crowd.