[music] Hi, I'm Joy and I research how computers detect, recognize, and classify people's faces. In my TED featured talk, I spoke about my experience with the Coded Gaze, my term for algorithmic bias. The system I was using worked well on my lighter skinned friend's face, but when it came to detecting my face, it didn't do so well, until I put on a white mask.
After my talk was posted, I tested my speaker image profile across different facial analysis demos. Two of the demos didn't detect my face. The other two, well they misgendered me.
The demos didn't even distinguish between gender identity and biological sex. They just provided two labels: male and female. Now I wanted to see if these results were just because my unique facial features, or if this was something that was more of a pattern across other faces too.
So I began a project that became my MIT thesis, Gender Shades. Or the long title: Or just Gender Shades. I wanted to see how well different gender classifications systems worked across different people's faces and if the results changed based on somebody's gender or their skin type.
I created a dataset of over a thousand images of parliament members ranked among the top ten in the world based on their representation of women in power. To get at a range of skin types, I chose three African countries and three European countries, so I could see how the system performed on lighter skin and darker skin. Then I chose three companies to evaluate: IBM, Microsoft, and Face ++, which has access to one of the largest data sets Chinese faces.
So now with the dataset and the companies, I decided to run a test. The companies appeared to have relatively accuracy overall. Microsoft performed best achieving 94% accuracy on whole dataset.
All companies performed better on males than females, and all companies also performed better on lighter subjects than on darker subjects. When we analyzed the results by four subgroups, we saw that all companies performed worse on darker females. IBM and Microsoft performed best on lighter males and Face++ performed best on darker males compared to the others.
IBM had the largest gap in accuracy with a difference of 34 percent in error rates between lighter males and darker females. I was surprised to see multiple commercial products failing on one in three women of color. In fact, as we tested women with darker and darker skin, the chances of being correctly gendered came close to a coin toss.
While more research is needed to explore the specific reasons for the accuracy differences, one general issue is the lack of diversity in training images and benchmark data sets. Failure to separate accuracy results across traits like gender and skin type, also makes it harder to identify differences. Companies should do better with commercially sold products, especially since the machine learning techniques that have made gender classification possible, are applied to other domains of computer vision like facial recognition and other areas of artificial intelligence like predictive analytics.
Predictive systems can help determine who is hired, granted a loan, or what information a particular individual sees. These data centric technologies are vulnerable to bias and abuse. As a result, we must demand more transparency and accountability.
We have entered the age of automation over confident yet under prepared. If we fail to make ethical and inclusive artificial intelligence, we risk losing gains made in civil rights and gender equity under the guise of machine neutrality. The Coded Gaze reflected in the Gender Shades project must be faced.