Meta Platforms
When content creators don't provide alt text for their images, Meta's AI-powered Automatic Alt Text (AAT) generates machine descriptions so blind and low-vision Facebook and Instagram users can receive some form of image context through their screen readers -- a stopgap that fills the gap left by billions of unlabeled photos.
ENABLE Model location
What it is
Meta Platforms deploys Automatic Alt Text (AAT) technology across Facebook and Instagram. AAT uses object recognition technology based on a neural network with billions of parameters, trained on millions of examples, to generate text-based descriptions of images when the person who uploaded or posted the image does not include their own alt text. Screen reader users who focus on an image will hear either the description entered by the poster or "a list of objects, concepts or locations the image may contain as recognized by Facebook's AAT technology."1
AAT was first launched on Facebook in April 2016 for iOS screen readers in English, developed by a team including research scientist Shaomei Wu and engineering manager Hermes Pique under the direction of Jeffrey Wieland, then Director of Accessibility. The system was trained to recognize over 100 concepts -- people, objects, scenes, and activities -- producing descriptions like "Image may contain three people, smiling, outdoors."2 It was significantly upgraded in 2021 (AAT v2), expanding the range of recognizable objects and improving description quality. On mobile (Android, iPhone, iPad), users can also request a "detailed image description" that includes: the poster's own alt text (if any), Facebook's generated description, any text visible in the image (read top-left to bottom-right), position information for objects, size information used to determine the image's focus, and elements sorted by category (people, plants, objects, etc.).1
Meta also offers a Disability Answer Desk for accessibility feedback and supports users through the Be My Eyes app for live visual assistance.1
Why it matters
Facebook and Instagram host billions of photos. The vast majority are uploaded without alt text, making them invisible to screen reader users. AAT is a stopgap: it provides something when the content creator provides nothing. Meta itself acknowledges this: "The automatic alt text may not always be complete."1
This is a textbook stopgap because the ideal solution -- content creators writing their own accurate, contextual alt text -- is the responsibility of the user who posts the image, not the platform. But because most users don't write alt text, Meta intercepts at the platform level to generate machine-produced descriptions. The result is better than silence but worse than a human-written description that captures the emotional, social, or narrative context of the image. A birthday photo might be described as "may contain: 3 people, cake, indoor" rather than "Grandma blowing out candles at her 80th birthday party."
Real-world example
Facebook's detailed image description feature illustrates both the capability and the limitation of AAT. When a screen reader user focuses on a photo and opens the action menu, they can select "Generate detailed image description" to receive structured information: object identification, spatial positioning, text extraction, and size-based focus detection.1 This multi-layered approach gives blind users significantly more information than the original 2016 AAT, which only listed recognized objects.
The team behind AAT described their motivation in terms of inclusion: "We want to build technology that helps the blind community experience Facebook the same way others enjoy it."2 Yet the feature remains opt-in (users must actively request the detailed description) and machine-generated (it cannot capture social context, humor, irony, or relationships between people). The detailed description for a meme might list "text, person, animal" without explaining the joke. This gap is why AAT is classified as a stopgap rather than a solution: it mitigates the harm of missing alt text without addressing the root cause.
What care sounds like
- "Our platform generates automatic descriptions for every image uploaded without alt text."
- "We prompt content creators to add their own alt text before posting -- and make it easy to do."
- "We continuously improve our computer vision models to recognize more objects, scenes, and contexts."
What neglect sounds like
- "Alt text is the user's responsibility -- if they don't add it, the image is just invisible."
- "We don't have the resources to build an auto-description system."
- "Screen reader users are a small percentage of our user base."
What compensation sounds like
- "I ask sighted friends to describe photos in my feed because the auto-generated text is too vague."
- "The alt text said 'may contain: 2 people, outdoor' -- I still don't know who's in the photo or what's happening."
- "I use Be My Eyes to have a volunteer describe images that the AI can't."