563.10.3 CAPTCHA
Presented by: Sari Louis
SPAM Group: Marc Gagnon, Sari Louis, Steve White
2
Agenda
• Definition
• Background • Applications
3
Definition
• CAPTCHA stands for Completely Automated
Public Turing test to tell Computers and Humans Apart
• A.K.A. Reverse Turing Test, Human Interaction Proof
4
Background
• First used by Altavista in1997
– Reduced SPAM add-url by over 95%
• CMU/Yahoo!
– Automated the creating and grading of challenges
• PARC
– Relies on document image degradation to prevent successful OCR
5
Background
• CAPTCHAs are based on open AI problems
• Breaking CAPTCHAs help advance AI by solving these open problems
• Improving CAPTCHAs help telling computers and human apart
6
Background - Papers
• Pessimal Print: A Reverse Turing Test
Allison L. Coates, Henry S. Baird, Richard J. Fateman
• Telling Humans and Computer Apart Automatically
Luis von Ahn, Manuel Blum, and John Langford
• CAPTCHA: Using Hard AI Problems for Security
Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford
• Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
7
Applications
• Free email services • Online polls
• Dictionary attacks
8
Types of CAPTCHAs
• Text based
– Gimpy, ez-gimpy
– Gimpy-r, Google CAPTCHA – Simard’s HIP (MSN)
• Graphic based
– Bongo – Pix
9
Text Based CAPTCHAs
• Gimpy, ez-gimpy
– Pick a word or words from a small dictionary – Distort them and add noise and background
• Gimpy-r, Google’s CAPTCHA
– Pick random letters
– Distort them, add noise and background
• Simard’s HIP
10
11
Graphic Based CAPTCHAs
• Bongo
– Display two series of blocks
– User must find the characteristic that sets the two series apart
– User is asked to determine which series each of four single blocks belongs to
12
Graphic Based CAPTCHAs
• PIX
– Create a large database of labeled images – Pick a concrete object
– Pick four images of the object from the images database
– Distort the images
13
Graphic Based CAPTCHAs
14
Audio Based CAPTCHAs
• Pick a word or a sequence of numbers at random
• Render them into an audio clip using a TTS software
• Distort the audio clip
15
Breaking CAPTCHAs
• Most text based CAPTCHAs have been broken by software
– OCR
– Segmentation
• Other CAPTCHAs were broken by
16
Proposed Approach
• Very similar to PIX
• Pick a concrete object
• Get 6 images at random from
images.google.com that match the object • Distort the images
• Build a list of 100 words: 90 from a full dictionary, 10 from the objects dictionary
17
Proposed Approach - Technical
• Make an HTTP call to images.google.com and search for the object
• Screen scrape the result of 2-3 pages to get the list of images
• Pick 6 images at random
• Randomly distort both the images and their URLs before displaying them
18
Proposed Approach - Benefits
• The database already exists and is public • The database is constantly being updated
and maintained
• Adding “concrete objects” to the dictionary is virtually instantaneous
• Distortion prevents caching hacks
19
Proposed Approach - Drawbacks
• Not accessible to people with disabilities (which is the case of most CAPTCHAs) • Relies on Google’s infrastructure