Monday 9 July 2012

Killing spam and digitising content

captchaThat’s what I call a win-win.

From BBC World Service 20/06/12:

Duolingo, aims to translate the entire web with the help of people starting to learn a new language. It's a project born out of guilt from the man behind one of the most annoying features of web surfing - those online security checks involving random words.

Duolingo hopes to convince millions of people to work for free and thus translate all web content in a matter of years.

As a 22-year-old graduate student in 2000, von Ahn invented the Captcha - those distorted images of words and numbers used to sign in to ticketing and social media websites, among others, which users have to decipher to prove they are human.

Erm, and to leave comments on blogs!  It’s a necessary evil we all have to endure to prevent comment bots leaving faux comments with hyperlinks in their name or the content of the post in the hope of adding google juice to their own sites. Sadly it doesn’t stop the human commenters doing the same thing (in virtual sweatshops in ahem, developing countries).  They’re one of the true evils of the internet as far as I’m concerned and they make me want to vomit.

The software is used by more than 350,000 websites to prevent computer programs from attacking them with spam. In 2007, von Ahn realised that 200 million Captchas were being typed by people all over the world every day.

"At first I felt really good about that because I thought, 'Look at the impact that I've had'," he says. "But then I starting feeling bad."

Typing each Captcha takes about 10 seconds, he estimates. Multiply that by 200 million, and humanity as a whole is wasting about 500,000 hours on these security codes every day.

Still, that seems a small price to pay if it helps keep spammers at bay.

He decided to put these hours to good use and devised ReCaptcha, a system that uses each human-typed response as both a security check and a means to digitise books one word at a time.

ReCaptchas use two words - one generated by the computer, the other taken from the pages of an old book, newspaper or journal that the system is digitising.

Each page has to be scanned individually, then run through a programme that transcribes every word. Computers have trouble reading text when pages are more than 50 years old, where paper is torn or yellowed or the typeface faded.

A human can do this easily - but can't always be relied on to get it right. When a user gets the first word right, the system logs their second response.

It then collates the most popular responses from a number of people.

Clever. Very clever.

All of that doesn't detract from the fact that for most people, these security codes are nothing more than a frustrating waste of time.

Frustrating – yes, but not a waste of time.

No comments:

Post a Comment