Skip to Content
用Python写网络爬虫(第2版)
book

用Python写网络爬虫(第2版)

by Posts & Telecom Press, Katharine Jarmul
February 2024
Intermediate to advanced
212 pages
3h 1m
Chinese
Packt Publishing
Content preview from 用Python写网络爬虫(第2版)

第7章 验证码处理

验证码(CAPTCHA)的全称为全自动区分计算机和人类的公开图灵测试(Completely Automated Public Turing test to tell Computersand Humans Apart)。从其全称可以看出,验证码用于测试用户是否为真实人类。一个典型的验证码由扭曲的文本组成,此时计算机程序难以解析,但人类仍然可以(希望如此)阅读。

许多网站使用验证码来防御与其网站交互的机器人程序。比如许多银行网站强制每次登录时都需要输入验证码,这就令人十分痛苦。本章将介绍如何自动化处理验证码问题,首先使用光学字符识别(Optical Character Recognition,OCR),然后使用一个验证码处理API。

在本章中,我们将会介绍如下主题。

  • 验证码处理;
  • 使用验证码处理服务;
  • 机器学习和验证码;
  • 报告错误。

在第6章处理表单时,我们使用手工创建的账号登录网站,而忽略了创建账号这一部分,这是因为注册表单需要输入验证码,如图7.1所示。

\\fuwuqi6\YDStu\18-0069\0701.tif

图7.1

请注意,每次加载表单时都会显示不同的验证码图像。为了了解表单需要哪些参数,我们可以复用上一章编写的parse_form()函数。

>>> import requests
>>> REGISTER_URL = 'http://example.python-scraping.com/user/register'
>>> session = requests.Session()
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

流畅的Python(第2版)

流畅的Python(第2版)

Luciano Ramalho
Python面向对象编程指南

Python面向对象编程指南

Posts & Telecom Press, Steven F. Lott
Python实用技能学习指南

Python实用技能学习指南

Posts & Telecom Press, Robert Smallshire, Austin Bingham
PyTorch深度学习

PyTorch深度学习

Posts & Telecom Press, Vishnu Subramanian

Publisher Resources

ISBN: 9781835888506