在开始讲授爬虫之前,我们轻微对HTTP(超文本传输协议)做一些回首,由于我们在网页上看到的内容凡是是赏识器执行HTML说话获得的功效,而HTTP就是传输HTML数据的协议。HTTP和其他许多应用级协议一样是构建在TCP(传输节制协议)之上的,它操作了TCP提供的靠得住的传输处究竟现了Web应用中的数据互换。凭证维基百科上的先容,计划HTTP最初的目标是为了提供一种宣布和吸取HTML页面的要领,也就是嗣魅这个协议是赏识器和Web处事器之间传输的数据的载体。关于这个协议的具体信息以及今朝的成长状况,各人可以阅读阮一峰先生的《HTTP 协议入门》、《互联网协议入门》系列以及《图解HTTPS协议》举办相识,下图是我在四川省收集通讯技能重点尝试室事变时代用开源协议说明器材Ethereal(抓包器材WireShark的前身)截取的会见百度首页时的HTTP哀求和相应的报文(协议数据),因为Ethereal截取的是颠末收集适配器的数据,因此可以清楚的看到从物理链路层到应用层的协议数据。
HTTP哀求(哀求行+哀求头+空行+[动静体]):
HTTP相应(相应行+相应头+空行+动静体):
声名:进展这两张犹如泛黄的照片般的截图辅佐你或许的相识到HTTP是一个奈何的协议。
相干器材
1.Chrome Developer Tools:谷歌赏识器内置的开拓者器材。
2.POSTMAN:成果强盛的网页调试与RESTful哀求器材。
3.
4.HTTPie:呼吁行HTTP客户端。
- $ http --header http://www.scu.edu.cn
- HTTP/1.1 200 OK
- Accept-Ranges: bytes
- Cache-Control: private, max-age=600
- Connection: Keep-Alive
- Content-Encoding: gzip
- Content-Language: zh-CN
- Content-Length: 14403
- Content-Type: text/html
- Date: Sun, 27 May 2018 15:38:25 GMT
- ETag: "e6ec-56d3032d70a32-gzip"
- Expires: Sun, 27 May 2018 15:48:25 GMT
- Keep-Alive: timeout=5, max=100
- Last-Modified: Sun, 27 May 2018 13:44:22 GMT
- Server: VWebServer
- Vary: User-Agent,Accept-Encoding
- X-Frame-Options: SAMEORIGIN
5.BuiltWith:辨认网站所用技能的器材。
- >>> import builtwith
- >>> builtwith.parse('http://www.bootcss.com/')
- {'web-servers': ['Nginx'], 'font-scripts': ['Font Awesome'], 'javascript-frameworks': ['Lo-dash', 'Underscore.js', 'Vue.js', 'Zepto', 'jQuery'], 'web-frameworks': ['Twitter Bootstrap']}
- >>>
- >>> import ssl
- >>> ssl._create_default_https_context = ssl._create_unverified_context
- >>> builtwith.parse('https://www.jianshu.com/')
- {'web-servers': ['Tengine'], 'web-frameworks': ['Twitter Bootstrap', 'Ruby on Rails'], 'programming-languages': ['Ruby']}
6.python-whois:查询网站全部者的器材。
- >>> import whois
- >>> whois.whois('baidu.com')
- {'domain_name': ['BAIDU.COM', 'baidu.com'], 'registrar': 'MarkMonitor, Inc.', 'whois_server': 'whois.markmonitor.com', 'referral_url': None, 'updated_date': [datetime.datetime(2017, 7, 28, 2, 36, 28), datetime.datetime(2017, 7, 27, 19, 36, 28)], 'creation_date': [datetime.datetime(1999, 10, 11, 11, 5, 17), datetime.datetime(1999, 10, 11, 4, 5, 17)], 'expiration_date': [datetime.datetime(2026, 10, 11, 11, 5, 17), datetime.datetime(2026, 10, 11, 0, 0)], 'name_servers': ['DNS.BAIDU.COM', 'NS2.BAIDU.COM', 'NS3.BAIDU.COM', 'NS4.BAIDU.COM', 'NS7.BAIDU.COM', 'dns.baidu.com', 'ns4.baidu.com', 'ns3.baidu.com', 'ns7.baidu.com', 'ns2.baidu.com'], 'status': ['clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited', 'clientTransferProhibited https://icann.org/epp#clientTransferProhibited', 'clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited', 'serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited', 'serverTransferProhibited https://icann.org/epp#serverTransferProhibited', 'serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited', 'clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)', 'clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)', 'clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)', 'serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)', 'serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)', 'serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)'], 'emails': ['abusecomplaints@markmonitor.com', 'whoisrelay@markmonitor.com'], 'dnssec': 'unsigned', 'name': None, 'org': 'Beijing Baidu Netcom Science Technology Co., Ltd.', 'address': None, 'city': None, 'state': 'Beijing', 'zipcode': None, 'country': 'CN'}
(编辑:河北网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|