最近開始對於爬蟲蠻有興趣的,現在網路資訊太多了,光是整理各種資訊其實就省了不少時間跟經歷,就開始研究一下怎麼寫。
一開始還是從最熟悉的php開始,寫了些簡單的爬蟲後,遇到一些網頁需要模擬登入後,才能進入到內頁開始爬的狀況,這時候php的缺點就顯現的比較大了,而且很多解法我個人覺得很不直覺外加程式碼超多的,就轉向現在最夯的網路爬蟲程式語言Python。
首先先來看一下php要用curl去抓一個網頁的範例吧:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
<?php> function get_web_page( $url ) { $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'; $options = array( CURLOPT_CUSTOMREQUEST =>"GET", //set request type post or get CURLOPT_POST =>false, //set to GET CURLOPT_USERAGENT => $user_agent, //set user agent CURLOPT_COOKIEFILE =>"cookie.txt", //set cookie file CURLOPT_COOKIEJAR =>"cookie.txt", //set cookie jar CURLOPT_RETURNTRANSFER => true, // return web page CURLOPT_HEADER => false, // don't return headers CURLOPT_FOLLOWLOCATION => true, // follow redirects CURLOPT_ENCODING => "", // handle all encodings CURLOPT_AUTOREFERER => true, // set referer on redirect CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect CURLOPT_TIMEOUT => 120, // timeout on response CURLOPT_MAXREDIRS => 10, // stop after 10 redirects ); $ch = curl_init( $url ); curl_setopt_array( $ch, $options ); $content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); $header['errno'] = $err; $header['errmsg'] = $errmsg; $header['content'] = $content; return $header; } <?> |
而python使用requests(pip install request就搞定了)的範例:
1 2 3 4 5 6 7 8 9 |
import reqeusts s = requests.get($url) //get s = reqeusts.post($url, params={a:'xx', b:'yy'}) //post,還可以把參數pass過去的function 超直覺 s = request.put($url) //put s = request.delete($url) //delete url = s.url header = s.headers cookie = s.cookies html = s.text |
一比就知道哪個語言更直覺簡單去寫crawler了,同時python對處理字串也比php更快更簡單!馬上立馬轉向python crawler的懷抱!php掰掰,你還是繼續作套版就好!
另外差異最大的還是在如何模擬登入後,取得user cookie後進行資料爬取,來看一下php要搞的多麻煩:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
<?php $cookie_jar = 'c:/cookie.txt' ; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://www.plurk.com/m/login'); curl_setopt($ch, CURLOPT_POST, 1); $request = 'username=davidou123&password=0000'; curl_setopt($ch, CURLOPT_POSTFIELDS, $request); //把返回來的cookie保存在$cookie_jar文件中 curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_jar); //設定返回的資料是否自動顯示 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //設定是否顯示頭訊息 curl_setopt($ch, CURLOPT_HEADER, false); //設定是否輸出頁面內容 curl_setopt($ch, CURLOPT_NOBODY, false); curl_exec($ch); curl_close($ch); //get data after login $ch2 = curl_init(); curl_setopt($ch2, CURLOPT_URL, '要爬的網址'); curl_setopt($ch2, CURLOPT_HEADER, false); curl_setopt($ch2, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch2, CURLOPT_COOKIEFILE, $cookie_jar); $orders = curl_exec($ch2); echo strip_tags($orders); curl_close($ch2); ?> |
要自己把cookie存回file裡面,我覺得很麻煩,之後還要在開一個request(帶入cookie)去撈。
而python就相當簡便:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import requests username = '$user' password = '$password' postData = { 'userid' : username, 'password' : password, 'action' : 'login' } header = { 'User-Agent' : 'Mozila xxxx xxx bot aaa' } with request.session() as s: s.post($loginurl, params = postData, headers = header, timeout = 5) // 這邊做完就登入完成自帶session往下走,超直覺,帶過去的參數已經自動做好urlencode,太爽了 response = s.get($dataurl, timeout = 5, verfily=False) // 可以把ssl憑證關掉,因為太麻煩了 html = response.text // 拿到html的東西 你想幹嘛就幹嘛了 |
是不是超簡單且方便,上手也很快速!
更棒的事情是,requests module 直接內建支援Oauth!
1 2 3 4 5 |
imports requests from requests.oauthlib import OAuth1 url = 'https://api.twitter.com/1.1/account/verify_credentials.json' auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET', 'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET') requests.get(url, auth=auth) |
搞定了,是不是超爽!
另外順道一提,python最好全面使用python3,對於UTF-8的支援度比較好,對於request lib的文件請參考以下連結:requests 文件