My first-crawler-in-python
-
Upload
viller-hsiao -
Category
Documents
-
view
26 -
download
1
Transcript of My first-crawler-in-python
Viller Hsiao
⽤用 Python 抓取財報資訊
• 練習 python
• 練習 好好寫 python
⽤用 Python 抓取財報資訊
• 練習 python
• 練習好好寫 python
• 了解 web 架構
• 計算股票價值
Steps
• 抓網⾴頁
• 解析內容
• 資料計算
資料來源表格別 股票id
檢查元素
開發⼈人員⼯工具
• 練習 google python style guide
中年Py的奇幻漂流
http://static.ettoday.net/images/206/206484.jpg
Python Modules• Parse DOM
• urllib + SGMLParser
• requests + BeautifulSoup4
• Excel
• xlutils
urllib
url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm'
webcode = urllib.urlopen(url)
if webcode.code == 200: self.webpage = webcode.read() webcode.close()
SGMLParserclass AccountTable(SGMLParser):
def feed(self, data):
def start_tr(self, attr):
def end_tr(self):
def handle_data(self):
Oops
def start_table(self, attrs): if len(attrs) > 0: for at in attrs: if at[0] == 'id' and at[1] == 'oMainTable': self.isTargetTbl = True
中⽂文轉碼
line.encode(‘big5’).decode(‘utf8’)
v2.0• Coding style refinement
• google python style guide
• pyhon 慣⽤用語
g0v 專案
requests
import requests
def parse_url(url): r = requests.get(url) if r.status_code == requests.codes.ok: parse_html(r.text)
BeautifulSoup from bs4 import BeautifulSoup
def parse_html(html_text): soup = BeautifulSoup(html_text) rows = soup.find(‘table', class=‘t01’)
rows = rows.find_all('tr') data = [] for row in rows: cols = row.find_all('td') cols = [e.text.encode('utf-8').strip() for e in cols] data.append(cols)
<td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>
Future Plan
• concurrent / gevent
• fake browser header
• free proxy