My first-crawler-in-python

18
Viller Hsiao

Transcript of My first-crawler-in-python

Page 1: My first-crawler-in-python

Viller Hsiao

Page 2: My first-crawler-in-python

⽤用 Python 抓取財報資訊

• 練習 python

• 練習 好好寫 python

Page 3: My first-crawler-in-python

⽤用 Python 抓取財報資訊

• 練習 python

• 練習好好寫 python

• 了解 web 架構

• 計算股票價值

Page 4: My first-crawler-in-python

Steps

• 抓網⾴頁

• 解析內容

• 資料計算

Page 5: My first-crawler-in-python

資料來源表格別 股票id

Page 6: My first-crawler-in-python

檢查元素

Page 7: My first-crawler-in-python

開發⼈人員⼯工具

• 練習 google python style guide

Page 8: My first-crawler-in-python

中年Py的奇幻漂流

http://static.ettoday.net/images/206/206484.jpg

Page 9: My first-crawler-in-python

Python Modules• Parse DOM

• urllib + SGMLParser

• requests + BeautifulSoup4

• Excel

• xlutils

Page 10: My first-crawler-in-python

urllib

url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm'

webcode = urllib.urlopen(url)

if webcode.code == 200: self.webpage = webcode.read() webcode.close()

Page 11: My first-crawler-in-python

SGMLParserclass AccountTable(SGMLParser):

def feed(self, data):

def start_tr(self, attr):

def end_tr(self):

def handle_data(self):

Page 12: My first-crawler-in-python

Oops

def start_table(self, attrs): if len(attrs) > 0: for at in attrs: if at[0] == 'id' and at[1] == 'oMainTable': self.isTargetTbl = True

Page 13: My first-crawler-in-python

中⽂文轉碼

line.encode(‘big5’).decode(‘utf8’)

Page 14: My first-crawler-in-python

v2.0• Coding style refinement

• google python style guide

• pyhon 慣⽤用語

Page 15: My first-crawler-in-python

g0v 專案

Page 16: My first-crawler-in-python

requests

import requests

def parse_url(url): r = requests.get(url) if r.status_code == requests.codes.ok: parse_html(r.text)

Page 17: My first-crawler-in-python

BeautifulSoup from bs4 import BeautifulSoup

def parse_html(html_text): soup = BeautifulSoup(html_text) rows = soup.find(‘table', class=‘t01’)

rows = rows.find_all('tr') data = [] for row in rows: cols = row.find_all('td') cols = [e.text.encode('utf-8').strip() for e in cols] data.append(cols)

<td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>

Page 18: My first-crawler-in-python

Future Plan

• concurrent / gevent

• fake browser header

• free proxy