My first-crawler-in-python

Viller Hsiao

⽤用 Python 抓取財報資訊

• 練習 python

• 練習好好寫 python

⽤用 Python 抓取財報資訊

• 練習 python

• 練習好好寫 python

• 了解 web 架構

• 計算股票價值

Steps

• 抓網⾴頁

• 解析內容

• 資料計算

資料來源表格別股票id

檢查元素

開發⼈人員⼯工具

• 練習 google python style guide

中年Py的奇幻漂流

http://static.ettoday.net/images/206/206484.jpg

http://static.ettoday.net/images/206/206484.jpg

Python Modules• Parse DOM

• urllib + SGMLParser

• requests + BeautifulSoup4

• Excel

• xlutils

urllib

url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm'

webcode = urllib.urlopen(url)

if webcode.code == 200: self.webpage = webcode.read() webcode.close()

SGMLParserclass AccountTable(SGMLParser):

def feed(self, data):

def start_tr(self, attr):

def end_tr(self):

def handle_data(self):

Oops

def start_table(self, attrs): if len(attrs) > 0: for at in attrs: if at[0] == 'id' and at[1] == 'oMainTable': self.isTargetTbl = True

中⽂文轉碼

line.encode(‘big5’).decode(‘utf8’)

v2.0• Coding style refinement

• google python style guide

• pyhon 慣⽤用語

https://google-styleguide.googlecode.com/svn/trunk/pyguide.html

http://seanlin.logdown.com/posts/239883-python-idioms

g0v 專案

requests

import requests

def parse_url(url): r = requests.get(url) if r.status_code == requests.codes.ok: parse_html(r.text)

BeautifulSoup from bs4 import BeautifulSoup

def parse_html(html_text): soup = BeautifulSoup(html_text) rows = soup.find(‘table', class=‘t01’)

rows = rows.find_all('tr') data = [] for row in rows: cols = row.find_all('td') cols = [e.text.encode('utf-8').strip() for e in cols] data.append(cols)

<td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>

Future Plan

• concurrent / gevent

• fake browser header

• free proxy

My first-crawler-in-python

Documents

Transcript of My first-crawler-in-python