電腦不只會幫你選土豆,還會幫你選新聞
-
Upload
andy-dai -
Category
Data & Analytics
-
view
1.916 -
download
1
description
Transcript of 電腦不只會幫你選土豆,還會幫你選新聞
![Page 2: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/2.jpg)
About Me
• Andy ([email protected])
• Taipei.py、PyCon TW、PyCon APAC ⼯工作⼈人員
• Backend Developer @ Dorm7 Software
• 平常都在寫 Python、Django
2
![Page 3: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/3.jpg)
今天會講些什麼?
3
![Page 4: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/4.jpg)
我們會⽤用 Python 當中的套件來做新聞的抓取以及簡單的
Machine Learning 來完成⼀一個電腦幫你篩選新聞的系統
4
![Page 5: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/5.jpg)
重點當然是 Machine Learning
5
![Page 6: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/6.jpg)
重點當然是 Machine Learning
Python
6
![Page 7: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/7.jpg)
先讓我們從問 問題開始
7
![Page 8: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/8.jpg)
If I had an hour to solve a problem I'd spend 55 minutes
thinking about the problem and 5 minutes thinking about solutions.
!
— Albert Einstein
8
![Page 9: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/9.jpg)
問題:我想要知道怎麼樣的新聞會得到⽐比較多的 Facebook Like
9
![Page 10: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/10.jpg)
讓我們開始吧!
10
![Page 11: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/11.jpg)
11
![Page 12: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/12.jpg)
GARBAGE IN GARBAGE OUT
12
![Page 13: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/13.jpg)
讓我們來抓新聞吧!
13
![Page 14: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/14.jpg)
打開瀏覽器
14
![Page 15: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/15.jpg)
15
![Page 16: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/16.jpg)
15
![Page 17: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/17.jpg)
16
![Page 18: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/18.jpg)
16
![Page 19: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/19.jpg)
Python Time
17
![Page 20: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/20.jpg)
打開瀏覽器
• requests - 模擬瀏覽器發查詢
• selenium - 真正操控瀏覽器
18
![Page 21: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/21.jpg)
Requests: HTTP for Humans
import requests!!
def get_content(url):! response = requests.get(url)! return response.content!
19
![Page 22: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/22.jpg)
Selenium - Web Browser Automation
20
from selenium import webdriver!!
browser = webdriver.Firefox()!browser.get('http://www.google.com')!
![Page 23: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/23.jpg)
Beautiful Soup: Navigating, searching your html
from bs4 import BeautifulSoup!!
soup = BeautifulSoup(html_content)!soup.title!soup.findAll('a')!soup.find('div', {'id': 'summary'})!
21
![Page 24: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/24.jpg)
22
![Page 25: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/25.jpg)
22
![Page 26: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/26.jpg)
22
![Page 27: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/27.jpg)
readability - Pulls out main body
from readability.readability import Document!!
doc = Document(content)!print doc.summary()!
23
![Page 28: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/28.jpg)
或是有些更帥的服務
24
![Page 30: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/30.jpg)
PyMongo
from pymongo import MongoClient!!
client = MongoClient()!db = client['news_database']!news = db.news!news.insert(data)!news.find_one({'url': a['href']})!
26
![Page 31: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/31.jpg)
從 ETtoday 抓這些資料• 標題
• ⽇日期
• 內⽂文
• 有幾張圖
• URL
• 類別
• Facebook likes 數量27
![Page 32: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/32.jpg)
所以要開始 Machine Learning 了對吧?
28
![Page 33: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/33.jpg)
所以要開始 Machine Learning 了對吧?
29
![Page 34: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/34.jpg)
整理資料
30
![Page 35: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/35.jpg)
⼀一些數據
• 4/20~7/15 ETtoday 的新聞
• 總共 28615 篇
• 6441 篇的 Facebook Like > 1000 (22.5%)
31
![Page 36: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/36.jpg)
先來看看你的直覺
32
![Page 37: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/37.jpg)
政治 財經 國際 ⼤大陸 社會 地⽅方 新奇
⽣生活 寵物動物 影劇 體育 消費 3C 健康 男⼥女
33
![Page 38: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/38.jpg)
最不受歡迎的類別
34
![Page 39: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/39.jpg)
最不受歡迎的類別
•財經 (3.11%)
34
![Page 40: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/40.jpg)
最不受歡迎的類別
•財經 (3.11%)
•消費 (2.57%)
34
![Page 41: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/41.jpg)
最不受歡迎的類別
•財經 (3.11%)
•消費 (2.57%)
•健康 (1.06%)
34
![Page 42: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/42.jpg)
台灣⼈人沒錢、沒辦法消費、也不重視健康
35
![Page 43: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/43.jpg)
最受歡迎的類別
36
![Page 44: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/44.jpg)
最受歡迎的類別
•⽣生活 (38.08%)
36
![Page 45: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/45.jpg)
最受歡迎的類別
•⽣生活 (38.08%)
•新奇 (47.74%)
36
![Page 46: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/46.jpg)
最受歡迎的類別
•⽣生活 (38.08%)
•新奇 (47.74%)
•寵物動物 (89.24%)
36
![Page 47: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/47.jpg)
37
![Page 48: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/48.jpg)
所以要開始 Machine Learning 了對吧?
38
![Page 49: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/49.jpg)
Machine Learning 第⼀一步: 把你的資料轉成電腦看得懂的
東⻄西
39
![Page 50: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/50.jpg)
屏東科技⼤大學裴家騏⽼老師團隊在苗栗的研究顯⽰示,近幾年來,無論地⽅方政府或私⼈人的開發,都使得⽯石⻁虎的棲地不斷地減少
和破碎化。
電腦看不懂這個...
40
![Page 51: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/51.jpg)
[1, 0, 1, 0, 1…, 0, 1]
電腦看得懂這個...
41
![Page 52: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/52.jpg)
中間缺了什麼?
42
![Page 53: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/53.jpg)
屏東/ 科技/ ⼤大學/ 裴家騏/ ⽼老師/ 團隊/ 在/ 苗栗/ 的/ 研究/ 顯⽰示/ ,/ 近幾年/ 來/ ,/ 無論/ 地⽅方/ 政府/ 或/ 私⼈人/ 的/ 開發/ ,/ 都/ 使得/ ⽯石⻁虎/ 的/ 棲地
/ 不斷/ 地/ 減少/ 和/ 破碎/ 化/ 。/
斷詞
43
![Page 54: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/54.jpg)
import jieba!segs = jieba.cut(! u”屏東科技⼤大學裴家騏⽼老師團隊在苗栗的研究顯⽰示"!)!print '/'.join(segs)!
jieba - 斷詞
屏東/ 科技/ ⼤大學/ 裴家騏/ ⽼老師/ 團隊/ 在/ 苗栗/ 的/ 研究/ 顯⽰示/
44
![Page 55: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/55.jpg)
import jieba.analyse!!content = """..."""!tags = jieba.analyse.extract_tags(!! content, topK=10!)
jieba - 找關鍵詞
45
![Page 56: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/56.jpg)
import jieba!!
jieba.load_userdict(! “userdict.txt"!)!
jieba - 加⾃自定詞庫
46
![Page 57: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/57.jpg)
"结巴"中⽂文分词:做最好的Python中⽂文分词组件
47
![Page 58: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/58.jpg)
所以要開始 Machine Learning 了對吧?
48
![Page 59: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/59.jpg)
49
![Page 60: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/60.jpg)
scikit-learn 可以做啥?
50
![Page 61: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/61.jpg)
Classification
這篇新聞會超過 1000 個 !Likes 嗎?
51
![Page 62: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/62.jpg)
Regression
這篇新聞會有幾個 Likes?
52
![Page 63: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/63.jpg)
Clustering
給⼀一堆新聞,能不能幫我找出哪些新聞⽐比較像?
53
![Page 64: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/64.jpg)
Machine Learning is Simple
54
from sklearn import svm!X, y = get_training_set()!clf = svm.SVC()!clf.fit(X, y)!clf.predict(unknown)!
![Page 65: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/65.jpg)
55
![Page 66: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/66.jpg)
最後成果
• 細節不多說,總之就是 call API 跟調參數
• ⽤用 SVM 做 Classification
• 使⽤用的特徵: 標題、內⽂文、類別、有幾張圖、星期幾
56
![Page 67: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/67.jpg)
今天沒提到但是可能有⽤用的 Packages
• Scrapy - 抓網⾴頁
• NLTK - ⾃自然語⾔言處理
• Pandas - Python Data Analysis Library
• Orange - Open source data visualization and analysis
• Matplotlib - 畫圖
57
![Page 68: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/68.jpg)
結論
58
![Page 69: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/69.jpg)
結論
• 沒有什麼神奇的⽅方法可以幫你搞定資料科學!請不要抗拒把⼿手弄髒
58
![Page 70: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/70.jpg)
結論
• 沒有什麼神奇的⽅方法可以幫你搞定資料科學!請不要抗拒把⼿手弄髒
• 想玩玩資料嗎?⽤用 Python 吧!
58
![Page 71: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/71.jpg)
結論
• 沒有什麼神奇的⽅方法可以幫你搞定資料科學!請不要抗拒把⼿手弄髒
• 想玩玩資料嗎?⽤用 Python 吧!
• 不⼀一定要靠 Machine Learning,資料整理得好就有價值
58
![Page 72: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/72.jpg)
⼯工商服務
59
![Page 73: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/73.jpg)
⼯工商服務
• Taipei.py
• http://www.meetup.com/Taipei-py/
59
![Page 74: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/74.jpg)
⼯工商服務
• Taipei.py
• http://www.meetup.com/Taipei-py/
• Django Girls Taiwan 籌備中
• http://bit.ly/djangogirls
59
![Page 75: 電腦不只會幫你選土豆,還會幫你選新聞](https://reader033.fdocument.pub/reader033/viewer/2022060109/5554ae14b4c90502618b53f3/html5/thumbnails/75.jpg)
Q&A
60