PyCon APAC 2016 Regular Expression[A-Z]+

40
Regular Expression[A-Z]+ PYCON APAC 2016 양민지 MATA COMPANY [email protected]

Transcript of PyCon APAC 2016 Regular Expression[A-Z]+

Regular�Expression[A-Z]+

PYCON�APAC�2016�양민지�

MATA�COMPANY

[email protected]

발표자�소개

양민지�/�검객�개발자�

현)�MATA�COMPANY�Software�Engineer��

DEVSISTERS,�The�Beatpacking�Company�

NEXON�Python�보조강사,�Django�Girls�코치

발표에�앞서

이�발표에서는�Python3�를�사용합니다.�

이�발표로�정규표현식을�완전히�이해할�수는�없습니다

다루는�내용

Why�Regex?�

간단한�예제�x�3�

The�re�module�

연습문제와�성능�팁�

그�외�유용한�것들

Why�regex?�

특정한�규칙을�가진�문자열의�집합을�표현하는�데�사용하는�식�

문자열의�검색이나�치환에�편리하다.

100312467 “Why So Lonely” “wondergirls” 3014725 20160306 2016-03-20T12:00:35+09:00

-> “Why So Lonely” “wondergirls” 2016-03-20T12:00

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

WHAAAAT?

How�to�learn�regex?

처음에는�복잡하고�읽을수�없어�어렵게�느껴진다.�

하지만�정규표현식은�생각보다�어렵지�않다.

간단한�예제�x�3

예제�1

핸드폰�번호�매칭

010-3333-7777

\d{3}-\d{4}-\d{4}

예제�2

웹사이트�주소에서�host�이름�가져오기

http://www.google.com/?q=pycon

http:\/\/([^/]*)\/\?q=pycon

예제�3

이메일�주소에서�아이디�꺼내먹어요

[email protected]

^([^@]+)@.+$

The�re�module

re module

Python에서는 re�모듈로�정규�표현식을�처리합니다.�

import re re.search(pattern, string)

re module

>>> re.search(‘abcd’, ‘abcdef’) <_sre.STR_Match object at 0X120670cc2>

>>> re.search(‘zxc’, ‘abcdef’) None

다시�만나는�예제�x�3

re.sub()

import re

phone = '010-1234-5678' re.sub( r'(\d{3}-\d{4}-)(\d{4})', r'\1****', phone )

>>> ’010-1234-****'

re.match()

import re

link = 'http://www.google.com/?q=pycon' match = re.match( r’(http:\/\/)([^/]*)(.*)’, link ) match.group(2)

>>> 'www.google.com'

re.search()

import re

email = '[email protected]' match = re.search('^[^@]*', email)

match.group()

>>> 'minji'

match vs search

import re

sample = '2016pycon' re.match('[a-z]+', sample) >>> None

re.search('[a-z]+', sample) >>> <_sre.SRE_Match object; span=(4, 8), match='pycon'>

re module re.search(pattern, string, flags=0)

= match되는�첫번째�문자열을�찾아줌

re.match(pattern, string, flags=0)

= string 처음부터�match되는지�확인

re.findall(pattern, string, flags=0)

= string 전체에서�pattern과�일치하는�것을�모두�찾아�list로�돌려�줌

Character�classes

. 줄바꿈�문자를�제외한�모든�문자와�매치됨

\d 모든�숫자와�매치됨�[0-9]

\D 숫자가�아닌�문자와�매치됨�[^0-9]

\w 숫자�또는�문자와�매치됨�[a-zA-Z0-9]��(파이썬에선�숫자도�포함)

\W 숫자�또는�문자가�아닌�것과�매치됨�[^a-zA-Z0-9]

\s 화이트�스페이스�문자와�매치됨

\S 화이트�스페이스가�아닌�것과�매치됨

Anchors�and�Repetition

^abc$ 문자열의�시작/�문자열의�마지막과�매치됨

* 0회�이상�반복

+ 1회�이상�반복

? 0회�또는�1회

{x} x회�반복�(e.g�{3}�)

{x,y} x회부터�y회까지�반복

[abc] 문자�집합�중�한�문자를�의미

[^abc] a,b,c�가�아닌�문자

[a-d] a,�b,�c�or�d�사이에�있는�문자를�의미

연습문제�풀어봅시다

<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?8h9C3zM9d2ErvunVTkjK">

<link rel="shortcut icon" href="favicon.ico">

<link rel="alternate" type="application/rss+xml" title="RSS" href="rss">

<title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">

<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="http://www.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>

<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>

link: https://bugzilla.mozilla.org/show_bug.cgi?id=1173199#c31 title: “Our primary goal is to un-fork the Tor Browser”

link: http://siliconangle.com/blog/2016/08/05/watson-correctly-diagnoses-woman-after-doctors-were-stumped/ title: IBM Watson correctly diagnoses a form of leukemia

link: http://gping.io title: Show HN: Gping.io – Like TinyURL for your car

link: http://bit-player.org/2016/the-39th-root-of-92 title: The 39th Root of 92

link: http://www.sciencealert.com/we-just-got-even-weirder-results-about-the-alien-megastructure-star title: Tabby's star is dimming at an incredible rate

우리가�원하는�Output

regex�안쓰고�코딩해보기

re.DOTALL�??

data = ‘<title>\nPYCON APAC 2016\n\nRegular Expressions\n\n</title>\n’

re.search(‘<title>(.*)</title>’, data).group(1) AttributeError: 'NoneType' object has no attribute ‘group'

re.search(‘<title>(.*)</title>’, data, re.DOTALL).group(1) '\nPYCON APAC 2016\n\nRegular Expressions[A-Z]+\nMinji Yang\n’

re.compile

그�외�유용한�것들

Vim:�Find�and�Replace�

:%s/old/new/g

http://vimregex.com/

1033303 -> 1233303, 1033213 -> 1233213:%s/103\(\d\{4}\)/123\1/g

str.find�vs�re.match�vs�in

http://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find

str.find�vs�re.match�vs�in

http://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find

strfind : 0.441393852234 re.match: 2.12302494049 in : 0.251421928406

WHAAAAT?

성능

정규표현식의�성능은�좋지�않다�

하지만�코딩은�편리하다�

성능이�중요한�코드에는�regex�가�답이�아닐�수�있다

print(“Thank You”)