A Python Book: Beginning Python, Advanced Python, and Python ...
1+Python+Class+Powerpoint+Outline
-
Upload
dsunte-wilson -
Category
Documents
-
view
212 -
download
0
Transcript of 1+Python+Class+Powerpoint+Outline
-
8/20/2019 1+Python+Class+Powerpoint+Outline
1/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
2/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
3/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
4/136
As an example, here is an implementation of the classic quicksort algorithm in
Python:
def quicksort(arr):
if len(arr) pivot]
return quicksort(left) + middle + quicksort(right)
print quicksort([3,6,8,10,1,2,1])
# Prints "[1, 1, 2, 3, 6, 8, 10]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
5/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
6/136
Numbers: Integers and floats work as you would expect from other languages:
x = 3
print type(x) # Prints ""
print x # Prints "3"
print x + 1 # Addition; prints "4"
print x - 1 # Subtraction; prints "2"
print x * 2 # Multiplication; prints "6"
print x ** 2 # Exponentiation; prints "9"x += 1
print x # Prints "4"
x *= 2
print x # Prints "8"
y = 2.5
print type(y) # Prints ""
print y, y + 1, y * 2, y ** 2 # Prints "2.5 3.5 5.0 6.25"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
7/136
Booleans:
t = True
f = False
print type(t) # Prints ""
print t and f # Logical AND; prints "False"
print t or f # Logical OR; prints "True"
print not t # Logical NOT; prints "False"
print t != f # Logical XOR; prints "True"
Strings:
hello = 'hello' # String literals can use single quotes
world = "world" # or double quotes; it does not matter.
print hello # Prints "hello"
print len(hello) # String length; prints "5"
hw = hello + ' ' + world # String concatenation
print hw # prints "hello world"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
8/136
hw12 = '%s %s %d' % (hello, world, 12) # sprintf style string formatting
print hw12 # prints "hello world 12"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
9/136
String objects have a bunch of useful methods; for example:
s = "hello"
print s.capitalize() # Capitalize a string; prints "Hello"
print s.upper() # Convert a string to uppercase; prints "HELLO"
print s.rjust(7) # Right-justify a string, padding with spaces; prints " hello"
print s.center(7) # Center a string, padding with spaces; prints " hello "
print s.replace('l', '(ell)') # Replace all instances of one substring with another;# prints "he(ell)(ell)o"
print ' world '.strip() # Strip leading and trailing whitespace; prints "world"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
10/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
11/136
xs = [3, 1, 2] # Create a list
print xs, xs[2] # Prints "[3, 1, 2] 2"
print xs[-1] # Negative indices count from the end of the list; prints "2"
xs[2] = 'foo' # Lists can contain elements of different types
print xs # Prints "[3, 1, 'foo']"
xs.append('bar') # Add a new element to the end of the list
print xs # Prints
x = xs.pop() # Remove and return the last element of the list
print x, xs # Prints "bar [3, 1, 'foo']"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
12/136
nums = range(5) # range is a built-in function that creates a list of integers
print nums # Prints "[0, 1, 2, 3, 4]"
print nums[2:4] # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"
print nums[2:] # Get a slice from index 2 to the end; prints "[2, 3, 4]"
print nums[:2] # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"
print nums[:] # Get a slice of the whole list; prints ["0, 1, 2, 3, 4]"
print nums[:-1] # Slice indices can be negative; prints ["0, 1, 2, 3]"
nums[2:4] = [8, 9] # Assign a new sublist to a slice
print nums # Prints "[0, 1, 8, 8, 4]"
We will see slicing again in the context of numpy arrays.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
13/136
animals = ['cat', 'dog', 'monkey']
for animal in animals:
print animal
# Prints "cat", "dog", "monkey", each on its own line.
If you want access to the index of each element within the body of a loop, use the built-in
enumerate function:
animals = ['cat', 'dog', 'monkey']for idx, animal in enumerate(animals):
print '#%d: %s' % (idx + 1, animal)
# Prints "#1: cat", "#2: dog", "#3: monkey", each on its own line
-
8/20/2019 1+Python+Class+Powerpoint+Outline
14/136
As a simple example, consider the following code that computes square numbers:
nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
squares.append(x ** 2)
print squares # Prints [0, 1, 4, 9, 16]
You can make this code simpler using a list comprehension:
nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]
print squares # Prints [0, 1, 4, 9, 16]
List comprehensions can also contain conditions:
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]
-
8/20/2019 1+Python+Class+Powerpoint+Outline
15/136
print even_squares # Prints "[0, 4, 16]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
16/136
You can use it like this:
d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data
print d['cat'] # Get an entry from a dictionary; prints "cute"
print 'cat' in d # Check if a dictionary has a given key; prints "True"
d['fish'] = 'wet' # Set an entry in a dictionary
print d['fish'] # Prints "wet"
# print d['monkey'] # KeyError: 'monkey' not a key of d
print d.get('monkey', 'N/A') # Get an element with a default; prints "N/A"print d.get('fish', 'N/A') # Get an element with a default; prints "wet"
del d['fish'] # Remove an element from a dictionary
print d.get('fish', 'N/A') # "fish" is no longer a key; prints "N/A"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
17/136
Loops: It is easy to iterate over the keys in a dictionary:
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
legs = d[animal]
print 'A %s has %d legs' % (animal, legs)
# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"
If you want access to keys and their corresponding values, use the iteritems method:
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal, legs in d.iteritems():
print 'A %s has %d legs' % (animal, legs)
# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"
Dictionary comprehensions: These are similar to list comprehensions, but allow you to easily
construct dictionaries. For example:
nums = [0, 1, 2, 3, 4]
even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}
print even_num_to_square # Prints "{0: 0, 2: 4, 4: 16}"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
18/136
As a simple example, consider the following:
animals = {'cat', 'dog'}
print 'cat' in animals # Check if an element is in a set; prints "True"
print 'fish' in animals # prints "False"
animals.add('fish') # Add an element to a set
print 'fish' in animals # Prints "True"
print len(animals) # Number of elements in a set; prints "3"
animals.add('cat') # Adding an element that is already in the set does nothingprint len(animals) # Prints "3"
animals.remove('cat') # Remove an element from a set
print len(animals) # Prints "2"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
19/136
As usual, everything you want to know about sets can be found in the documentation.
Loops: Iterating over a set has the same syntax as iterating over a list; however since sets are
unordered, you cannot make assumptions about the order in which you visit the elements of the
set:
animals = {'cat', 'dog', 'fish'}
for idx, animal in enumerate(animals):
print '#%d: %s' % (idx + 1, animal)
# Prints "#1: fish", "#2: dog", "#3: cat"
Set comprehensions: Like lists and dictionaries, we can easily construct sets using set
comprehensions:
from math import sqrt
nums = {int(sqrt(x)) for x in range(30)}print nums # Prints "set([0, 1, 2, 3, 4, 5])"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
20/136
Here is a trivial example:
d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys
t = (5, 6) # Create a tuple
print type(t) # Prints ""
print d[t] # Prints "5"
print d[(1, 2)] # Prints "1"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
21/136
For example:
def sign(x):
if x > 0:return 'positive'
elif x < 0:
return 'negative'
else:
return 'zero'
for x in [-1, 0, 1]:
print sign(x)
# Prints "negative", "zero", "positive"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
22/136
We will often define functions to take optional keyword arguments, like this:
def hello(name, loud=False):
if loud:
print 'HELLO, %s' % name.upper()
else:
print 'Hello, %s!' % name
hello('Bob') # Prints "Hello, Bob"
hello('Fred', loud=True) # Prints "HELLO, FRED!"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
23/136
class Greeter:
# Constructor
def __init__(self, name):
self.name = name # Create an instance variable
# Instance method
def greet(self, loud=False):
if loud:print 'HELLO, %s!' % self.name.upper()
else:
print 'Hello, %s' % self.name
g = Greeter('Fred') # Construct an instance of the Greeter class
g.greet() # Call an instance method; prints "Hello, Fred"
g.greet(loud=True) # Call an instance method; prints "HELLO, FRED!"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
24/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
25/136
We can initialize numpy arrays from nested Python lists, and access elements using square
brackets:
import numpy as np
a = np.array([1, 2, 3]) # Create a rank 1 array
print type(a) # Prints ""
print a.shape # Prints "(3,)"
print a[0], a[1], a[2] # Prints "1 2 3"a[0] = 5 # Change an element of the array
print a # Prints "[5, 2, 3]"
b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array
print b.shape # Prints "(2, 3)"
print b[0, 0], b[0, 1], b[1, 0] # Prints "1 2 4"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
26/136
Numpy also provides many functions to create arrays:
import numpy as np
a = np.zeros((2,2)) # Create an array of all zeros
print a # Prints "[[ 0. 0.]
# [ 0. 0.]]"
b = np.ones((1,2)) # Create an array of all ones
print b # Prints "[[ 1. 1.]]"
c = np.full((2,2), 7) # Create a constant array
print c # Prints "[[ 7. 7.]
# [ 7. 7.]]"
d = np.eye(2) # Create a 2x2 identity matrix
print d # Prints "[[ 1. 0.]
-
8/20/2019 1+Python+Class+Powerpoint+Outline
27/136
# [ 0. 1.]]"
e = np.random.random((2,2)) # Create an array filled with random values
print e # Might print "[[ 0.91940167 0.08143941]
# [ 0.68744134 0.87236687]]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
28/136
Since arrays may be multidimensional, you must specify a slice for each dimension of the array:
import numpy as np
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]
# A slice of an array is a view into the same data, so modifying it
-
8/20/2019 1+Python+Class+Powerpoint+Outline
29/136
# will modify the original array.
print a[0, 1] # Prints "2"
b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]
print a[0, 1] # Prints "77"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
30/136
You can also mix integer indexing with slice indexing. However, doing so will yield an array of
lower rank than the original array. Note that this is quite different from the way that MATLAB
handles array slicing:
import numpy as np
# Create the following rank 2 array with shape (3, 4)# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Two ways of accessing the data in the middle row of the array.
# Mixing integer indexing with slices yields an array of lower rank,
# while using only slices yields an array of the same rank as the
# original array:row_r1 = a[1, :] # Rank 1 view of the second row of a
-
8/20/2019 1+Python+Class+Powerpoint+Outline
31/136
row_r2 = a[1:2, :] # Rank 2 view of the second row of a
print row_r1, row_r1.shape # Prints "[5 6 7 8] (4,)"
print row_r2, row_r2.shape # Prints "[[5 6 7 8]] (1, 4)"
# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]col_r2 = a[:, 1:2]
print col_r1, col_r1.shape # Prints "[ 2 6 10] (3,)"
print col_r2, col_r2.shape # Prints "[[ 2]
# [ 6]
# [10]] (3, 1)"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
32/136
Here is an example:
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
# An example of integer array indexing.
# The returned array will have shape (3,) and
print a[[0, 1, 2], [0, 1, 0]] # Prints "[1 4 5]"
# The above example of integer array indexing is equivalent to this:
print np.array([a[0, 0], a[1, 1], a[2, 0]]) # Prints "[1 4 5]"
# When using integer array indexing, you can reuse the same
# element from the source array:
print a[[0, 0], [1, 1]] # Prints "[2 2]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
33/136
# Equivalent to the previous integer array indexing example
print np.array([a[0, 1], a[0, 1]]) # Prints "[2 2]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
34/136
Here is an example:
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
bool_idx = (a > 2) # Find the elements of a that are bigger than 2;
# this returns a numpy array of Booleans of the same
# shape as a, where each slot of bool_idx tells
# whether that element of a is > 2.
print bool_idx # Prints "[[False False]
# [ True True]
# [ True True]]"
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
-
8/20/2019 1+Python+Class+Powerpoint+Outline
35/136
# of bool_idx
print a[bool_idx] # Prints "[3 4 5 6]"
# We can do all of the above in a single concise statement:
print a[a > 2] # Prints "[3 4 5 6]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
36/136
Here is an example:
import numpy as np
x = np.array([1, 2]) # Let numpy choose the datatype
print x.dtype # Prints "int64"
x = np.array([1.0, 2.0]) # Let numpy choose the datatype
print x.dtype # Prints "float64"
x = np.array([1, 2], dtype=np.int64) # Force a particular datatype
print x.dtype # Prints "int64"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
37/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
38/136
import numpy as np
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
# Elementwise sum; both produce the array
# [[ 6.0 8.0]
# [10.0 12.0]]
print x + y
print np.add(x, y)
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
# [-4.0 -4.0]]
print x - y
print np.subtract(x, y)
-
8/20/2019 1+Python+Class+Powerpoint+Outline
39/136
# Elementwise product; both produce the array
# [[ 5.0 12.0]
# [21.0 32.0]]
print x * y
print np.multiply(x, y)
# Elementwise division; both produce the array
# [[ 0.2 0.33333333]
# [ 0.42857143 0.5 ]]
print x / y
print np.divide(x, y)
# Elementwise square root; produces the array
# [[ 1. 1.41421356]
# [ 1.73205081 2. ]]
-
8/20/2019 1+Python+Class+Powerpoint+Outline
40/136
print np.sqrt(x)
-
8/20/2019 1+Python+Class+Powerpoint+Outline
41/136
import numpy as np
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])
# Inner product of vectors; both produce 219
print v.dot(w)print np.dot(v, w)
# Matrix / vector product; both produce the rank 1 array [29 67]
print x.dot(v)
print np.dot(x, v)
# Matrix / matrix product; both produce the rank 2 array
-
8/20/2019 1+Python+Class+Powerpoint+Outline
42/136
# [[19 22]
# [43 50]]
print x.dot(y)
print np.dot(x, y)
-
8/20/2019 1+Python+Class+Powerpoint+Outline
43/136
import numpy as np
x = np.array([[1,2],[3,4]])
print np.sum(x) # Compute sum of all elements; prints "10"
print np.sum(x, axis=0) # Compute sum of each column; prints "[4 6]"
print np.sum(x, axis=1) # Compute sum of each row; prints "[3 7]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
44/136
import numpy as np
x = np.array([[1,2], [3,4]])
print x # Prints "[[1 2]
# [3 4]]"
print x.T # Prints "[[1 3]
# [2 4]]"
# Note that taking the transpose of a rank 1 array does nothing:v = np.array([1,2,3])
print v # Prints "[1 2 3]"
print v.T # Prints "[1 2 3]"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
45/136
import numpy as np
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x) # Create an empty matrix with the same shape as x
# Add the vector v to each row of the matrix x with an explicit loopfor i in range(4):
y[i, :] = x[i, :] + v
# Now y is the following
# [[ 2 2 4]
# [ 5 5 7]
# [ 8 8 10]
# [11 11 13]]
-
8/20/2019 1+Python+Class+Powerpoint+Outline
46/136
print y
-
8/20/2019 1+Python+Class+Powerpoint+Outline
47/136
This works; however when the matrix x is very large, computing an explicit loop in Python could
be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a
matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x
and vv. We could implement this approach like this:
import numpy as np
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other
print vv # Prints "[[1 0 1]
# [1 0 1]
# [1 0 1]
# [1 0 1]]"
y = x + vv # Add x and vv elementwise
-
8/20/2019 1+Python+Class+Powerpoint+Outline
48/136
print y # Prints "[[ 2 2 4
# [ 5 5 7]
# [ 8 8 10]
# [11 11 13]]"
Numpy broadcasting allows us to perform this computation without actually creatingmultiple copies of v. Consider this version, using broadcasting:
import numpy as np
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v # Add v to each row of x using broadcastingprint y # Prints "[[ 2 2 4]
# [ 5 5 7]
# [ 8 8 10]
# [11 11 13]]"
The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to
broadcasting; this line works as if v actually had shape (4, 3), where each row was a
copy of v, and the sum was performed elementwise.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
49/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
50/136
There are currently more than 60 universal functions defined in numpy on one or more types,
covering a wide variety of operations. Some of these ufuncs are called automatically on arrays
when the relevant infix notation is used (e.g., add(a, b) is called internally when a + b is written
and a or b is an ndarray). Nevertheless, you may still want to use the ufunc call in order to use the
optional output argument(s) to place the output(s) in an object (or objects) of your choice.
Recall that each ufunc operates element-by-element. Therefore, each ufunc will be described as if
acting on a set of scalar inputs to return a set of scalar outputs.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
51/136
Math operations
add(x1, x2[, out]) Add arguments element-wise.
subtract(x1, x2[, out]) Subtract arguments, element-wise.
multiply(x1, x2[, out]) Multiply arguments element-wise.
divide(x1, x2[, out]) Divide arguments element-wise.
logaddexp(x1, x2[, out]) Logarithm of the sum of exponentiations of the inputs.logaddexp2(x1, x2[, out]) Logarithm of the sum of exponentiations of the inputs in base-2.
true_divide(x1, x2[, out]) Returns a true division of the inputs, element-wise.
floor_divide(x1, x2[, out]) Return the largest integer smaller or equal to the division of the
inputs.
negative(x[, out]) Numerical negative, element-wise.
power(x1, x2[, out]) First array elements raised to powers from second array, element-
wise.
remainder(x1, x2[, out]) Return element-wise remainder of division.
mod(x1, x2[, out]) Return element-wise remainder of division.
fmod(x1, x2[, out]) Return the element-wise remainder of division.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
52/136
absolute(x[, out]) Calculate the absolute value element-wise.
rint(x[, out]) Round elements of the array to the nearest integer.
sign(x[, out]) Returns an element-wise indication of the sign of a number.
conj(x[, out]) Return the complex conjugate, element-wise.
exp(x[, out]) Calculate the exponential of all elements in the input array.
exp2(x[, out]) Calculate 2**p for all p in the input array.log(x[, out]) Natural logarithm, element-wise.
log2(x[, out]) Base-2 logarithm of x.
log10(x[, out]) Return the base 10 logarithm of the input array, element-wise.
expm1(x[, out]) Calculate exp(x) - 1 for all elements in the array.
log1p(x[, out]) Return the natural logarithm of one plus the input array,
element-wise.
sqrt(x[, out]) Return the positive square-root of an array, element-wise.
square(x[, out]) Return the element-wise square of the input.
reciprocal(x[, out]) Return the reciprocal of the argument, element-wise.ones_like(a[, dtype, order, subok]) Return an array of ones with the same
shape and type as a given array.
Tip
The optional output arguments can be used to help you save memory for large
calculations. If your arrays are large, complicated expressions can take longer than
absolutely necessary due to the creation and (later) destruction of temporary
calculation spaces. For example, the expression G = a * b + c is equivalent to t1 = A *
B; G = T1 + C; del t1. It will be more quickly executed as G = A * B; add(G, C, G) which
is the same as G = A * B; G += C.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
53/136
Trigonometric functions
All trigonometric functions use radians when an angle is called for. The ratio of degrees to radians
is 180^{\circ}/\pi.
sin(x[, out]) Trigonometric sine, element-wise.cos(x[, out]) Cosine element-wise.
tan(x[, out]) Compute tangent element-wise.
arcsin(x[, out]) Inverse sine, element-wise.
arccos(x[, out]) Trigonometric inverse cosine, element-wise.
arctan(x[, out]) Trigonometric inverse tangent, element-wise.
arctan2(x1, x2[, out]) Element-wise arc tangent of x1/x2 choosing the quadrant correctly.
hypot(x1, x2[, out]) Given the “legs” of a right triangle, return its hypotenuse.
sinh(x[, out]) Hyperbolic sine, element-wise.
cosh(x[, out]) Hyperbolic cosine, element-wise.
tanh(x[, out]) Compute hyperbolic tangent element-wise.
arcsinh(x[, out]) Inverse hyperbolic sine element-wise.
arccosh(x[, out]) Inverse hyperbolic cosine, element-wise.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
54/136
arctanh(x[, out]) Inverse hyperbolic tangent element-wise.
deg2rad(x[, out]) Convert angles from degrees to radians.
rad2deg(x[, out]) Convert angles from radians to degrees.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
55/136
Bit-twiddling functions
These function all require integer arguments and they manipulate the bit-pattern of those
arguments.
bitwise_and(x1, x2[, out]) Compute the bit-wise AND of two arrays element-wise.
bitwise_or(x1, x2[, out]) Compute the bit-wise OR of two arrays element-wise.
bitwise_xor(x1, x2[, out]) Compute the bit-wise XOR of two arrays element-wise.
invert(x[, out]) Compute bit-wise inversion, or bit-wise NOT, element-wise.
left_shift(x1, x2[, out]) Shift the bits of an integer to the left.
right_shift(x1, x2[, out]) Shift the bits of an integer to the right.
Comparison functions
greater(x1, x2[, out]) Return the truth value of (x1 > x2) element-wise.
greater_equal(x1, x2[, out]) Return the truth value of (x1 >= x2) element-wise.
less(x1, x2[, out]) Return the truth value of (x1 < x2) element-wise.less_equal(x1, x2[, out]) Return the truth value of (x1 =< x2) element-wise.
not_equal(x1, x2[, out]) Return (x1 != x2) element-wise.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
56/136
equal(x1, x2[, out]) Return (x1 == x2) element-wise.
logical_and(x1, x2[, out]) Compute the truth value of x1 AND x2 element-wise.
logical_or(x1, x2[, out]) Compute the truth value of x1 OR x2 element-wise.
logical_xor(x1, x2[, out]) Compute the truth value of x1 XOR x2, element-wise.
logical_not(x[, out]) Compute the truth value of NOT x element-wise.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
57/136
Floating functions
Recall that all of these functions work element-by-element over an array, returning an array
output. The description details only a single operation.isreal(x) Returns a bool array, where True if input element is real.
iscomplex(x) Returns a bool array, where True if input element is complex.
isfinite(x[, out]) Test element-wise for finiteness (not infinity or not Not a Number).
isinf(x[, out]) Test element-wise for positive or negative infinity.
isnan(x[, out]) Test element-wise for NaN and return result as a boolean array.
signbit(x[, out]) Returns element-wise True where signbit is set (less than zero).
copysign(x1, x2[, out]) Change the sign of x1 to that of x2, element-wise.
nextafter(x1, x2[, out]) Return the next floating-point value after x1 towards x2, element-
wise.
modf(x[, out1, out2]) Return the fractional and integral parts of an array, element-wise.
ldexp(x1, x2[, out]) Returns x1 * 2**x2, element-wise.
frexp(x[, out1, out2]) Decompose the elements of x into mantissa and twos exponent.
fmod(x1, x2[, out]) Return the element-wise remainder of division.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
58/136
floor(x[, out]) Return the floor of the input, element-wise.
ceil(x[, out]) Return the ceiling of the input, element-wise.
trunc(x[, out]) Return the truncated value of the input, element-wise.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
59/136
Here are some applications of broadcasting:
import numpy as np
# Compute outer product of vectors
v = np.array([1,2,3]) # v has shape (3,)
w = np.array([4,5]) # w has shape (2,)# To compute an outer product, we first reshape v to be a column
# vector of shape (3, 1); we can then broadcast it against w to yield
# an output of shape (3, 2), which is the outer product of v and w:
# [[ 4 5]
# [ 8 10]
# [12 15]]
print np.reshape(v, (3, 1)) * w
# Add a vector to each row of a matrix
x = np.array([[1,2,3], [4,5,6]])
-
8/20/2019 1+Python+Class+Powerpoint+Outline
60/136
# x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3),
# giving the following matrix:
# [[2 4 6]
# [5 7 9]]
print x + v
# Add a vector to each column of a matrix
# x has shape (2, 3) and w has shape (2,).
# If we transpose x then it has shape (3, 2) and can be broadcast
# against w to yield a result of shape (3, 2); transposing this result
# yields the final result of shape (2, 3) which is the matrix x with
# the vector w added to each column. Gives the following matrix:
# [[ 5 6 7]
# [ 9 10 11]]
print (x.T + w).T# Another solution is to reshape w to be a row vector of shape (2, 1);
# we can then broadcast it directly against x to produce the same
# output.
print x + np.reshape(w, (2, 1))
# Multiply a matrix by a constant:
# x has shape (2, 3). Numpy treats scalars as arrays of shape ();
# these can be broadcast together to shape (2, 3), producing the
# following array:
# [[ 2 4 6]
# [ 8 10 12]]
print x * 2
-
8/20/2019 1+Python+Class+Powerpoint+Outline
61/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
62/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
63/136
Here is a simple example:
import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)
# Plot the points using matplotlib
plt.plot(x, y)
plt.show() # You must call plt.show() to make graphics appear.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
64/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
65/136
With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend,
and axis labels:
import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)y_cos = np.cos(x)
# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
-
8/20/2019 1+Python+Class+Powerpoint+Outline
66/136
plt.legend(['Sine', 'Cosine'])
plt.show()
-
8/20/2019 1+Python+Class+Powerpoint+Outline
67/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
68/136
Here is an example:
import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)
# Make the first plotplt.plot(x, y_sin)
plt.title('Sine')
# Set the second subplot as active, and make the second plot.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
69/136
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')
# Show the figure.
plt.show()
-
8/20/2019 1+Python+Class+Powerpoint+Outline
70/136
Here is an example:
import numpy as np
from scipy.misc import imread, imresize
import matplotlib.pyplot as plt
img = imread('assets/cat.jpg')
img_tinted = img * [1, 0.95, 0.9]
# Show the original image
plt.subplot(1, 2, 1)plt.imshow(img)
# Show the tinted image
plt.subplot(1, 2, 2)
# A slight gotcha with imshow is that it might give strange results
-
8/20/2019 1+Python+Class+Powerpoint+Outline
71/136
# if presented with data that is not uint8. To work around this, we
# explicitly cast the image to uint8 before displaying it.
plt.imshow(np.uint8(img_tinted))
plt.show()
-
8/20/2019 1+Python+Class+Powerpoint+Outline
72/136
-
8/20/2019 1+Python+Class+Powerpoint+Outline
73/136
PANDAS
-
8/20/2019 1+Python+Class+Powerpoint+Outline
74/136
1
Lesson 1 Create Data - We begin by creating our own data set for analysis. This prevents the end userreading this tutorial from having to download any files to replicate the results below. We will exportthis data set to a text file so that you can get some experience pulling data from a text file.
Get Data - We will learn how to read in the text file. The data consist of baby names and the numberof baby names born in the year 1880.Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean Imean we will take a look inside the contents of the text file and look for any anomalities. These caninclude missing data, inconsistencies in the data, or any other data that seems out of place. If anyare found we will then have to make decisions on what to do with these records.Analyze Data - We will simply find the most popular name in a specific year.Present Data - Through tabular data and a graph, clearly show the end user what is the mostpopular name in a specific year.
The pandas library is used for all the data analysis excluding a small piece of the data presentationsection. The matplot l ib library will only be needed for the data presentation section. Importing the
libraries is the first step we will take in the lesson.
# Import all libraries needed for the tutorial
# General syntax to import specific functions in a library:
##from (library) import (specific library function)
from pandas import DataFrame, read_csv
# General syntax to import a library but no functions:
##import (library) as (give the library a nickname/alias)
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
# Enable inline plotting
%matplotlib inline
print 'Python version ' + sys.version
-
8/20/2019 1+Python+Class+Powerpoint+Outline
75/136
2
print 'Pandas version ' + pd.__version__
Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version 0.15.2
Create DataThe data set will consist of 5 baby names and the number of births recorded for that year (1880).
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
To merge these two lists together we will use the zip function.
zip?
BabyDataSet = zip(names,births)
BabyDataSet
[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]
We are basically done creating the data set. We now will use the pandas library to export this dataset into a csv file.
d f will be a DataFrame object. You can think of this object holding the contents of the BabyDataSetin a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contentsinside d f .
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df
Names Births
0Bob 968
1Jessica 155
2Mary 77
3John 578
-
8/20/2019 1+Python+Class+Powerpoint+Outline
76/136
3
Names Births
4Mel 973Export the dataframe to a csv file. We can name the file births1880.csv . The function to_csv will beused to export the file. The file will be saved in the same location of the notebook unless specifiedotherwise.
In [7]:
df.to_csv?
The only parameters we will use is index and header . Setting these parameters to True will preventthe index and header names from being exported. Change the values of these parameters to get abetter understanding of their use.
In [8]:
df.to_csv('births1880.csv',index=False,header=False)
Get DataTo pull in the csv file, we will use the pandas function read_csv . Let us take a look at this functionand what inputs it takes.
In [9]:
read_csv?
Even though this functions has many parameters, we will simply pass it the location of the text file.
Location = C:\Users\ENTER_USER_NAME.xy\startups\births1880.csv
Note: Depending on where you save your notebooks, you may need to modify the location above.
In [10]:
Location = r'C:\Users\david\notebooks\pandas\births1880.csv'
df = pd.read_csv(Location)
Notice the r before the string. Since the slashes are special characters, prefixing the string with a r will escape the whole string.
In [11]:
df
Out[11]:
Bob 968
0Jessica 155
1Mary 77
2John 578
-
8/20/2019 1+Python+Class+Powerpoint+Outline
77/136
4
Bob 968
3Mel 973This brings us the our first problem of the exercise. The read_csv function treated the first record inthe csv file as the header names. This is obviously not correct since the text file did not provide uswith header names.
To correct this we will pass the header parameter to the read_csv function and set it to None (means null in python).
In [12]:
df = pd.read_csv(Location, header=None)
df
Out[12]:
0 1
0Bob 968
1Jessica 1552Mary 77
3John 578
4Mel 973
If we wanted to give the columns specific names, we would have to pass another paramter callednames . We can also omit the header parameter.
In [13]:
df = pd.read_csv(Location, names=['Names','Births'])
df
Out[13]:
Names Births
0Bob 968
1Jessica 155
2Mary 77
3John 578
4Mel 973
You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these arepart of the index of the dataframe. You can think of the index as the primary key of a sql table withthe exception that an index is allowed to have duplicates.
[Names, Births] can be though of as column headers similar to the ones found in an Excel
spreadsheet or sql database.
Delete the csv file now that we are done using it.
In [14]:
import os
os.remove(Location)
-
8/20/2019 1+Python+Class+Powerpoint+Outline
78/136
5
Prepare DataThe data we have consists of baby names and the number of births in the year 1880. We alreadyknow that we have 5 records and none of the records are missing (non-null values).
The Names column at this point is of no concern since it most likely is just composed of alpha
numeric strings (baby names). There is a chance of bad data in this column but we will not worryabout that at this point of the analysis. The Births column should just contain integers representingthe number of babies born in a specific year with a specific name. We can check if the all the data isof the data type integer. It would not make sense to have this column have a data type of float. Iwould not worry about any possible outliers at this point of the analysis.
Realize that aside from the check we did on the "Names" column, briefly looking at the data insidethe dataframe should be as far as we need to go at this stage of the game. As we continue in thedata analysis life cycle we will have plenty of opportunities to find any issues with the data set.
In [15]:
# Check data type of the columns
df.dtypes
Out[15]:
Names object
Births int64
dtype: object
In [16]:
# Check data type of Births column
df.Births.dtype
Out[16]:
dtype('int64')
As you can see the Births column is of type int64 , thus no floats (decimal numbers) or alpha numericcharacters will be present in this column.
Analyze DataTo find the most popular name or the baby name with the higest birth rate, we can do one of thefollowing.
Sort the dataframe and select the top row Use the max() attribute to find the maximum value
In [17]:
# Method 1:
Sorted = df.sort(['Births'], ascending=False)
-
8/20/2019 1+Python+Class+Powerpoint+Outline
79/136
6
Sorted.head(1)
Out[17]:
Names Births
4Mel 973In [18]:
# Method 2:
df['Births'].max()
Out[18]:
973
Present DataHere we can plot the Births column and label the graph to show the end user the highest point onthe graph. In conjunction with the table, the end user has a clear picture that Mel is the most popularbaby name in the data set.
plot() is a convinient attribute where pandas lets you painlessly plot the data in your dataframe. Welearned how to find the maximum value of the Births column in the previous section. Now to find theactual baby name of the 973 value looks a bit tricky, so lets go over it.
Explain the pieces: df['Names'] - This is the entire list of baby names, the entire Names columndf['Births'] - This is the entire list of Births in the year 1880, the entire Births columndf['Births'].max() - This is the maximum value found in the Births column
[df['Births'] == df['Births'].max()] IS EQUAL TO [Find all of the records in the Births column where it isequal to 973]df['Names'][df['Births'] == df['Births'].max()] IS EQUAL TO Select all of the records in the Namescolumn WHERE [The Births column is equal to 973]
An alternative way could have been to use the Sorted dataframe:Sorted['Names'].head(1).value
The str() function simply converts an object into a string.
In [19]:
# Create graph
df['Births'].plot()
# Maximum value in the data set
MaxValue = df['Births'].max()
-
8/20/2019 1+Python+Class+Powerpoint+Outline
80/136
7
# Name associated with the maximum value
MaxName = df['Names'][df['Births'] == df['Births'].max()].values
# Text to display on graph
Text = str(MaxValue) + " - " + MaxName
# Add text to graph
plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),
xycoords=('axes fraction', 'data'), textcoords='offset point
s')
print "The most popular name"
df[df['Births'] == df['Births'].max()]
#Sorted.head(1) can also be used
The most popular name
Out[19]:
Names Births
4Mel 973
-
8/20/2019 1+Python+Class+Powerpoint+Outline
81/136
8
Lesson 2 In [1]:
# The usual preamble
import pandas as pd
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
figsize(15, 5)
We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is asubset of the of 311 service requests from NYC Open Data.
In [2]:
complaints = pd.read_csv('../data/311-service-requests.csv')
2.1 What's even in it? (the summary)
When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll showyou a summary . This includes all the columns, and how many non-null values there are in eachcolumn.
In [3]:
complaints
Out[3]:
Int64Index: 111069 entries, 0 to 111068Data columns (total 52 columns):Unique Key 111069 non-null values
Created Date 111069 non-null valuesClosed Date 60270 non-null valuesAgency 111069 non-null valuesAgency Name 111069 non-null valuesComplaint Type 111069 non-null valuesDescriptor 111068 non-null valuesLocation Type 79048 non-null valuesIncident Zip 98813 non-null valuesIncident Address 84441 non-null valuesStreet Name 84438 non-null values
https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9
-
8/20/2019 1+Python+Class+Powerpoint+Outline
82/136
9
Cross Street 1 84728 non-null valuesCross Street 2 84005 non-null valuesIntersection Street 1 19364 non-null valuesIntersection Street 2 19366 non-null valuesAddress Type 102247 non-null valuesCity 98860 non-null valuesLandmark 95 non-null valuesFacility Type 110938 non-null valuesStatus 111069 non-null valuesDue Date 39239 non-null valuesResolution Action Updated Date 96507 non-null valuesCommunity Board 111069 non-null valuesBorough 111069 non-null valuesX Coordinate (State Plane) 98143 non-null valuesY Coordinate (State Plane) 98143 non-null valuesPark Facility Name 111069 non-null valuesPark Borough 111069 non-null valuesSchool Name 111069 non-null valuesSchool Number 111052 non-null valuesSchool Region 110524 non-null values
School Code 110524 non-null valuesSchool Phone Number 111069 non-null valuesSchool Address 111069 non-null valuesSchool City 111069 non-null valuesSchool State 111069 non-null valuesSchool Zip 111069 non-null valuesSchool Not Found 38984 non-null valuesSchool or Citywide Complaint 0 non-null valuesVehicle Type 99 non-null valuesTaxi Company Borough 117 non-null valuesTaxi Pick Up Location 1059 non-null valuesBridge Highway Name 185 non-null valuesBridge Highway Direction 185 non-null valuesRoad Ramp 184 non-null valuesBridge Highway Segment 223 non-null valuesGarage Lot Name 49 non-null valuesFerry Direction 37 non-null valuesFerry Terminal Name 336 non-null valuesLatitude 98143 non-null valuesLongitude 98143 non-null valuesLocation 98143 non-null valuesdtypes: float64(5), int64(1), object(46)
2.2 Selecting columns and rows
To select a column, we index with the name of the column, like this:
In [4]:
complaints['Complaint Type']
Out[4]:
0 Noise - Street/Sidewalk
1 Illegal Parking
-
8/20/2019 1+Python+Class+Powerpoint+Outline
83/136
10
2 Noise - Commercial
3 Noise - Vehicle
4 Rodent
5 Noise - Commercial
6 Blocked Driveway
7 Noise - Commercial8 Noise - Commercial
9 Noise - Commercial
10 Noise - House of Worship
11 Noise - Commercial
12 Illegal Parking
13 Noise - Vehicle
14 Rodent
...
111054 Noise - Street/Sidewalk
111055 Noise - Commercial
111056 Street Sign - Missing
111057 Noise
111058 Noise - Commercial
111059 Noise - Street/Sidewalk
111060 Noise
111061 Noise - Commercial
111062 Water System
111063 Water System
111064 Maintenance or Facility
111065 Illegal Parking
111066 Noise - Street/Sidewalk111067 Noise - Commercial
111068 Blocked Driveway
Name: Complaint Type, Length: 111069, dtype: object
To get the first 5 rows of a dataframe, we can use a slice: df[:5].
This is a great way to get a sense for what kind of information is in the dataframe -- take a minute tolook at the contents and get a feel for this dataset.
In [5]:
complaints[:5]
Out[5]:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
84/136
11
UniqueKey
CreatedDate
ClosedDate
gency
AgencyName
ComplaintType
Descriptor
LocationType
IncidentZip
IncidentAddress
StreetName
CrossStreet1
CrossStreet2
In
tersectionStree
t1
In
tersectionStree
t2
AddressType
City
Land
ark
FacilityType
Status
DueDate
Resolu
tionActionUpdat
edDate
CommunityBoard
Borough
XCoor
dinate(StateP
lane)
YCoor
dinate(StateP
lane)
ParkFacilityNam
e
ParkBorough
SchoolName
SchoolNumber
SchoolRegion
SchoolCode
Sc
hoolPhoneNumb
er
SchoolAddress
SchoolCity
SchoolState
SchoolZip
chool
otFoun
d
SchoolorCitywideCo
plaint
ehicleType
TaxiCo
panyBoro
ugh
Ta
xiPickUpLocati
on
Br
idgeHigh
ayNa
e
Brid
geHigh
ayDire
ction
ad
a
BridgeHigh
aySeg
ent
arageLot
a
e
FerryDirection
Fe
rryTer
inalNa
e
Latitude
Longitude
Location
26589651
10
/ 31
/ 201302:08:41AM
NaN
NYPD
NewYor
kCityPoliceDepart
ment
N
oise-Street
/Sidewalk
LoudTalking
Street/Sidewalk
11432
9
0-03169STREET
169STREET
90AVENUE
91AVENUE
NaN
NaN
ADDRESS
JAMAICA
NaN
recinct
Assigned
10
/ 31
/ 201310:08:41AM
10
/ 31
/ 201302:35:17AM
12QUEENS
QUEENS
1042027
197389
Unspecified
QUEENS
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
NaN
a
NaN
NaN
NaN
NaN
a
NaN
a
NaN
NaN
40.708275
-73.791604
(40.7082
7532593202,-73.791603957797
21)
26593
10
/ 31
NaN
NYPD
NewYo
IllegalP
Comme
Street/Side
11378
58AVE
58AVE
58PLA
59STR
NaN
NaN
BLOCK
MASPE
NaN
reci
pen
10
/ 31
NaN
05QUE
QUEE
10093
2019
Unspe
QUEE
Unspe
Unspe
Unspe
Unspe
Unspe
Unspe
Unspe
Unspe
Unspe
NaN
a
NaN
NaN
NaN
NaN
a
NaN
a
NaN
NaN
40.72
-73.9
(40.72104053
-
8/20/2019 1+Python+Class+Powerpoint+Outline
85/136
12
UniqueKey
CreatedDate
ClosedDate
gency
AgencyName
ComplaintType
Descriptor
LocationType
IncidentZip
IncidentAddress
StreetName
CrossStreet1
CrossStreet2
In
tersectionStree
t1
In
tersectionStree
t2
AddressType
City
Land
ark
FacilityType
Status
DueDate
Resolu
tionActionUpdat
edDate
CommunityBoard
Borough
XCoor
dinate(StateP
lane)
YCoor
dinate(StateP
lane)
ParkFacilityNam
e
ParkBorough
SchoolName
SchoolNumber
SchoolRegion
SchoolCode
Sc
hoolPhoneNumb
er
SchoolAddress
SchoolCity
SchoolState
SchoolZip
chool
otFoun
d
SchoolorCitywideCo
plaint
ehicleType
TaxiCo
panyBoro
ugh
Ta
xiPickUpLocati
on
Br
idgeHigh
ayNa
e
Brid
geHigh
ayDire
ction
ad
a
BridgeHigh
aySeg
ent
arageLot
a
e
FerryDirection
Fe
rryTer
inalNa
e
Latitude
Longitude
Location
698
/ 20130
2:01:04AM
rkCityP
oliceDepartment
arking
rcialOve
rnightParking
walk
NUE
NUE
CE
EET
FACE
TH
nct
/ 20131
0:01:04AM
ENS
NS
49
84
cified
NS
cified
cified
cified
cified
cified
cified
cified
cified
cified
1041
09453
5628305, -73.90
945306791765)
26594139
1
0 / 31
/ 2013
1
0 / 31
/ 2013
NYPD
N
ewYorkCity
Noise-Commer
L
oudMusic/ Pa
Club /Bar/ Restaurant
10032
4
060BROADW
BROADWAY
W
EST171STR
W
EST172STR
NaN
NaN
ADDRESS
NE
YORK
NaN
recinct
Closed
1
0 / 31
/ 2013
1
0 / 31
/ 2013
12MANHATT
MANHATTAN
1001088
246531
Unspecified
MANHATTAN
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
Unspecified
NaN
a
NaN
NaN
NaN
NaN
a
NaN
a
NaN
NaN
40.843330
-
73.939144
(40
.84332975466513,-73.
-
8/20/2019 1+Python+Class+Powerpoint+Outline
86/136
13
UniqueKey
CreatedDate
ClosedDate
gency
AgencyName
ComplaintType
Descriptor
LocationType
IncidentZip
IncidentAddress
StreetName
CrossStreet1
CrossStreet2
In
tersectionStree
t1
In
tersectionStree
t2
AddressType
City
Land
ark
FacilityType
Status
DueDate
Resolu
tionActionUpdat
edDate
CommunityBoard
Borough
XCoor
dinate(StateP
lane)
YCoor
dinate(StateP
lane)
ParkFacilityNam
e
ParkBorough
SchoolName
SchoolNumber
SchoolRegion
SchoolCode
Sc
hoolPhoneNumb
er
SchoolAddress
SchoolCity
SchoolState
SchoolZip
chool
otFoun
d
SchoolorCitywideCo
plaint
ehicleType
TaxiCo
panyBoro
ugh
Ta
xiPickUpLocati
on
Br
idgeHigh
ayNa
e
Brid
geHigh
ayDire
ction
ad
a
BridgeHigh
aySeg
ent
arageLot
a
e
FerryDirection
Fe
rryTer
inalNa
e
Latitude
Longitude
Location
02:00:
24AM
02:40:
32AM
PoliceDe
partment
cial
rty
AY
EET
EET
10:00:
24AM
02:39:
42AM
AN
939143719134
82)
26
595721
10
/ 31
/
201301:56
10
/ 31
/
201302:21
NYPD
NewYor
kCityPoliceD
Noise
-Vehicle
Car/ T
ruckHorn
Str
eet/Sidewalk
1
0023
WEST
72STREET
WEST
72STREET
COLUM
BUSAVENUE
AMSTER
DAMAVENUE
NaN
NaN
BLO
CKFACE
NE
YORK
NaN
r
ecinct
C
losed
10
/ 31
/
201309:56
10
/ 31
/
201302:21
07MA
NHATTAN
MAN
HATTAN
9
89730
2
22727
Uns
pecified
MAN
HATTAN
Uns
pecified
Uns
pecified
Uns
pecified
Uns
pecified
Uns
pecified
Uns
pecified
Uns
pecified
Uns
pecified
Uns
pecified
NaN
a
NaN
NaN
NaN
NaN
a
NaN
a
NaN
NaN
40.
778009
-73
.980213
(40.77800874
46372, -73.9802134902
-
8/20/2019 1+Python+Class+Powerpoint+Outline
87/136
14
UniqueKey
CreatedDate
ClosedDate
gency
AgencyName
ComplaintType
Descriptor
LocationType
IncidentZip
IncidentAddress
StreetName
CrossStreet1
CrossStreet2
In
tersectionStree
t1
In
tersectionStree
t2
AddressType
City
Land
ark
FacilityType
Status
DueDate
Resolu
tionActionUpdat
edDate
CommunityBoard
Borough
XCoor
dinate(StateP
lane)
YCoor
dinate(StateP
lane)
ParkFacilityNam
e
ParkBorough
SchoolName
SchoolNumber
SchoolRegion
SchoolCode
Sc
hoolPhoneNumb
er
SchoolAddress
SchoolCity
SchoolState
SchoolZip
chool
otFoun
d
SchoolorCitywideCo
plaint
ehicleType
TaxiCo
panyBoro
ugh
Ta
xiPickUpLocati
on
Br
idgeHigh
ayNa
e
Brid
geHigh
ayDire
ction
ad
a
BridgeHigh
aySeg
ent
arageLot
a
e
FerryDirection
Fe
rryTer
inalNa
e
Latitude
Longitude
Location
:23AM
:48AM
epartme
nt
:23AM
:10AM
3975)
26590
930
10
/ 31
/ 20130
1:53:44AM
Na
N
D
H
H
Departmentof
HealthandM
Rod
ent
ConditionAttra
ctingRodents
Vacant
Lot
100
27
WEST124
STREET
WEST124
STREET
LENOXA
VENUE
ADAMCLAYTON
POWELLJRB
Na
N
Na
N
BLOCK
FACE
NE
Y
ORK
Na
N
/
Pend
ing
11
/ 30
/ 20130
1:53:44AM
10
/ 31
/ 20130
1:59:54AM
10MANH
ATTAN
MANHA
TTAN
9988
15
2335
45
Unspe
cified
MANHA
TTAN
Unspe
cified
Unspe
cified
Unspe
cified
Unspe
cified
Unspe
cified
Unspe
cified
Unspe
cified
Unspe
cified
Unspe
cified
Na
N
a
Na
N
Na
N
Na
N
Na
N
a
Na
N
a
Na
N
Na
N
40.80
7691
-73.94
7387
(40.80769092704951,-
73.94738703491433)
-
8/20/2019 1+Python+Class+Powerpoint+Outline
88/136
15
UniqueKey
CreatedDate
ClosedDate
gency
AgencyName
ComplaintType
Descriptor
LocationType
IncidentZip
IncidentAddress
StreetName
CrossStreet1
CrossStreet2
In
tersectionStree
t1
In
tersectionStree
t2
AddressType
City
Land
ark
FacilityType
Status
DueDate
Resolu
tionActionUpdat
edDate
CommunityBoard
Borough
XCoor
dinate(StateP
lane)
YCoor
dinate(StateP
lane)
ParkFacilityNam
e
ParkBorough
SchoolName
SchoolNumber
SchoolRegion
SchoolCode
Sc
hoolPhoneNumb
er
SchoolAddress
SchoolCity
SchoolState
SchoolZip
chool
otFoun
d
SchoolorCitywideCo
plaint
ehicleType
TaxiCo
panyBoro
ugh
Ta
xiPickUpLocati
on
Br
idgeHigh
ayNa
e
Brid
geHigh
ayDire
ction
ad
a
BridgeHigh
aySeg
ent
arageLot
a
e
FerryDirection
Fe
rryTer
inalNa
e
Latitude
Longitude
Location
entalHy
giene
OULEVA
RD
We can combine these to get the first 5 rows of a column:
In [6]:
complaints['Complaint Type'][:5]
Out[6]:
0 Noise - Street/Sidewalk
1 Illegal Parking2 Noise - Commercial
3 Noise - Vehicle
4 Rodent
Name: Complaint Type, dtype: object
and it doesn't matter which direction we do it in:
In [7]:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
89/136
16
complaints[:5]['Complaint Type']
Out[7]:
0 Noise - Street/Sidewalk
1 Illegal Parking2 Noise - Commercial
3 Noise - Vehicle
4 Rodent
Name: Complaint Type, dtype: object
2.3 Selecting multiple columnsWhat if we just want to know the complaint type and the borough, but not the rest of the information?Pandas makes it really easy to select a subset of the columns: just index with list of columns youwant.
In [8]:
complaints[['Complaint Type', 'Borough']]
Out[8]:
Int64Index: 111069 entries, 0 to 111068Data columns (total 2 columns):Complaint Type 111069 non-null valuesBorough 111069 non-null valuesdtypes: object(2)
That showed us a summary, and then we can look at the first 10 rows:
In [9]:
complaints[['Complaint Type', 'Borough']][:10]
Out[9]:
Complaint Type Borough0Noise - Street/Sidewalk QUEENS
1Illegal Parking QUEENS
2Noise - Commercial MANHATTAN3Noise - Vehicle MANHATTAN
4Rodent MANHATTAN
5Noise - Commercial QUEENS6Blocked Driveway QUEENS
7Noise - Commercial QUEENS
8Noise - Commercial MANHATTAN
9Noise - Commercial BROOKLYN
-
8/20/2019 1+Python+Class+Powerpoint+Outline
90/136
17
2.4 What's the most common complaint
type?
This is a really easy question to answer! There's a .value_counts() method that we can use:
In [10]:
complaints['Complaint Type'].value_counts()
Out[10]:
HEATING 14200
GENERAL CONSTRUCTION 7471
Street Light Condition 7117
DOF Literature Request 5797
PLUMBING 5373
PAINT - PLASTER 5149
Blocked Driveway 4590
NONCONST 3998
Street Condition 3473
Illegal Parking 3343
Noise 3321
Traffic Signal Condition 3145
Dirty Conditions 2653
Water System 2636
Noise - Commercial 2578
...Opinion for the Mayor 2
Window Guard 2
DFTA Literature Request 2
Legal Services Provider Complaint 2
Open Flame Permit 1
Snow 1
Municipal Parking Facility 1
X-Ray Machine/Equipment 1
Stalled Sites 1
DHS Income Savings Requirement 1
Tunnel Condition 1
Highway Sign - Damaged 1
Ferry Permit 1
Trans Fat 1
DWD 1
Length: 165, dtype: int64
If we just wanted the top 10 most common complaints, we can do this:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
91/136
18
In [11]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
Out[11]:HEATING 14200
GENERAL CONSTRUCTION 7471
Street Light Condition 7117
DOF Literature Request 5797
PLUMBING 5373
PAINT - PLASTER 5149
Blocked Driveway 4590
NONCONST 3998
Street Condition 3473
Illegal Parking 3343dtype: int64
But it gets better! We can plot them!
In [12]:
complaint_counts[:10].plot(kind='bar')
Out[12]:
.warning{
color: rgb( 240, 20, 20 )
}
-
8/20/2019 1+Python+Class+Powerpoint+Outline
92/136
19
Lesson 3Get Data - Our data set will consist of an Excel file containing customer counts per date. We willlearn how to read in the excel file for processing.Prepare Data - The data is an irregular time series having duplicate dates. We will be challenged in
compressing the data and coming up with next years forecasted customer count.Analyze Data - We use graphs to visualize trends and spot outliers. Some built in computationaltools will be used to calculate next years forecasted customer count.Present Data - The results will be plotted.
NOTE: Make sure you have looked th rough al l previous lessons, as the know ledge learned in
previous lessons wi l l be needed for this exercise.
In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys
%matplotlib inline
In [2]:
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version: 0.15.2
We will be creating our own test data for analysis.
In [3]:
# set seed
np.seed(111)
# Function to generate test data
def CreateDataSet(Number=1):
Output = []
-
8/20/2019 1+Python+Class+Powerpoint+Outline
93/136
20
for i in range(Number):
# Create a weekly (mondays) date range
rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')
# Create random data
data = np.randint(low=25,high=1000,size=len(rng))
# Status pool
status = [1,2,3]
# Make a random list of statuses
random_status = [status[np.randint(low=0,high=len(status))] for i in
range(len(rng))]
# State pool
states = ['GA','FL','fl','NY','NJ','TX']
# Make a random list of states
random_states = [states[np.randint(low=0,high=len(states))] for i in
range(len(rng))]
Output.extend(zip(random_states, random_status, data, rng))
return Output
Now that we have a function to generate our test data, lets create some data and stick it into adataframe.
In [4]:
dataset = CreateDataSet(4)
-
8/20/2019 1+Python+Class+Powerpoint+Outline
94/136
21
df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','St
atusDate'])
df.info()
Int64Index: 836 entries, 0 to 835Data columns (total 4 columns):State 836 non-null objectStatus 836 non-null int64CustomerCount 836 non-null int64StatusDate 836 non-null datetime64[ns]dtypes: datetime64[ns](1), int64(2), object(1)memory usage: 32.7+ KB
In [5]:
df.head()
Out[5]:State Status CustomerCount StatusDate
0GA 1 877 2009-01-05
1FL 1 901 2009-01-12
2fl 3 749 2009-01-19
3FL 3 111 2009-01-26
4GA 1 300 2009-02-02We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. Wesimply do this to show you how to read and write to Excel files.
We do not write the index values of the dataframe to the Excel file, since they are not meant to bepart of our initial test data set.
In [6]:
# Save results to excel
df.to_excel('Lesson3.xlsx', index=False)
print 'Done'
Done
Grab Data from ExcelWe will be using the read_excel function to read in data from an Excel file. The function allows youto read in specfic tabs by name or location.
In [7]:
pd.read_excel?
-
8/20/2019 1+Python+Class+Powerpoint+Outline
95/136
22
Note: The location on the Excel file will be in the same folder as the notebook, unlessspecified otherwise.
In [8]:
# Location of file
Location = r'C:\Users\david\notebooks\pandas\Lesson3.xlsx'
# Parse a specific sheet
df = pd.read_excel(Location, 0, index_col='StatusDate')
df.dtypes
Out[8]:
State object
Status int64CustomerCount int64
dtype: object
In [9]:
df.index
Out[9]:
[2009-01-05, ..., 2012-12-31]
Length: 836, Freq: None, Timezone: None
In [10]:
df.head()
Out[10]:
State Status CustomerCountStatusDate
2009-01-05 GA 1 877
2009-01-12 FL 1 901
2009-01-19 fl 3 749
2009-01-26 FL 3 111
2009-02-02 GA 1 300
Prepare DataThis section attempts to clean up the data for analysis.
1. Make sure the state column is all in upper case2. Only select records where the account status is equal to "1"
-
8/20/2019 1+Python+Class+Powerpoint+Outline
96/136
23
3. Merge (NJ and NY) to NY in the state column4. Remove any outliers (any odd results in the data set)
Lets take a quick look on how some of the State values are upper case and some are lower case
In [11]:
df['State'].unique()
Out[11]:
array([u'GA', u'FL', u'fl', u'TX', u'NY', u'NJ'], dtype=object)
To convert all the State values to upper case we will use the upper() function and the dataframe'sapply attribute. The lambda function simply will apply the upper function to each value in the State column.
In [12]:
# Clean State Column, convert to upper case
df['State'] = df.State.apply(lambda x: x.upper())
In [13]:
df['State'].unique()
Out[13]:
array([u'GA', u'FL', u'TX', u'NY', u'NJ'], dtype=object)
In [14]:
# Only grab where Status == 1
mask = df['Status'] == 1
df = df[mask]
To turn the NJ states to NY we simply...
[df.State == 'NJ'] - Find all records in the State column where they are equal to NJ .df .State[df.State == 'NJ'] = 'NY' - For all records in the State column where they are equal to NJ ,replace them with NY .
In [15]:
# Convert NJ to NY
mask = df.State == 'NJ'
df['State'][mask] = 'NY'
Now we can see we have a much cleaner data set to work with.
In [16]:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
97/136
24
df['State'].unique()
Out[16]:
array([u'GA', u'FL', u'NY', u'TX'], dtype=object)
At this point we may want to graph the data to check for any outliers or inconsistencies in the data.We will be using the plot() attribute of the dataframe.
As you can see from the graph below it is not very conclusive and is probably a sign that we need toperform some more data preparation.
In [17]:
df['CustomerCount'].plot(figsize=(15,5));
If we take a look at the data, we begin to realize that there are multiple values for the same State,StatusDate, and Status combination. It is possible that this means the data you are working with is
dirty/bad/inaccurate, but we will assume otherwise. We can assume this data set is a subset of abigger data set and if we simply add the values in the CustomerCount column per State,StatusDate, and Status we will get the Total Custom er Count per day.
In [18]:
sortdf = df[df['State']=='NY'].sort(axis=0)
sortdf.head(10)
Out[18]:
State Status CustomerCount
StatusDate2009-01-19 NY 1 522
2009-02-23 NY 1 710
2009-03-09 NY 1 992
2009-03-16 NY 1 355
2009-03-23 NY 1 728
2009-03-30 NY 1 863
2009-04-13 NY 1 520
2009-04-20 NY 1 820
2009-04-20 NY 1 937
2009-04-27 NY 1 447
Our task is now to create a new dataframe that compresses the data so we have daily customer
counts per State and StatusDate. We can ignore the Status column since all the values in thiscolumn are of value 1. To accomplish this we will use the dataframe's functions groupby and sum() .
Note that we had to use reset_index . If we did not, we would not have been able to group by boththe State and the StatusDate since the groupby function expects only columns as inputs. Thereset_index function will bring the index StatusDate back to a column in the dataframe.
In [19]:
# Group by State and StatusDate
-
8/20/2019 1+Python+Class+Powerpoint+Outline
98/136
25
Daily = df.reset_index().groupby(['State','StatusDate']).sum()
Daily.head()
Out[19]:
Status CustomerCountState StatusDate
FL 2009-01-12 1 901
2009-02-02 1 653
2009-03-23 1 752
2009-04-06 2 1086
2009-06-08 1 649The State and StatusDate columns are automatically placed in the index of the Daily dataframe.You can think of the index as the primary key of a database table but without the constraint ofhaving unique values. Columns in the index as you will see allow us to easily select, plot, andperform calculations on the data.
Below we delete the Status column since it is all equal to one and no longer necessary.
In [20]:
del Daily['Status']
Daily.head()
Out[20]:
CustomerCountState StatusDate
FL 2009-01-12 901
2009-02-02 653
2009-03-23 7522009-04-06 1086
2009-06-08 649
In [21]:
# What is the index of the dataframe
Daily.index
Out[21]:
MultiIndex(levels=[[u'FL', u'GA', u'NY', u'TX'], [2009-01-05 00:00:00, 2009-0
1-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00,
2009-03-09 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:
00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04
-27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00,
2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:
00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08
-10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00,
2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
99/136
26
00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10
-26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00,
2009-12-07 00:00:00, 2009-12-14 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:
00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02
-15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00,
2010-03-15 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-26 00:00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05
-24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-28 00:00:00,
2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:
00:00, 2010-08-09 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09
-06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00,
2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:
00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11
-29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00,
2011-01-17 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:
00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-03
-21 00:00:00, 2011-03-28 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00,
2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:
00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-06 00:00:00, ...]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, ...], [1, 3, 7, 9, 17, 19, 20, 21, 23, 25, 27, 28, 29, 30, 31, 35, 3
8, 40, 41, 44, 45, 46, 47, 48, 49, 52, 54, 56, 57, 59, 60, 62, 66, 68, 69, 70
, 71, 72, 75, 76, 77, 78, 79, 85, 88, 89, 92, 96, 97, 99, 100, 101, 103, 104,
105, 108, 109, 110, 112, 114, 115, 117, 118, 119, 125, 126, 127, 128, 129, 131, 133, 134, 135, 136, 137, 140, 146, 150, 151, 152, 153, 157, 0, 3, 7, 22, 2
3, 24, 27, 28, 34, 37, 42, 47, 50, 55, 58, 66, 67, 69, ...]],
names=[u'State', u'StatusDate'])
In [22]:
# Select the State index
Daily.index.levels[0]
Out[22]:
Index([u'FL', u'GA', u'NY', u'TX'], dtype='object')In [23]:
# Select the StatusDate index
Daily.index.levels[1]
Out[23]:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
100/136
27
[2009-01-05, ..., 2012-12-10]
Length: 161, Freq: None, Timezone: None
Lets now plot the data per State.
As you can see by breaking the graph up by the State column we have a much clearer picture onhow the data looks like. Can you spot any outliers?
In [24]:
Daily.loc['FL'].plot()
Daily.loc['GA'].plot()
Daily.loc['NY'].plot()
Daily.loc['TX'].plot();
We can also just plot the data on a specific date, like 2012 . We can now clearly see that the data forthese states is all over the place. since the data consist of weekly customer counts, the variability ofthe data seems suspect. For this tutorial we will assume bad data and proceed.
In [25]:
Daily.loc['FL']['2012':].plot()
Daily.loc['GA']['2012':].plot()
Daily.loc['NY']['2012':].plot()
Daily.loc['TX']['2012':].plot();
We will assume that per month the customer count should remain relatively steady. Any data outsidea specific range in that month will be removed from the data set. The final result should have smoothgraphs with no spikes.
StateYearMonth - Here we group by State, Year of StatusDate, and Month of StatusDate.Daily['Outl ier'] - A boolean (True or False) value letting us know if the value in the CustomerCountcolumn is ouside the acceptable range.
We will be using the attribute t ransform instead of apply . The reason is that transform will keep theshape(# of rows and columns) of the dataframe the same and apply will not. By looking at theprevious graphs, we can realize they are not resembling a gaussian distribution, this means wecannot use summary statistics like the mean and stDev. We use percentiles instead. Note that werun the risk of eliminating good data.
In [26]:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
101/136
28
# Calculate Outliers
StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.
get_level_values(1).year, Daily.index.get_level_values(1).month])
Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['Custom
erCount'] > Daily['Upper'])
# Remove Outliers
Daily = Daily[Daily['Outlier'] == False]
The dataframe named Daily will hold customer counts that have been aggregated per day. Theoriginal data (df) has multiple records per day. We are left with a data set that is indexed by both thestate and the StatusDate. The Outlier column should be equal to False signifying that the record isnot an outlier.
In [27]:
Daily.head()
Out[27]:
CustomerCount Lower Upper Outlier
State StatusDate
FL 2009-01-12 901 450.5 1351.5 False2009-02-02 653 326.5 979.5 False
2009-03-23 752 376.0 1128.0 False
2009-04-06 1086 543.0 1629.0 False
2009-06-08 649 324.5 973.5 False
We create a separate dataframe named ALL which groups the Daily dataframe by StatusDate. Weare essentially getting rid of the State column. The Max column represents the maximum customercount per month. The Max column is used to smooth out the graph.
In [28]:
# Combine all markets
# Get the max customer count by Date
ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_value
s(1)).sum())
ALL.columns = ['CustomerCount'] # rename column
-
8/20/2019 1+Python+Class+Powerpoint+Outline
102/136
29
# Group by Year and Month
YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])
# What is the max customer count per Year and Month
ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())
ALL.head()
Out[28]:
CustomerCount MaxStatusDate
2009-01-05 877 901
2009-01-12 901 9012009-01-19 522 901
2009-02-02 953 953
2009-02-23 710 953As you can see from the ALL dataframe above, in the month of January 2009, the maximumcustomer count was 901. If we had used apply , we would have got a dataframe with (Year andMonth) as the index and just the Max column with the value of 901.
There is also an interest to gauge if the current customer counts were reaching certain goals thecompany had established. The task here is to visually show if the current customer counts aremeeting the goals listed below. We will call the goals BHAG (Big Hairy Annual Goal).
12/31/2011 - 1,000 customers 12/31/2012 - 2,000 customers 12/31/2013 - 3,000 customers
We will be using the date_range function to create our dates.
Definition: date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False,name=None, closed=None)Docstr ing: Return a fixed frequency datetime index, with day (calendar) as the default frequency
By choosing the frequency to be A or annual we will be able to get the three target dates fromabove.
In [29]:
date_range?
Object `date_range` not found.
In [30]:
# Create the BHAG dataframe
-
8/20/2019 1+Python+Class+Powerpoint+Outline
103/136
30
data = [1000,2000,3000]
idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')
BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])
BHAG
Out[30]:
BHAG
2011-12-311000
2012-12-312000
2013-12-313000
Combining dataframes as we have learned in previous lesson is made simple using the concat function. Remember when we choose axis = 0 we are appending row wise.
In [31]:
# Combine the BHAG and the ALL data set
combined = pd.concat([ALL,BHAG], axis=0)
combined = combined.sort(axis=0)
combined.tail()
Out[31]:
BHAG CustomerCount Max
2012-11-19NaN 136 1115
2012-11-26NaN 1115 1115
2012-12-10NaN 1269 1269
2012-12-312000 NaN NaN
2013-12-313000 NaN NaN
In [32]:
fig, axes = plt.subplots(figsize=(12, 7))
combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')
combined['Max'].plot(color='blue', label='All Markets')
plt.legend(loc='best');
There was also a need to forecast next year's customer count and we can do this in a couple ofsimple steps. We will first group the combined dataframe by Year and place the maximum customercount for that year. This will give us one row per Year.
In [33]:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
104/136
31
# Group by Year and then get the max value per year
Year = combined.groupby(lambda x: x.year).max()
Year
Out[33]:
BHAG CustomerCount Max2009NaN 2452 2452
2010NaN 2065 2065
20111000 2711 271120122000 2061 2061
20133000 NaN NaN
In [34]:
# Add a column representing the percent change per year
Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)
Year
Out[34]:
BHAG CustomerCount Max YR_PCT_Change
2009NaN 2452 2452 NaN
2010NaN 2065 2065 -0.157830
20111000 2711 2711 0.312833
20122000 2061 2061 -0.239764
20133000 NaN NaN NaNTo get next year's end customer count we will assume our current growth rate remains constant. Wethen will increase this years customer count by that amount and that will be our forecast for nextyear.
In [35]:
(1 + Year.ix[2012,'YR_PCT_Change']) * Year.ix[2012,'Max']
Out[35]:
1566.8465510881595
Present DataCreate individual Graphs per State.
In [36]:
# First Graph
ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')
-
8/20/2019 1+Python+Class+Powerpoint+Outline
105/136
32
# Last four Graphs
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
fig.subplots_adjust(hspace=1.0) ## Create space between plots
Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,0])
Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,1])
Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,0])
Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,1])
# Add titles
axes[0,0].set_title('Florida')
axes[0,1].set_title('Georgia')
axes[1,0].set_title('Texas')
axes[1,1].set_title('North East');
-
8/20/2019 1+Python+Class+Powerpoint+Outline
106/136
33
Lesson 4In this lesson were going to go back to the basics. We will be working with a small data set so thatyou can easily understand what I am trying to explain. We will be adding columns, deleting columns,and slicing the data many different ways. Enjoy!
In [1]:
# Import libraries
import pandas as pd
import sys
In [2]:
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version: 0.15.2
In [3]:
# Our small data set
d = [0,1,2,3,4,5,6,7,8,9]
# Create dataframe
df = pd.DataFrame(d)
df
Out[3]:
000
11
22
33
4455
66
77
88
99
In [4]:
# Lets change the name of the column
-
8/20/2019 1+Python+Class+Powerpoint+Outline
107/136
34
df.columns = ['Rev']
df
Out[4]:
Rev00
11
22
33
44
55
66
77
88
99
In [5]:
# Lets add a column
df['NewCol'] = 5
df
Out[5]:
Rev NewCol00 5
11 5
22 5
33 544 5
55 5
66 5
77 5
88 5
99 5
In [6]:
# Lets modify our new column
df['NewCol'] = df['NewCol'] + 1
df
Out[6]:
Rev NewCol
00 6
11 622 6
33 6
-
8/20/2019 1+Python+Class+Powerpoint+Outline
108/136
35
Rev NewCol
44 6
55 6
66 6
77 6
88 6
99 6
In [7]:
# We can delete columns
del df['NewCol']
df
Out[7]:
Rev00
1122
33
44
55
66
7788
99
In [8]:
# Lets add a couple of columns
df['test'] = 3
df['col'] = df['Rev']
df
Out[8]:
Rev test col00 3 0
11 3 1
22 3 2
33 3 344 3 4
55 3 5
66 3 6
77 3 7
88 3 8
99 3 9
In [9]:
-
8/20/2019 1+Python+Class+Powerpoint+Outline
109/136
36
# If we wanted, we could change the name of the i