1+Python+Class+Powerpoint+Outline

download 1+Python+Class+Powerpoint+Outline

of 55

Transcript of 1+Python+Class+Powerpoint+Outline

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    1/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    2/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    3/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    4/136

    As an example, here is an implementation of the classic quicksort algorithm in

    Python:

    def quicksort(arr):

    if len(arr) pivot]

    return quicksort(left) + middle + quicksort(right)

    print quicksort([3,6,8,10,1,2,1])

    # Prints "[1, 1, 2, 3, 6, 8, 10]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    5/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    6/136

    Numbers: Integers and floats work as you would expect from other languages:

    x = 3

    print type(x) # Prints ""

    print x # Prints "3"

    print x + 1 # Addition; prints "4"

    print x - 1 # Subtraction; prints "2"

    print x * 2 # Multiplication; prints "6"

    print x ** 2 # Exponentiation; prints "9"x += 1

    print x # Prints "4"

    x *= 2

    print x # Prints "8"

    y = 2.5

    print type(y) # Prints ""

    print y, y + 1, y * 2, y ** 2 # Prints "2.5 3.5 5.0 6.25"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    7/136

    Booleans:

    t = True

    f = False

    print type(t) # Prints ""

    print t and f # Logical AND; prints "False"

    print t or f # Logical OR; prints "True"

    print not t # Logical NOT; prints "False"

    print t != f # Logical XOR; prints "True"

    Strings:

    hello = 'hello' # String literals can use single quotes

    world = "world" # or double quotes; it does not matter.

    print hello # Prints "hello"

    print len(hello) # String length; prints "5"

    hw = hello + ' ' + world # String concatenation

    print hw # prints "hello world"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    8/136

    hw12 = '%s %s %d' % (hello, world, 12) # sprintf style string formatting

    print hw12 # prints "hello world 12"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    9/136

    String objects have a bunch of useful methods; for example:

    s = "hello"

    print s.capitalize() # Capitalize a string; prints "Hello"

    print s.upper() # Convert a string to uppercase; prints "HELLO"

    print s.rjust(7) # Right-justify a string, padding with spaces; prints " hello"

    print s.center(7) # Center a string, padding with spaces; prints " hello "

    print s.replace('l', '(ell)') # Replace all instances of one substring with another;# prints "he(ell)(ell)o"

    print ' world '.strip() # Strip leading and trailing whitespace; prints "world"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    10/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    11/136

    xs = [3, 1, 2] # Create a list

    print xs, xs[2] # Prints "[3, 1, 2] 2"

    print xs[-1] # Negative indices count from the end of the list; prints "2"

    xs[2] = 'foo' # Lists can contain elements of different types

    print xs # Prints "[3, 1, 'foo']"

    xs.append('bar') # Add a new element to the end of the list

    print xs # Prints

    x = xs.pop() # Remove and return the last element of the list

    print x, xs # Prints "bar [3, 1, 'foo']"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    12/136

    nums = range(5) # range is a built-in function that creates a list of integers

    print nums # Prints "[0, 1, 2, 3, 4]"

    print nums[2:4] # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"

    print nums[2:] # Get a slice from index 2 to the end; prints "[2, 3, 4]"

    print nums[:2] # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"

    print nums[:] # Get a slice of the whole list; prints ["0, 1, 2, 3, 4]"

    print nums[:-1] # Slice indices can be negative; prints ["0, 1, 2, 3]"

    nums[2:4] = [8, 9] # Assign a new sublist to a slice

    print nums # Prints "[0, 1, 8, 8, 4]"

    We will see slicing again in the context of numpy arrays.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    13/136

    animals = ['cat', 'dog', 'monkey']

    for animal in animals:

    print animal

    # Prints "cat", "dog", "monkey", each on its own line.

    If you want access to the index of each element within the body of a loop, use the built-in

    enumerate function:

    animals = ['cat', 'dog', 'monkey']for idx, animal in enumerate(animals):

    print '#%d: %s' % (idx + 1, animal)

    # Prints "#1: cat", "#2: dog", "#3: monkey", each on its own line

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    14/136

    As a simple example, consider the following code that computes square numbers:

    nums = [0, 1, 2, 3, 4]

    squares = []

    for x in nums:

    squares.append(x ** 2)

    print squares # Prints [0, 1, 4, 9, 16]

    You can make this code simpler using a list comprehension:

    nums = [0, 1, 2, 3, 4]

    squares = [x ** 2 for x in nums]

    print squares # Prints [0, 1, 4, 9, 16]

    List comprehensions can also contain conditions:

    nums = [0, 1, 2, 3, 4]

    even_squares = [x ** 2 for x in nums if x % 2 == 0]

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    15/136

    print even_squares # Prints "[0, 4, 16]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    16/136

    You can use it like this:

    d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data

    print d['cat'] # Get an entry from a dictionary; prints "cute"

    print 'cat' in d # Check if a dictionary has a given key; prints "True"

    d['fish'] = 'wet' # Set an entry in a dictionary

    print d['fish'] # Prints "wet"

    # print d['monkey'] # KeyError: 'monkey' not a key of d

    print d.get('monkey', 'N/A') # Get an element with a default; prints "N/A"print d.get('fish', 'N/A') # Get an element with a default; prints "wet"

    del d['fish'] # Remove an element from a dictionary

    print d.get('fish', 'N/A') # "fish" is no longer a key; prints "N/A"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    17/136

    Loops: It is easy to iterate over the keys in a dictionary:

    d = {'person': 2, 'cat': 4, 'spider': 8}

    for animal in d:

    legs = d[animal]

    print 'A %s has %d legs' % (animal, legs)

    # Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"

    If you want access to keys and their corresponding values, use the iteritems method:

    d = {'person': 2, 'cat': 4, 'spider': 8}

    for animal, legs in d.iteritems():

    print 'A %s has %d legs' % (animal, legs)

    # Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"

    Dictionary comprehensions: These are similar to list comprehensions, but allow you to easily

    construct dictionaries. For example:

    nums = [0, 1, 2, 3, 4]

    even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}

    print even_num_to_square # Prints "{0: 0, 2: 4, 4: 16}"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    18/136

    As a simple example, consider the following:

    animals = {'cat', 'dog'}

    print 'cat' in animals # Check if an element is in a set; prints "True"

    print 'fish' in animals # prints "False"

    animals.add('fish') # Add an element to a set

    print 'fish' in animals # Prints "True"

    print len(animals) # Number of elements in a set; prints "3"

    animals.add('cat') # Adding an element that is already in the set does nothingprint len(animals) # Prints "3"

    animals.remove('cat') # Remove an element from a set

    print len(animals) # Prints "2"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    19/136

    As usual, everything you want to know about sets can be found in the documentation.

    Loops: Iterating over a set has the same syntax as iterating over a list; however since sets are

    unordered, you cannot make assumptions about the order in which you visit the elements of the

    set:

    animals = {'cat', 'dog', 'fish'}

    for idx, animal in enumerate(animals):

    print '#%d: %s' % (idx + 1, animal)

    # Prints "#1: fish", "#2: dog", "#3: cat"

    Set comprehensions: Like lists and dictionaries, we can easily construct sets using set

    comprehensions:

    from math import sqrt

    nums = {int(sqrt(x)) for x in range(30)}print nums # Prints "set([0, 1, 2, 3, 4, 5])"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    20/136

    Here is a trivial example:

    d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys

    t = (5, 6) # Create a tuple

    print type(t) # Prints ""

    print d[t] # Prints "5"

    print d[(1, 2)] # Prints "1"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    21/136

    For example:

    def sign(x):

    if x > 0:return 'positive'

    elif x < 0:

    return 'negative'

    else:

    return 'zero'

    for x in [-1, 0, 1]:

    print sign(x)

    # Prints "negative", "zero", "positive"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    22/136

    We will often define functions to take optional keyword arguments, like this:

    def hello(name, loud=False):

    if loud:

    print 'HELLO, %s' % name.upper()

    else:

    print 'Hello, %s!' % name

    hello('Bob') # Prints "Hello, Bob"

    hello('Fred', loud=True) # Prints "HELLO, FRED!"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    23/136

    class Greeter:

    # Constructor

    def __init__(self, name):

    self.name = name # Create an instance variable

    # Instance method

    def greet(self, loud=False):

    if loud:print 'HELLO, %s!' % self.name.upper()

    else:

    print 'Hello, %s' % self.name

    g = Greeter('Fred') # Construct an instance of the Greeter class

    g.greet() # Call an instance method; prints "Hello, Fred"

    g.greet(loud=True) # Call an instance method; prints "HELLO, FRED!"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    24/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    25/136

    We can initialize numpy arrays from nested Python lists, and access elements using square

    brackets:

    import numpy as np

    a = np.array([1, 2, 3]) # Create a rank 1 array

    print type(a) # Prints ""

    print a.shape # Prints "(3,)"

    print a[0], a[1], a[2] # Prints "1 2 3"a[0] = 5 # Change an element of the array

    print a # Prints "[5, 2, 3]"

    b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array

    print b.shape # Prints "(2, 3)"

    print b[0, 0], b[0, 1], b[1, 0] # Prints "1 2 4"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    26/136

    Numpy also provides many functions to create arrays:

    import numpy as np

    a = np.zeros((2,2)) # Create an array of all zeros

    print a # Prints "[[ 0. 0.]

    # [ 0. 0.]]"

    b = np.ones((1,2)) # Create an array of all ones

    print b # Prints "[[ 1. 1.]]"

    c = np.full((2,2), 7) # Create a constant array

    print c # Prints "[[ 7. 7.]

    # [ 7. 7.]]"

    d = np.eye(2) # Create a 2x2 identity matrix

    print d # Prints "[[ 1. 0.]

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    27/136

    # [ 0. 1.]]"

    e = np.random.random((2,2)) # Create an array filled with random values

    print e # Might print "[[ 0.91940167 0.08143941]

    # [ 0.68744134 0.87236687]]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    28/136

    Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

    import numpy as np

    # Create the following rank 2 array with shape (3, 4)

    # [[ 1 2 3 4]

    # [ 5 6 7 8]

    # [ 9 10 11 12]]

    a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

    # Use slicing to pull out the subarray consisting of the first 2 rows

    # and columns 1 and 2; b is the following array of shape (2, 2):

    # [[2 3]

    # [6 7]]

    b = a[:2, 1:3]

    # A slice of an array is a view into the same data, so modifying it

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    29/136

    # will modify the original array.

    print a[0, 1] # Prints "2"

    b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]

    print a[0, 1] # Prints "77"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    30/136

    You can also mix integer indexing with slice indexing. However, doing so will yield an array of

    lower rank than the original array. Note that this is quite different from the way that MATLAB

    handles array slicing:

    import numpy as np

    # Create the following rank 2 array with shape (3, 4)# [[ 1 2 3 4]

    # [ 5 6 7 8]

    # [ 9 10 11 12]]

    a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

    # Two ways of accessing the data in the middle row of the array.

    # Mixing integer indexing with slices yields an array of lower rank,

    # while using only slices yields an array of the same rank as the

    # original array:row_r1 = a[1, :] # Rank 1 view of the second row of a

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    31/136

    row_r2 = a[1:2, :] # Rank 2 view of the second row of a

    print row_r1, row_r1.shape # Prints "[5 6 7 8] (4,)"

    print row_r2, row_r2.shape # Prints "[[5 6 7 8]] (1, 4)"

    # We can make the same distinction when accessing columns of an array:

    col_r1 = a[:, 1]col_r2 = a[:, 1:2]

    print col_r1, col_r1.shape # Prints "[ 2 6 10] (3,)"

    print col_r2, col_r2.shape # Prints "[[ 2]

    # [ 6]

    # [10]] (3, 1)"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    32/136

    Here is an example:

    import numpy as np

    a = np.array([[1,2], [3, 4], [5, 6]])

    # An example of integer array indexing.

    # The returned array will have shape (3,) and

    print a[[0, 1, 2], [0, 1, 0]] # Prints "[1 4 5]"

    # The above example of integer array indexing is equivalent to this:

    print np.array([a[0, 0], a[1, 1], a[2, 0]]) # Prints "[1 4 5]"

    # When using integer array indexing, you can reuse the same

    # element from the source array:

    print a[[0, 0], [1, 1]] # Prints "[2 2]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    33/136

    # Equivalent to the previous integer array indexing example

    print np.array([a[0, 1], a[0, 1]]) # Prints "[2 2]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    34/136

    Here is an example:

    import numpy as np

    a = np.array([[1,2], [3, 4], [5, 6]])

    bool_idx = (a > 2) # Find the elements of a that are bigger than 2;

    # this returns a numpy array of Booleans of the same

    # shape as a, where each slot of bool_idx tells

    # whether that element of a is > 2.

    print bool_idx # Prints "[[False False]

    # [ True True]

    # [ True True]]"

    # We use boolean array indexing to construct a rank 1 array

    # consisting of the elements of a corresponding to the True values

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    35/136

    # of bool_idx

    print a[bool_idx] # Prints "[3 4 5 6]"

    # We can do all of the above in a single concise statement:

    print a[a > 2] # Prints "[3 4 5 6]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    36/136

    Here is an example:

    import numpy as np

    x = np.array([1, 2]) # Let numpy choose the datatype

    print x.dtype # Prints "int64"

    x = np.array([1.0, 2.0]) # Let numpy choose the datatype

    print x.dtype # Prints "float64"

    x = np.array([1, 2], dtype=np.int64) # Force a particular datatype

    print x.dtype # Prints "int64"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    37/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    38/136

    import numpy as np

    x = np.array([[1,2],[3,4]], dtype=np.float64)

    y = np.array([[5,6],[7,8]], dtype=np.float64)

    # Elementwise sum; both produce the array

    # [[ 6.0 8.0]

    # [10.0 12.0]]

    print x + y

    print np.add(x, y)

    # Elementwise difference; both produce the array

    # [[-4.0 -4.0]

    # [-4.0 -4.0]]

    print x - y

    print np.subtract(x, y)

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    39/136

    # Elementwise product; both produce the array

    # [[ 5.0 12.0]

    # [21.0 32.0]]

    print x * y

    print np.multiply(x, y)

    # Elementwise division; both produce the array

    # [[ 0.2 0.33333333]

    # [ 0.42857143 0.5 ]]

    print x / y

    print np.divide(x, y)

    # Elementwise square root; produces the array

    # [[ 1. 1.41421356]

    # [ 1.73205081 2. ]]

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    40/136

    print np.sqrt(x)

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    41/136

    import numpy as np

    x = np.array([[1,2],[3,4]])

    y = np.array([[5,6],[7,8]])

    v = np.array([9,10])

    w = np.array([11, 12])

    # Inner product of vectors; both produce 219

    print v.dot(w)print np.dot(v, w)

    # Matrix / vector product; both produce the rank 1 array [29 67]

    print x.dot(v)

    print np.dot(x, v)

    # Matrix / matrix product; both produce the rank 2 array

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    42/136

    # [[19 22]

    # [43 50]]

    print x.dot(y)

    print np.dot(x, y)

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    43/136

    import numpy as np

    x = np.array([[1,2],[3,4]])

    print np.sum(x) # Compute sum of all elements; prints "10"

    print np.sum(x, axis=0) # Compute sum of each column; prints "[4 6]"

    print np.sum(x, axis=1) # Compute sum of each row; prints "[3 7]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    44/136

    import numpy as np

    x = np.array([[1,2], [3,4]])

    print x # Prints "[[1 2]

    # [3 4]]"

    print x.T # Prints "[[1 3]

    # [2 4]]"

    # Note that taking the transpose of a rank 1 array does nothing:v = np.array([1,2,3])

    print v # Prints "[1 2 3]"

    print v.T # Prints "[1 2 3]"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    45/136

    import numpy as np

    # We will add the vector v to each row of the matrix x,

    # storing the result in the matrix y

    x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

    v = np.array([1, 0, 1])

    y = np.empty_like(x) # Create an empty matrix with the same shape as x

    # Add the vector v to each row of the matrix x with an explicit loopfor i in range(4):

    y[i, :] = x[i, :] + v

    # Now y is the following

    # [[ 2 2 4]

    # [ 5 5 7]

    # [ 8 8 10]

    # [11 11 13]]

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    46/136

    print y

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    47/136

    This works; however when the matrix x is very large, computing an explicit loop in Python could

    be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a

    matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x

    and vv. We could implement this approach like this:

    import numpy as np

    # We will add the vector v to each row of the matrix x,

    # storing the result in the matrix y

    x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

    v = np.array([1, 0, 1])

    vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other

    print vv # Prints "[[1 0 1]

    # [1 0 1]

    # [1 0 1]

    # [1 0 1]]"

    y = x + vv # Add x and vv elementwise

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    48/136

    print y # Prints "[[ 2 2 4

    # [ 5 5 7]

    # [ 8 8 10]

    # [11 11 13]]"

    Numpy broadcasting allows us to perform this computation without actually creatingmultiple copies of v. Consider this version, using broadcasting:

    import numpy as np

    # We will add the vector v to each row of the matrix x,

    # storing the result in the matrix y

    x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

    v = np.array([1, 0, 1])

    y = x + v # Add v to each row of x using broadcastingprint y # Prints "[[ 2 2 4]

    # [ 5 5 7]

    # [ 8 8 10]

    # [11 11 13]]"

    The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to

    broadcasting; this line works as if v actually had shape (4, 3), where each row was a

    copy of v, and the sum was performed elementwise.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    49/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    50/136

    There are currently more than 60 universal functions defined in numpy on one or more types,

    covering a wide variety of operations. Some of these ufuncs are called automatically on arrays

    when the relevant infix notation is used (e.g., add(a, b) is called internally when a + b is written

    and a or b is an ndarray). Nevertheless, you may still want to use the ufunc call in order to use the

    optional output argument(s) to place the output(s) in an object (or objects) of your choice.

    Recall that each ufunc operates element-by-element. Therefore, each ufunc will be described as if

    acting on a set of scalar inputs to return a set of scalar outputs.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    51/136

    Math operations

    add(x1, x2[, out]) Add arguments element-wise.

    subtract(x1, x2[, out]) Subtract arguments, element-wise.

    multiply(x1, x2[, out]) Multiply arguments element-wise.

    divide(x1, x2[, out]) Divide arguments element-wise.

    logaddexp(x1, x2[, out]) Logarithm of the sum of exponentiations of the inputs.logaddexp2(x1, x2[, out]) Logarithm of the sum of exponentiations of the inputs in base-2.

    true_divide(x1, x2[, out]) Returns a true division of the inputs, element-wise.

    floor_divide(x1, x2[, out]) Return the largest integer smaller or equal to the division of the

    inputs.

    negative(x[, out]) Numerical negative, element-wise.

    power(x1, x2[, out]) First array elements raised to powers from second array, element-

    wise.

    remainder(x1, x2[, out]) Return element-wise remainder of division.

    mod(x1, x2[, out]) Return element-wise remainder of division.

    fmod(x1, x2[, out]) Return the element-wise remainder of division.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    52/136

    absolute(x[, out]) Calculate the absolute value element-wise.

    rint(x[, out]) Round elements of the array to the nearest integer.

    sign(x[, out]) Returns an element-wise indication of the sign of a number.

    conj(x[, out]) Return the complex conjugate, element-wise.

    exp(x[, out]) Calculate the exponential of all elements in the input array.

    exp2(x[, out]) Calculate 2**p for all p in the input array.log(x[, out]) Natural logarithm, element-wise.

    log2(x[, out]) Base-2 logarithm of x.

    log10(x[, out]) Return the base 10 logarithm of the input array, element-wise.

    expm1(x[, out]) Calculate exp(x) - 1 for all elements in the array.

    log1p(x[, out]) Return the natural logarithm of one plus the input array,

    element-wise.

    sqrt(x[, out]) Return the positive square-root of an array, element-wise.

    square(x[, out]) Return the element-wise square of the input.

    reciprocal(x[, out]) Return the reciprocal of the argument, element-wise.ones_like(a[, dtype, order, subok]) Return an array of ones with the same

    shape and type as a given array.

    Tip

    The optional output arguments can be used to help you save memory for large

    calculations. If your arrays are large, complicated expressions can take longer than

    absolutely necessary due to the creation and (later) destruction of temporary

    calculation spaces. For example, the expression G = a * b + c is equivalent to t1 = A *

    B; G = T1 + C; del t1. It will be more quickly executed as G = A * B; add(G, C, G) which

    is the same as G = A * B; G += C.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    53/136

    Trigonometric functions

    All trigonometric functions use radians when an angle is called for. The ratio of degrees to radians

    is 180^{\circ}/\pi.

    sin(x[, out]) Trigonometric sine, element-wise.cos(x[, out]) Cosine element-wise.

    tan(x[, out]) Compute tangent element-wise.

    arcsin(x[, out]) Inverse sine, element-wise.

    arccos(x[, out]) Trigonometric inverse cosine, element-wise.

    arctan(x[, out]) Trigonometric inverse tangent, element-wise.

    arctan2(x1, x2[, out]) Element-wise arc tangent of x1/x2 choosing the quadrant correctly.

    hypot(x1, x2[, out]) Given the “legs” of a right triangle, return its hypotenuse.

    sinh(x[, out]) Hyperbolic sine, element-wise.

    cosh(x[, out]) Hyperbolic cosine, element-wise.

    tanh(x[, out]) Compute hyperbolic tangent element-wise.

    arcsinh(x[, out]) Inverse hyperbolic sine element-wise.

    arccosh(x[, out]) Inverse hyperbolic cosine, element-wise.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    54/136

    arctanh(x[, out]) Inverse hyperbolic tangent element-wise.

    deg2rad(x[, out]) Convert angles from degrees to radians.

    rad2deg(x[, out]) Convert angles from radians to degrees.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    55/136

    Bit-twiddling functions

    These function all require integer arguments and they manipulate the bit-pattern of those

    arguments.

    bitwise_and(x1, x2[, out]) Compute the bit-wise AND of two arrays element-wise.

    bitwise_or(x1, x2[, out]) Compute the bit-wise OR of two arrays element-wise.

    bitwise_xor(x1, x2[, out]) Compute the bit-wise XOR of two arrays element-wise.

    invert(x[, out]) Compute bit-wise inversion, or bit-wise NOT, element-wise.

    left_shift(x1, x2[, out]) Shift the bits of an integer to the left.

    right_shift(x1, x2[, out]) Shift the bits of an integer to the right.

    Comparison functions

    greater(x1, x2[, out]) Return the truth value of (x1 > x2) element-wise.

    greater_equal(x1, x2[, out]) Return the truth value of (x1 >= x2) element-wise.

    less(x1, x2[, out]) Return the truth value of (x1 < x2) element-wise.less_equal(x1, x2[, out]) Return the truth value of (x1 =< x2) element-wise.

    not_equal(x1, x2[, out]) Return (x1 != x2) element-wise.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    56/136

    equal(x1, x2[, out]) Return (x1 == x2) element-wise.

    logical_and(x1, x2[, out]) Compute the truth value of x1 AND x2 element-wise.

    logical_or(x1, x2[, out]) Compute the truth value of x1 OR x2 element-wise.

    logical_xor(x1, x2[, out]) Compute the truth value of x1 XOR x2, element-wise.

    logical_not(x[, out]) Compute the truth value of NOT x element-wise.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    57/136

    Floating functions

    Recall that all of these functions work element-by-element over an array, returning an array

    output. The description details only a single operation.isreal(x) Returns a bool array, where True if input element is real.

    iscomplex(x) Returns a bool array, where True if input element is complex.

    isfinite(x[, out]) Test element-wise for finiteness (not infinity or not Not a Number).

    isinf(x[, out]) Test element-wise for positive or negative infinity.

    isnan(x[, out]) Test element-wise for NaN and return result as a boolean array.

    signbit(x[, out]) Returns element-wise True where signbit is set (less than zero).

    copysign(x1, x2[, out]) Change the sign of x1 to that of x2, element-wise.

    nextafter(x1, x2[, out]) Return the next floating-point value after x1 towards x2, element-

    wise.

    modf(x[, out1, out2]) Return the fractional and integral parts of an array, element-wise.

    ldexp(x1, x2[, out]) Returns x1 * 2**x2, element-wise.

    frexp(x[, out1, out2]) Decompose the elements of x into mantissa and twos exponent.

    fmod(x1, x2[, out]) Return the element-wise remainder of division.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    58/136

    floor(x[, out]) Return the floor of the input, element-wise.

    ceil(x[, out]) Return the ceiling of the input, element-wise.

    trunc(x[, out]) Return the truncated value of the input, element-wise.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    59/136

    Here are some applications of broadcasting:

    import numpy as np

    # Compute outer product of vectors

    v = np.array([1,2,3]) # v has shape (3,)

    w = np.array([4,5]) # w has shape (2,)# To compute an outer product, we first reshape v to be a column

    # vector of shape (3, 1); we can then broadcast it against w to yield

    # an output of shape (3, 2), which is the outer product of v and w:

    # [[ 4 5]

    # [ 8 10]

    # [12 15]]

    print np.reshape(v, (3, 1)) * w

    # Add a vector to each row of a matrix

    x = np.array([[1,2,3], [4,5,6]])

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    60/136

    # x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3),

    # giving the following matrix:

    # [[2 4 6]

    # [5 7 9]]

    print x + v

    # Add a vector to each column of a matrix

    # x has shape (2, 3) and w has shape (2,).

    # If we transpose x then it has shape (3, 2) and can be broadcast

    # against w to yield a result of shape (3, 2); transposing this result

    # yields the final result of shape (2, 3) which is the matrix x with

    # the vector w added to each column. Gives the following matrix:

    # [[ 5 6 7]

    # [ 9 10 11]]

    print (x.T + w).T# Another solution is to reshape w to be a row vector of shape (2, 1);

    # we can then broadcast it directly against x to produce the same

    # output.

    print x + np.reshape(w, (2, 1))

    # Multiply a matrix by a constant:

    # x has shape (2, 3). Numpy treats scalars as arrays of shape ();

    # these can be broadcast together to shape (2, 3), producing the

    # following array:

    # [[ 2 4 6]

    # [ 8 10 12]]

    print x * 2

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    61/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    62/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    63/136

    Here is a simple example:

    import numpy as np

    import matplotlib.pyplot as plt

    # Compute the x and y coordinates for points on a sine curve

    x = np.arange(0, 3 * np.pi, 0.1)

    y = np.sin(x)

    # Plot the points using matplotlib

    plt.plot(x, y)

    plt.show() # You must call plt.show() to make graphics appear.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    64/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    65/136

    With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend,

    and axis labels:

    import numpy as np

    import matplotlib.pyplot as plt

    # Compute the x and y coordinates for points on sine and cosine curves

    x = np.arange(0, 3 * np.pi, 0.1)

    y_sin = np.sin(x)y_cos = np.cos(x)

    # Plot the points using matplotlib

    plt.plot(x, y_sin)

    plt.plot(x, y_cos)

    plt.xlabel('x axis label')

    plt.ylabel('y axis label')

    plt.title('Sine and Cosine')

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    66/136

    plt.legend(['Sine', 'Cosine'])

    plt.show()

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    67/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    68/136

    Here is an example:

    import numpy as np

    import matplotlib.pyplot as plt

    # Compute the x and y coordinates for points on sine and cosine curves

    x = np.arange(0, 3 * np.pi, 0.1)

    y_sin = np.sin(x)

    y_cos = np.cos(x)

    # Set up a subplot grid that has height 2 and width 1,

    # and set the first such subplot as active.

    plt.subplot(2, 1, 1)

    # Make the first plotplt.plot(x, y_sin)

    plt.title('Sine')

    # Set the second subplot as active, and make the second plot.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    69/136

    plt.subplot(2, 1, 2)

    plt.plot(x, y_cos)

    plt.title('Cosine')

    # Show the figure.

    plt.show()

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    70/136

    Here is an example:

    import numpy as np

    from scipy.misc import imread, imresize

    import matplotlib.pyplot as plt

    img = imread('assets/cat.jpg')

    img_tinted = img * [1, 0.95, 0.9]

    # Show the original image

    plt.subplot(1, 2, 1)plt.imshow(img)

    # Show the tinted image

    plt.subplot(1, 2, 2)

    # A slight gotcha with imshow is that it might give strange results

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    71/136

    # if presented with data that is not uint8. To work around this, we

    # explicitly cast the image to uint8 before displaying it.

    plt.imshow(np.uint8(img_tinted))

    plt.show()

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    72/136

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    73/136

     

    PANDAS 

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    74/136

    1

    Lesson 1 Create Data - We begin by creating our own data set for analysis. This prevents the end userreading this tutorial from having to download any files to replicate the results below. We will exportthis data set to a text file so that you can get some experience pulling data from a text file.

    Get Data - We will learn how to read in the text file. The data consist of baby names and the numberof baby names born in the year 1880.Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean Imean we will take a look inside the contents of the text file and look for any anomalities. These caninclude missing data, inconsistencies in the data, or any other data that seems out of place. If anyare found we will then have to make decisions on what to do with these records.Analyze Data - We will simply find the most popular name in a specific year.Present Data - Through tabular data and a graph, clearly show the end user what is the mostpopular name in a specific year.

    The pandas  library is used for all the data analysis excluding a small piece of the data presentationsection. The matplot l ib  library will only be needed for the data presentation section. Importing the

    libraries is the first step we will take in the lesson.

    # Import all libraries needed for the tutorial

    # General syntax to import specific functions in a library:

    ##from (library) import (specific library function)

    from pandas import DataFrame, read_csv

    # General syntax to import a library but no functions:

    ##import (library) as (give the library a nickname/alias)

    import matplotlib.pyplot as plt

    import pandas as pd #this is how I usually import pandas

    import sys #only needed to determine Python version number

    # Enable inline plotting

    %matplotlib inline

    print 'Python version ' + sys.version

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    75/136

    2

    print 'Pandas version ' + pd.__version__

    Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version 0.15.2

    Create DataThe data set will consist of 5 baby names and the number of births recorded for that year (1880).

    # The inital set of baby names and bith rates

    names = ['Bob','Jessica','Mary','John','Mel']

    births = [968, 155, 77, 578, 973]

    To merge these two lists together we will use the zip  function.

    zip?

    BabyDataSet = zip(names,births)

    BabyDataSet

    [('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

    We are basically done creating the data set. We now will use the pandas  library to export this dataset into a csv file.

    d f  will be a DataFrame  object. You can think of this object holding the contents of the BabyDataSetin a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contentsinside d f .

    df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])

    df

    Names Births

    0Bob 968

    1Jessica 155

    2Mary 77

    3John 578

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    76/136

    3

    Names Births

    4Mel 973Export the dataframe to a csv  file. We can name the file births1880.csv . The function to_csv  will beused to export the file. The file will be saved in the same location of the notebook unless specifiedotherwise.

    In [7]:

    df.to_csv?

    The only parameters we will use is index  and header . Setting these parameters to True will preventthe index and header names from being exported. Change the values of these parameters to get abetter understanding of their use.

    In [8]:

    df.to_csv('births1880.csv',index=False,header=False)

    Get DataTo pull in the csv file, we will use the pandas function read_csv . Let us take a look at this functionand what inputs it takes.

    In [9]:

    read_csv?

    Even though this functions has many parameters, we will simply pass it the location of the text file.

    Location = C:\Users\ENTER_USER_NAME.xy\startups\births1880.csv

    Note:  Depending on where you save your notebooks, you may need to modify the location above.

    In [10]:

    Location = r'C:\Users\david\notebooks\pandas\births1880.csv'

    df = pd.read_csv(Location)

    Notice the r  before the string. Since the slashes are special characters, prefixing the string with a r  will escape the whole string.

    In [11]:

    df

    Out[11]:

    Bob 968

    0Jessica 155

    1Mary 77

    2John 578

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    77/136

    4

    Bob 968

    3Mel 973This brings us the our first problem of the exercise. The read_csv  function treated the first record inthe csv file as the header names. This is obviously not correct since the text file did not provide uswith header names.

    To correct this we will pass the header  parameter to the read_csv  function and set it to None  (means null in python).

    In [12]:

    df = pd.read_csv(Location, header=None)

    df

    Out[12]:

    0 1

    0Bob 968

    1Jessica 1552Mary 77

    3John 578

    4Mel 973

    If we wanted to give the columns specific names, we would have to pass another paramter callednames . We can also omit the header  parameter.

    In [13]:

    df = pd.read_csv(Location, names=['Names','Births'])

    df

    Out[13]:

    Names Births

    0Bob 968

    1Jessica 155

    2Mary 77

    3John 578

    4Mel 973

    You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these arepart of the index  of the dataframe. You can think of the index as the primary key of a sql table withthe exception that an index is allowed to have duplicates.

    [Names, Births] can be though of as column headers similar to the ones found in an Excel

    spreadsheet or sql database.

    Delete the csv file now that we are done using it.

    In [14]:

    import os

    os.remove(Location)

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    78/136

    5

    Prepare DataThe data we have consists of baby names and the number of births in the year 1880. We alreadyknow that we have 5 records and none of the records are missing (non-null values).

    The Names  column at this point is of no concern since it most likely is just composed of alpha

    numeric strings (baby names). There is a chance of bad data in this column but we will not worryabout that at this point of the analysis. The Births  column should just contain integers representingthe number of babies born in a specific year with a specific name. We can check if the all the data isof the data type integer. It would not make sense to have this column have a data type of float. Iwould not worry about any possible outliers at this point of the analysis.

    Realize that aside from the check we did on the "Names" column, briefly looking at the data insidethe dataframe should be as far as we need to go at this stage of the game. As we continue in thedata analysis life cycle we will have plenty of opportunities to find any issues with the data set.

    In [15]:

    # Check data type of the columns

    df.dtypes

    Out[15]:

    Names object

    Births int64

    dtype: object

    In [16]:

    # Check data type of Births column

    df.Births.dtype

    Out[16]:

    dtype('int64')

    As you can see the Births column is of type int64 , thus no floats (decimal numbers) or alpha numericcharacters will be present in this column.

    Analyze DataTo find the most popular name or the baby name with the higest birth rate, we can do one of thefollowing.

     

    Sort the dataframe and select the top row  Use the max()  attribute to find the maximum value

    In [17]:

    # Method 1:

    Sorted = df.sort(['Births'], ascending=False)

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    79/136

    6

    Sorted.head(1)

    Out[17]:

    Names Births

    4Mel 973In [18]:

    # Method 2:

    df['Births'].max()

    Out[18]:

    973

    Present DataHere we can plot the Births  column and label the graph to show the end user the highest point onthe graph. In conjunction with the table, the end user has a clear picture that Mel is the most popularbaby name in the data set.

    plot()  is a convinient attribute where pandas lets you painlessly plot the data in your dataframe. Welearned how to find the maximum value of the Births column in the previous section. Now to find theactual baby name of the 973 value looks a bit tricky, so lets go over it.

    Explain the pieces: df['Names']  - This is the entire list of baby names, the entire Names columndf['Births']  - This is the entire list of Births in the year 1880, the entire Births columndf['Births'].max() - This is the maximum value found in the Births column

    [df['Births'] == df['Births'].max()] IS EQUAL TO [Find all of the records in the Births column where it isequal to 973]df['Names'][df['Births'] == df['Births'].max()] IS EQUAL TO Select all of the records in the Namescolumn WHERE [The Births column is equal to 973]

    An alternative way could have been to use the Sorted  dataframe:Sorted['Names'].head(1).value

    The str()  function simply converts an object into a string.

    In [19]:

    # Create graph

    df['Births'].plot()

    # Maximum value in the data set

    MaxValue = df['Births'].max()

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    80/136

    7

    # Name associated with the maximum value

    MaxName = df['Names'][df['Births'] == df['Births'].max()].values

    # Text to display on graph

    Text = str(MaxValue) + " - " + MaxName

    # Add text to graph

    plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),

    xycoords=('axes fraction', 'data'), textcoords='offset point

    s')

    print "The most popular name"

    df[df['Births'] == df['Births'].max()]

    #Sorted.head(1) can also be used

    The most popular name

    Out[19]:

    Names Births

    4Mel 973

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    81/136

    8

    Lesson 2 In [1]:

    # The usual preamble

    import pandas as pd

    # Make the graphs a bit prettier, and bigger

    pd.set_option('display.mpl_style', 'default')

    pd.set_option('display.line_width', 5000)

    pd.set_option('display.max_columns', 60)

    figsize(15, 5)

    We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is asubset of the of 311 service requests from  NYC Open Data. 

    In [2]:

    complaints = pd.read_csv('../data/311-service-requests.csv')

    2.1 What's even in it? (the summary)

    When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll showyou a summary . This includes all the columns, and how many non-null values there are in eachcolumn.

    In [3]:

    complaints

    Out[3]:

    Int64Index: 111069 entries, 0 to 111068Data columns (total 52 columns):Unique Key 111069 non-null values

    Created Date 111069 non-null valuesClosed Date 60270 non-null valuesAgency 111069 non-null valuesAgency Name 111069 non-null valuesComplaint Type 111069 non-null valuesDescriptor 111068 non-null valuesLocation Type 79048 non-null valuesIncident Zip 98813 non-null valuesIncident Address 84441 non-null valuesStreet Name 84438 non-null values

    https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    82/136

    9

    Cross Street 1 84728 non-null valuesCross Street 2 84005 non-null valuesIntersection Street 1 19364 non-null valuesIntersection Street 2 19366 non-null valuesAddress Type 102247 non-null valuesCity 98860 non-null valuesLandmark 95 non-null valuesFacility Type 110938 non-null valuesStatus 111069 non-null valuesDue Date 39239 non-null valuesResolution Action Updated Date 96507 non-null valuesCommunity Board 111069 non-null valuesBorough 111069 non-null valuesX Coordinate (State Plane) 98143 non-null valuesY Coordinate (State Plane) 98143 non-null valuesPark Facility Name 111069 non-null valuesPark Borough 111069 non-null valuesSchool Name 111069 non-null valuesSchool Number 111052 non-null valuesSchool Region 110524 non-null values

    School Code 110524 non-null valuesSchool Phone Number 111069 non-null valuesSchool Address 111069 non-null valuesSchool City 111069 non-null valuesSchool State 111069 non-null valuesSchool Zip 111069 non-null valuesSchool Not Found 38984 non-null valuesSchool or Citywide Complaint 0 non-null valuesVehicle Type 99 non-null valuesTaxi Company Borough 117 non-null valuesTaxi Pick Up Location 1059 non-null valuesBridge Highway Name 185 non-null valuesBridge Highway Direction 185 non-null valuesRoad Ramp 184 non-null valuesBridge Highway Segment 223 non-null valuesGarage Lot Name 49 non-null valuesFerry Direction 37 non-null valuesFerry Terminal Name 336 non-null valuesLatitude 98143 non-null valuesLongitude 98143 non-null valuesLocation 98143 non-null valuesdtypes: float64(5), int64(1), object(46)

    2.2 Selecting columns and rows

    To select a column, we index with the name of the column, like this:

    In [4]:

    complaints['Complaint Type']

    Out[4]:

    0 Noise - Street/Sidewalk

    1 Illegal Parking

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    83/136

    10

    2 Noise - Commercial

    3 Noise - Vehicle

    4 Rodent

    5 Noise - Commercial

    6 Blocked Driveway

    7 Noise - Commercial8 Noise - Commercial

    9 Noise - Commercial

    10 Noise - House of Worship

    11 Noise - Commercial

    12 Illegal Parking

    13 Noise - Vehicle

    14 Rodent

    ...

    111054 Noise - Street/Sidewalk

    111055 Noise - Commercial

    111056 Street Sign - Missing

    111057 Noise

    111058 Noise - Commercial

    111059 Noise - Street/Sidewalk

    111060 Noise

    111061 Noise - Commercial

    111062 Water System

    111063 Water System

    111064 Maintenance or Facility

    111065 Illegal Parking

    111066 Noise - Street/Sidewalk111067 Noise - Commercial

    111068 Blocked Driveway

    Name: Complaint Type, Length: 111069, dtype: object

    To get the first 5 rows of a dataframe, we can use a slice: df[:5].

    This is a great way to get a sense for what kind of information is in the dataframe -- take a minute tolook at the contents and get a feel for this dataset.

    In [5]:

    complaints[:5]

    Out[5]:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    84/136

    11

    UniqueKey

    CreatedDate

    ClosedDate

    gency

    AgencyName

    ComplaintType

    Descriptor

    LocationType

    IncidentZip

    IncidentAddress

    StreetName

    CrossStreet1

    CrossStreet2

    In

    tersectionStree

    t1

    In

    tersectionStree

    t2

    AddressType

    City

    Land

    ark

    FacilityType

    Status

    DueDate

    Resolu

    tionActionUpdat

    edDate

    CommunityBoard

    Borough

    XCoor

    dinate(StateP

    lane)

    YCoor

    dinate(StateP

    lane)

    ParkFacilityNam

    e

    ParkBorough

    SchoolName

    SchoolNumber

    SchoolRegion

    SchoolCode

    Sc

    hoolPhoneNumb

    er

    SchoolAddress

    SchoolCity

    SchoolState

    SchoolZip

    chool

    otFoun

    d

    SchoolorCitywideCo

    plaint

    ehicleType

    TaxiCo

    panyBoro

    ugh

    Ta

    xiPickUpLocati

    on

    Br

    idgeHigh

    ayNa

    e

    Brid

    geHigh

    ayDire

    ction

    ad

    a

     

    BridgeHigh

    aySeg

    ent

    arageLot

    a

    e

    FerryDirection

    Fe

    rryTer

    inalNa

    e

    Latitude

    Longitude

    Location

    26589651

    10

     / 31

     / 201302:08:41AM

    NaN

    NYPD

    NewYor

    kCityPoliceDepart

    ment

    N

    oise-Street

     /Sidewalk

    LoudTalking

    Street/Sidewalk

    11432

    9

    0-03169STREET

    169STREET

    90AVENUE

    91AVENUE

    NaN

    NaN

    ADDRESS

    JAMAICA

    NaN

    recinct

    Assigned

    10

     / 31

     / 201310:08:41AM

    10

     / 31

     / 201302:35:17AM

    12QUEENS

    QUEENS

    1042027

    197389

    Unspecified

    QUEENS

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    40.708275

    -73.791604

    (40.7082

    7532593202,-73.791603957797

    21)

    26593

    10

     / 31

    NaN

    NYPD

    NewYo

    IllegalP

    Comme

    Street/Side

    11378

    58AVE

    58AVE

    58PLA

    59STR

    NaN

    NaN

    BLOCK

    MASPE

    NaN

    reci

    pen

    10

     / 31

    NaN

    05QUE

    QUEE

    10093

    2019

    Unspe

    QUEE

    Unspe

    Unspe

    Unspe

    Unspe

    Unspe

    Unspe

    Unspe

    Unspe

    Unspe

     NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    40.72

    -73.9

    (40.72104053

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    85/136

    12

    UniqueKey

    CreatedDate

    ClosedDate

    gency

    AgencyName

    ComplaintType

    Descriptor

    LocationType

    IncidentZip

    IncidentAddress

    StreetName

    CrossStreet1

    CrossStreet2

    In

    tersectionStree

    t1

    In

    tersectionStree

    t2

    AddressType

    City

    Land

    ark

    FacilityType

    Status

    DueDate

    Resolu

    tionActionUpdat

    edDate

    CommunityBoard

    Borough

    XCoor

    dinate(StateP

    lane)

    YCoor

    dinate(StateP

    lane)

    ParkFacilityNam

    e

    ParkBorough

    SchoolName

    SchoolNumber

    SchoolRegion

    SchoolCode

    Sc

    hoolPhoneNumb

    er

    SchoolAddress

    SchoolCity

    SchoolState

    SchoolZip

    chool

    otFoun

    d

    SchoolorCitywideCo

    plaint

    ehicleType

    TaxiCo

    panyBoro

    ugh

    Ta

    xiPickUpLocati

    on

    Br

    idgeHigh

    ayNa

    e

    Brid

    geHigh

    ayDire

    ction

    ad

    a

     

    BridgeHigh

    aySeg

    ent

    arageLot

    a

    e

    FerryDirection

    Fe

    rryTer

    inalNa

    e

    Latitude

    Longitude

    Location

    698

     / 20130

    2:01:04AM

    rkCityP

    oliceDepartment

    arking

    rcialOve

    rnightParking

    walk

    NUE

    NUE

    CE

    EET

    FACE

    TH

    nct

     / 20131

    0:01:04AM

    ENS

    NS

    49

    84

    cified

    NS

    cified

    cified

    cified

    cified

    cified

    cified

    cified

    cified

    cified

    1041

    09453

    5628305, -73.90

    945306791765)

    26594139

    1

    0 / 31

     / 2013

    1

    0 / 31

     / 2013

    NYPD

    N

    ewYorkCity

    Noise-Commer

    L

    oudMusic/ Pa

    Club /Bar/ Restaurant

    10032

    4

    060BROADW

    BROADWAY

    W

    EST171STR

    W

    EST172STR

    NaN

    NaN

    ADDRESS

    NE

     YORK

    NaN

    recinct

    Closed

    1

    0 / 31

     / 2013

    1

    0 / 31

     / 2013

    12MANHATT

    MANHATTAN

    1001088

    246531

    Unspecified

    MANHATTAN

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    Unspecified

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    40.843330

    -

    73.939144

    (40

    .84332975466513,-73.

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    86/136

    13

    UniqueKey

    CreatedDate

    ClosedDate

    gency

    AgencyName

    ComplaintType

    Descriptor

    LocationType

    IncidentZip

    IncidentAddress

    StreetName

    CrossStreet1

    CrossStreet2

    In

    tersectionStree

    t1

    In

    tersectionStree

    t2

    AddressType

    City

    Land

    ark

    FacilityType

    Status

    DueDate

    Resolu

    tionActionUpdat

    edDate

    CommunityBoard

    Borough

    XCoor

    dinate(StateP

    lane)

    YCoor

    dinate(StateP

    lane)

    ParkFacilityNam

    e

    ParkBorough

    SchoolName

    SchoolNumber

    SchoolRegion

    SchoolCode

    Sc

    hoolPhoneNumb

    er

    SchoolAddress

    SchoolCity

    SchoolState

    SchoolZip

    chool

    otFoun

    d

    SchoolorCitywideCo

    plaint

    ehicleType

    TaxiCo

    panyBoro

    ugh

    Ta

    xiPickUpLocati

    on

    Br

    idgeHigh

    ayNa

    e

    Brid

    geHigh

    ayDire

    ction

    ad

    a

     

    BridgeHigh

    aySeg

    ent

    arageLot

    a

    e

    FerryDirection

    Fe

    rryTer

    inalNa

    e

    Latitude

    Longitude

    Location

    02:00:

    24AM

    02:40:

    32AM

    PoliceDe

    partment

    cial

    rty

    AY

    EET

    EET

    10:00:

    24AM

    02:39:

    42AM

    AN

    939143719134

    82)

    26

    595721

    10

     / 31

     / 

    201301:56

    10

     / 31

     / 

    201302:21

    NYPD

    NewYor

    kCityPoliceD

    Noise

    -Vehicle

    Car/ T

    ruckHorn

    Str

    eet/Sidewalk

    1

    0023

    WEST

    72STREET

    WEST

    72STREET

    COLUM

    BUSAVENUE

    AMSTER

    DAMAVENUE

    NaN

    NaN

    BLO

    CKFACE

    NE

     YORK

    NaN

    r

    ecinct

    C

    losed

    10

     / 31

     / 

    201309:56

    10

     / 31

     / 

    201302:21

    07MA

    NHATTAN

    MAN

    HATTAN

    9

    89730

    2

    22727

    Uns

    pecified

    MAN

    HATTAN

    Uns

    pecified

    Uns

    pecified

    Uns

    pecified

    Uns

    pecified

    Uns

    pecified

    Uns

    pecified

    Uns

    pecified

    Uns

    pecified

    Uns

    pecified

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    NaN

    40.

    778009

    -73

    .980213

    (40.77800874

    46372, -73.9802134902

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    87/136

    14

    UniqueKey

    CreatedDate

    ClosedDate

    gency

    AgencyName

    ComplaintType

    Descriptor

    LocationType

    IncidentZip

    IncidentAddress

    StreetName

    CrossStreet1

    CrossStreet2

    In

    tersectionStree

    t1

    In

    tersectionStree

    t2

    AddressType

    City

    Land

    ark

    FacilityType

    Status

    DueDate

    Resolu

    tionActionUpdat

    edDate

    CommunityBoard

    Borough

    XCoor

    dinate(StateP

    lane)

    YCoor

    dinate(StateP

    lane)

    ParkFacilityNam

    e

    ParkBorough

    SchoolName

    SchoolNumber

    SchoolRegion

    SchoolCode

    Sc

    hoolPhoneNumb

    er

    SchoolAddress

    SchoolCity

    SchoolState

    SchoolZip

    chool

    otFoun

    d

    SchoolorCitywideCo

    plaint

    ehicleType

    TaxiCo

    panyBoro

    ugh

    Ta

    xiPickUpLocati

    on

    Br

    idgeHigh

    ayNa

    e

    Brid

    geHigh

    ayDire

    ction

    ad

    a

     

    BridgeHigh

    aySeg

    ent

    arageLot

    a

    e

    FerryDirection

    Fe

    rryTer

    inalNa

    e

    Latitude

    Longitude

    Location

    :23AM

    :48AM

    epartme

    nt

    :23AM

    :10AM

    3975)

    26590

    930

    10

     / 31

     / 20130

    1:53:44AM

    Na

    N

    D

    H

    H

    Departmentof

    HealthandM

    Rod

    ent

    ConditionAttra

    ctingRodents

    Vacant

    Lot

    100

    27

    WEST124

    STREET

    WEST124

    STREET

    LENOXA

    VENUE

    ADAMCLAYTON

    POWELLJRB

    Na

    N

    Na

    N

    BLOCK

    FACE

    NE

     Y

    ORK

    Na

    N

     / 

     

    Pend

    ing

    11

     / 30

     / 20130

    1:53:44AM

    10

     / 31

     / 20130

    1:59:54AM

    10MANH

    ATTAN

    MANHA

    TTAN

    9988

    15

    2335

    45

    Unspe

    cified

    MANHA

    TTAN

    Unspe

    cified

    Unspe

    cified

    Unspe

    cified

    Unspe

    cified

    Unspe

    cified

    Unspe

    cified

    Unspe

    cified

    Unspe

    cified

    Unspe

    cified

    Na

    N

    a

     

    Na

    N

    Na

    N

    Na

    N

    Na

    N

    a

     

    Na

    N

    a

     

    Na

    N

    Na

    N

    40.80

    7691

    -73.94

    7387

    (40.80769092704951,-

    73.94738703491433)

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    88/136

    15

    UniqueKey

    CreatedDate

    ClosedDate

    gency

    AgencyName

    ComplaintType

    Descriptor

    LocationType

    IncidentZip

    IncidentAddress

    StreetName

    CrossStreet1

    CrossStreet2

    In

    tersectionStree

    t1

    In

    tersectionStree

    t2

    AddressType

    City

    Land

    ark

    FacilityType

    Status

    DueDate

    Resolu

    tionActionUpdat

    edDate

    CommunityBoard

    Borough

    XCoor

    dinate(StateP

    lane)

    YCoor

    dinate(StateP

    lane)

    ParkFacilityNam

    e

    ParkBorough

    SchoolName

    SchoolNumber

    SchoolRegion

    SchoolCode

    Sc

    hoolPhoneNumb

    er

    SchoolAddress

    SchoolCity

    SchoolState

    SchoolZip

    chool

    otFoun

    d

    SchoolorCitywideCo

    plaint

    ehicleType

    TaxiCo

    panyBoro

    ugh

    Ta

    xiPickUpLocati

    on

    Br

    idgeHigh

    ayNa

    e

    Brid

    geHigh

    ayDire

    ction

    ad

    a

     

    BridgeHigh

    aySeg

    ent

    arageLot

    a

    e

    FerryDirection

    Fe

    rryTer

    inalNa

    e

    Latitude

    Longitude

    Location

    entalHy

    giene

    OULEVA

    RD

    We can combine these to get the first 5 rows of a column:

    In [6]:

    complaints['Complaint Type'][:5]

    Out[6]:

    0 Noise - Street/Sidewalk

    1 Illegal Parking2 Noise - Commercial

    3 Noise - Vehicle

    4 Rodent

    Name: Complaint Type, dtype: object

    and it doesn't matter which direction we do it in:

    In [7]:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    89/136

    16

    complaints[:5]['Complaint Type']

    Out[7]:

    0 Noise - Street/Sidewalk

    1 Illegal Parking2 Noise - Commercial

    3 Noise - Vehicle

    4 Rodent

    Name: Complaint Type, dtype: object

    2.3 Selecting multiple columnsWhat if we just want to know the complaint type and the borough, but not the rest of the information?Pandas makes it really easy to select a subset of the columns: just index with list of columns youwant.

    In [8]:

    complaints[['Complaint Type', 'Borough']]

    Out[8]:

    Int64Index: 111069 entries, 0 to 111068Data columns (total 2 columns):Complaint Type 111069 non-null valuesBorough 111069 non-null valuesdtypes: object(2)

    That showed us a summary, and then we can look at the first 10 rows:

    In [9]:

    complaints[['Complaint Type', 'Borough']][:10]

    Out[9]:

    Complaint Type Borough0Noise - Street/Sidewalk QUEENS

    1Illegal Parking QUEENS

    2Noise - Commercial MANHATTAN3Noise - Vehicle MANHATTAN

    4Rodent MANHATTAN

    5Noise - Commercial QUEENS6Blocked Driveway QUEENS

    7Noise - Commercial QUEENS

    8Noise - Commercial MANHATTAN

    9Noise - Commercial BROOKLYN

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    90/136

    17

    2.4 What's the most common complaint

    type?

    This is a really easy question to answer! There's a .value_counts() method that we can use:

    In [10]:

    complaints['Complaint Type'].value_counts()

    Out[10]:

    HEATING 14200

    GENERAL CONSTRUCTION 7471

    Street Light Condition 7117

    DOF Literature Request 5797

    PLUMBING 5373

    PAINT - PLASTER 5149

    Blocked Driveway 4590

    NONCONST 3998

    Street Condition 3473

    Illegal Parking 3343

    Noise 3321

    Traffic Signal Condition 3145

    Dirty Conditions 2653

    Water System 2636

    Noise - Commercial 2578

    ...Opinion for the Mayor 2

    Window Guard 2

    DFTA Literature Request 2

    Legal Services Provider Complaint 2

    Open Flame Permit 1

    Snow 1

    Municipal Parking Facility 1

    X-Ray Machine/Equipment 1

    Stalled Sites 1

    DHS Income Savings Requirement 1

    Tunnel Condition 1

    Highway Sign - Damaged 1

    Ferry Permit 1

    Trans Fat 1

    DWD 1

    Length: 165, dtype: int64

    If we just wanted the top 10 most common complaints, we can do this:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    91/136

    18

    In [11]:

    complaint_counts = complaints['Complaint Type'].value_counts()

    complaint_counts[:10]

    Out[11]:HEATING 14200

    GENERAL CONSTRUCTION 7471

    Street Light Condition 7117

    DOF Literature Request 5797

    PLUMBING 5373

    PAINT - PLASTER 5149

    Blocked Driveway 4590

    NONCONST 3998

    Street Condition 3473

    Illegal Parking 3343dtype: int64

    But it gets better! We can plot them!

    In [12]:

    complaint_counts[:10].plot(kind='bar')

    Out[12]:

    .warning{

    color: rgb( 240, 20, 20 )

    }

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    92/136

    19

    Lesson 3Get Data - Our data set will consist of an Excel file containing customer counts per date. We willlearn how to read in the excel file for processing.Prepare Data - The data is an irregular time series having duplicate dates. We will be challenged in

    compressing the data and coming up with next years forecasted customer count.Analyze Data - We use graphs to visualize trends and spot outliers. Some built in computationaltools will be used to calculate next years forecasted customer count.Present Data - The results will be plotted.

    NOTE: Make sure you have looked th rough al l previous lessons, as the know ledge learned in

    previous lessons wi l l be needed for this exercise. 

    In [1]:

    # Import libraries

    import pandas as pd

    import matplotlib.pyplot as plt

    import numpy.random as np

    import sys

    %matplotlib inline

    In [2]:

    print 'Python version ' + sys.version

    print 'Pandas version: ' + pd.__version__

    Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version: 0.15.2

    We will be creating our own test data for analysis.

    In [3]:

    # set seed

    np.seed(111)

    # Function to generate test data

    def CreateDataSet(Number=1):

    Output = []

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    93/136

    20

    for i in range(Number):

    # Create a weekly (mondays) date range

    rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')

    # Create random data

    data = np.randint(low=25,high=1000,size=len(rng))

    # Status pool

    status = [1,2,3]

    # Make a random list of statuses

    random_status = [status[np.randint(low=0,high=len(status))] for i in

    range(len(rng))]

    # State pool

    states = ['GA','FL','fl','NY','NJ','TX']

    # Make a random list of states

    random_states = [states[np.randint(low=0,high=len(states))] for i in

    range(len(rng))]

    Output.extend(zip(random_states, random_status, data, rng))

    return Output

    Now that we have a function to generate our test data, lets create some data and stick it into adataframe.

    In [4]:

    dataset = CreateDataSet(4)

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    94/136

    21

    df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','St

    atusDate'])

    df.info()

    Int64Index: 836 entries, 0 to 835Data columns (total 4 columns):State 836 non-null objectStatus 836 non-null int64CustomerCount 836 non-null int64StatusDate 836 non-null datetime64[ns]dtypes: datetime64[ns](1), int64(2), object(1)memory usage: 32.7+ KB

    In [5]:

    df.head()

    Out[5]:State Status CustomerCount StatusDate

    0GA 1 877 2009-01-05

    1FL 1 901 2009-01-12

    2fl 3 749 2009-01-19

    3FL 3 111 2009-01-26

    4GA 1 300 2009-02-02We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. Wesimply do this to show you how to read and write to Excel files.

    We do not write the index values of the dataframe to the Excel file, since they are not meant to bepart of our initial test data set.

    In [6]:

    # Save results to excel

    df.to_excel('Lesson3.xlsx', index=False)

    print 'Done'

    Done

    Grab Data from ExcelWe will be using the read_excel  function to read in data from an Excel file. The function allows youto read in specfic tabs by name or location.

    In [7]:

    pd.read_excel?

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    95/136

    22

    Note: The location on the Excel file will be in the same folder as the notebook, unlessspecified otherwise. 

    In [8]:

    # Location of file

    Location = r'C:\Users\david\notebooks\pandas\Lesson3.xlsx'

    # Parse a specific sheet

    df = pd.read_excel(Location, 0, index_col='StatusDate')

    df.dtypes

    Out[8]:

    State object

    Status int64CustomerCount int64

    dtype: object

    In [9]:

    df.index

    Out[9]:

    [2009-01-05, ..., 2012-12-31]

    Length: 836, Freq: None, Timezone: None

    In [10]:

    df.head()

    Out[10]:

    State Status CustomerCountStatusDate

    2009-01-05 GA 1 877

    2009-01-12 FL 1 901

    2009-01-19 fl 3 749

    2009-01-26 FL 3 111

    2009-02-02 GA 1 300

    Prepare DataThis section attempts to clean up the data for analysis.

    1.  Make sure the state column is all in upper case2.  Only select records where the account status is equal to "1"

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    96/136

    23

    3.  Merge (NJ and NY) to NY in the state column4.  Remove any outliers (any odd results in the data set)

    Lets take a quick look on how some of the State values are upper case and some are lower case

    In [11]:

    df['State'].unique()

    Out[11]:

    array([u'GA', u'FL', u'fl', u'TX', u'NY', u'NJ'], dtype=object)

    To convert all the State values to upper case we will use the upper()  function and the dataframe'sapply  attribute. The lambda  function simply will apply the upper function to each value in the State column.

    In [12]:

    # Clean State Column, convert to upper case

    df['State'] = df.State.apply(lambda x: x.upper())

    In [13]:

    df['State'].unique()

    Out[13]:

    array([u'GA', u'FL', u'TX', u'NY', u'NJ'], dtype=object)

    In [14]:

    # Only grab where Status == 1

    mask = df['Status'] == 1

    df = df[mask]

    To turn the NJ  states to NY  we simply...

    [df.State == 'NJ']  - Find all records in the State column where they are equal to NJ .df .State[df.State == 'NJ'] = 'NY'  - For all records in the State column where they are equal to NJ ,replace them with NY .

    In [15]:

    # Convert NJ to NY

    mask = df.State == 'NJ'

    df['State'][mask] = 'NY'

    Now we can see we have a much cleaner data set to work with.

    In [16]:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    97/136

    24

    df['State'].unique()

    Out[16]:

    array([u'GA', u'FL', u'NY', u'TX'], dtype=object)

    At this point we may want to graph the data to check for any outliers or inconsistencies in the data.We will be using the plot()  attribute of the dataframe.

    As you can see from the graph below it is not very conclusive and is probably a sign that we need toperform some more data preparation.

    In [17]:

    df['CustomerCount'].plot(figsize=(15,5));

    If we take a look at the data, we begin to realize that there are multiple values for the same State,StatusDate, and Status combination. It is possible that this means the data you are working with is

    dirty/bad/inaccurate, but we will assume otherwise. We can assume this data set is a subset of abigger data set and if we simply add the values in the CustomerCount  column per State,StatusDate, and Status we will get the Total Custom er Count  per day.

    In [18]:

    sortdf = df[df['State']=='NY'].sort(axis=0)

    sortdf.head(10)

    Out[18]:

    State Status CustomerCount

    StatusDate2009-01-19 NY 1 522

    2009-02-23 NY 1 710

    2009-03-09 NY 1 992

    2009-03-16 NY 1 355

    2009-03-23 NY 1 728

    2009-03-30 NY 1 863

    2009-04-13 NY 1 520

    2009-04-20 NY 1 820

    2009-04-20 NY 1 937

    2009-04-27 NY 1 447

    Our task is now to create a new dataframe that compresses the data so we have daily customer

    counts per State and StatusDate. We can ignore the Status column since all the values in thiscolumn are of value 1. To accomplish this we will use the dataframe's functions groupby  and sum() .

    Note that we had to use reset_index . If we did not, we would not have been able to group by boththe State and the StatusDate since the groupby function expects only columns as inputs. Thereset_index function will bring the index StatusDate  back to a column in the dataframe.

    In [19]:

    # Group by State and StatusDate

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    98/136

    25

    Daily = df.reset_index().groupby(['State','StatusDate']).sum()

    Daily.head()

    Out[19]:

    Status CustomerCountState StatusDate

    FL 2009-01-12 1 901

    2009-02-02 1 653

    2009-03-23 1 752

    2009-04-06 2 1086

    2009-06-08 1 649The State  and StatusDate  columns are automatically placed in the index of the Daily  dataframe.You can think of the index  as the primary key of a database table but without the constraint ofhaving unique values. Columns in the index as you will see allow us to easily select, plot, andperform calculations on the data.

    Below we delete the Status  column since it is all equal to one and no longer necessary.

    In [20]:

    del Daily['Status']

    Daily.head()

    Out[20]:

    CustomerCountState StatusDate

    FL 2009-01-12 901

    2009-02-02 653

    2009-03-23 7522009-04-06 1086

    2009-06-08 649

    In [21]:

    # What is the index of the dataframe

    Daily.index

    Out[21]:

    MultiIndex(levels=[[u'FL', u'GA', u'NY', u'TX'], [2009-01-05 00:00:00, 2009-0

    1-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00,

    2009-03-09 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:

    00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04

    -27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00,

    2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:

    00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08

    -10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00,

    2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    99/136

    26

    00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10

    -26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00,

    2009-12-07 00:00:00, 2009-12-14 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:

    00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02

    -15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00,

    2010-03-15 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-26 00:00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05

    -24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-28 00:00:00,

    2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:

    00:00, 2010-08-09 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09

    -06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00,

    2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:

    00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11

    -29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00,

    2011-01-17 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:

    00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-03

    -21 00:00:00, 2011-03-28 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00,

    2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:

    00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-06 00:00:00, ...]],

    labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

    1, 1, 1, ...], [1, 3, 7, 9, 17, 19, 20, 21, 23, 25, 27, 28, 29, 30, 31, 35, 3

    8, 40, 41, 44, 45, 46, 47, 48, 49, 52, 54, 56, 57, 59, 60, 62, 66, 68, 69, 70

    , 71, 72, 75, 76, 77, 78, 79, 85, 88, 89, 92, 96, 97, 99, 100, 101, 103, 104,

    105, 108, 109, 110, 112, 114, 115, 117, 118, 119, 125, 126, 127, 128, 129, 131, 133, 134, 135, 136, 137, 140, 146, 150, 151, 152, 153, 157, 0, 3, 7, 22, 2

    3, 24, 27, 28, 34, 37, 42, 47, 50, 55, 58, 66, 67, 69, ...]],

    names=[u'State', u'StatusDate'])

    In [22]:

    # Select the State index

    Daily.index.levels[0]

    Out[22]:

    Index([u'FL', u'GA', u'NY', u'TX'], dtype='object')In [23]:

    # Select the StatusDate index

    Daily.index.levels[1]

    Out[23]:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    100/136

    27

    [2009-01-05, ..., 2012-12-10]

    Length: 161, Freq: None, Timezone: None

    Lets now plot the data per State.

    As you can see by breaking the graph up by the State  column we have a much clearer picture onhow the data looks like. Can you spot any outliers?

    In [24]:

    Daily.loc['FL'].plot()

    Daily.loc['GA'].plot()

    Daily.loc['NY'].plot()

    Daily.loc['TX'].plot();

    We can also just plot the data on a specific date, like 2012 . We can now clearly see that the data forthese states is all over the place. since the data consist of weekly customer counts, the variability ofthe data seems suspect. For this tutorial we will assume bad data and proceed.

    In [25]:

    Daily.loc['FL']['2012':].plot()

    Daily.loc['GA']['2012':].plot()

    Daily.loc['NY']['2012':].plot()

    Daily.loc['TX']['2012':].plot();

    We will assume that per month the customer count should remain relatively steady. Any data outsidea specific range in that month will be removed from the data set. The final result should have smoothgraphs with no spikes.

    StateYearMonth  - Here we group by State, Year of StatusDate, and Month of StatusDate.Daily['Outl ier']  - A boolean (True or False) value letting us know if the value in the CustomerCountcolumn is ouside the acceptable range.

    We will be using the attribute t ransform  instead of apply . The reason is that transform will keep theshape(# of rows and columns) of the dataframe the same and apply will not. By looking at theprevious graphs, we can realize they are not resembling a gaussian distribution, this means wecannot use summary statistics like the mean and stDev. We use percentiles instead. Note that werun the risk of eliminating good data.

    In [26]:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    101/136

    28

    # Calculate Outliers

    StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.

    get_level_values(1).year, Daily.index.get_level_values(1).month])

    Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant

    ile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )

    Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant

    ile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )

    Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['Custom

    erCount'] > Daily['Upper'])

    # Remove Outliers

    Daily = Daily[Daily['Outlier'] == False]

    The dataframe named Daily  will hold customer counts that have been aggregated per day. Theoriginal data (df) has multiple records per day. We are left with a data set that is indexed by both thestate and the StatusDate. The Outlier column should be equal to False  signifying that the record isnot an outlier.

    In [27]:

    Daily.head()

    Out[27]:

    CustomerCount Lower Upper Outlier

    State StatusDate

    FL 2009-01-12 901 450.5 1351.5 False2009-02-02 653 326.5 979.5 False

    2009-03-23 752 376.0 1128.0 False

    2009-04-06 1086 543.0 1629.0 False

    2009-06-08 649 324.5 973.5 False

    We create a separate dataframe named ALL  which groups the Daily dataframe by StatusDate. Weare essentially getting rid of the State  column. The Max  column represents the maximum customercount per month. The Max  column is used to smooth out the graph.

    In [28]:

    # Combine all markets

    # Get the max customer count by Date

    ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_value

    s(1)).sum())

    ALL.columns = ['CustomerCount'] # rename column

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    102/136

    29

    # Group by Year and Month

    YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])

    # What is the max customer count per Year and Month

    ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())

    ALL.head()

    Out[28]:

    CustomerCount MaxStatusDate

    2009-01-05 877 901

    2009-01-12 901 9012009-01-19 522 901

    2009-02-02 953 953

    2009-02-23 710 953As you can see from the ALL  dataframe above, in the month of January 2009, the maximumcustomer count was 901. If we had used apply , we would have got a dataframe with (Year andMonth) as the index and just the Max  column with the value of 901.

    There is also an interest to gauge if the current customer counts were reaching certain goals thecompany had established. The task here is to visually show if the current customer counts aremeeting the goals listed below. We will call the goals BHAG  (Big Hairy Annual Goal).

     

    12/31/2011 - 1,000 customers  12/31/2012 - 2,000 customers  12/31/2013 - 3,000 customers

    We will be using the date_range function to create our dates.

    Definition:  date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False,name=None, closed=None)Docstr ing:  Return a fixed frequency datetime index, with day (calendar) as the default frequency

    By choosing the frequency to be A  or annual we will be able to get the three target dates fromabove.

    In [29]:

    date_range?

    Object `date_range` not found.

    In [30]:

    # Create the BHAG dataframe

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    103/136

    30

    data = [1000,2000,3000]

    idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')

    BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])

    BHAG

    Out[30]:

    BHAG

    2011-12-311000

    2012-12-312000

    2013-12-313000

    Combining dataframes as we have learned in previous lesson is made simple using the concat  function. Remember when we choose axis = 0  we are appending row wise.

    In [31]:

    # Combine the BHAG and the ALL data set

    combined = pd.concat([ALL,BHAG], axis=0)

    combined = combined.sort(axis=0)

    combined.tail()

    Out[31]:

    BHAG CustomerCount Max

    2012-11-19NaN 136 1115

    2012-11-26NaN 1115 1115

    2012-12-10NaN 1269 1269

    2012-12-312000 NaN NaN

    2013-12-313000 NaN NaN

    In [32]:

    fig, axes = plt.subplots(figsize=(12, 7))

    combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')

    combined['Max'].plot(color='blue', label='All Markets')

    plt.legend(loc='best');

    There was also a need to forecast next year's customer count and we can do this in a couple ofsimple steps. We will first group the combined  dataframe by Year  and place the maximum customercount for that year. This will give us one row per Year.

    In [33]:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    104/136

    31

    # Group by Year and then get the max value per year

    Year = combined.groupby(lambda x: x.year).max()

    Year

    Out[33]:

    BHAG CustomerCount Max2009NaN 2452 2452

    2010NaN 2065 2065

    20111000 2711 271120122000 2061 2061

    20133000 NaN NaN

    In [34]:

    # Add a column representing the percent change per year

    Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)

    Year

    Out[34]:

    BHAG CustomerCount Max YR_PCT_Change

    2009NaN 2452 2452 NaN

    2010NaN 2065 2065 -0.157830

    20111000 2711 2711 0.312833

    20122000 2061 2061 -0.239764

    20133000 NaN NaN NaNTo get next year's end customer count we will assume our current growth rate remains constant. Wethen will increase this years customer count by that amount and that will be our forecast for nextyear.

    In [35]:

    (1 + Year.ix[2012,'YR_PCT_Change']) * Year.ix[2012,'Max']

    Out[35]:

    1566.8465510881595

    Present DataCreate individual Graphs per State.

    In [36]:

    # First Graph

    ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    105/136

    32

    # Last four Graphs

    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))

    fig.subplots_adjust(hspace=1.0) ## Create space between plots

    Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0

    ,0])

    Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0

    ,1])

    Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1

    ,0])

    Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1

    ,1])

    # Add titles

    axes[0,0].set_title('Florida')

    axes[0,1].set_title('Georgia')

    axes[1,0].set_title('Texas')

    axes[1,1].set_title('North East');

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    106/136

    33

    Lesson 4In this lesson were going to go back to the basics. We will be working with a small data set so thatyou can easily understand what I am trying to explain. We will be adding columns, deleting columns,and slicing the data many different ways. Enjoy!

    In [1]:

    # Import libraries

    import pandas as pd

    import sys

    In [2]:

    print 'Python version ' + sys.version

    print 'Pandas version: ' + pd.__version__

    Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version: 0.15.2

    In [3]:

    # Our small data set

    d = [0,1,2,3,4,5,6,7,8,9]

    # Create dataframe

    df = pd.DataFrame(d)

    df

    Out[3]:

    000

    11

    22

    33

    4455

    66

    77

    88

    99

    In [4]:

    # Lets change the name of the column

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    107/136

    34

    df.columns = ['Rev']

    df

    Out[4]:

    Rev00

    11

    22

    33

    44

    55

    66

    77

    88

    99

    In [5]:

    # Lets add a column

    df['NewCol'] = 5

    df

    Out[5]:

    Rev NewCol00 5

    11 5

    22 5

    33 544 5

    55 5

    66 5

    77 5

    88 5

    99 5

    In [6]:

    # Lets modify our new column

    df['NewCol'] = df['NewCol'] + 1

    df

    Out[6]:

    Rev NewCol

    00 6

    11 622 6

    33 6

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    108/136

    35

    Rev NewCol

    44 6

    55 6

    66 6

    77 6

    88 6

    99 6

    In [7]:

    # We can delete columns

    del df['NewCol']

    df

    Out[7]:

    Rev00

    1122

    33

    44

    55

    66

    7788

    99

    In [8]:

    # Lets add a couple of columns

    df['test'] = 3

    df['col'] = df['Rev']

    df

    Out[8]:

    Rev test col00 3 0

    11 3 1

    22 3 2

    33 3 344 3 4

    55 3 5

    66 3 6

    77 3 7

    88 3 8

    99 3 9

    In [9]:

  • 8/20/2019 1+Python+Class+Powerpoint+Outline

    109/136

    36

    # If we wanted, we could change the name of the i