1+Python+Class+Powerpoint+Outline

8/20/2019 1+Python+Class+Powerpoint+Outline

1/136


2/136


3/136


4/136

As an example, here is an implementation of the classic quicksort algorithm in

Python:

def quicksort(arr):

if len(arr) pivot]

return quicksort(left) + middle + quicksort(right)

print quicksort([3,6,8,10,1,2,1])

# Prints "[1, 1, 2, 3, 6, 8, 10]"


5/136


6/136

Numbers: Integers and floats work as you would expect from other languages:

x = 3

print type(x) # Prints ""

print x # Prints "3"

print x + 1 # Addition; prints "4"

print x - 1 # Subtraction; prints "2"

print x * 2 # Multiplication; prints "6"

print x ** 2 # Exponentiation; prints "9"x += 1


x *= 2


y = 2.5

print type(y) # Prints ""

print y, y + 1, y * 2, y ** 2 # Prints "2.5 3.5 5.0 6.25"


7/136

Booleans:

t = True

f = False

print type(t) # Prints ""

print t and f # Logical AND; prints "False"

print t or f # Logical OR; prints "True"

print not t # Logical NOT; prints "False"

print t != f # Logical XOR; prints "True"

Strings:

hello = 'hello' # String literals can use single quotes

world = "world" # or double quotes; it does not matter.

print hello # Prints "hello"

print len(hello) # String length; prints "5"

hw = hello + ' ' + world # String concatenation

print hw # prints "hello world"


8/136

hw12 = '%s %s %d' % (hello, world, 12) # sprintf style string formatting

print hw12 # prints "hello world 12"


9/136

String objects have a bunch of useful methods; for example:

s = "hello"

print s.capitalize() # Capitalize a string; prints "Hello"

print s.upper() # Convert a string to uppercase; prints "HELLO"

print s.rjust(7) # Right-justify a string, padding with spaces; prints " hello"

print s.center(7) # Center a string, padding with spaces; prints " hello "

print s.replace('l', '(ell)') # Replace all instances of one substring with another;# prints "he(ell)(ell)o"

print ' world '.strip() # Strip leading and trailing whitespace; prints "world"


10/136


11/136

xs = [3, 1, 2] # Create a list

print xs, xs[2] # Prints "[3, 1, 2] 2"

print xs[-1] # Negative indices count from the end of the list; prints "2"

xs[2] = 'foo' # Lists can contain elements of different types

print xs # Prints "[3, 1, 'foo']"

xs.append('bar') # Add a new element to the end of the list

print xs # Prints

x = xs.pop() # Remove and return the last element of the list

print x, xs # Prints "bar [3, 1, 'foo']"


12/136

nums = range(5) # range is a built-in function that creates a list of integers

print nums # Prints "[0, 1, 2, 3, 4]"

print nums[2:4] # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"

print nums[2:] # Get a slice from index 2 to the end; prints "[2, 3, 4]"

print nums[:2] # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"

print nums[:] # Get a slice of the whole list; prints ["0, 1, 2, 3, 4]"

print nums[:-1] # Slice indices can be negative; prints ["0, 1, 2, 3]"

nums[2:4] = [8, 9] # Assign a new sublist to a slice

print nums # Prints "[0, 1, 8, 8, 4]"

We will see slicing again in the context of numpy arrays.


13/136

animals = ['cat', 'dog', 'monkey']

for animal in animals:

print animal

# Prints "cat", "dog", "monkey", each on its own line.

If you want access to the index of each element within the body of a loop, use the built-in

enumerate function:

animals = ['cat', 'dog', 'monkey']for idx, animal in enumerate(animals):

print '#%d: %s' % (idx + 1, animal)

# Prints "#1: cat", "#2: dog", "#3: monkey", each on its own line


14/136

As a simple example, consider the following code that computes square numbers:

nums = [0, 1, 2, 3, 4]

squares = []

for x in nums:

squares.append(x ** 2)

print squares # Prints [0, 1, 4, 9, 16]

You can make this code simpler using a list comprehension:

nums = [0, 1, 2, 3, 4]

squares = [x ** 2 for x in nums]

print squares # Prints [0, 1, 4, 9, 16]

List comprehensions can also contain conditions:

nums = [0, 1, 2, 3, 4]

even_squares = [x ** 2 for x in nums if x % 2 == 0]


15/136

print even_squares # Prints "[0, 4, 16]"


16/136

You can use it like this:

d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data

print d['cat'] # Get an entry from a dictionary; prints "cute"

print 'cat' in d # Check if a dictionary has a given key; prints "True"

d['fish'] = 'wet' # Set an entry in a dictionary

print d['fish'] # Prints "wet"

# print d['monkey'] # KeyError: 'monkey' not a key of d

print d.get('monkey', 'N/A') # Get an element with a default; prints "N/A"print d.get('fish', 'N/A') # Get an element with a default; prints "wet"

del d['fish'] # Remove an element from a dictionary

print d.get('fish', 'N/A') # "fish" is no longer a key; prints "N/A"


17/136

Loops: It is easy to iterate over the keys in a dictionary:

d = {'person': 2, 'cat': 4, 'spider': 8}

for animal in d:

legs = d[animal]

print 'A %s has %d legs' % (animal, legs)

# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"

If you want access to keys and their corresponding values, use the iteritems method:

d = {'person': 2, 'cat': 4, 'spider': 8}

for animal, legs in d.iteritems():

print 'A %s has %d legs' % (animal, legs)

# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"

Dictionary comprehensions: These are similar to list comprehensions, but allow you to easily

construct dictionaries. For example:

nums = [0, 1, 2, 3, 4]

even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}

print even_num_to_square # Prints "{0: 0, 2: 4, 4: 16}"


18/136

As a simple example, consider the following:

animals = {'cat', 'dog'}

print 'cat' in animals # Check if an element is in a set; prints "True"

print 'fish' in animals # prints "False"

animals.add('fish') # Add an element to a set

print 'fish' in animals # Prints "True"

print len(animals) # Number of elements in a set; prints "3"

animals.add('cat') # Adding an element that is already in the set does nothingprint len(animals) # Prints "3"

animals.remove('cat') # Remove an element from a set

print len(animals) # Prints "2"


19/136

As usual, everything you want to know about sets can be found in the documentation.

Loops: Iterating over a set has the same syntax as iterating over a list; however since sets are

unordered, you cannot make assumptions about the order in which you visit the elements of the

set:

animals = {'cat', 'dog', 'fish'}

for idx, animal in enumerate(animals):

print '#%d: %s' % (idx + 1, animal)

# Prints "#1: fish", "#2: dog", "#3: cat"

Set comprehensions: Like lists and dictionaries, we can easily construct sets using set

comprehensions:

from math import sqrt

nums = {int(sqrt(x)) for x in range(30)}print nums # Prints "set([0, 1, 2, 3, 4, 5])"


20/136

Here is a trivial example:

d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys

t = (5, 6) # Create a tuple

print type(t) # Prints ""

print d[t] # Prints "5"

print d[(1, 2)] # Prints "1"


21/136

For example:

def sign(x):

if x > 0:return 'positive'

elif x < 0:

return 'negative'

else:

return 'zero'

for x in [-1, 0, 1]:

print sign(x)

# Prints "negative", "zero", "positive"


22/136

We will often define functions to take optional keyword arguments, like this:

def hello(name, loud=False):

if loud:

print 'HELLO, %s' % name.upper()

else:

print 'Hello, %s!' % name

hello('Bob') # Prints "Hello, Bob"

hello('Fred', loud=True) # Prints "HELLO, FRED!"


23/136

class Greeter:

# Constructor

def __init__(self, name):

self.name = name # Create an instance variable

# Instance method

def greet(self, loud=False):

if loud:print 'HELLO, %s!' % self.name.upper()

else:

print 'Hello, %s' % self.name

g = Greeter('Fred') # Construct an instance of the Greeter class

g.greet() # Call an instance method; prints "Hello, Fred"

g.greet(loud=True) # Call an instance method; prints "HELLO, FRED!"


24/136


25/136

We can initialize numpy arrays from nested Python lists, and access elements using square

brackets:

import numpy as np

a = np.array([1, 2, 3]) # Create a rank 1 array

print type(a) # Prints ""

print a.shape # Prints "(3,)"

print a[0], a[1], a[2] # Prints "1 2 3"a[0] = 5 # Change an element of the array

print a # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array

print b.shape # Prints "(2, 3)"

print b[0, 0], b[0, 1], b[1, 0] # Prints "1 2 4"


26/136

Numpy also provides many functions to create arrays:

import numpy as np

a = np.zeros((2,2)) # Create an array of all zeros

print a # Prints "[[ 0. 0.]

# [ 0. 0.]]"

b = np.ones((1,2)) # Create an array of all ones

print b # Prints "[[ 1. 1.]]"

c = np.full((2,2), 7) # Create a constant array

print c # Prints "[[ 7. 7.]

# [ 7. 7.]]"

d = np.eye(2) # Create a 2x2 identity matrix

print d # Prints "[[ 1. 0.]


27/136

# [ 0. 1.]]"

e = np.random.random((2,2)) # Create an array filled with random values

print e # Might print "[[ 0.91940167 0.08143941]

# [ 0.68744134 0.87236687]]"


28/136

Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

import numpy as np

# Create the following rank 2 array with shape (3, 4)

# [[ 1 2 3 4]

# [ 5 6 7 8]

# [ 9 10 11 12]]

a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows

# and columns 1 and 2; b is the following array of shape (2, 2):

# [[2 3]

# [6 7]]

b = a[:2, 1:3]

# A slice of an array is a view into the same data, so modifying it


29/136

# will modify the original array.

print a[0, 1] # Prints "2"

b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]

print a[0, 1] # Prints "77"


30/136

You can also mix integer indexing with slice indexing. However, doing so will yield an array of

lower rank than the original array. Note that this is quite different from the way that MATLAB

handles array slicing:

import numpy as np

# Create the following rank 2 array with shape (3, 4)# [[ 1 2 3 4]

# [ 5 6 7 8]

# [ 9 10 11 12]]

a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Two ways of accessing the data in the middle row of the array.

# Mixing integer indexing with slices yields an array of lower rank,

# while using only slices yields an array of the same rank as the

# original array:row_r1 = a[1, :] # Rank 1 view of the second row of a


31/136

row_r2 = a[1:2, :] # Rank 2 view of the second row of a

print row_r1, row_r1.shape # Prints "[5 6 7 8] (4,)"

print row_r2, row_r2.shape # Prints "[[5 6 7 8]] (1, 4)"

# We can make the same distinction when accessing columns of an array:

col_r1 = a[:, 1]col_r2 = a[:, 1:2]

print col_r1, col_r1.shape # Prints "[ 2 6 10] (3,)"

print col_r2, col_r2.shape # Prints "[[ 2]

# [ 6]

# [10]] (3, 1)"


32/136

Here is an example:

import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.

# The returned array will have shape (3,) and

print a[[0, 1, 2], [0, 1, 0]] # Prints "[1 4 5]"

# The above example of integer array indexing is equivalent to this:

print np.array([a[0, 0], a[1, 1], a[2, 0]]) # Prints "[1 4 5]"

# When using integer array indexing, you can reuse the same

# element from the source array:

print a[[0, 0], [1, 1]] # Prints "[2 2]"


33/136

# Equivalent to the previous integer array indexing example

print np.array([a[0, 1], a[0, 1]]) # Prints "[2 2]"


34/136

Here is an example:

import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2) # Find the elements of a that are bigger than 2;

# this returns a numpy array of Booleans of the same

# shape as a, where each slot of bool_idx tells

# whether that element of a is > 2.

print bool_idx # Prints "[[False False]

# [ True True]

# [ True True]]"

# We use boolean array indexing to construct a rank 1 array

# consisting of the elements of a corresponding to the True values


35/136

# of bool_idx

print a[bool_idx] # Prints "[3 4 5 6]"

# We can do all of the above in a single concise statement:

print a[a > 2] # Prints "[3 4 5 6]"


36/136

Here is an example:

import numpy as np

x = np.array([1, 2]) # Let numpy choose the datatype

print x.dtype # Prints "int64"

x = np.array([1.0, 2.0]) # Let numpy choose the datatype

print x.dtype # Prints "float64"

x = np.array([1, 2], dtype=np.int64) # Force a particular datatype

print x.dtype # Prints "int64"


37/136


38/136

import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)

y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array

# [[ 6.0 8.0]

# [10.0 12.0]]

print x + y

print np.add(x, y)

# Elementwise difference; both produce the array

# [[-4.0 -4.0]

# [-4.0 -4.0]]

print x - y

print np.subtract(x, y)


39/136

# Elementwise product; both produce the array

# [[ 5.0 12.0]

# [21.0 32.0]]

print x * y

print np.multiply(x, y)

# Elementwise division; both produce the array

# [[ 0.2 0.33333333]

# [ 0.42857143 0.5 ]]

print x / y

print np.divide(x, y)

# Elementwise square root; produces the array

# [[ 1. 1.41421356]

# [ 1.73205081 2. ]]


40/136

print np.sqrt(x)


41/136

import numpy as np

x = np.array([[1,2],[3,4]])

y = np.array([[5,6],[7,8]])

v = np.array([9,10])

w = np.array([11, 12])

# Inner product of vectors; both produce 219

print v.dot(w)print np.dot(v, w)

# Matrix / vector product; both produce the rank 1 array [29 67]

print x.dot(v)

print np.dot(x, v)

# Matrix / matrix product; both produce the rank 2 array


42/136

# [[19 22]

# [43 50]]

print x.dot(y)

print np.dot(x, y)


43/136

import numpy as np

x = np.array([[1,2],[3,4]])

print np.sum(x) # Compute sum of all elements; prints "10"

print np.sum(x, axis=0) # Compute sum of each column; prints "[4 6]"

print np.sum(x, axis=1) # Compute sum of each row; prints "[3 7]"


44/136

import numpy as np

x = np.array([[1,2], [3,4]])

print x # Prints "[[1 2]

# [3 4]]"

print x.T # Prints "[[1 3]

# [2 4]]"

# Note that taking the transpose of a rank 1 array does nothing:v = np.array([1,2,3])

print v # Prints "[1 2 3]"

print v.T # Prints "[1 2 3]"


45/136

import numpy as np

# We will add the vector v to each row of the matrix x,

# storing the result in the matrix y

x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

v = np.array([1, 0, 1])

y = np.empty_like(x) # Create an empty matrix with the same shape as x

# Add the vector v to each row of the matrix x with an explicit loopfor i in range(4):

y[i, :] = x[i, :] + v

# Now y is the following

# [[ 2 2 4]

# [ 5 5 7]

# [ 8 8 10]

# [11 11 13]]


46/136

print y


47/136

This works; however when the matrix x is very large, computing an explicit loop in Python could

be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a

matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x

and vv. We could implement this approach like this:

import numpy as np



x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

v = np.array([1, 0, 1])

vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other

print vv # Prints "[[1 0 1]

# [1 0 1]

# [1 0 1]

# [1 0 1]]"

y = x + vv # Add x and vv elementwise


48/136

print y # Prints "[[ 2 2 4

# [ 5 5 7]

# [ 8 8 10]

# [11 11 13]]"

Numpy broadcasting allows us to perform this computation without actually creatingmultiple copies of v. Consider this version, using broadcasting:

import numpy as np



x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])

v = np.array([1, 0, 1])

y = x + v # Add v to each row of x using broadcastingprint y # Prints "[[ 2 2 4]

# [ 5 5 7]

# [ 8 8 10]

# [11 11 13]]"

The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to

broadcasting; this line works as if v actually had shape (4, 3), where each row was a

copy of v, and the sum was performed elementwise.


49/136


50/136

There are currently more than 60 universal functions defined in numpy on one or more types,

covering a wide variety of operations. Some of these ufuncs are called automatically on arrays

when the relevant infix notation is used (e.g., add(a, b) is called internally when a + b is written

and a or b is an ndarray). Nevertheless, you may still want to use the ufunc call in order to use the

optional output argument(s) to place the output(s) in an object (or objects) of your choice.

Recall that each ufunc operates element-by-element. Therefore, each ufunc will be described as if

acting on a set of scalar inputs to return a set of scalar outputs.


51/136

Math operations

add(x1, x2[, out]) Add arguments element-wise.

subtract(x1, x2[, out]) Subtract arguments, element-wise.

multiply(x1, x2[, out]) Multiply arguments element-wise.

divide(x1, x2[, out]) Divide arguments element-wise.

logaddexp(x1, x2[, out]) Logarithm of the sum of exponentiations of the inputs.logaddexp2(x1, x2[, out]) Logarithm of the sum of exponentiations of the inputs in base-2.

true_divide(x1, x2[, out]) Returns a true division of the inputs, element-wise.

floor_divide(x1, x2[, out]) Return the largest integer smaller or equal to the division of the

inputs.

negative(x[, out]) Numerical negative, element-wise.

power(x1, x2[, out]) First array elements raised to powers from second array, element-

wise.

remainder(x1, x2[, out]) Return element-wise remainder of division.

mod(x1, x2[, out]) Return element-wise remainder of division.

fmod(x1, x2[, out]) Return the element-wise remainder of division.


52/136

absolute(x[, out]) Calculate the absolute value element-wise.

rint(x[, out]) Round elements of the array to the nearest integer.

sign(x[, out]) Returns an element-wise indication of the sign of a number.

conj(x[, out]) Return the complex conjugate, element-wise.

exp(x[, out]) Calculate the exponential of all elements in the input array.

exp2(x[, out]) Calculate 2**p for all p in the input array.log(x[, out]) Natural logarithm, element-wise.

log2(x[, out]) Base-2 logarithm of x.

log10(x[, out]) Return the base 10 logarithm of the input array, element-wise.

expm1(x[, out]) Calculate exp(x) - 1 for all elements in the array.

log1p(x[, out]) Return the natural logarithm of one plus the input array,

element-wise.

sqrt(x[, out]) Return the positive square-root of an array, element-wise.

square(x[, out]) Return the element-wise square of the input.

reciprocal(x[, out]) Return the reciprocal of the argument, element-wise.ones_like(a[, dtype, order, subok]) Return an array of ones with the same

shape and type as a given array.

Tip

The optional output arguments can be used to help you save memory for large

calculations. If your arrays are large, complicated expressions can take longer than

absolutely necessary due to the creation and (later) destruction of temporary

calculation spaces. For example, the expression G = a * b + c is equivalent to t1 = A *

B; G = T1 + C; del t1. It will be more quickly executed as G = A * B; add(G, C, G) which

is the same as G = A * B; G += C.


53/136

Trigonometric functions

All trigonometric functions use radians when an angle is called for. The ratio of degrees to radians

is 180^{\circ}/\pi.

sin(x[, out]) Trigonometric sine, element-wise.cos(x[, out]) Cosine element-wise.

tan(x[, out]) Compute tangent element-wise.

arcsin(x[, out]) Inverse sine, element-wise.

arccos(x[, out]) Trigonometric inverse cosine, element-wise.

arctan(x[, out]) Trigonometric inverse tangent, element-wise.

arctan2(x1, x2[, out]) Element-wise arc tangent of x1/x2 choosing the quadrant correctly.

hypot(x1, x2[, out]) Given the “legs” of a right triangle, return its hypotenuse.

sinh(x[, out]) Hyperbolic sine, element-wise.

cosh(x[, out]) Hyperbolic cosine, element-wise.

tanh(x[, out]) Compute hyperbolic tangent element-wise.

arcsinh(x[, out]) Inverse hyperbolic sine element-wise.

arccosh(x[, out]) Inverse hyperbolic cosine, element-wise.


54/136

arctanh(x[, out]) Inverse hyperbolic tangent element-wise.

deg2rad(x[, out]) Convert angles from degrees to radians.

rad2deg(x[, out]) Convert angles from radians to degrees.


55/136

Bit-twiddling functions

These function all require integer arguments and they manipulate the bit-pattern of those

arguments.

bitwise_and(x1, x2[, out]) Compute the bit-wise AND of two arrays element-wise.

bitwise_or(x1, x2[, out]) Compute the bit-wise OR of two arrays element-wise.

bitwise_xor(x1, x2[, out]) Compute the bit-wise XOR of two arrays element-wise.

invert(x[, out]) Compute bit-wise inversion, or bit-wise NOT, element-wise.

left_shift(x1, x2[, out]) Shift the bits of an integer to the left.

right_shift(x1, x2[, out]) Shift the bits of an integer to the right.

Comparison functions

greater(x1, x2[, out]) Return the truth value of (x1 > x2) element-wise.

greater_equal(x1, x2[, out]) Return the truth value of (x1 >= x2) element-wise.

less(x1, x2[, out]) Return the truth value of (x1 < x2) element-wise.less_equal(x1, x2[, out]) Return the truth value of (x1 =< x2) element-wise.

not_equal(x1, x2[, out]) Return (x1 != x2) element-wise.


56/136

equal(x1, x2[, out]) Return (x1 == x2) element-wise.

logical_and(x1, x2[, out]) Compute the truth value of x1 AND x2 element-wise.

logical_or(x1, x2[, out]) Compute the truth value of x1 OR x2 element-wise.

logical_xor(x1, x2[, out]) Compute the truth value of x1 XOR x2, element-wise.

logical_not(x[, out]) Compute the truth value of NOT x element-wise.


57/136

Floating functions

Recall that all of these functions work element-by-element over an array, returning an array

output. The description details only a single operation.isreal(x) Returns a bool array, where True if input element is real.

iscomplex(x) Returns a bool array, where True if input element is complex.

isfinite(x[, out]) Test element-wise for finiteness (not infinity or not Not a Number).

isinf(x[, out]) Test element-wise for positive or negative infinity.

isnan(x[, out]) Test element-wise for NaN and return result as a boolean array.

signbit(x[, out]) Returns element-wise True where signbit is set (less than zero).

copysign(x1, x2[, out]) Change the sign of x1 to that of x2, element-wise.

nextafter(x1, x2[, out]) Return the next floating-point value after x1 towards x2, element-

wise.

modf(x[, out1, out2]) Return the fractional and integral parts of an array, element-wise.

ldexp(x1, x2[, out]) Returns x1 * 2**x2, element-wise.

frexp(x[, out1, out2]) Decompose the elements of x into mantissa and twos exponent.

fmod(x1, x2[, out]) Return the element-wise remainder of division.


58/136

floor(x[, out]) Return the floor of the input, element-wise.

ceil(x[, out]) Return the ceiling of the input, element-wise.

trunc(x[, out]) Return the truncated value of the input, element-wise.


59/136

Here are some applications of broadcasting:

import numpy as np

# Compute outer product of vectors

v = np.array([1,2,3]) # v has shape (3,)

w = np.array([4,5]) # w has shape (2,)# To compute an outer product, we first reshape v to be a column

# vector of shape (3, 1); we can then broadcast it against w to yield

# an output of shape (3, 2), which is the outer product of v and w:

# [[ 4 5]

# [ 8 10]

# [12 15]]

print np.reshape(v, (3, 1)) * w

# Add a vector to each row of a matrix

x = np.array([[1,2,3], [4,5,6]])


60/136

# x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3),

# giving the following matrix:

# [[2 4 6]

# [5 7 9]]

print x + v

# Add a vector to each column of a matrix

# x has shape (2, 3) and w has shape (2,).

# If we transpose x then it has shape (3, 2) and can be broadcast

# against w to yield a result of shape (3, 2); transposing this result

# yields the final result of shape (2, 3) which is the matrix x with

# the vector w added to each column. Gives the following matrix:

# [[ 5 6 7]

# [ 9 10 11]]

print (x.T + w).T# Another solution is to reshape w to be a row vector of shape (2, 1);

# we can then broadcast it directly against x to produce the same

# output.

print x + np.reshape(w, (2, 1))

# Multiply a matrix by a constant:

# x has shape (2, 3). Numpy treats scalars as arrays of shape ();

# these can be broadcast together to shape (2, 3), producing the

# following array:

# [[ 2 4 6]

# [ 8 10 12]]

print x * 2


61/136


62/136


63/136

Here is a simple example:

import numpy as np

import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on a sine curve

x = np.arange(0, 3 * np.pi, 0.1)

y = np.sin(x)

# Plot the points using matplotlib

plt.plot(x, y)

plt.show() # You must call plt.show() to make graphics appear.


64/136


65/136

With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend,

and axis labels:

import numpy as np


# Compute the x and y coordinates for points on sine and cosine curves


y_sin = np.sin(x)y_cos = np.cos(x)

# Plot the points using matplotlib

plt.plot(x, y_sin)

plt.plot(x, y_cos)

plt.xlabel('x axis label')

plt.ylabel('y axis label')

plt.title('Sine and Cosine')


66/136

plt.legend(['Sine', 'Cosine'])

plt.show()


67/136


68/136

Here is an example:

import numpy as np


# Compute the x and y coordinates for points on sine and cosine curves


y_sin = np.sin(x)

y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,

# and set the first such subplot as active.

plt.subplot(2, 1, 1)

# Make the first plotplt.plot(x, y_sin)

plt.title('Sine')

# Set the second subplot as active, and make the second plot.


69/136


plt.plot(x, y_cos)

plt.title('Cosine')

# Show the figure.

plt.show()


70/136

Here is an example:

import numpy as np

from scipy.misc import imread, imresize


img = imread('assets/cat.jpg')

img_tinted = img * [1, 0.95, 0.9]

# Show the original image

plt.subplot(1, 2, 1)plt.imshow(img)

# Show the tinted image


# A slight gotcha with imshow is that it might give strange results


71/136

# if presented with data that is not uint8. To work around this, we

# explicitly cast the image to uint8 before displaying it.

plt.imshow(np.uint8(img_tinted))

plt.show()


72/136


73/136

PANDAS


74/136

1

Lesson 1 Create Data - We begin by creating our own data set for analysis. This prevents the end userreading this tutorial from having to download any files to replicate the results below. We will exportthis data set to a text file so that you can get some experience pulling data from a text file.

Get Data - We will learn how to read in the text file. The data consist of baby names and the numberof baby names born in the year 1880.Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean Imean we will take a look inside the contents of the text file and look for any anomalities. These caninclude missing data, inconsistencies in the data, or any other data that seems out of place. If anyare found we will then have to make decisions on what to do with these records.Analyze Data - We will simply find the most popular name in a specific year.Present Data - Through tabular data and a graph, clearly show the end user what is the mostpopular name in a specific year.

The pandas library is used for all the data analysis excluding a small piece of the data presentationsection. The matplot l ib library will only be needed for the data presentation section. Importing the

libraries is the first step we will take in the lesson.

# Import all libraries needed for the tutorial

# General syntax to import specific functions in a library:

##from (library) import (specific library function)

from pandas import DataFrame, read_csv

# General syntax to import a library but no functions:

##import (library) as (give the library a nickname/alias)


import pandas as pd #this is how I usually import pandas

import sys #only needed to determine Python version number

# Enable inline plotting

%matplotlib inline

print 'Python version ' + sys.version


75/136

2

print 'Pandas version ' + pd.__version__

Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version 0.15.2

Create DataThe data set will consist of 5 baby names and the number of births recorded for that year (1880).

# The inital set of baby names and bith rates

names = ['Bob','Jessica','Mary','John','Mel']

births = [968, 155, 77, 578, 973]

To merge these two lists together we will use the zip function.

zip?

BabyDataSet = zip(names,births)

BabyDataSet

[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

We are basically done creating the data set. We now will use the pandas library to export this dataset into a csv file.

d f will be a DataFrame object. You can think of this object holding the contents of the BabyDataSetin a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contentsinside d f .

df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])

df

Names Births

0Bob 968

1Jessica 155

2Mary 77

3John 578


76/136

3

Names Births

4Mel 973Export the dataframe to a csv file. We can name the file births1880.csv . The function to_csv will beused to export the file. The file will be saved in the same location of the notebook unless specifiedotherwise.

In [7]:

df.to_csv?

The only parameters we will use is index and header . Setting these parameters to True will preventthe index and header names from being exported. Change the values of these parameters to get abetter understanding of their use.

In [8]:

df.to_csv('births1880.csv',index=False,header=False)

Get DataTo pull in the csv file, we will use the pandas function read_csv . Let us take a look at this functionand what inputs it takes.

In [9]:

read_csv?

Even though this functions has many parameters, we will simply pass it the location of the text file.

Location = C:\Users\ENTER_USER_NAME.xy\startups\births1880.csv

Note: Depending on where you save your notebooks, you may need to modify the location above.

In [10]:

Location = r'C:\Users\david\notebooks\pandas\births1880.csv'

df = pd.read_csv(Location)

Notice the r before the string. Since the slashes are special characters, prefixing the string with a r will escape the whole string.

In [11]:

df

Out[11]:

Bob 968

0Jessica 155

1Mary 77

2John 578


77/136

4

Bob 968

3Mel 973This brings us the our first problem of the exercise. The read_csv function treated the first record inthe csv file as the header names. This is obviously not correct since the text file did not provide uswith header names.

To correct this we will pass the header parameter to the read_csv function and set it to None (means null in python).

In [12]:

df = pd.read_csv(Location, header=None)

df

Out[12]:

0 1

0Bob 968

1Jessica 1552Mary 77

3John 578

4Mel 973

If we wanted to give the columns specific names, we would have to pass another paramter callednames . We can also omit the header parameter.

In [13]:

df = pd.read_csv(Location, names=['Names','Births'])

df

Out[13]:

Names Births

0Bob 968

1Jessica 155

2Mary 77

3John 578

4Mel 973

You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these arepart of the index of the dataframe. You can think of the index as the primary key of a sql table withthe exception that an index is allowed to have duplicates.

[Names, Births] can be though of as column headers similar to the ones found in an Excel

spreadsheet or sql database.

Delete the csv file now that we are done using it.

In [14]:

import os

os.remove(Location)


78/136

5

Prepare DataThe data we have consists of baby names and the number of births in the year 1880. We alreadyknow that we have 5 records and none of the records are missing (non-null values).

The Names column at this point is of no concern since it most likely is just composed of alpha

numeric strings (baby names). There is a chance of bad data in this column but we will not worryabout that at this point of the analysis. The Births column should just contain integers representingthe number of babies born in a specific year with a specific name. We can check if the all the data isof the data type integer. It would not make sense to have this column have a data type of float. Iwould not worry about any possible outliers at this point of the analysis.

Realize that aside from the check we did on the "Names" column, briefly looking at the data insidethe dataframe should be as far as we need to go at this stage of the game. As we continue in thedata analysis life cycle we will have plenty of opportunities to find any issues with the data set.

In [15]:

# Check data type of the columns

df.dtypes

Out[15]:

Names object

Births int64

dtype: object

In [16]:

# Check data type of Births column

df.Births.dtype

Out[16]:

dtype('int64')

As you can see the Births column is of type int64 , thus no floats (decimal numbers) or alpha numericcharacters will be present in this column.

Analyze DataTo find the most popular name or the baby name with the higest birth rate, we can do one of thefollowing.

Sort the dataframe and select the top row Use the max() attribute to find the maximum value

In [17]:

# Method 1:

Sorted = df.sort(['Births'], ascending=False)


79/136

6

Sorted.head(1)

Out[17]:

Names Births

4Mel 973In [18]:

# Method 2:

df['Births'].max()

Out[18]:

973

Present DataHere we can plot the Births column and label the graph to show the end user the highest point onthe graph. In conjunction with the table, the end user has a clear picture that Mel is the most popularbaby name in the data set.

plot() is a convinient attribute where pandas lets you painlessly plot the data in your dataframe. Welearned how to find the maximum value of the Births column in the previous section. Now to find theactual baby name of the 973 value looks a bit tricky, so lets go over it.

Explain the pieces: df['Names'] - This is the entire list of baby names, the entire Names columndf['Births'] - This is the entire list of Births in the year 1880, the entire Births columndf['Births'].max() - This is the maximum value found in the Births column

[df['Births'] == df['Births'].max()] IS EQUAL TO [Find all of the records in the Births column where it isequal to 973]df['Names'][df['Births'] == df['Births'].max()] IS EQUAL TO Select all of the records in the Namescolumn WHERE [The Births column is equal to 973]

An alternative way could have been to use the Sorted dataframe:Sorted['Names'].head(1).value

The str() function simply converts an object into a string.

In [19]:

# Create graph

df['Births'].plot()

# Maximum value in the data set

MaxValue = df['Births'].max()


80/136

7

# Name associated with the maximum value

MaxName = df['Names'][df['Births'] == df['Births'].max()].values

# Text to display on graph

Text = str(MaxValue) + " - " + MaxName

# Add text to graph

plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),

xycoords=('axes fraction', 'data'), textcoords='offset point

s')

print "The most popular name"

df[df['Births'] == df['Births'].max()]

#Sorted.head(1) can also be used

The most popular name

Out[19]:

Names Births

4Mel 973


81/136

8

Lesson 2 In [1]:

# The usual preamble

import pandas as pd

# Make the graphs a bit prettier, and bigger

pd.set_option('display.mpl_style', 'default')

pd.set_option('display.line_width', 5000)

pd.set_option('display.max_columns', 60)

figsize(15, 5)

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is asubset of the of 311 service requests from NYC Open Data.

In [2]:

complaints = pd.read_csv('../data/311-service-requests.csv')

2.1 What's even in it? (the summary)

When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll showyou a summary . This includes all the columns, and how many non-null values there are in eachcolumn.

In [3]:

complaints

Out[3]:

Int64Index: 111069 entries, 0 to 111068Data columns (total 52 columns):Unique Key 111069 non-null values

Created Date 111069 non-null valuesClosed Date 60270 non-null valuesAgency 111069 non-null valuesAgency Name 111069 non-null valuesComplaint Type 111069 non-null valuesDescriptor 111068 non-null valuesLocation Type 79048 non-null valuesIncident Zip 98813 non-null valuesIncident Address 84441 non-null valuesStreet Name 84438 non-null values

https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9


82/136

9

Cross Street 1 84728 non-null valuesCross Street 2 84005 non-null valuesIntersection Street 1 19364 non-null valuesIntersection Street 2 19366 non-null valuesAddress Type 102247 non-null valuesCity 98860 non-null valuesLandmark 95 non-null valuesFacility Type 110938 non-null valuesStatus 111069 non-null valuesDue Date 39239 non-null valuesResolution Action Updated Date 96507 non-null valuesCommunity Board 111069 non-null valuesBorough 111069 non-null valuesX Coordinate (State Plane) 98143 non-null valuesY Coordinate (State Plane) 98143 non-null valuesPark Facility Name 111069 non-null valuesPark Borough 111069 non-null valuesSchool Name 111069 non-null valuesSchool Number 111052 non-null valuesSchool Region 110524 non-null values

School Code 110524 non-null valuesSchool Phone Number 111069 non-null valuesSchool Address 111069 non-null valuesSchool City 111069 non-null valuesSchool State 111069 non-null valuesSchool Zip 111069 non-null valuesSchool Not Found 38984 non-null valuesSchool or Citywide Complaint 0 non-null valuesVehicle Type 99 non-null valuesTaxi Company Borough 117 non-null valuesTaxi Pick Up Location 1059 non-null valuesBridge Highway Name 185 non-null valuesBridge Highway Direction 185 non-null valuesRoad Ramp 184 non-null valuesBridge Highway Segment 223 non-null valuesGarage Lot Name 49 non-null valuesFerry Direction 37 non-null valuesFerry Terminal Name 336 non-null valuesLatitude 98143 non-null valuesLongitude 98143 non-null valuesLocation 98143 non-null valuesdtypes: float64(5), int64(1), object(46)

2.2 Selecting columns and rows

To select a column, we index with the name of the column, like this:

In [4]:

complaints['Complaint Type']

Out[4]:

0 Noise - Street/Sidewalk

1 Illegal Parking


83/136

10

2 Noise - Commercial

3 Noise - Vehicle

4 Rodent


6 Blocked Driveway

7 Noise - Commercial8 Noise - Commercial


10 Noise - House of Worship


12 Illegal Parking

13 Noise - Vehicle

14 Rodent

...



111056 Street Sign - Missing

111057 Noise



111060 Noise


111062 Water System

111063 Water System

111064 Maintenance or Facility

111065 Illegal Parking

111066 Noise - Street/Sidewalk111067 Noise - Commercial

111068 Blocked Driveway

Name: Complaint Type, Length: 111069, dtype: object

To get the first 5 rows of a dataframe, we can use a slice: df[:5].

This is a great way to get a sense for what kind of information is in the dataframe -- take a minute tolook at the contents and get a feel for this dataset.

In [5]:

complaints[:5]

Out[5]:


84/136

11

UniqueKey

CreatedDate

ClosedDate

gency

AgencyName

ComplaintType

Descriptor

LocationType

IncidentZip

IncidentAddress

StreetName

CrossStreet1

CrossStreet2

In

tersectionStree

t1

In

tersectionStree

t2

AddressType

City

Land

ark

FacilityType

Status

DueDate

Resolu

tionActionUpdat

edDate

CommunityBoard

Borough

XCoor

dinate(StateP

lane)

YCoor

dinate(StateP

lane)

ParkFacilityNam

e

ParkBorough

SchoolName

SchoolNumber

SchoolRegion

SchoolCode

Sc

hoolPhoneNumb

er

SchoolAddress

SchoolCity

SchoolState

SchoolZip

chool

otFoun

d

SchoolorCitywideCo

plaint

ehicleType

TaxiCo

panyBoro

ugh

Ta

xiPickUpLocati

on

Br

idgeHigh

ayNa

e

Brid

geHigh

ayDire

ction

ad

a

BridgeHigh

aySeg

ent

arageLot

a

e

FerryDirection

Fe

rryTer

inalNa

e

Latitude

Longitude

Location

26589651

10

/ 31

/ 201302:08:41AM

NaN

NYPD

NewYor

kCityPoliceDepart

ment

N

oise-Street

/Sidewalk

LoudTalking

Street/Sidewalk

11432

9

0-03169STREET

169STREET

90AVENUE

91AVENUE

NaN

NaN

ADDRESS

JAMAICA

NaN

recinct

Assigned

10

/ 31

/ 201310:08:41AM

10

/ 31

/ 201302:35:17AM

12QUEENS

QUEENS

1042027

197389

Unspecified

QUEENS

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

NaN

a

NaN

NaN

NaN

NaN

a

NaN

a

NaN

NaN

40.708275

-73.791604

(40.7082

7532593202,-73.791603957797

21)

26593

10

/ 31

NaN

NYPD

NewYo

IllegalP

Comme

Street/Side

11378

58AVE

58AVE

58PLA

59STR

NaN

NaN

BLOCK

MASPE

NaN

reci

pen

10

/ 31

NaN

05QUE

QUEE

10093

2019

Unspe

QUEE

Unspe

Unspe

Unspe

Unspe

Unspe

Unspe

Unspe

Unspe

Unspe

NaN

a

NaN

NaN

NaN

NaN

a

NaN

a

NaN

NaN

40.72

-73.9

(40.72104053


85/136

12

UniqueKey

CreatedDate

ClosedDate

gency

AgencyName

ComplaintType

Descriptor

LocationType

IncidentZip

IncidentAddress

StreetName

CrossStreet1

CrossStreet2

In

tersectionStree

t1

In

tersectionStree

t2

AddressType

City

Land

ark

FacilityType

Status

DueDate

Resolu

tionActionUpdat

edDate

CommunityBoard

Borough

XCoor

dinate(StateP

lane)

YCoor

dinate(StateP

lane)

ParkFacilityNam

e

ParkBorough

SchoolName

SchoolNumber

SchoolRegion

SchoolCode

Sc

hoolPhoneNumb

er

SchoolAddress

SchoolCity

SchoolState

SchoolZip

chool

otFoun

d

SchoolorCitywideCo

plaint

ehicleType

TaxiCo

panyBoro

ugh

Ta

xiPickUpLocati

on

Br

idgeHigh

ayNa

e

Brid

geHigh

ayDire

ction

ad

a

BridgeHigh

aySeg

ent

arageLot

a

e

FerryDirection

Fe

rryTer

inalNa

e

Latitude

Longitude

Location

698

/ 20130

2:01:04AM

rkCityP

oliceDepartment

arking

rcialOve

rnightParking

walk

NUE

NUE

CE

EET

FACE

TH

nct

/ 20131

0:01:04AM

ENS

NS

49

84

cified

NS

cified

cified

cified

cified

cified

cified

cified

cified

cified

1041

09453

5628305, -73.90

945306791765)

26594139

1

0 / 31

/ 2013

1

0 / 31

/ 2013

NYPD

N

ewYorkCity

Noise-Commer

L

oudMusic/ Pa

Club /Bar/ Restaurant

10032

4

060BROADW

BROADWAY

W

EST171STR

W

EST172STR

NaN

NaN

ADDRESS

NE

YORK

NaN

recinct

Closed

1

0 / 31

/ 2013

1

0 / 31

/ 2013

12MANHATT

MANHATTAN

1001088

246531

Unspecified

MANHATTAN

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

Unspecified

NaN

a

NaN

NaN

NaN

NaN

a

NaN

a

NaN

NaN

40.843330

-

73.939144

(40

.84332975466513,-73.


86/136

13

UniqueKey

CreatedDate

ClosedDate

gency

AgencyName

ComplaintType

Descriptor

LocationType

IncidentZip

IncidentAddress

StreetName

CrossStreet1

CrossStreet2

In

tersectionStree

t1

In

tersectionStree

t2

AddressType

City

Land

ark

FacilityType

Status

DueDate

Resolu

tionActionUpdat

edDate

CommunityBoard

Borough

XCoor

dinate(StateP

lane)

YCoor

dinate(StateP

lane)

ParkFacilityNam

e

ParkBorough

SchoolName

SchoolNumber

SchoolRegion

SchoolCode

Sc

hoolPhoneNumb

er

SchoolAddress

SchoolCity

SchoolState

SchoolZip

chool

otFoun

d

SchoolorCitywideCo

plaint

ehicleType

TaxiCo

panyBoro

ugh

Ta

xiPickUpLocati

on

Br

idgeHigh

ayNa

e

Brid

geHigh

ayDire

ction

ad

a

BridgeHigh

aySeg

ent

arageLot

a

e

FerryDirection

Fe

rryTer

inalNa

e

Latitude

Longitude

Location

02:00:

24AM

02:40:

32AM

PoliceDe

partment

cial

rty

AY

EET

EET

10:00:

24AM

02:39:

42AM

AN

939143719134

82)

26

595721

10

/ 31

/

201301:56

10

/ 31

/

201302:21

NYPD

NewYor

kCityPoliceD

Noise

-Vehicle

Car/ T

ruckHorn

Str

eet/Sidewalk

1

0023

WEST

72STREET

WEST

72STREET

COLUM

BUSAVENUE

AMSTER

DAMAVENUE

NaN

NaN

BLO

CKFACE

NE

YORK

NaN

r

ecinct

C

losed

10

/ 31

/

201309:56

10

/ 31

/

201302:21

07MA

NHATTAN

MAN

HATTAN

9

89730

2

22727

Uns

pecified

MAN

HATTAN

Uns

pecified

Uns

pecified

Uns

pecified

Uns

pecified

Uns

pecified

Uns

pecified

Uns

pecified

Uns

pecified

Uns

pecified

NaN

a

NaN

NaN

NaN

NaN

a

NaN

a

NaN

NaN

40.

778009

-73

.980213

(40.77800874

46372, -73.9802134902


87/136

14

UniqueKey

CreatedDate

ClosedDate

gency

AgencyName

ComplaintType

Descriptor

LocationType

IncidentZip

IncidentAddress

StreetName

CrossStreet1

CrossStreet2

In

tersectionStree

t1

In

tersectionStree

t2

AddressType

City

Land

ark

FacilityType

Status

DueDate

Resolu

tionActionUpdat

edDate

CommunityBoard

Borough

XCoor

dinate(StateP

lane)

YCoor

dinate(StateP

lane)

ParkFacilityNam

e

ParkBorough

SchoolName

SchoolNumber

SchoolRegion

SchoolCode

Sc

hoolPhoneNumb

er

SchoolAddress

SchoolCity

SchoolState

SchoolZip

chool

otFoun

d

SchoolorCitywideCo

plaint

ehicleType

TaxiCo

panyBoro

ugh

Ta

xiPickUpLocati

on

Br

idgeHigh

ayNa

e

Brid

geHigh

ayDire

ction

ad

a

BridgeHigh

aySeg

ent

arageLot

a

e

FerryDirection

Fe

rryTer

inalNa

e

Latitude

Longitude

Location

:23AM

:48AM

epartme

nt

:23AM

:10AM

3975)

26590

930

10

/ 31

/ 20130

1:53:44AM

Na

N

D

H

H

Departmentof

HealthandM

Rod

ent

ConditionAttra

ctingRodents

Vacant

Lot

100

27

WEST124

STREET

WEST124

STREET

LENOXA

VENUE

ADAMCLAYTON

POWELLJRB

Na

N

Na

N

BLOCK

FACE

NE

Y

ORK

Na

N

/

Pend

ing

11

/ 30

/ 20130

1:53:44AM

10

/ 31

/ 20130

1:59:54AM

10MANH

ATTAN

MANHA

TTAN

9988

15

2335

45

Unspe

cified

MANHA

TTAN

Unspe

cified

Unspe

cified

Unspe

cified

Unspe

cified

Unspe

cified

Unspe

cified

Unspe

cified

Unspe

cified

Unspe

cified

Na

N

a

Na

N

Na

N

Na

N

Na

N

a

Na

N

a

Na

N

Na

N

40.80

7691

-73.94

7387

(40.80769092704951,-

73.94738703491433)


88/136

15

UniqueKey

CreatedDate

ClosedDate

gency

AgencyName

ComplaintType

Descriptor

LocationType

IncidentZip

IncidentAddress

StreetName

CrossStreet1

CrossStreet2

In

tersectionStree

t1

In

tersectionStree

t2

AddressType

City

Land

ark

FacilityType

Status

DueDate

Resolu

tionActionUpdat

edDate

CommunityBoard

Borough

XCoor

dinate(StateP

lane)

YCoor

dinate(StateP

lane)

ParkFacilityNam

e

ParkBorough

SchoolName

SchoolNumber

SchoolRegion

SchoolCode

Sc

hoolPhoneNumb

er

SchoolAddress

SchoolCity

SchoolState

SchoolZip

chool

otFoun

d

SchoolorCitywideCo

plaint

ehicleType

TaxiCo

panyBoro

ugh

Ta

xiPickUpLocati

on

Br

idgeHigh

ayNa

e

Brid

geHigh

ayDire

ction

ad

a

BridgeHigh

aySeg

ent

arageLot

a

e

FerryDirection

Fe

rryTer

inalNa

e

Latitude

Longitude

Location

entalHy

giene

OULEVA

RD

We can combine these to get the first 5 rows of a column:

In [6]:

complaints['Complaint Type'][:5]

Out[6]:


1 Illegal Parking2 Noise - Commercial

3 Noise - Vehicle

4 Rodent

Name: Complaint Type, dtype: object

and it doesn't matter which direction we do it in:

In [7]:


89/136

16

complaints[:5]['Complaint Type']

Out[7]:


1 Illegal Parking2 Noise - Commercial

3 Noise - Vehicle

4 Rodent

Name: Complaint Type, dtype: object

2.3 Selecting multiple columnsWhat if we just want to know the complaint type and the borough, but not the rest of the information?Pandas makes it really easy to select a subset of the columns: just index with list of columns youwant.

In [8]:

complaints[['Complaint Type', 'Borough']]

Out[8]:

Int64Index: 111069 entries, 0 to 111068Data columns (total 2 columns):Complaint Type 111069 non-null valuesBorough 111069 non-null valuesdtypes: object(2)

That showed us a summary, and then we can look at the first 10 rows:

In [9]:

complaints[['Complaint Type', 'Borough']][:10]

Out[9]:

Complaint Type Borough0Noise - Street/Sidewalk QUEENS

1Illegal Parking QUEENS

2Noise - Commercial MANHATTAN3Noise - Vehicle MANHATTAN

4Rodent MANHATTAN

5Noise - Commercial QUEENS6Blocked Driveway QUEENS

7Noise - Commercial QUEENS

8Noise - Commercial MANHATTAN

9Noise - Commercial BROOKLYN


90/136

17

2.4 What's the most common complaint

type?

This is a really easy question to answer! There's a .value_counts() method that we can use:

In [10]:

complaints['Complaint Type'].value_counts()

Out[10]:

HEATING 14200

GENERAL CONSTRUCTION 7471

Street Light Condition 7117

DOF Literature Request 5797

PLUMBING 5373

PAINT - PLASTER 5149

Blocked Driveway 4590

NONCONST 3998

Street Condition 3473

Illegal Parking 3343

Noise 3321

Traffic Signal Condition 3145

Dirty Conditions 2653

Water System 2636

Noise - Commercial 2578

...Opinion for the Mayor 2

Window Guard 2

DFTA Literature Request 2

Legal Services Provider Complaint 2

Open Flame Permit 1

Snow 1

Municipal Parking Facility 1

X-Ray Machine/Equipment 1

Stalled Sites 1

DHS Income Savings Requirement 1

Tunnel Condition 1

Highway Sign - Damaged 1

Ferry Permit 1

Trans Fat 1

DWD 1

Length: 165, dtype: int64

If we just wanted the top 10 most common complaints, we can do this:


91/136

18

In [11]:

complaint_counts = complaints['Complaint Type'].value_counts()

complaint_counts[:10]

Out[11]:HEATING 14200

GENERAL CONSTRUCTION 7471

Street Light Condition 7117

DOF Literature Request 5797

PLUMBING 5373

PAINT - PLASTER 5149

Blocked Driveway 4590

NONCONST 3998

Street Condition 3473

Illegal Parking 3343dtype: int64

But it gets better! We can plot them!

In [12]:

complaint_counts[:10].plot(kind='bar')

Out[12]:

.warning{

color: rgb( 240, 20, 20 )

}


92/136

19

Lesson 3Get Data - Our data set will consist of an Excel file containing customer counts per date. We willlearn how to read in the excel file for processing.Prepare Data - The data is an irregular time series having duplicate dates. We will be challenged in

compressing the data and coming up with next years forecasted customer count.Analyze Data - We use graphs to visualize trends and spot outliers. Some built in computationaltools will be used to calculate next years forecasted customer count.Present Data - The results will be plotted.

NOTE: Make sure you have looked th rough al l previous lessons, as the know ledge learned in

previous lessons wi l l be needed for this exercise.

In [1]:

# Import libraries

import pandas as pd


import numpy.random as np

import sys

%matplotlib inline

In [2]:


print 'Pandas version: ' + pd.__version__

Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version: 0.15.2

We will be creating our own test data for analysis.

In [3]:

# set seed

np.seed(111)

# Function to generate test data

def CreateDataSet(Number=1):

Output = []


93/136

20

for i in range(Number):

# Create a weekly (mondays) date range

rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')

# Create random data

data = np.randint(low=25,high=1000,size=len(rng))

# Status pool

status = [1,2,3]

# Make a random list of statuses

random_status = [status[np.randint(low=0,high=len(status))] for i in

range(len(rng))]

# State pool

states = ['GA','FL','fl','NY','NJ','TX']

# Make a random list of states

random_states = [states[np.randint(low=0,high=len(states))] for i in

range(len(rng))]

Output.extend(zip(random_states, random_status, data, rng))

return Output

Now that we have a function to generate our test data, lets create some data and stick it into adataframe.

In [4]:

dataset = CreateDataSet(4)


94/136

21

df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','St

atusDate'])

df.info()

Int64Index: 836 entries, 0 to 835Data columns (total 4 columns):State 836 non-null objectStatus 836 non-null int64CustomerCount 836 non-null int64StatusDate 836 non-null datetime64[ns]dtypes: datetime64[ns](1), int64(2), object(1)memory usage: 32.7+ KB

In [5]:

df.head()

Out[5]:State Status CustomerCount StatusDate

0GA 1 877 2009-01-05

1FL 1 901 2009-01-12

2fl 3 749 2009-01-19

3FL 3 111 2009-01-26

4GA 1 300 2009-02-02We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. Wesimply do this to show you how to read and write to Excel files.

We do not write the index values of the dataframe to the Excel file, since they are not meant to bepart of our initial test data set.

In [6]:

# Save results to excel

df.to_excel('Lesson3.xlsx', index=False)

print 'Done'

Done

Grab Data from ExcelWe will be using the read_excel function to read in data from an Excel file. The function allows youto read in specfic tabs by name or location.

In [7]:

pd.read_excel?


95/136

22

Note: The location on the Excel file will be in the same folder as the notebook, unlessspecified otherwise.

In [8]:

# Location of file

Location = r'C:\Users\david\notebooks\pandas\Lesson3.xlsx'

# Parse a specific sheet

df = pd.read_excel(Location, 0, index_col='StatusDate')

df.dtypes

Out[8]:

State object

Status int64CustomerCount int64

dtype: object

In [9]:

df.index

Out[9]:

[2009-01-05, ..., 2012-12-31]

Length: 836, Freq: None, Timezone: None

In [10]:

df.head()

Out[10]:

State Status CustomerCountStatusDate

2009-01-05 GA 1 877

2009-01-12 FL 1 901

2009-01-19 fl 3 749

2009-01-26 FL 3 111

2009-02-02 GA 1 300

Prepare DataThis section attempts to clean up the data for analysis.

1. Make sure the state column is all in upper case2. Only select records where the account status is equal to "1"


96/136

23

3. Merge (NJ and NY) to NY in the state column4. Remove any outliers (any odd results in the data set)

Lets take a quick look on how some of the State values are upper case and some are lower case

In [11]:

df['State'].unique()

Out[11]:

array([u'GA', u'FL', u'fl', u'TX', u'NY', u'NJ'], dtype=object)

To convert all the State values to upper case we will use the upper() function and the dataframe'sapply attribute. The lambda function simply will apply the upper function to each value in the State column.

In [12]:

# Clean State Column, convert to upper case

df['State'] = df.State.apply(lambda x: x.upper())

In [13]:


Out[13]:

array([u'GA', u'FL', u'TX', u'NY', u'NJ'], dtype=object)

In [14]:

# Only grab where Status == 1

mask = df['Status'] == 1

df = df[mask]

To turn the NJ states to NY we simply...

[df.State == 'NJ'] - Find all records in the State column where they are equal to NJ .df .State[df.State == 'NJ'] = 'NY' - For all records in the State column where they are equal to NJ ,replace them with NY .

In [15]:

# Convert NJ to NY

mask = df.State == 'NJ'

df['State'][mask] = 'NY'

Now we can see we have a much cleaner data set to work with.

In [16]:


97/136

24


Out[16]:

array([u'GA', u'FL', u'NY', u'TX'], dtype=object)

At this point we may want to graph the data to check for any outliers or inconsistencies in the data.We will be using the plot() attribute of the dataframe.

As you can see from the graph below it is not very conclusive and is probably a sign that we need toperform some more data preparation.

In [17]:

df['CustomerCount'].plot(figsize=(15,5));

If we take a look at the data, we begin to realize that there are multiple values for the same State,StatusDate, and Status combination. It is possible that this means the data you are working with is

dirty/bad/inaccurate, but we will assume otherwise. We can assume this data set is a subset of abigger data set and if we simply add the values in the CustomerCount column per State,StatusDate, and Status we will get the Total Custom er Count per day.

In [18]:

sortdf = df[df['State']=='NY'].sort(axis=0)

sortdf.head(10)

Out[18]:

State Status CustomerCount

StatusDate2009-01-19 NY 1 522

2009-02-23 NY 1 710

2009-03-09 NY 1 992

2009-03-16 NY 1 355

2009-03-23 NY 1 728

2009-03-30 NY 1 863

2009-04-13 NY 1 520

2009-04-20 NY 1 820

2009-04-20 NY 1 937

2009-04-27 NY 1 447

Our task is now to create a new dataframe that compresses the data so we have daily customer

counts per State and StatusDate. We can ignore the Status column since all the values in thiscolumn are of value 1. To accomplish this we will use the dataframe's functions groupby and sum() .

Note that we had to use reset_index . If we did not, we would not have been able to group by boththe State and the StatusDate since the groupby function expects only columns as inputs. Thereset_index function will bring the index StatusDate back to a column in the dataframe.

In [19]:

# Group by State and StatusDate


98/136

25

Daily = df.reset_index().groupby(['State','StatusDate']).sum()

Daily.head()

Out[19]:

Status CustomerCountState StatusDate

FL 2009-01-12 1 901

2009-02-02 1 653

2009-03-23 1 752

2009-04-06 2 1086

2009-06-08 1 649The State and StatusDate columns are automatically placed in the index of the Daily dataframe.You can think of the index as the primary key of a database table but without the constraint ofhaving unique values. Columns in the index as you will see allow us to easily select, plot, andperform calculations on the data.

Below we delete the Status column since it is all equal to one and no longer necessary.

In [20]:

del Daily['Status']

Daily.head()

Out[20]:

CustomerCountState StatusDate

FL 2009-01-12 901

2009-02-02 653

2009-03-23 7522009-04-06 1086

2009-06-08 649

In [21]:

# What is the index of the dataframe

Daily.index

Out[21]:

MultiIndex(levels=[[u'FL', u'GA', u'NY', u'TX'], [2009-01-05 00:00:00, 2009-0

1-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00,

2009-03-09 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:

00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04

-27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00,

2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:

00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08

-10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00,

2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:


99/136

26

00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10

-26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00,

2009-12-07 00:00:00, 2009-12-14 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:

00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02

-15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00,

2010-03-15 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-26 00:00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05

-24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-28 00:00:00,

2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:

00:00, 2010-08-09 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09

-06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00,

2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:

00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11

-29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00,

2011-01-17 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:

00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-03

-21 00:00:00, 2011-03-28 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00,

2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:

00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-06 00:00:00, ...]],

labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, ...], [1, 3, 7, 9, 17, 19, 20, 21, 23, 25, 27, 28, 29, 30, 31, 35, 3

8, 40, 41, 44, 45, 46, 47, 48, 49, 52, 54, 56, 57, 59, 60, 62, 66, 68, 69, 70

, 71, 72, 75, 76, 77, 78, 79, 85, 88, 89, 92, 96, 97, 99, 100, 101, 103, 104,

105, 108, 109, 110, 112, 114, 115, 117, 118, 119, 125, 126, 127, 128, 129, 131, 133, 134, 135, 136, 137, 140, 146, 150, 151, 152, 153, 157, 0, 3, 7, 22, 2

3, 24, 27, 28, 34, 37, 42, 47, 50, 55, 58, 66, 67, 69, ...]],

names=[u'State', u'StatusDate'])

In [22]:

# Select the State index

Daily.index.levels[0]

Out[22]:

Index([u'FL', u'GA', u'NY', u'TX'], dtype='object')In [23]:

# Select the StatusDate index

Daily.index.levels[1]

Out[23]:


100/136

27

[2009-01-05, ..., 2012-12-10]

Length: 161, Freq: None, Timezone: None

Lets now plot the data per State.

As you can see by breaking the graph up by the State column we have a much clearer picture onhow the data looks like. Can you spot any outliers?

In [24]:

Daily.loc['FL'].plot()

Daily.loc['GA'].plot()

Daily.loc['NY'].plot()

Daily.loc['TX'].plot();

We can also just plot the data on a specific date, like 2012 . We can now clearly see that the data forthese states is all over the place. since the data consist of weekly customer counts, the variability ofthe data seems suspect. For this tutorial we will assume bad data and proceed.

In [25]:

Daily.loc['FL']['2012':].plot()

Daily.loc['GA']['2012':].plot()

Daily.loc['NY']['2012':].plot()

Daily.loc['TX']['2012':].plot();

We will assume that per month the customer count should remain relatively steady. Any data outsidea specific range in that month will be removed from the data set. The final result should have smoothgraphs with no spikes.

StateYearMonth - Here we group by State, Year of StatusDate, and Month of StatusDate.Daily['Outl ier'] - A boolean (True or False) value letting us know if the value in the CustomerCountcolumn is ouside the acceptable range.

We will be using the attribute t ransform instead of apply . The reason is that transform will keep theshape(# of rows and columns) of the dataframe the same and apply will not. By looking at theprevious graphs, we can realize they are not resembling a gaussian distribution, this means wecannot use summary statistics like the mean and stDev. We use percentiles instead. Note that werun the risk of eliminating good data.

In [26]:


101/136

28

# Calculate Outliers

StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.

get_level_values(1).year, Daily.index.get_level_values(1).month])

Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant

ile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )

Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant

ile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )

Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['Custom

erCount'] > Daily['Upper'])

# Remove Outliers

Daily = Daily[Daily['Outlier'] == False]

The dataframe named Daily will hold customer counts that have been aggregated per day. Theoriginal data (df) has multiple records per day. We are left with a data set that is indexed by both thestate and the StatusDate. The Outlier column should be equal to False signifying that the record isnot an outlier.

In [27]:

Daily.head()

Out[27]:

CustomerCount Lower Upper Outlier

State StatusDate

FL 2009-01-12 901 450.5 1351.5 False2009-02-02 653 326.5 979.5 False

2009-03-23 752 376.0 1128.0 False

2009-04-06 1086 543.0 1629.0 False

2009-06-08 649 324.5 973.5 False

We create a separate dataframe named ALL which groups the Daily dataframe by StatusDate. Weare essentially getting rid of the State column. The Max column represents the maximum customercount per month. The Max column is used to smooth out the graph.

In [28]:

# Combine all markets

# Get the max customer count by Date

ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_value

s(1)).sum())

ALL.columns = ['CustomerCount'] # rename column


102/136

29

# Group by Year and Month

YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])

# What is the max customer count per Year and Month

ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())

ALL.head()

Out[28]:

CustomerCount MaxStatusDate

2009-01-05 877 901

2009-01-12 901 9012009-01-19 522 901

2009-02-02 953 953

2009-02-23 710 953As you can see from the ALL dataframe above, in the month of January 2009, the maximumcustomer count was 901. If we had used apply , we would have got a dataframe with (Year andMonth) as the index and just the Max column with the value of 901.

There is also an interest to gauge if the current customer counts were reaching certain goals thecompany had established. The task here is to visually show if the current customer counts aremeeting the goals listed below. We will call the goals BHAG (Big Hairy Annual Goal).

12/31/2011 - 1,000 customers 12/31/2012 - 2,000 customers 12/31/2013 - 3,000 customers

We will be using the date_range function to create our dates.

Definition: date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False,name=None, closed=None)Docstr ing: Return a fixed frequency datetime index, with day (calendar) as the default frequency

By choosing the frequency to be A or annual we will be able to get the three target dates fromabove.

In [29]:

date_range?

Object `date_range` not found.

In [30]:

# Create the BHAG dataframe


103/136

30

data = [1000,2000,3000]

idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')

BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])

BHAG

Out[30]:

BHAG

2011-12-311000

2012-12-312000

2013-12-313000

Combining dataframes as we have learned in previous lesson is made simple using the concat function. Remember when we choose axis = 0 we are appending row wise.

In [31]:

# Combine the BHAG and the ALL data set

combined = pd.concat([ALL,BHAG], axis=0)

combined = combined.sort(axis=0)

combined.tail()

Out[31]:

BHAG CustomerCount Max

2012-11-19NaN 136 1115

2012-11-26NaN 1115 1115

2012-12-10NaN 1269 1269

2012-12-312000 NaN NaN

2013-12-313000 NaN NaN

In [32]:

fig, axes = plt.subplots(figsize=(12, 7))

combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')

combined['Max'].plot(color='blue', label='All Markets')

plt.legend(loc='best');

There was also a need to forecast next year's customer count and we can do this in a couple ofsimple steps. We will first group the combined dataframe by Year and place the maximum customercount for that year. This will give us one row per Year.

In [33]:


104/136

31

# Group by Year and then get the max value per year

Year = combined.groupby(lambda x: x.year).max()

Year

Out[33]:

BHAG CustomerCount Max2009NaN 2452 2452

2010NaN 2065 2065

20111000 2711 271120122000 2061 2061

20133000 NaN NaN

In [34]:

# Add a column representing the percent change per year

Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)

Year

Out[34]:

BHAG CustomerCount Max YR_PCT_Change

2009NaN 2452 2452 NaN

2010NaN 2065 2065 -0.157830

20111000 2711 2711 0.312833

20122000 2061 2061 -0.239764

20133000 NaN NaN NaNTo get next year's end customer count we will assume our current growth rate remains constant. Wethen will increase this years customer count by that amount and that will be our forecast for nextyear.

In [35]:

(1 + Year.ix[2012,'YR_PCT_Change']) * Year.ix[2012,'Max']

Out[35]:

1566.8465510881595

Present DataCreate individual Graphs per State.

In [36]:

# First Graph

ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')


105/136

32

# Last four Graphs

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))

fig.subplots_adjust(hspace=1.0) ## Create space between plots

Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0

,0])

Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0

,1])

Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1

,0])

Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1

,1])

# Add titles

axes[0,0].set_title('Florida')

axes[0,1].set_title('Georgia')

axes[1,0].set_title('Texas')

axes[1,1].set_title('North East');


106/136

33

Lesson 4In this lesson were going to go back to the basics. We will be working with a small data set so thatyou can easily understand what I am trying to explain. We will be adding columns, deleting columns,and slicing the data many different ways. Enjoy!

In [1]:

# Import libraries

import pandas as pd

import sys

In [2]:


print 'Pandas version: ' + pd.__version__

Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)]Pandas version: 0.15.2

In [3]:

# Our small data set

d = [0,1,2,3,4,5,6,7,8,9]

# Create dataframe

df = pd.DataFrame(d)

df

Out[3]:

000

11

22

33

4455

66

77

88

99

In [4]:

# Lets change the name of the column


107/136

34

df.columns = ['Rev']

df

Out[4]:

Rev00

11

22

33

44

55

66

77

88

99

In [5]:

# Lets add a column

df['NewCol'] = 5

df

Out[5]:

Rev NewCol00 5

11 5

22 5

33 544 5

55 5

66 5

77 5

88 5

99 5

In [6]:

# Lets modify our new column

df['NewCol'] = df['NewCol'] + 1

df

Out[6]:

Rev NewCol

00 6

11 622 6

33 6


108/136

35

Rev NewCol

44 6

55 6

66 6

77 6

88 6

99 6

In [7]:

# We can delete columns

del df['NewCol']

df

Out[7]:

Rev00

1122

33

44

55

66

7788

99

In [8]:

# Lets add a couple of columns

df['test'] = 3

df['col'] = df['Rev']

df

Out[8]:

Rev test col00 3 0

11 3 1

22 3 2

33 3 344 3 4

55 3 5

66 3 6

77 3 7

88 3 8

99 3 9

In [9]:


109/136

36

# If we wanted, we could change the name of the i

1+Python+Class+Powerpoint+Outline

Documents

Transcript of 1+Python+Class+Powerpoint+Outline