Contents

P09: Intro to NumPy

  • NumPy is the main scientific computing package for Python - it allows you to easily work with large arrays of data and supports functionality for many common operations (including linear algebra)

  • All about doing computations on large data sets all at once - can do many many things without looping! Much more effecient

# import numpy and other stuff for this tutorial
import numpy as np

# import a specific function from NumPy cause we'll use it a lot
from numpy import pi

# functionality for plotting
import matplotlib.pyplot as plt

Initialize array and a few basic operations

  • np.arange method works just like the built in range function

  • the interval includes start but excludes stop, overall interval [start…stop-1]

# set up an array and figure out shape...  
my_array = np.arange(10)      
print(my_array)

# note that its 1D (a vector...)
my_array.shape
[0 1 2 3 4 5 6 7 8 9]
(10,)
# can specify start, stop and step
seq_array = np.arange(0,30,5)     # start, stop (stop at < X), step size
print(seq_array)
# note that 30 is not in there...
[ 0  5 10 15 20 25]

Reshape array - in this case a 1D vector to a 2D matrix

my_array = np.arange(36)
my_array = my_array.reshape(6,6)    # 3,12,  9,4
print(my_array.shape)   
print(my_array)
# why is (6,6) and (12,3) ok but (5,5) not ok? 
(6, 6)
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]
 [24 25 26 27 28 29]
 [30 31 32 33 34 35]]

Reshape array - more than 2D

  • 1D, 2D, ND arrays

  • Notice how the dims stack on top of each other!

my_array = np.arange(100)
my_array = my_array.reshape(5,5,4)   # 2,5,10
my_array.shape   
print(my_array)

# NOTICE how the dims stack on top of each other! there are 5, 5x4 matrices
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]
  [12 13 14 15]
  [16 17 18 19]]

 [[20 21 22 23]
  [24 25 26 27]
  [28 29 30 31]
  [32 33 34 35]
  [36 37 38 39]]

 [[40 41 42 43]
  [44 45 46 47]
  [48 49 50 51]
  [52 53 54 55]
  [56 57 58 59]]

 [[60 61 62 63]
  [64 65 66 67]
  [68 69 70 71]
  [72 73 74 75]
  [76 77 78 79]]

 [[80 81 82 83]
  [84 85 86 87]
  [88 89 90 91]
  [92 93 94 95]
  [96 97 98 99]]]

Allocate arrays of zeros, ones or rand to reserve the memory before filling up later

  • Handy when you know what size you need, but you’re not ready to fill it up yet…saves you from dynamically resizing the matrix during analysis, which is VERY,VERY slow (e.g. the ‘append’ method)

# note the () around the dims because here we're specifying as a tuple...
# default type is float64...can also pass in a list
arr = np.zeros( (3,4) )   
print(arr)
arr.dtype
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
dtype('float64')

Init an array of ones

  • Can use this method to init an array of any value…see next cell below

# ones
# note the 3D output below...4, 4x4 squares of floating point 1s...
arr = np.ones( (4,4,4) )
print(arr)
[[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]

What if you want to initialize an array of 10s?

arr = np.ones( (4,4,4) ) * 10
print(arr)
[[[10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]]

 [[10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]]

 [[10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]]

 [[10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]
  [10. 10. 10. 10.]]]

Random numbers - generate all at once as opposed to looping like we did earlier in the class

arr = np.random.random( (5,4) )
print(arr)
[[0.44641626 0.58859782 0.77603544 0.28115959]
 [0.78071366 0.2796471  0.63270426 0.7201253 ]
 [0.2318062  0.00347654 0.89150387 0.56721201]
 [0.93510149 0.95112364 0.97054665 0.60036315]
 [0.98083677 0.356807   0.93856945 0.16744974]]

Empty

  • Because you’re not initializing to a specific value (like zeros), can by marginally faster when allocating a large array

  • However, this is a bit dangerous because exact values in an ‘empty’ array are based on current state of memory and can vary…

  • Need to make sure that you are overwriting ALL of the values and that you remember that the values are NOT 0!!! (or 1)

# and empty...not really 'empty' but initialized with varible output determined 
# by current state of memory
arr = np.empty( (2,2) )
print(arr)
[[-2.00000000e+000  3.11108212e+231]
 [ 1.73059687e-077  2.82466759e-309]]

Fill up an array at init with any value, include NaNs! (very handy for error checking!)

# an alternate way to initialize an array with arbitrary values
# note that 'full' will guess best data type given init value
arr = np.full( (2,2), np.nan)
print(arr)
[[nan nan]
 [nan nan]]

Data type, size, other attributes of numpy arrays…

print('Dims of data:', my_array.ndim)         # number of dims
print('Name of data type:', my_array.dtype)   # name of data type (float, int32, int64 etc)
print('Size of each element (bytes):', my_array.itemsize)          # size of each element in bytes
print('Total number of elements in array:', my_array.size)         # total number of elements in array
Dims of data: 3
Name of data type: int64
Size of each element (bytes): 8
Total number of elements in array: 100

Infer data types upon array creation

  • Use np.array to initialize an array and fill it with numbers

  • Can use lists or tuples (or any array-like input of numerical values)

  • Can specify data type upon array creation…complex, float32, float64, int32, uint32 (unsigned int32), etc

# will infer data type based on input values...here we have 1 float so the whole thing is float
float_array = np.array([1.2,2,3])  
float_array.dtype             # or np.dtype
dtype('float64')

Can also specify type upon array creation

  • What happens if you initialize with floating point numbers but you declare an int data type?

  • e.g. type casting upon array creation, as we discussed with pandas

  • doesn’t round, it truncates!

int_array = np.array([1.1,7.5], dtype = 'int32')   
int_array

# truncation of floats...be careful
array([1, 7], dtype=int32)

Numpy Part II: Simple elementwise arithmetic operations like + and - work on corresponding elements of arrays.

  • MASSIVE speed up over looping!

# set up two sets of data...
N = 1000
x = np.arange(0,N)
y = np.sin(x)

First add each element of x with the corresponding element of y using the old method…

sum_lst = []
%timeit for i in range(N): sum_lst.append(x[i] + y[i])
1.87 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now do it the numpy way…it goes much much faster!

  • often goes from milliseconds to microseconds

%timeit sum_lst = x + y
1.33 µs ± 2.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Unary operations implemented as methods of the ndarray class

# note the method chain...
x = np.arange(10).reshape(2,5)   # 2 x 5 matrix

print(x.sum())                   # sum of all elements
print(x.sum(axis=0))             # sum of each column (across 1st dim)
print(x.sum(axis=1))             # sum of each row (across 2nd dim)
print(x.sum(0))                  # don't need the axis arg, can just specify
print(x.mean())
45
[ 5  7  9 11 13]
[10 35]
[ 5  7  9 11 13]
4.5

Slicing…

# create a 1d array
x = np.linspace(0,9,10)
print(x)
x[1]                     # just the second entry, remember 0 based indexing

# specific start and stop points (exclusive)
x[0:2]                   # the first and second entries in the array, so N>=0 and N<2 (note the < upper bound - not inclusive)

# assign the 2nd - 4th element to 100 (index 1,2,3)
x[1:4] = 100               
print(x[1:4])
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[100. 100. 100.]

Step through a ndarray - similar to a list

# start, stop, step interval
print(x[0:8:2])

# reverse x
print(x[::-1])
[  0. 100.   4.   6.]
[  9.   8.   7.   6.   5.   4. 100. 100. 100.   0.]

Multidimentional array indexing, slicing etc

# generate a matrix of uniformly distributed random numbers over 0:10
x = np.round(np.random.random((10,5))*10)  
print(x)

x[0,0]     # first row, first column
x[2,3]     # third row, 4th column

x[:, 3]    # all entries in the 4th column 
x[3, :]    # all entries in the 4th row
x[0:2, 4]  # first two entries of the 5th column
x[6, 2:4]  # 7th row, 3rd and 4th entries. 

# if not all dims specified then missing values are considered complete slices
# these three ways of writing all do the same thing...
x[6]       
x[6,]
x[6,:]

# tricks...
print('last row: ', x[-1,:])     # last row
print('last column: ', x[:,-1])  # last column
print('last entry: ', x[-1,-1])  # last value
[[ 7.  8. 10.  8.  8.]
 [ 3.  6.  2.  1.  2.]
 [ 3.  9.  6.  5.  4.]
 [ 3.  8.  1.  3.  2.]
 [ 5.  3.  2.  9.  1.]
 [ 1.  3.  6.  1.  9.]
 [ 6.  4. 10.  9.  9.]
 [ 3.  0.  7.  5.  4.]
 [ 6.  9.  5.  4.  5.]
 [ 3.  3.  9.  0.  4.]]
last row:  [3. 3. 9. 0. 4.]
last column:  [8. 2. 4. 2. 1. 9. 9. 4. 5. 4.]
last entry:  4.0

Pull out subsets of rows and columns

# generate a matrix of random numbers over 0-1
x = np.random.rand(4,3) 
print(x)

# first two rows - note that you don't have to specify the 2nd dim - and note that 
# '2' here means rows 0 and 1 (not 0 through 2!)
y = x[:2] 
print('\n', y)

# can also take the last two rows...in the same manner...in this case rows 3 and 4
y = x[2:] 
print('\n', y)

# first two rows, 1st column
y = x[:2,0] 
print('\n', y)

# rows 3 - end, columns 2 - end
y = x[2:,1:]
print('\n', y)
[[0.96839314 0.44528471 0.03082751]
 [0.05369719 0.11398778 0.44991492]
 [0.52486147 0.28391756 0.58813709]
 [0.89474918 0.84582619 0.4803415 ]]

 [[0.96839314 0.44528471 0.03082751]
 [0.05369719 0.11398778 0.44991492]]

 [[0.52486147 0.28391756 0.58813709]
 [0.89474918 0.84582619 0.4803415 ]]

 [0.96839314 0.05369719]

 [[0.28391756 0.58813709]
 [0.84582619 0.4803415 ]]

Important - slicing an array and re-assigning the output creates a view of the data, not a copy!

  • Recall that a ‘view’ is when two variables are both referencing the same data in memory

  • Because both variables are referencing the same data, changing one variable will also change the other…

Init an array to demonstrate…in this case a 3x2 array

x = np.array([ [2,4], [6,7], [5,4] ])
print('Initial values in x:\n', x)
Initial values in x:
 [[2 4]
 [6 7]
 [5 4]]

Then reassign all values in the 3rd row of x to a new variable z

  • z will be a ‘view’ of the data in the 3rd row of x

z = x[2,]
print('Shape of z:', z.shape, 'Values in z:', z)
Shape of z: (2,) Values in z: [5 4]

Now change all values in z to 100 (or whatever you want)

  • use the syntax z[:], which indicates “all values in z”

  • if you change data in z it will also change the corresponding elements in x because z references the same data (or chunk of memory)

z[:]=100
print('New values in z:', z)
New values in z: [100 100]

Notice that x has now changed even though you never directly changed it!

print('x also changed!!!\n', x)
x also changed!!!
 [[  2   4]
 [  6   7]
 [100 100]]

If you want two independent variables that do not reference the same data, use the copy method

# re-initialize x
x = np.array([ [2,4], [6,7], [5,4] ])

# make a copy
z = x[2,].copy()

# now you can modify z
z[:] = 100

# and it won't change x
print(x)
[[2 4]
 [6 7]
 [5 4]]

Logical indexing.

  • Just like with Pandas, we in NumPy we use ‘&’ for comparisons instead of ‘and’ and ‘or’

# using logical indexing to grab out subsets of data...
x = np.arange(0,10)
y = x[(x>3) & (x<7)]
print(y)
[4 5 6]

Fancy indexing…using arrays to index arrays - used all the time in data analysis…

  • Fancy indexing always makes a COPY of the data (unlike normal slicing which creates a view)!!!

# define an array to play around with...
x = np.random.rand(3,4)

# define another array (a tuple) to use as an index into the first array
y = (2,3)

# index  
print(x)
print('\n x indexed at tuple y: ', x[y])
[[0.86192083 0.53454066 0.23038754 0.68472688]
 [0.97537347 0.56734332 0.54103559 0.8607483 ]
 [0.13353761 0.20434622 0.3753845  0.02520431]]

 x indexed at tuple y:  0.025204310902670723

Can use fancy indexing to extract elements in a particular order

print(x)

# this will extract the 3rd row, then the 2nd row, then the first row
x[[2,1,0]]

# and this will extract all rows from the 2nd, 3rd and then 1st column. 
x[:,[1,2,0]]
[[0.86192083 0.53454066 0.23038754 0.68472688]
 [0.97537347 0.56734332 0.54103559 0.8607483 ]
 [0.13353761 0.20434622 0.3753845  0.02520431]]
array([[0.53454066, 0.23038754, 0.86192083],
       [0.56734332, 0.54103559, 0.97537347],
       [0.20434622, 0.3753845 , 0.13353761]])

Or can pass in multiple arrays…will return a 1D array corresponding to each array [1,1] and [2,2] in this case

print(x)
x[[1,2],[1,2]]
[[0.86192083 0.53454066 0.23038754 0.68472688]
 [0.97537347 0.56734332 0.54103559 0.8607483 ]
 [0.13353761 0.20434622 0.3753845  0.02520431]]
array([0.56734332, 0.3753845 ])