McKinney: Python for Data analysis University of Michigan Data Science Specialization - Intro to data science
Create from list, dictionary etc
Index created with ints if not provided with index parameter.
Query with attributes:
Can Iterate with
for item in series:
But quicker to use vectorised functions, e.g. np.sum(series). Can create your own vectorised functions. Also broadcast, e.g. s += 2 adds 2 to each item. When iterating through series, use broadcast / vectorised instead.
.loc can also add values.
Append returns a new series, original unchanged.
Index can be duplicated!
Column label, row index. Can be non unique. Quickly select based on row and col.
s.iloc[3] for query by position
s.loc['country'] for query by label
Can call with s[3] or s['country'] and it works out if you're querying by position or value
Df.loc[row, col] for cell.
Chaining (df[][]) returns copy of the dataframe.
Slicing better. df.loc[:,['Name', 'Cost']]
doesn't change the df, returns a copy without the row.
can say inplace, not a copy. Axis = 1 means drop a col.
Del df['Name']
Views are generally faster, don't impact on underlying data. Use copy method to avoid this.
.columns is the list of cols. Set the header row on the import. Can also use Df.rename
array of true / false combined with original data. True included in final result. Heavily used in pandas.
Create by applying operators to df, e.g. df['Gold'] > 0 produces same shape df.
Where function applies mask at the same time, only_gold = df.where(df['Gold'] > 0)
Includes values not meeting criteria as NaN, but most functions ignore these rows. Use dropna to get rid.
Boolean mask can be used as value for indexing. E.g. only_gold = df[df['Gold'] > 0]
Can chain Boolean masks e.g. len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)])
Or df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]
Extremely important, and often an issue for new users, is to remember that each Boolean mask needs to be encased in parenthesis because of the order of operations.
Index can be set by set_index function. Reset_index gets back to integer indexes.
Index can be multiple cols. Call set _index with list of cols. Finds distinct combinations and sets index.
Df = df.set_index(['state name', 'county name']) - this creates a 2 level index. Need to query with args in order of index, level 0 first. e.g. df.loc['Michigan', 'Washtenaw County']
Can have multi level cols also.
Df['sumlevel'].unique() shows unique values in a col.
cols_to_keep = ['col 1', 'col 2']
Df = df(cols_to_keep)
df.info()
loading from files use na_values to indicate which values loaded are actually None. Use na_filter to turn off whitespace filtering.
Fill_na to fill missing data forward and back relative to rows. Can e.g. reset index to a timestamp and use Df.sort_index() to sort on index. Then fill missing data forwards. Can also fill with series of this same size.
Maths ignore missing values.
For integer based position selection. Data.iloc[<row selection>, <column selection>]. Column selection is optional. Don't really use, always use name selction with .loc.
Data.iloc[-1] gets last row
Data.iloc[:, 0] gets first column
Data.iloc[0:5] gets first 5 rows.
Data.iloc[:, 0:2] gets all rows first 2 columns
Data.iloc[[0, 3, 6, 24], [0, 5, 6]] for complex.
Returns series when one row selected, Df when multiple selected. To get around this, pass a list of values, e.g. Data.iloc[[100]] returns one row as a dataframe.
Used for selecting by label / index, or selecting with Boolean mask.
Df.loc[<row selection>, <column selection>]. Column selection is optional.
Selections withdata.loc are based on the index of the dataframe. Set the index with df.set_index("last_name")
Access with df.loc['Jeremy'], or df.loc[['Jeremy', 'Lucy']]
Again pass single row requests with a list using [[]] to make the result a DF not a series.
Data.loc[:, ['col1' : 'col 3:']] gets the cols between col 1 and col 3
Most common method in practice.
Pass an array or series of true / false values to select rows where true.
Select matching rows on a condition: df.loc['first_name' == 'Jeremy']
Select some colums from rows matching a condition: df.loc['first_name' == 'Jeremy', ['col1', 'col2']]
0:3 runs from 0 to row before 3. The last limit is not included.
df.iloc[row selection, column selection] for position based selection.
Data = [1, 2, 3]
From numpy import array
Data = array(data)
Data = [[11, 22],
[33, 44],
[55, 66]]
Data = array(data)
Data[4] # zero base, 5^th^ element
Data[-1] # gets last item
Data[-2] # gets last but one
Data[4, 1] # gets 5^th^ row, 2^nd^ col
Data[0, ] # gets 1^st^ row all cols
Data[start: end]
Data[-2: ] # gets data starting 2 before the end
Lots of Scikit learn models need targets as 2d arrays. Common to reshape 1d array to 2d, but no change to the data. Just a 2d array with one column.
rehsape(rows, 1) # rows is data.shape[0]
Often you have 2d data (sequence in each row), but model expects 3d including multiple timestep dimension or multiple features.
Reshape(rows, columns, 1) # no change to data, but different dimensions
Pd.DataFrame(data = .., columns = .., index = ..)
df['Gold'].idxmax()
max = df1['abs diff gold'].max()
val_mask = df1['abs diff gold'] == max
df1[val_mask]
dfc67 = dfc66.iloc[0:3]
data.loc[data['first_name'] == 'Antonio', 'city':'email']
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')]
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']]
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)]
(Select rows where the company name has 4 words in it. Can select outside of the main .loc for clarity: idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4))
dfg32 = dfg3.assign(q3 = (dfg3['Gold'] - dfg3['Gold.1']) / dfg3['Gold.2'])
del df['column_name']
max = dfg32['q3'].max()
mask = dfg32['q3'] == max
return (dfg32[mask].index)[0]
Points = pd.Series(df['Gold.2'] * 3 + df['Silver.2'] * 2 + df['Bronze.2'] * 1)
s.values
columns_to_keep = ['STNAME', 'CTYNAME']
dfc52 = dfc5[columns_to_keep]
df_state = dfc52.groupby(['STNAME']).count()
df['target'].unique()
df['hID'].nunique()
df['hID'].count()
df['hID'].size
df['target'].value_counts()
dfc62.sort_values(by = ['STNAME', 'CENSUS2010POP'], ascending = [True, False], inplace = True)
df.sort_values(by=['count'], ascending=False)
grouped = Df2.groupby('Markets')
grouped.describe(percentiles = []).sort_values(by=[('PP Complete Ad Cost', 'count')])
df['Date'] = ['December 1', 'January 1', 'mid-May']
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
Merge preserves different cols same name with _x for left, _y for right.
good! Methods on objects return reference to the object.
(df.where(df['SUMLEV']==50)
.dropna()
.set_index(['STNAME','CTYNAME'])
.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
# control printouts
pandas.set_option('display.max_rows', None)
#pandas.set_option('display.max_rows', 5)
You can also reset an option back to its default value like this:
pd.reset_option('display.max_rows')
or set directly:
pd.options.display.max_rows = 100
Chain indexing (E.G. df.loc[KEY1][KEY2]) this is a bad code smell!!!
Pass function and iterable, returns a list after running function on each object.
#map to replace values with sparse dictionary lookup
GDP['Country Name'] = GDP['Country Name'].map(r2).fillna(GDP['Country Name'])
t6['Continent'] = t6.index.to_series().map(ContinentDict)
Pass function to operate on each cell of dataframe, returns a dataframe. Not so often used as all cells.
Pass function, and dataframe axis to operate on. Good for function on rows. Operates on entire rows or cols.
e.g apply(lambda x:1 if x else 0)
pd.Series({'min': np.min(data), 'max': np.max(data)})
Lambda x: x + 3. Here, x is the bound variable (argument), and the body is x+3. It's an expression, so it can be named, can't include statements. Single line. Can take multiple parameters, comma separated.
Add_three = lambda x: x+ 3
Add_three(2)
5
Takes column names, splits into chunks, returns dataframe group by object. Tuple, first iten is group condition, second item is the dataframe reduced by that condition. Could also give a function to group the data.
Split data, apply function, then combine -- use group by .agg. Pass dictionary of cols interested in, and a function to apply. The dictionary can be used to either to identify the columns to apply a function on or to name an output column if there's multiple functions to be run. The difference depends on the keys that you pass in from the dictionary and how they're named.
Example:
conts = t6.groupby('Continent')
con = conts['population'].agg({
'size': np.count_nonzero,
'sum': np.sum,
'mean': np.mean,
'std': np.std})
Often come out as object.
t2.dtypes prints the types of all cols
to cast to different type with a dictionary: df.astype({'col1': 'int32'}).dtypes
Whenever I get a hold of such complex problems, I use apply/lambda. Let me first show you how I will do this.
def custom_rating(genre,rating):
if 'Thriller' in genre:
return min(10,rating+1)
elif 'Comedy' in genre:
return max(0,rating-1)
else:
return rating
df['CustomRating'] = df.apply(lambda x:
custom_rating(x['Genre'],x['Rating']),axis=1)
The general structure is:
You define a function that will take the column values you want to
play with to come up with your logic. Here the only two columns we end up using are genre and rating.
You use an apply function with lambda along the row with axis=1. The
general syntax is:
df.apply(lambda x: func(x['col1'],x['col2']),axis=1)
You should be able to create pretty much any logic using apply/lambda since you just have to worry about the custom function.
Example:
#creating a function and call on each row with apply
def checkrenew(x):
if (x > med):
return 1
else:
return 0
t6['HighRenew'] = t6['% Renewable'].apply(checkrenew)
Good to simulate an experiment and look at the results to get to conclusions without doing lots of maths. E.g. chance of a tornado is 1% on a day, whats the chance of 2 in a row. Simulate with numpy
chance_of_tornado = 0.01
tornado_events = np.random.binomial(1, chance_of_tornado, 1000000)
two_days_in_a_row = 0
for j in range(1,len(tornado_events)-1):
if tornado_events[j]==1 and tornado_events[j-1]==1:
two_days_in_a_row+=1
print('{} tornadoes back to back in {} years'.format(two_days_in_a_row, 1000000/365))
Hypothesis testing:
also called A/B testing. experimentation, measure the impact of a change. Set a hypotheses, including a null hypothesis. Find evidence against the null hypo. Choose significance level (alpha) for the amount of chance that the effect is noise youre happy to accept. Typical values are 0.01, 0.1. Depends on the cost of the downside of wrong result, and the amount of noise.
Scipi library has lots of significance tests. Use ttest to test this hypothesis. Tests expect a particular shape of data.
P value is the value from the distribution to test against alpha.
p-hacking / dredging: data science flaw - try lots of tests, when you try enough some will pass alpha. But you've done lots of tests!
Remedies: Bonferroni correction to decrease alpha by 1 / number of tests, hold out sets (experiment on one set, test against the other), or investigation pre-registration.
one dimensional array, sequence of values with associated array of data labels called index.
Like a fixed length ordered dictionary.
The order is maintained.
NaNs / NA / null are all sentinel values, all are false.
Rectangular table of data
Ordered collection of columns. Columns can have different data types.
Has a row and column index.
Like a dict of series, all sharing the same index.
Can make multi dimensional with higherarchical indexing.
Index created 0.. if not assigned on creation.
NeilShamby68
Gitbhut: https://github.com/microsoft/reactors
To get to local env: go to git, select fork, and that creates a fork of the repositary in my git acocunt. Then need to get the code locally: VS code to clone a repositary. That says where it is located. Then still need to pull. Connects to git, clones locally.
good Pandas exercise: https://github.com/microsoft/Reactors/blob/main/Data_Science_1/Full_Day/3-Pandas.ipynb. Forked to get a copy to my repositary.
https://notebooks.azure.com https://notebooks.azure.com/jeremy-bates64
Open azure project, upload a file from github url. Can run jupyter from there.
VS code is good. Lots of intellsense, built in git, can debug, extensions. Installed azure tools for vs code
Microsoft azure portal offers machine learning service to run Jupyter. Azure machine learning studio.
numpy arrays are inplicit index, pandas series is explicit. Series keys are ordered.
resources: https://pandas.pydata.org/docs/getting_started/10min.html