For Part 1, we will be using data from this paper. The data is a collection of reviews from Pitchfork, a site that provides expert reviews of music album. The authors of this paper have also combined the data with a set of features from Spotify’s API that provide insight into the music itself, e.g. the "acousticness" of the song. We will tackle a regression problem here, trying to predict the score of a review from several of the other columns in the dataset.
In the first subsection of Part 1, We’re going to look at how running linear regression with various subsets of our features impacts our ability to predict score.
In Part 1.1, Here we are going to train a separate linear regression model for a number of different feature subsets. Specifically:
feature_sets
below is a list of lists; each sublist is a different subset of features to build a model with. part1_train.csv
. part1_test.csv
. The evaluation metric we will use is root mean squared error. Our output file part_1.1_results.csv
will have the following columns:
feature_set
- a column describing the features of the model used. For feature sets with multiple features, combine them using an underscore (you can do this with the code "_".join(feature_set)
)training_rmse
- a column that gives the RMSE of a linear regression model trained on this feature set on the training datatest_rmse
- a column that gives the RMSE of a linear regression model trained on this feature set on the test datafeature_sets = [['artist'],
['reviewauthor'],
['releaseyear'],
['recordlabel'],
['genre'],
['danceability'],
['energy'],
['key'],
['loudness'],
['speechiness'],
['acousticness'],
['instrumentalness'],
['liveness'],
['valence'],
['tempo'],
['danceability','energy','key','loudness','speechiness','acousticness',
'instrumentalness','liveness','valence','tempo'],
['artist', 'reviewauthor', 'releaseyear', 'recordlabel', 'genre'],
['artist', 'reviewauthor', 'releaseyear', 'recordlabel', 'genre', 'danceability',
'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo']]
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import math
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
labelencoder = LabelEncoder()
# Convenience things for you, note that releaseyear is continuous but is not a Spotify API variable
CONTINUOUS_FEATURES = ['releaseyear', 'danceability', 'energy', 'key', 'loudness',
'speechiness', 'acousticness', 'instrumentalness', 'liveness',
'valence', 'tempo']
CATEGORICAL_FEATURES = ['artist', 'reviewauthor', 'recordlabel', 'genre']
# Read in the data
training_data = pd.read_csv("part1_train.csv")
test_data = pd.read_csv("part1_test.csv")
new_training_data = training_data
new_test_data = test_data
for feature in CATEGORICAL_FEATURES:
new_training_data[feature] = labelencoder.fit_transform(new_training_data[feature])
new_test_data[feature] = labelencoder.fit_transform(new_test_data[feature])
result_data = []
for feature_set in feature_sets:
X_train = new_training_data[feature_set]
y_train = new_training_data['score']
X_test = new_test_data[feature_set]
y_test = new_test_data['score']
clf1 = LinearRegression(fit_intercept=True)
clf1.fit(X_train,y_train)
test_pred = clf1.predict(X_test)
train_pred = clf1.predict(X_train)
result_data.append(["_".join(feature_set), math.sqrt(mean_squared_error(y_train,train_pred)), math.sqrt(mean_squared_error(y_test,test_pred))])
result = pd.DataFrame(result_data)
result.columns = ["feature_set","training_rmse","test_rmse"]
result
feature_set | training_rmse | test_rmse | |
---|---|---|---|
0 | artist | 1.243393 | 1.243591 |
1 | reviewauthor | 1.243345 | 1.243498 |
2 | releaseyear | 1.235994 | 1.232989 |
3 | recordlabel | 1.242969 | 1.244257 |
4 | genre | 1.242326 | 1.242353 |
5 | danceability | 1.240362 | 1.241097 |
6 | energy | 1.240900 | 1.239929 |
7 | key | 1.243497 | 1.243823 |
8 | loudness | 1.236968 | 1.237578 |
9 | speechiness | 1.243530 | 1.243566 |
10 | acousticness | 1.240370 | 1.240188 |
11 | instrumentalness | 1.238989 | 1.238840 |
12 | liveness | 1.243534 | 1.243558 |
13 | valence | 1.242268 | 1.240970 |
14 | tempo | 1.243195 | 1.242322 |
15 | danceability_energy_key_loudness_speechiness_a... | 1.233091 | 1.235554 |
16 | artist_reviewauthor_releaseyear_recordlabel_genre | 1.233341 | 1.231714 |
17 | artist_reviewauthor_releaseyear_recordlabel_ge... | 1.224337 | 1.224779 |
result.to_csv("part_1.1_results.csv",sep = '\t')
In Part 1.2, We will be training an L1-regularized linear regression model, with an expanded feature set. Specifically:
feature_sets
(i.e. your feature set, to begin this section, is feature_sets[-1]
.drop=if_binary
and sparse=False
in the function arguments. StandardScaler
.part1_train.csv
. You should use the LassoCV class in sklearn
, it will do the cross-validation necessary to select the appropriate value for the regularizer for you! Use 10-fold cross-validation to perform model selection (set the LassoCV
parmaeter cv
to 10), and set the random_state
to 1. Do not change any of the other parameters to LassoCV
(i.e. leave them at their defaults).alpha
value (the regularizer term, according to sklearn
. In class, we refer to this as $\lambda$!) in terms of average mean squared error according to the cross-validation.part1_train.csv
). We will use this to report the root mean squared error on the test set.# Write your code for Part 1.2 here
from sklearn.linear_model import LassoCV, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import f1_score, accuracy_score, recall_score, confusion_matrix,classification_report, precision_score, roc_auc_score
# Do the CV to find alpha
features = feature_sets[-1]
X_train = training_data[features]
y_train = training_data['score']
X_test = test_data[features]
y_test = test_data['score']
basic_pipeline = make_pipeline(
ColumnTransformer([('numerical', StandardScaler(), CONTINUOUS_FEATURES),
("categorical", OneHotEncoder(drop ="if_binary",sparse=False),CATEGORICAL_FEATURES)]),
)
clf1 = LassoCV(cv = 10, random_state = 1)
pipe1 = Pipeline([("pipe",basic_pipeline),
('clf1', clf1)])
pipe1.fit(X_train,y_train)
y_pred_test = pipe1.predict(X_test)
y_pred_train = pipe1.predict(X_train)
rmse_train = math.sqrt(mean_squared_error(y_train,y_pred_train))
rmse_test = math.sqrt(mean_squared_error(y_test,y_pred_test))
evaluation_matrix_base_training = {
"training" : [rmse_train],
"testing": [rmse_test]
}
eva_base_train = pd.DataFrame(data = evaluation_matrix_base_training, index = ['RMSE'])
display(eva_base_train)
# Retrain the model
l1_penalties = np.logspace(-3, 3, num = 15) # TODO: make the list of lambda values
lasso_output = []
lasso_data_train = training_data[features + ["score"]].dropna()
lasso_data_test = test_data[features+["score"]].dropna()
for l1_penalty in l1_penalties:
regression_pipeline = make_pipeline(ColumnTransformer([
("categorical", OneHotEncoder(drop ="if_binary",sparse=False),CATEGORICAL_FEATURES),
("scale",StandardScaler(),CONTINUOUS_FEATURES)
]),
Lasso(alpha=l1_penalty, random_state=1)
)
regression_pipeline.fit(lasso_data_train,lasso_data_train.score)
train_rmse = np.sqrt(mean_squared_error(lasso_data_train.score,
regression_pipeline.predict(lasso_data_train)))
test_rmse = np.sqrt(mean_squared_error(lasso_data_test.score,
regression_pipeline.predict(lasso_data_test)))
# We maintain a list of dictionaries containing our results
lasso_output.append({'l1_penalty': l1_penalty,'model': regression_pipeline,'train_rmse': train_rmse,'test_rmse': test_rmse})
lasso_output = pd.DataFrame(lasso_output)
display(lasso_output)
training | testing | |
---|---|---|
RMSE | 1.122921 | 1.343646 |
l1_penalty | model | train_rmse | test_rmse | |
---|---|---|---|---|
0 | 0.001000 | (ColumnTransformer(transformers=[('categorical... | 1.164865 | 1.249920 |
1 | 0.002683 | (ColumnTransformer(transformers=[('categorical... | 1.193347 | 1.227617 |
2 | 0.007197 | (ColumnTransformer(transformers=[('categorical... | 1.212013 | 1.224412 |
3 | 0.019307 | (ColumnTransformer(transformers=[('categorical... | 1.226351 | 1.224815 |
4 | 0.051795 | (ColumnTransformer(transformers=[('categorical... | 1.230498 | 1.229373 |
5 | 0.138950 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
6 | 0.372759 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
7 | 1.000000 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
8 | 2.682696 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
9 | 7.196857 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
10 | 19.306977 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
11 | 51.794747 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
12 | 138.949549 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
13 | 372.759372 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
14 | 1000.000000 | (ColumnTransformer(transformers=[('categorical... | 1.243534 | 1.243558 |
(lasso_output["train_rmse"].sum())/15
1.2308278386434948
from sklearn import set_config
from sklearn.utils import estimator_html_repr
from IPython.core.display import display, HTML
set_config(display='diagram')
mod = lasso_output.model.iloc[2]
mod
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('categorical', OneHotEncoder(drop='if_binary', sparse=False), ['artist', 'reviewauthor', 'recordlabel', 'genre']), ('scale', StandardScaler(), ['releaseyear', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'])])), ('lasso', Lasso(alpha=0.0071968567300115215, random_state=1))])Please rerun this cell to show the HTML repr or trust the notebook.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('categorical', OneHotEncoder(drop='if_binary', sparse=False), ['artist', 'reviewauthor', 'recordlabel', 'genre']), ('scale', StandardScaler(), ['releaseyear', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'])])), ('lasso', Lasso(alpha=0.0071968567300115215, random_state=1))])
ColumnTransformer(transformers=[('categorical', OneHotEncoder(drop='if_binary', sparse=False), ['artist', 'reviewauthor', 'recordlabel', 'genre']), ('scale', StandardScaler(), ['releaseyear', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'])])
['artist', 'reviewauthor', 'recordlabel', 'genre']
OneHotEncoder(drop='if_binary', sparse=False)
['releaseyear', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
StandardScaler()
Lasso(alpha=0.0071968567300115215, random_state=1)
lasso_results = pd.DataFrame({"features" : mod[-2].get_feature_names_out(),
"coefs" : mod[-1].coef_
})
lasso_results.sort_values("coefs", ascending = False)
features | coefs | |
---|---|---|
666 | categorical__recordlabel_316 | 0.196678 |
670 | categorical__genre_1 | 0.121404 |
287 | categorical__reviewauthor_181 | 0.086074 |
242 | categorical__reviewauthor_136 | 0.085880 |
687 | scale__instrumentalness | 0.062686 |
... | ... | ... |
680 | scale__releaseyear | -0.129545 |
63 | categorical__artist_63 | -0.131319 |
301 | categorical__reviewauthor_195 | -0.171756 |
108 | categorical__reviewauthor_2 | -0.189233 |
187 | categorical__reviewauthor_81 | -0.303939 |
691 rows × 2 columns
non_zeros_count = sum(lasso_results["coefs"][lasso_results["coefs"] != 0].value_counts())
total_count = sum(lasso_results["coefs"].value_counts())
print(f'Total number of non zero coefficients are',non_zeros_count)
print(f'Percentage of total number of non zero coefficients are',round((non_zeros_count/total_count),4)*100)
Total number of non zero coefficients are 21 Percentage of total number of non zero coefficients are 3.04
lasso_results[lasso_results["features"].str.contains('reviewauthor', na=False)].sort_values("coefs", ascending=False)
features | coefs | |
---|---|---|
287 | categorical__reviewauthor_181 | 0.086074 |
242 | categorical__reviewauthor_136 | 0.085880 |
273 | categorical__reviewauthor_167 | 0.000000 |
261 | categorical__reviewauthor_155 | 0.000000 |
262 | categorical__reviewauthor_156 | 0.000000 |
... | ... | ... |
349 | categorical__reviewauthor_243 | 0.000000 |
236 | categorical__reviewauthor_130 | -0.121298 |
301 | categorical__reviewauthor_195 | -0.171756 |
108 | categorical__reviewauthor_2 | -0.189233 |
187 | categorical__reviewauthor_81 | -0.303939 |
244 rows × 2 columns
lasso_results[lasso_results["features"].str.contains('genre', na=False)].sort_values("coefs", ascending=False)
features | coefs | |
---|---|---|
670 | categorical__genre_1 | 0.121404 |
671 | categorical__genre_2 | 0.000000 |
672 | categorical__genre_3 | 0.000000 |
673 | categorical__genre_4 | 0.000000 |
674 | categorical__genre_5 | -0.000000 |
675 | categorical__genre_6 | 0.000000 |
676 | categorical__genre_7 | 0.000000 |
678 | categorical__genre_9 | 0.000000 |
677 | categorical__genre_8 | -0.037467 |
669 | categorical__genre_0 | -0.096707 |
679 | categorical__genre_10 | -0.118597 |
lasso_results[lasso_results["features"].str.contains('artist', na=False)].sort_values("coefs", ascending=False)
features | coefs | |
---|---|---|
0 | categorical__artist_0 | 0.000000 |
1 | categorical__artist_1 | 0.000000 |
78 | categorical__artist_78 | 0.000000 |
77 | categorical__artist_77 | -0.000000 |
76 | categorical__artist_76 | 0.000000 |
... | ... | ... |
31 | categorical__artist_31 | 0.000000 |
30 | categorical__artist_30 | -0.000000 |
29 | categorical__artist_29 | -0.000000 |
105 | categorical__artist_105 | 0.000000 |
63 | categorical__artist_63 | -0.131319 |
106 rows × 2 columns
We will finally use cross validation for both algorithm and model selection, with a hold-out test set for a final evaluation. We will use 5-fold cross validation to identify the best parameters and hyperparameters for a set of models. We will then take our final models and use a final hold-out test set (the same one as above) to estimate the generalization error of the models.
Specifically, We will be training and evaluating the following models, one for each of the specified hyper parameters sets:
Decision Tree regression
- All combinations of a max_depth
of 5, 10, or 20, and a criterion
of "squared error"
or "absolute error"
. Use the DecisionTreeRegressor.np.logspace(-5, 5, 11)
. Use the Ridge class from sklearn to train a Ridge Regression model. The parameters you need to pass when constructing the Ridge model are alpha
, which lets you specify what you want the L2 penalty to be, and random_state=0
to avoid randomness.n_neighbors
of 1, 5, 10, and 15. Use the KNeighborsRegressor class.Our output file part_1.4_results.csv
should have the following columns:
model_name
- The name of the model, one of DTR
(Decision Tree Regression), Ridge
, or KNN
.hyperparameter_setting
- a column describing the hyperparameters of the model. For models with multiple hyperparameters, combine them using an underscore (you can do this with the code "_".join(hyperparameters)
).mean_training_rmse
- a column that gives the mean RMSE on the k-fold training data. You should take the average of the model’s errors on the different folds, using root mean squared error again as your evaluation metric.sd_training_rmse
- a column that gives the standard deviation RMSE on the k-fold training data.test_rmse
- a column that gives the RMSE of a linear regression model trained on this feature set on the test datafrom sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import f1_score, accuracy_score, recall_score, confusion_matrix,classification_report, precision_score, roc_auc_score
fold0_data = pd.read_csv("/Users/aparna_kamal/Downloads/assignment2/1.2_fold0.csv").dropna()
fold1_data = pd.read_csv("/Users/aparna_kamal/Downloads/assignment2/1.2_fold1.csv").dropna()
fold2_data = pd.read_csv("/Users/aparna_kamal/Downloads/assignment2/1.2_fold2.csv").dropna()
fold3_data = pd.read_csv("/Users/aparna_kamal/Downloads/assignment2/1.2_fold3.csv").dropna()
fold4_data = pd.read_csv("/Users/aparna_kamal/Downloads/assignment2/1.2_fold4.csv").dropna()
fold1234_train = pd.concat([fold1_data,fold2_data,fold3_data,fold4_data])
fold0234_train = pd.concat([fold0_data,fold2_data,fold3_data,fold4_data])
fold0134_train = pd.concat([fold0_data,fold1_data,fold3_data,fold4_data])
fold0123_train = pd.concat([fold1_data,fold2_data,fold3_data,fold0_data])
fold0124_train = pd.concat([fold1_data,fold2_data,fold0_data,fold0_data])
training_entire_set =[]
test_entire_set = []
training_entire_set.append(fold1234_train)
test_entire_set.append(fold0_data)
training_entire_set.append(fold0234_train)
test_entire_set.append(fold1_data)
training_entire_set.append(fold0134_train)
test_entire_set.append(fold2_data)
training_entire_set.append(fold0124_train)
test_entire_set.append(fold3_data)
training_entire_set.append(fold0123_train)
test_entire_set.append(fold4_data)
basic_pipeline = make_pipeline(
ColumnTransformer([('numerical', StandardScaler(), CONTINUOUS_FEATURES),
("categorical", OneHotEncoder(drop ="if_binary",sparse=False),CATEGORICAL_FEATURES)]),
)
# clf1 = DecisionTreeRegressor(random_state=1)
# clf2 = Ridge(random_state = 0)
# clf3 = KNeighborsRegressor()
DT_CRITERIA = ["squared_error","absolute_error"]
DT_MAX_DEPTH = [5,10,20]
KNN_NEIGHBOR = [1,2,5,10]
RIDGE_PARAM = np.logspace(-5, 5, num = 11)
result = []
for i in range(0,len(DT_MAX_DEPTH)):
for j in range(0,len(DT_CRITERIA)):
rmse_train=0
rmse_test = 0
rmse_sd = []
for k in range(0,5):
X_train = training_entire_set[k][features]
y_train = training_entire_set[k]['score']
X_test = test_entire_set[k][features]
y_test = test_entire_set[k]['score']
clf1 = DecisionTreeRegressor(random_state = 1, criterion = DT_CRITERIA[j], max_depth=DT_MAX_DEPTH[i])
pipe_DT = Pipeline([("pipe",basic_pipeline),
('clf1', clf1)])
pipe_DT.fit(X_train,y_train)
y_pred_train = pipe_DT.predict(X_train)
y_pred_test = pipe_DT.predict(X_test)
rmse_train = math.sqrt(mean_squared_error(y_train,y_pred_train))
rmse_sd.append(rmse_train)
rmse_test = rmse_test+math.sqrt(mean_squared_error(y_test,y_pred_test))
rmse_train_mean = np.mean(rmse_sd)
rmse_sd_train = np.std(rmse_sd)
rmse_test = rmse_test/5
result.append(["DTR",str(DT_MAX_DEPTH[i])+"_"+DT_CRITERIA[j],rmse_train_mean,rmse_sd_train,rmse_test])
### KNN #####
for j in range(0,len(KNN_NEIGHBOR)):
rmse_train=0
rmse_test = 0
for i in range(0,5):
X_train = training_entire_set[i][features]
y_train = training_entire_set[i]['score']
X_test = test_entire_set[i][features]
y_test = test_entire_set[i]['score']
clf3 = KNeighborsRegressor(n_neighbors=KNN_NEIGHBOR[j])
pipe_KNN = Pipeline([("pipe",basic_pipeline),
('clf3', clf3)])
pipe_KNN.fit(X_train,y_train)
y_pred_train = pipe_KNN.predict(X_train)
y_pred_test = pipe_KNN.predict(X_test)
rmse_train = math.sqrt(mean_squared_error(y_train,y_pred_train))
rmse_sd.append(rmse_train)
rmse_test = rmse_test+math.sqrt(mean_squared_error(y_test,y_pred_test))
rmse_train_mean = np.mean(rmse_sd)
rmse_sd_train = np.std(rmse_sd)
rmse_test = rmse_test/5
result.append(["KNN",str(KNN_NEIGHBOR[j]),rmse_train_mean,rmse_sd_train,rmse_test])
##### RIDGE #####
RIDGE={}
for j in range(0,len(RIDGE_PARAM)):
rmse_train=0
rmse_test = 0
for i in range(0,5):
X_train = training_entire_set[i][features]
y_train = training_entire_set[i]['score']
X_test = test_entire_set[i][features]
y_test = test_entire_set[i]['score']
clf2 = Ridge(alpha = RIDGE_PARAM[j])
pipe_RR = Pipeline([("pipe",basic_pipeline),
('clf2', clf2)])
pipe_RR.fit(X_train,y_train)
y_pred_train = pipe_RR.predict(X_train)
y_pred_test = pipe_RR.predict(X_test)
rmse_train = math.sqrt(mean_squared_error(y_train,y_pred_train))
rmse_sd.append(rmse_train)
rmse_test = rmse_test+math.sqrt(mean_squared_error(y_test,y_pred_test))
rmse_train_mean = np.mean(rmse_sd)
rmse_sd_train = np.std(rmse_sd)
rmse_test = rmse_test/5
result.append(['Ridge',str(RIDGE_PARAM[j]),rmse_train_mean,rmse_sd_train,rmse_test])
output = pd.DataFrame(result)
output.columns = ["Model_name","hyperparameter_setting","mean_training_rmse","sd_training_rmse","test_rmse"]
output
Model_name | hyperparameter_setting | mean_training_rmse | sd_training_rmse | test_rmse | |
---|---|---|---|---|---|
0 | DTR | 5_squared_error | 1.185240 | 0.006366 | 1.209052 |
1 | DTR | 5_absolute_error | 1.207012 | 0.007595 | 1.229086 |
2 | DTR | 10_squared_error | 1.107777 | 0.013955 | 1.250543 |
3 | DTR | 10_absolute_error | 1.148851 | 0.011303 | 1.264720 |
4 | DTR | 20_squared_error | 0.947981 | 0.019125 | 1.348417 |
5 | DTR | 20_absolute_error | 0.994747 | 0.032185 | 1.340303 |
6 | KNN | 1 | 0.504337 | 0.490956 | 1.600402 |
7 | KNN | 2 | 0.586845 | 0.420960 | 1.416636 |
8 | KNN | 5 | 0.693795 | 0.409158 | 1.274647 |
9 | KNN | 10 | 0.774806 | 0.400271 | 1.226887 |
10 | Ridge | 1e-05 | 0.830618 | 0.386141 | 1.180388 |
11 | Ridge | 0.0001 | 0.870485 | 0.370612 | 1.180387 |
12 | Ridge | 0.001 | 0.900384 | 0.355603 | 1.180381 |
13 | Ridge | 0.01 | 0.923639 | 0.341673 | 1.180317 |
14 | Ridge | 0.1 | 0.942245 | 0.328924 | 1.179707 |
15 | Ridge | 1.0 | 0.957514 | 0.317325 | 1.175053 |
16 | Ridge | 10.0 | 0.971251 | 0.307224 | 1.165559 |
17 | Ridge | 100.0 | 0.986336 | 0.299770 | 1.182385 |
18 | Ridge | 1000.0 | 1.002310 | 0.294558 | 1.212416 |
19 | Ridge | 10000.0 | 1.017486 | 0.290187 | 1.228917 |
20 | Ridge | 100000.0 | 1.031621 | 0.286265 | 1.241454 |
output.to_csv("part_1.4_results.csv",sep = '\t')
Here, we're going to perform optimization of one of the classification models - logistic regression. As a reminder...
The loss function of logistic regression (also known as the logistic-loss or log-loss) is given by: \begin{equation} J({\bf w}) = \frac{1}{n}\sum_{i=1}^n \log{(1 + \exp{(-y_i{\bf w}^\top{\bf x}_i}))} \label{eqn:logloss} \end{equation}
The gradient for this loss function, as derived in class, is: \begin{equation} \nabla J({\bf w}) = -\frac{1}{n}\sum_{i=1}^n \frac{y_i}{1 + \exp{(y_i{\bf w}^\top{\bf x}_i)}}{\bf x}_i \label{eqn:loglossgradient} \end{equation}
The Hessian for the loss function is given by: \begin{equation} {\bf H}({\bf w}) = \frac{1}{n} \sum_{i=1}^n \frac{\exp{(y_i{\bf w}^\top{\bf x}_i)}}{(1 + \exp{(y_i{\bf w}^\top{\bf x}_i)})^2}{\bf x}_i{\bf x}_i^\top \label{eqn:loglosshessian} \end{equation}
In Part 2.1 we will implement logistic regression with gradient descent.
logistic_objective
- compute the logistic loss for the given data set (see equation above)logistic_gradient
- compute the gradient vector of logistic loss for the given data set (see equation above)run_gradient_descent
- run the gradient descent algorithm, given these two functions.def logistic_gradient(w, X, y):
# compute the gradient of the log-loss error (vector) with respect
# to w (vector) for the given data X and y
#
# Inputs:
# w = d x 1
# X = N x d
# y = N x 1
# Output:
# error = d length gradient vector (not a d x 1 matrix)
# IMPLEMENT THIS METHOD - REMOVE THE NEXT LINE
if len(w.shape) == 1:
w = w[:,np.newaxis]
gradient = 0
l = 0
for i in range(0,len(X)):
yi=y[i]
xi=X[i]
wt=np.transpose(w)
l = l + (y[i]/(1+np.exp(np.dot(np.dot(-yi,wt),xi))))*(X[i])
return l/(len(X))
def logistic_objective(w, X, y):
# compute log-loss error (scalar) with respect
# to w (vector) for the given data X and y
# Inputs:
# w = d x 1
# X = N x d
# y = N x 1
# Output:
# error = scalar
# IMPLEMENT THIS METHOD - REMOVE THE NEXT LINE
if len(w.shape) == 1:
w = w[:,np.newaxis]
s=0
for i in range(1,len(X)):
yi=y[i]
xi=X[i]
wt=np.transpose(w)
s = s + np.log(1+np.exp(np.dot(np.dot(-yi,wt),xi)))
error = s/len(X)
return error
def run_gradient_descent(X,y):
old_w = np.array([-1]*X.shape[1])
# change this value! This is an unreasonable step size
step_size = 0.0000005
new_w =old_w - 1
while ((new_w-old_w)**2).sum() > .0000000001:
#IMPLEMENT THIS!
old_w=new_w
dw = logistic_gradient(new_w, X, y)
new_w=old_w-(step_size*dw)
return new_w
from scipy.stats import uniform, bernoulli
import functools
draw_binary = functools.partial(np.random.binomial,n=1)
## Simulated data to test your method
DATA_SIZE = 10000
x1 = bernoulli(.5).rvs(DATA_SIZE)
x2 = np.floor(uniform(18,60).rvs(DATA_SIZE))
true_w = [-9, 3.5, 0.2]
xb = true_w[0] + true_w[1]*x1 + true_w[2]*x2
p = 1/(1 + np.exp(-xb))
y = np.array([1 if draw_binary(p=v) else -1 for v in p])
from sklearn.linear_model import LogisticRegression
# notice that logistic regression as implemented in sklearn does not get the exact results either!
# so you shouldn't worry if you're a bit off
X = np.hstack([np.ones((len(xb),1)), x1[:,np.newaxis], x2[:,np.newaxis]])
model = LogisticRegression(solver='liblinear', random_state=0,fit_intercept=False)
model.fit(X,y).coef_
array([[-8.48327258, 3.31959859, 0.18859071]])
# this is how we will test your results
gd_result = run_gradient_descent(X,y)
# is your result relatively close to the truth?
np.abs(true_w-gd_result).sum()
14.699995258300001
In Part 2.2, we are going to use the Newton-Raphson method to optimize the same logistic regression model. To do so, we will need to 1) implement the logistic_hessian
function to compute the Hessian matrix of logistic loss for the given data set, and 2) use scipy
's optimize
function to perform the optimization, rather than writing a function by hand to do so.
def logistic_hessian(w, X, y):
# compute the Hessian of the log-loss error (matrix) with respect
# to w (vector) for the given data X and y
#
# Inputs:
# w = d x 1
# X = N x d
# y = N x 1
# Output:
# Hessian = d x d matrix
print(w.shape)
if len(w.shape) == 1:
w = w[:,np.newaxis]
print(w.shape)
# IMPLEMENT THIS METHOD - REMOVE THE NEXT LINE
hessian = np.zeros((X.shape[1],X.shape[1]))
for i in range(0,len(X)):
wt=np.transpose(w)
num = np.exp(np.dot(np.dot(y[i],wt),X[i]))
den = (1+np.exp(np.dot(np.dot(y[i],wt),X[i])))
den = np.dot(den,den)
hessian = hessian + np.dot(np.dot((num/den),(X[i])),np.transpose(X[i]))
hessian = hessian/len(X)
return hessian
from scipy.optimize import minimize
def run_newton_raphson(X,y):
args = (X,y[:,np.newaxis])
opts = {'maxiter' : 50} # Preferred value.
w_init = np.zeros((X.shape[1]))
print(w_init.shape)
# note: this is almost what you need, you just need to figure out what arguments are necessary here!
soln = minimize(fun=logistic_objective,
x0=w_init,
jac=logistic_gradient,
hess=logistic_hessian,
args=args,
method='Newton-CG',
options=opts)
w = np.transpose(np.array(soln.x))
w = w[:,np.newaxis]
return w
run_newton_raphson(X,y)
(3,) (3,) (3, 1)
array([[0.], [0.], [0.]])