Sunday, May 13, 2018

Introduction to Kaggle Machine Learning - Titanic Competition

Have you wonder how the fancy Machine Learning engineers complete their analysis and model training?

How they pick a good model and deploy all those predictions?

Well you are in the right place.

The big picture:

a simple pipeline begins with data massage.

then you will pick an algorithm and train on the data.

deploy your model online and support your business decisions.



So long story short, let's dive in:

1. First of all the Titanic dataset have a bunch of missing data.

so we filled it with NAN or Age.median.

If that a numerical column like the age, fill with median.
If that's a categorical column you can fill in female, or even UNKNOWN.

def fill_NAN(data):
    data_copy = data.copy(deep=True)
    #0, median, max, mean
    data_copy.loc[:, 'Age'] = data_copy.Age.fillna(data_copy.Age.median())
    data_copy.loc[:, 'Fare'] = data_copy.Fare.fillna(data_copy.Fare.median())
    data_copy.loc[:, 'Pclass'] = data_copy.Pclass.fillna(data_copy.Pclass.median())
    data_copy.loc[:, 'Sex'] = data_copy.Sex.fillna('female')
    data_copy.loc[:, 'Embarked'] = data_copy.Embarked.fillna('S')
    return data_copy
 
data_no_nan = fill_NAN(train)
data_no_nan

2. Then since we are using KNN today,

it would be great to change the male to 0, the female to 1. For the sake of the algorithm requirement.

def transfer_sex(data):
    data_copy = data.copy(deep=True)
    data_copy.loc[data_copy.Sex == 'female', 'Sex'] = 0
    data_copy.loc[data_copy.Sex == 'male', 'Sex'] = 1
    return data_copy
 
data_after_sex = transfer_sex(data_no_nan)
data_after_sex


def transfer_embarked(data):
    data_copy = data.copy(deep=True)
    data_copy.loc[data_copy.Embarked == 'S', 'Embarked'] = 0
    data_copy.loc[data_copy.Embarked == 'Q', 'Embarked'] = 1
    data_copy.loc[data_copy.Embarked == 'C', 'Embarked'] = 2
    return data_copy
 
data_after_embarked = transfer_embarked(data_after_sex)
data_after_embarked

3. remove the names since they contribute nothing to the probability. And also remove the cabin since the PClass is a better metric.
# Remove Ticket, Cabin
# type(data_after_embarked)
# np_array = data_after_embarked.values
# type(np_array)
data_after_dropped = data_after_embarked.drop('Ticket', 1)
print(data_after_dropped.shape)
data_after_dropped = data_after_dropped.drop('Cabin', 1)
print(data_after_dropped.shape)
data_after_dropped = data_after_dropped.drop('Name', 1)
print(data_after_dropped.shape)
data_after_dropped

4. Now you can simple run KNN
from sklearn.neighbors import KNeighborsClassifier
def KNN(train_X, train_y, test_X):
    k = 3
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(train_X, train_y)
    pred_y = knn.predict(test_X)
    return pred_y

pred_y = KNN(train_X, train_y, test_X)

print(train.shape)
# train_X = train[:, :]
# train_y = train[:, :]
print(train_X.shape,
     train_y.shape,
     test_X.shape,
      test_y.shape,
     pred_y.shape)

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(accuracy_score(test_y, pred_y))
print(confusion_matrix(test_y, pred_y))
print(classification_report(test_y, pred_y))

5. Or instead I recommend using a 5-fold cross validation:
from sklearn.model_selection import cross_val_score
for k in range(1, 6):
    knn = KNeighborsClassifier(n_neighbors = k)
    print(k, cross_val_score(knn, train_X, train_y, cv= 5))

1 [ 0.65921788  0.68715084  0.7247191   0.69662921  0.66666667]
2 [ 0.66480447  0.69273743  0.71910112  0.69662921  0.68361582]
3 [ 0.65921788  0.7150838   0.7247191   0.74157303  0.72316384]
4 [ 0.67039106  0.70391061  0.69101124  0.70786517  0.70056497]
5 [ 0.65921788  0.67597765  0.69662921  0.7247191   0.72316384]


form which we can see that the k = 3 is the best parameter.

6. Finally adopt the best parameter from your previous experiments, then generate result and submit.

Ok so here are some tricks to speed you up/ avoid some pitfalls:
1. leave id alone, then combine the training and testing data together. So you don't need to apply the same data massage twice.

2. pick the id up at last and put it in the first column of final result.


To sum up,
A. PreProcess:
1. read data
2. Visualization: Heatmap of Correlation
0. Combine for a single pass massaging
1. Feature Selection - drop irrelevant
2. fill NaN
3. white hot encoding
Feature Engineering
Visualization - 'Pearson Correlation of Features'
4. Split back into training and testing
optional 5. chop as you see fit

optional: matplotlib
optional: feature engineering
optional:dropna - X_train=X_train.dropna(axis=1, how='all')

B:
1. cross validation, grid search
1. Generating our Base First-Level Models
2. Second-Level Predictions from the First-level Output

C:
1. final prediction and submit