Big Data - Machine Learning, Algorithms and Data Structures: 2018

Sunday, May 13, 2018

Introduction to Kaggle Machine Learning - Titanic Competition

Have you wonder how the fancy Machine Learning engineers complete their analysis and model training?

How they pick a good model and deploy all those predictions?

Well you are in the right place.

The big picture:

a simple pipeline begins with data massage.

then you will pick an algorithm and train on the data.

deploy your model online and support your business decisions.

So long story short, let's dive in:

1. First of all the Titanic dataset have a bunch of missing data.

so we filled it with NAN or Age.median.

If that a numerical column like the age, fill with median.
If that's a categorical column you can fill in female, or even UNKNOWN.

def fill_NAN(data):
data_copy = data.copy(deep=True)
#0, median, max, mean
data_copy.loc[:, 'Age'] = data_copy.Age.fillna(data_copy.Age.median())
data_copy.loc[:, 'Fare'] = data_copy.Fare.fillna(data_copy.Fare.median())
data_copy.loc[:, 'Pclass'] = data_copy.Pclass.fillna(data_copy.Pclass.median())
data_copy.loc[:, 'Sex'] = data_copy.Sex.fillna('female')
data_copy.loc[:, 'Embarked'] = data_copy.Embarked.fillna('S')
return data_copy

data_no_nan = fill_NAN(train)
data_no_nan

2. Then since we are using KNN today,

it would be great to change the male to 0, the female to 1. For the sake of the algorithm requirement.

def transfer_sex(data):
data_copy = data.copy(deep=True)
data_copy.loc[data_copy.Sex == 'female', 'Sex'] = 0
data_copy.loc[data_copy.Sex == 'male', 'Sex'] = 1
return data_copy

data_after_sex = transfer_sex(data_no_nan)
data_after_sex

def transfer_embarked(data):
data_copy = data.copy(deep=True)
data_copy.loc[data_copy.Embarked == 'S', 'Embarked'] = 0
data_copy.loc[data_copy.Embarked == 'Q', 'Embarked'] = 1
data_copy.loc[data_copy.Embarked == 'C', 'Embarked'] = 2
return data_copy

data_after_embarked = transfer_embarked(data_after_sex)
data_after_embarked

3. remove the names since they contribute nothing to the probability. And also remove the cabin since the PClass is a better metric.
# Remove Ticket, Cabin
# type(data_after_embarked)
# np_array = data_after_embarked.values
# type(np_array)
data_after_dropped = data_after_embarked.drop('Ticket', 1)
print(data_after_dropped.shape)
data_after_dropped = data_after_dropped.drop('Cabin', 1)
print(data_after_dropped.shape)
data_after_dropped = data_after_dropped.drop('Name', 1)
print(data_after_dropped.shape)
data_after_dropped

4. Now you can simple run KNN
from sklearn.neighbors import KNeighborsClassifier
def KNN(train_X, train_y, test_X):
k = 3
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(train_X, train_y)
pred_y = knn.predict(test_X)
return pred_y

pred_y = KNN(train_X, train_y, test_X)

print(train.shape)
# train_X = train[:, :]
# train_y = train[:, :]
print(train_X.shape,
train_y.shape,
test_X.shape,
test_y.shape,
pred_y.shape)

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(accuracy_score(test_y, pred_y))
print(confusion_matrix(test_y, pred_y))
print(classification_report(test_y, pred_y))

5. Or instead I recommend using a 5-fold cross validation:
from sklearn.model_selection import cross_val_score
for k in range(1, 6):
knn = KNeighborsClassifier(n_neighbors = k)
print(k, cross_val_score(knn, train_X, train_y, cv= 5))

1 [ 0.65921788  0.68715084  0.7247191   0.69662921  0.66666667]
2 [ 0.66480447  0.69273743  0.71910112  0.69662921  0.68361582]
3 [ 0.65921788  0.7150838   0.7247191   0.74157303  0.72316384]
4 [ 0.67039106  0.70391061  0.69101124  0.70786517  0.70056497]
5 [ 0.65921788  0.67597765  0.69662921  0.7247191   0.72316384]

form which we can see that the k = 3 is the best parameter.

6. Finally adopt the best parameter from your previous experiments, then generate result and submit.

Ok so here are some tricks to speed you up/ avoid some pitfalls:
1. leave id alone, then combine the training and testing data together. So you don't need to apply the same data massage twice.

2. pick the id up at last and put it in the first column of final result.

To sum up,
A. PreProcess:
1. read data
2. Visualization: Heatmap of Correlation
0. Combine for a single pass massaging
1. Feature Selection - drop irrelevant
2. fill NaN
3. white hot encoding
Feature Engineering
Visualization - 'Pearson Correlation of Features'
4. Split back into training and testing
optional 5. chop as you see fit

optional: matplotlib
optional: feature engineering
optional:dropna - X_train=X_train.dropna(axis=1, how='all')

B:
1. cross validation, grid search
1. Generating our Base First-Level Models
2. Second-Level Predictions from the First-level Output

C:
1. final prediction and submit

Friday, April 13, 2018

LintCode 970. Big Business - Weekly13

Big Business, lol

O(n) loop and execute a trade if and only if you can afford it, and also you can gain profit from it.

For any profitable but not yet affordable Business, put them in a minimum Heap.

Finally, check the heap for anything that you can afford and execute anything that's possible.

At last, you can no longer possibly afford anything else, thus return the latest K right away.

LintCode 972. Deliver The Message - Weekly13

http://www.lintcode.com/en/problem/deliver-the-message/

Given the information of a company's personnel. The time spent by the ith person passing the message is t[i] and the list of subordinates is list[i]. When someone receives a message, he will immediately pass it on to all his subordinates. Person numbered 0 is the CEO. Now that the CEO has posted a message, find how much time it takes for everyone in the company to receive the message?

Notice

The number of employees is n，n <= 1000.
Everyone can have multiple subordinates but only one superior.
Time t[i] <= 10000。
-1 represent no subordinates.

Use a queue to mark this level, and do a BFS search.
Keep updating the shortest time for each employee.
Stop the search whenever everybody's covered.

Then O(n) loop for each employees to get the maximum time and return.

Tuesday, March 20, 2018

Machine Learning notes

Supervised Machine Learning (providing labels)

Regression v.s. Classification:
one for the continuous data, the other for the discrete data.

Friday, March 16, 2018

LintCode 897. Island City - Weekly9

http://www.lintcode.com/en/problem/island-city/

You could use UnionFind, I just don't bother doing it.

As the islands are not dynamic UnionFind is unnecessary.

Just use plain old DFS, trigger it with a simple scan for cities that's not visited, ~~then paint all visited cities with 2(or maybe not, it doesn't matter now).~~

Then count the number of triggers.

Boom!

LintCode 898. Leftmost One - Weekly9

http://www.lintcode.com/en/problem/leftmost-one/

An interesting questions, i find it sufficient applying binary search.

LintCode 895. Friend Request - Weekly9

http://www.lintcode.com/en/problem/friend-request/

It's a stupid question. Don't bother optimize with sorting or anything.

Just brute force is more than sufficient.

Nothing more to say...sigh...

LintCode 896. PrimeProduct - Weekly9

http://www.lintcode.com/en/problem/prime-product/

it's a tricky question if you are trying to if-else for 2 to 9 for-loops.

but we can simply use the DFS search.

Then eliminate duplication, and apply ordering.

Saturday, March 3, 2018

LeetCode 794. Valid Tic-Tac-Toe State - Weekly Contest 74

https://leetcode.com/contest/weekly-contest-74/problems/valid-tic-tac-toe-state/

It's a simulation, just be patient and careful.

think of all the possibilities. Classified them. use | stand for the column strike, use - for the row strike.

use the \ and / for the diagonals.

Wednesday, February 28, 2018

LintCode 622. Frog Jump

http://www.lintcode.com/en/problem/frog-jump/

Suitable for the bottom-up DP approach:
the first jump must be 1 unit.
the other jumps can be multiple units.

That mean a if-else clause is needed inside the loop.
Then the multiple units call for a collection, and we don't want duplication, so a HashSet would be ideal.

There you go:

Saturday, February 24, 2018

LeetCode 791. Custom Sort String - Weekly Contest 73

https://leetcode.com/contest/weekly-contest-73/problems/custom-sort-string/

First I tried to use lambda for Arrays.sort(T[]), but it doesn't work that way.

So I come up with 3 arrays of the length of 27 which covers all the chars.

the
count: char -> occurrence
map: char -> i index
seq: index i -> char

this way, you can easily assemble the answer.

From their sorted sequence, with given char and the respective occurrence. all within your reach in O(1).

That's crazily fast! So let's do it!

LeetCode 789. Escape The Ghosts - Weekly Contest 73

https://leetcode.com/contest/weekly-contest-73/problems/escape-the-ghosts/

Come to think of it, I realize no matter which location they are locating, the distance to the target is what really matters.

So naturally we come up with a helper function to getDistance,

then, naturally we will compare the distances and see if any ghosts are dangerously close.

That's it.

LeetCode 788. Rotated Digits - Weekly Contest 73

https://leetcode.com/contest/weekly-contest-73/problems/rotated-digits/

Quite Straight Forward,
in case of 25, 69 they change to something beautiful.
in case of 0, 1, 8, they just don't mean anything,
in case of the 3, 4, 7 they are pretty much deal breaker.

the conditions are messy so I speed them up with an array.

Monday, February 19, 2018

LintCode 89. k Sum

http://www.lintcode.com/en/problem/k-sum/

/**
* dynamic programming
* DP Elements:
* 1. state:
* dp ijt means the combinations, of picking j numbers, from the first i numbers. that sum to the t.
*
* 2. Init
* dp 0j mean no number to pick, target 0 would meet, otherwise haha
* dp i0 mean pick 0 numbers, same as above.
*
* 3. Func
* dp ijt = no pick -> dp i - 1, j, t
* pick -> dp i - 1, j - 1, t - A[i]
*
* 4. Answer
* dp len, k
*/

Sunday, February 18, 2018

LintCode 154. Regular Expression Matching

http://www.lintcode.com/en/problem/regular-expression-matching/

A classic Google Problem.

A classic matching dp.

the Dynamic Programming Elements:

1. state
dp i, j means first i of A matching with first j of B.

2. Init
dp i, 0 -> i == 0;
dp 0, j -> isEmpty(j)

3. Func
dp i, j =
if(B.chatAt(j) == '*') {
check if the Ai matches, then |= helper(i - 1, j)
and always we can do without Bj, so |= helper(i, j - 2)
} else {
Bj match Ai? then |= helper(i - 1, j - 1);
else false;
}

4. Answer
dp m, n

PS: dp size m + 1 by n + 1.

Tuesday, February 13, 2018

LintCode 476. Stone Game

http://www.lintcode.com/en/problem/stone-game/

Game Theory problems are classified as DP problems.

And this is a classic one. Because there are two players gaming and so many possible moves,

that comes with a huge searching space with a lot of overlaps.

So overlaps must be eliminated by Memoization.

And let's see the 4 Elements of the DP:

1. initialization: (or termination)
dp[i][i] = 0;

2. transfer formula:
dp[i][j] = sum[i][j] + (for all the possible k) min(dp(i, k) + dp(k, j));

3. Result:
dp(0, length - 1);

Saturday, February 10, 2018

LeetCode 780. Reaching Points - Weekly Contest 71

https://leetcode.com/contest/weekly-contest-71/problems/reaching-points/

A move consists of taking a point (x, y) and transforming it to either (x, x+y) or (x+y, y).

Given a starting point (sx, sy) and a target point (tx, ty), return True if and only if a sequence of moves exists to transform the point (sx, sy) to (tx, ty). Otherwise, return False.

LeetCode 781. Rabbits in Forest - Weekly Contest 71

https://leetcode.com/contest/weekly-contest-71/problems/rabbits-in-forest/

In a forest, each rabbit has some color. Some subset of rabbits (possibly all of them) tell you how many other rabbits have the same color as them. Those answers are placed in an array.

Return the minimum number of rabbits that could be in the forest.

LeetCode 783. Minimum Distance Between BST Nodes - Weekly Contest 71

https://leetcode.com/contest/weekly-contest-71/problems/minimum-distance-between-bst-nodes/
Given a Binary Search Tree (BST) with the root node root, return the minimum difference between the values of any two different nodes in the tree.

Wednesday, February 7, 2018

LintCode 401. Kth Smallest Number in Sorted Matrix

http://www.lintcode.com/en/problem/kth-smallest-number-in-sorted-matrix/

Find the kth smallest number in at row and column sorted matrix.
Example
Given k = 4 and a matrix:
[
[1 ,5 ,7],
[3 ,7 ,8],
[4 ,8 ,9],
]
return 5

Saturday, February 3, 2018

LeetCode 779. K-th Symbol in Grammar - Weekly Contest 70

https://leetcode.com/contest/weekly-contest-70/problems/k-th-symbol-in-grammar/

Saturday, January 27, 2018

LeetCode 771. Jewels and Stones - Weekly Contest 69

https://leetcode.com/contest/weekly-contest-69/problems/jewels-and-stones/

LeetCode 775. Global and Local Inversions - Weekly Contest 69

https://leetcode.com/contest/weekly-contest-69/problems/global-and-local-inversions/

Tuesday, January 23, 2018

LeetCode 159. Longest Substring with At Most Two Distinct Characters - Google - HashMap, Two Pointers, String

The main idea is to maintain a sliding window with 2 unique characters.

The key is to store the last occurrence of each character as the value in the hashmap.

https://leetcode.com/problems/longest-substring-with-at-most-two-distinct-characters/description/

Saturday, January 20, 2018

LeetCode 768. Max Chunks to Make Sorted (ver. 2) - Weekly Contest 68

https://leetcode.com/contest/weekly-contest-68/problems/max-chunks-to-make-sorted-ver-2/

LeetCode 769. Max Chunks To Make Sorted (ver. 1) - Weekly Contest 68

https://leetcode.com/contest/weekly-contest-68/problems/max-chunks-to-make-sorted-ver-1/

LeetCode 766. Toeplitz Matrix - Weekly Contest 68

https://leetcode.com/contest/weekly-contest-68/problems/toeplitz-matrix/

Friday, January 19, 2018

LCS - Longest Common Subsequence - Dynamic Programming

https://practice.geeksforgeeks.org/problems/longest-common-subsequence/0/?ref=self

LIS - Longest Increasing Subsequence - Dynamic Programming

https://practice.geeksforgeeks.org/problems/longest-increasing-subsequence/0

Thursday, January 18, 2018

HR Hash Tables: Ice Cream Parlor - HackerRank Cracking the Coding Interview Challenges

https://www.hackerrank.com/challenges/ctci-ice-cream-parlor/problem

Wednesday, January 17, 2018

LC 375. Guess Number Higher or Lower II - LeetCode Google DynamicProgramming MinMax

https://leetcode.com/problems/guess-number-higher-or-lower-ii/description/

LC 374. Guess Number Higher or Lower - LeetCode Google BinarySearch

https://leetcode.com/problems/guess-number-higher-or-lower/description/

LC 278. First Bad Version - LeetCode Facebook BinarySearch

https://leetcode.com/problems/first-bad-version/description/

LC 683. K Empty Slots - LeetCode Hard Google TreeSet

https://leetcode.com/problems/k-empty-slots/description/

LC 35. Search Insert Position - LeetCode

This question requires a straightforward scan.
We can't help but notice there's only two situations:
0. empty array, so go ahead and insert into 0;
1. if you see nums[i] equal to the target, congratulations and return the i! Also, if you see nums[i] larger, go ahead and return the i because it will be inserted into i.
2. If you can't find such a fit in the Rule No.1. Go ahead and append it to the very end of the Array. And that would be nums.length;

So my clever reader now you can go ahead and give it a try yourself.
:)

Tuesday, January 16, 2018

LC 102. Binary Tree Level Order Traversal - LeetCode

Given a binary tree, return the level order traversal of its nodes' values. (ie, from left to right, level by level).

For example:
Given binary tree [3,9,20,null,null,15,7],
3
/ \
9 20
/ \
15 7
return its level order traversal as:
[
[3],
[9,20],
[15,7]
]

Monday, January 15, 2018

LC 765. Couples Holding Hands - Weekly Contest 67 - LeetCode

N couples sit in 2N seats arranged in a row and want to hold hands. We want to know the minimum number of swaps so that every couple is sitting side by side. A swap consists of choosing any two people, then they stand up and switch seats.

The people and seats are represented by an integer from 0 to 2N-1, the couples are numbered in order, the first couple being (0, 1), the second couple being (2, 3), and so on with the last couple being (2N-2, 2N-1).

The couples' initial seating is given by row[i] being the value of the person who is initially sitting in the i-th seat.

Example 1:

Input: row = [0, 2, 1, 3]
Output: 1
Explanation: We only need to swap the second (row[1]) and third (row[2]) person.
Example 2:

Input: row = [3, 2, 0, 1]
Output: 0
Explanation: All couples are already seated side by side.
Note:

len(row) is even and in the range of [4, 60].
row is guaranteed to be a permutation of 0...len(row)-1.

LC 764. Largest Plus Sign - Weekly Contest 67 - LeetCode

In a 2D grid from (0, 0) to (N-1, N-1), every cell contains a 1, except those cells in the given list mines which are 0. What is the largest axis-aligned plus sign of 1s contained in the grid? Return the order of the plus sign. If there is none, return 0.

An "axis-aligned plus sign of 1s of order k" has some center grid[x][y] = 1 along with 4 arms of length k-1 going up, down, left, and right, and made of 1s. This is demonstrated in the diagrams below. Note that there could be 0s or 1s beyond the arms of the plus sign, only the relevant area of the plus sign is checked for 1s.

Examples of Axis-Aligned Plus Signs of Order k:

Order 1:
000
010
000

Order 2:
00000
00100
01110
00100
00000

Order 3:
0000000
0001000
0001000
0111110
0001000
0001000
0000000
Example 1:

Input: N = 5, mines = [[4, 2]]
Output: 2
Explanation:
11111
11111
11111
11111
11011
In the above grid, the largest plus sign can only be order 2. One of them is marked in bold.
Example 2:

Input: N = 2, mines = []
Output: 1
Explanation:
There is no plus sign of order 2, but there is of order 1.
Example 3:

Input: N = 1, mines = [[0, 0]]
Output: 0
Explanation:
There is no plus sign, so return 0.
Note:

N will be an integer in the range [1, 500].
mines will have length at most 5000.
mines[i] will be length 2 and consist of integers in the range [0, N-1].
(Additionally, programs submitted in C, C++, or C# will be judged with a slightly smaller time limit.)

首先是把十字架的需求分解成四个方向。
这样理解就有思路了。
接着一看可以上动归。
四个方向分别开个dp数组，动态规划，全是O(N^N)的时空复杂度。
最后扫一遍得出结果，最边缘的地方特殊情况放最后处理。

最终时空复杂度O(N^N)

LC 762. Prime Number of Set Bits in Binary Representation - Weekly Contest 67 - LeetCode

Given two integers L and R, find the count of numbers in the range [L, R] (inclusive) having a prime number of set bits in their binary representation.

(Recall that the number of set bits an integer has is the number of 1s present when written in binary. For example, 21 written in binary is 10101 which has 3 set bits. Also, 1 is not a prime.)

Example 1:

Input: L = 6, R = 10
Output: 4
Explanation:
6 -> 110 (2 set bits, 2 is prime)
7 -> 111 (3 set bits, 3 is prime)
9 -> 1001 (2 set bits , 2 is prime)
10->1010 (2 set bits , 2 is prime)
Example 2:

Input: L = 10, R = 15
Output: 5
Explanation:
10 -> 1010 (2 set bits, 2 is prime)
11 -> 1011 (3 set bits, 3 is prime)
12 -> 1100 (2 set bits, 2 is prime)
13 -> 1101 (3 set bits, 3 is prime)
14 -> 1110 (3 set bits, 3 is prime)
15 -> 1111 (4 set bits, 4 is not prime)
Note:

L, R will be integers L <= R in the range [1, 10^6].
R - L will be at most 10000.

LC 763. Partition Labels - Weekly Contest 67 - LeetCode

A string S of lowercase letters is given. We want to partition this string into as many parts as possible so that each letter appears in at most one part, and return a list of integers representing the size of these parts.

Example 1:
Input: S = "ababcbacadefegdehijhklij"
Output: [9,7,8]
Explanation:
The partition is "ababcbaca", "defegde", "hijhklij".
This is a partition so that each letter appears in at most one part.
A partition like "ababcbacadefegde", "hijhklij" is incorrect, because it splits S into less parts.
Note:

S will have length in range [1, 500].
S will consist of lowercase letters ('a' to 'z') only.

这题要求最多的partition。
那么我们从左到右遍历一遍，用贪心算法。
能分马上分。除非这个字母在后边还出现而不可分离。
那么聪明的读者可能想到了，我怎么知道这个/些字符在后边还有没有，有的话在多后？
我这里采用的是先扫一遍用HashMap把每个char的起始坐标给缓存起来，
第二遍直接查缓存，进而得出结果。

时间复杂度就是O(n)