CSE 569 – HW4 – Fall 2018

Aditya Vikram Sharma1215126588

November 26, 2018

1 K-Means

K-Means algorithm is used to compute cluster the points given in the dataset into multiple clusters

based on the value of ‘K’. In this assignment, 2 datasets have been given which have 1600 and 1500

points respectively. The Simulations have been performed for both the datasets. Initially, to nd

the optimal value of ‘K’, Elbow-rule is used where a graph between ‘K’ values and Sum of Squared

Errors/ Total number of points for each cluster for a xed random initialization are plotted and the

optimal ‘K’ value is the point in the graph where the curve bends like an elbow. Next, xing this

‘K’, simulation for various values of ‘r’ which is the random initialization is performed and a graph

of ‘r’ vs (Sum of Squared errors)/ (Total number of points) are plotted to identify the best random

initialization value. The cluster of the random initialization is shown is depicted.

The algorithm performed for K-means in this assignment is shown here : Algorithm 1:

K-Means1

k GetInput

2 for i 0to kdo 3

C entroid i RandomInitialization

4 end

5 newC entroid Centroid

6 for j 0to MaxIter do 7

forx 0to TotPoints do 8

fory 0to kdo 9

dist EucDist(Centroidy, Pointsx)

10 end

11 ind Index(min(dist))

12 C lusterind Pointsx)

13 end

14 newC entroid avg(Clusuterind)

15 if(newCentroid != Centroid) then16

newC entroid Centroid

17 continue

18 else 19

break

20 end

21 end For both the dataset, simulation is performed and the corresponding plots are shown in the next

subheadings.

1

1.1 Dataset 1

This dataset contains 1600 points in total. A depiction of the dataset is shown in the gure below. Figure 1: Dataset 1

The optimal ‘K’ value for this dataset was calculated by plotting various values of ‘K’ vs ‘Sum of

Squared errors/ Total Points’ graph for a xed random initialization using the elbow rule. The optimal

value of ‘K’ turned out to be 2 which can be seen in the below gure. Figure 2: Elbow Rule

Now, that we xed K=2, the algorithm is run ‘r’ times with different random initializations where

‘r’ is set to 5 to nd the random initialization with the lowest Sum of Square error/ Total Points value

and a graph for that is shown in the gure 3.

2

Figure 3: Random Initialization vs SSE/Total points

From the above plot, the SSE/ Total Points value for all the random Initializations are :

1.762324974077466, 1.762324974077466, 1.7622818440138992, 1.762324974077466, 1.7622793729899464

and it is clear that Random Initialization 5 has the lowest SSE/ Total Points value.

Using this random initialization, the clustering for the points is done with K = 2 and the plot for

that is shown below. Figure 4: Clustering of Dataset 1 for K = 2

With this random initialization and K value, the algorithm converged at Iteration 4.

3

1.2 Dataset 2

This dataset contains 1500 points in total. A depiction of the dataset is shown in the gure below. Figure 5: Dataset 2

The optimal ‘K’ value for this dataset was calculated by plotting various values of ‘K’ vs ‘Sum of

Squared errors/ Total Points’ graph for a xed random initialization using the elbow rule. The optimal

value of ‘K’ turned out to be 3 which can be seen in the below gure. Figure 6: Elbow Rule

Now, that we xed K=3, the algorithm is run ‘r’ times with different random initializations where

‘r’ is set to 5 to nd the random initialization with the lowest Sum of Square error/ Total Points value

and a graph for that is shown in the gure 3.

4

Figure 7: Random Initialization vs SSE/Total points

From the above plot, the SSE/ Total Points value for all the random Initializations are :

1.0261745127196509, 1.026653218697321, 1.1787796700494195, 1.178782241593976, 1.0261849230725586

and it is clear that Random Initialization 1 has the lowest SSE/ Total Points value.

Using this random initialization, the clustering for the points is done with K = 3 and the plot for

that is shown below. Figure 8: Clustering of Dataset 2 for K = 3

With this random initialization and K Value, the algorithm converged at Iteration 8.

The python code for the K-means algorithm implemented is shown in the next page.

5

# –

*- coding: utf-8 –

*-

“””

Created on Wed Nov 21 11:09:15 2018

@author: 1796a

“””

importmatplotlib.pyplotasplt

importnumpyasnp

importmatplotlib.pyplotaspltt

fromcollectionsimport defaultdict

fromcopyimport deepcopy

file=open(C: Users 1796a Documents Sem 1 FSL Assignment

Codes Assignment 4 Dataset_2.txt)

, !

X=np.loadtxt(file)

def euc_dist(X, Y): #Function to Calculate Eucledian Distance

return (np.linalg.norm(X-Y))

def sse(X,Y): #Function to Calculate Sum of Squared Erros

return (X-Y)

**2

def kmeansClust(k=3, i=1):

“””

C is a dictionary which stores the centeroids.

M is a list which randomly permutates the input, to take

initial centeroids

, !

d is a list of eucledian distances between centroid and each pt

in dataset

, !

Cluster is a dictionary which appends each pt to the

corressponding index

, !

PrevCent stores the previous value of centroid, to check if it

has changed

, !

after each iteration

“””

C=defaultdict(list)

size=len(X)

np.random.seed(i)

M=np.take(X,np.random.permutation(X.shape0),axis=0)

plt.scatter(X:,0, X:,1,s=25, cmap=viridis)

plt.savefig(“C: Users 1796a Documents Sem

1 FSL Assignment Codes Assignment

4 initialPoints2.png”)

,

!

, !

for iin range(0,k):

Ci=Mi

max_iter=0

PrevCent=deepcopy(C)

list(C)

for _in range(50): # Run loop max of 50 times

max_iter=max_iter+1

Cluster=defaultdict(list)

for iin range(0,len(X)):

d=list()

for jin range(0,k):

d.append(euc_dist(Cj,Xi))

ind=d.index(min(d))

Clusterind.append(Xi)

for clin range(0,k):

6

Ccl0=0

Ccl1=0

for lin range(0,len(Clustercl)):

Ccl0=Ccl0+ (Clustercll0)/len(Clustercl)

, !

Ccl1=Ccl1+ (Clustercll1)/len(Clustercl)

, !

for p,cinzip(PrevCent, C):

if (list(PrevCentp)!=list(Cc) ):

PrevCent=deepcopy(C)

flag=0

break

else :

print (“Converged at Iteration “, max_iter)

flag=1

break

if flag==1:

break

Colors=b,y,g,c,m,k,w,r

d=0

for valinrange(0,k):

for pin range(0,len(Clusterval)):

pltt.scatter(Clustervalp0,

Clustervalp1,s=25, c=Colorsval)

, !

#Calculate SSE here

d=d+sse(Clustervalp0, Cval0)+ sse(Clustervalp1,Cval1)

, !

pltt.scatter(Cval0, Cval1, marker= *, c=black,

s=200, alpha=0.5)

, !

pltt.savefig(“C: Users 1796a Documents Sem

1 FSL Assignment Codes Assignment

4 Cluster_3_r1_Graph.png”)

,

!

, !

pltt.show()

return (d/size)

def main():

Val={}

k=int(input(Enter number of Clusters: ))

r=int(input(Enter max r value: ))

for iin range(1,r+1):

print (“R = “, i)

Vali=kmeansClust(k=k,i=i)

print (Val)

lists=sorted(Val.items())

x, y=zip( *lists)

plt.plot(x, y)

plt.xlabel(“Random Initialization”)

plt.ylabel(“Sum of Squared Errors”)

plt.gcf().subplots_adjust(left=0.15)

plt.savefig(“C: Users 1796a Documents Sem

1 FSL Assignment Codes Assignment 4 SSE_K_Graph2.png”)

, !

plt.show()

if __name__==”__main__”:

main()

7

2 Gaussian Mixture Model

Gaussian Mixture model is generally the representation of sum of gaussian densities. In this task,

there are two datasets on which the EM Algorithm of the GMM is performed which contains 1600

and 1500 points respectively. The EM algorithm has two stages – the E-step and the M-step where at

the E-step, which is the expectation step, the initial parameters are evaluated and calculated and at the

maximization step the log likelihood function is maximized and the parameters are estimated again.

This process is repeated till the algorithm converges which is when the estimated parameters equal

the previous estimate of the iteration. The algorithm implemented in shown below. Algorithm 2:

Gaussian Mixture Model – EM Algorithm1

Mean

2 P

Covariance

3 if (initialization =kmeans) then4

Centroid(Kmeans)

5 iniLog LogLikelihood

6 while (Algo != Converge) do7

fori 0to len(df) do 8

C

i

PMF (Data)

9 C

i C

i S um

(C

i)

10 V al:append (C

i)

11 end

12 forj 0to len(Val) do 13

C

j

M aximize (C )

14 N ewV al:append (C 1h; C 2h )

15 j N ewV al V al

+N ewV al

16 n (P

j)

17 end

18 dif f eucledian (; n )

19 n

20 end The task involves two stages – one to initialize the mean vectors and covariance randomly and the

second is to use the K-means centroid values as the mean vectors and covariance. This process is

done for both the datasets and the plots are shown in the respective subheading.

2.1 Dataset 1

This dataset contains 1600 points in total. A depiction of the dataset is shown in the gure below. Figure 9: Dataset 1

8

2.1.1 Random Initialization

Here, the mean vectors and covariance is randomly initialized for the K value set 2 and is used to the

the E-M algorithm. The plot for the clustering and the log likelihood can be seen below. Figure 10: GMM Clustering with Random Initialization

Figure 11: Log Likelihood Graph for Random Initialization

As, we can see from the gures that the algorithm converges very well and the log likelihood graph

is also converging at a very good rate with respect to the Iterations which can be seen in gure 11.

The Plot and Log likelihood graph for K-means initialization is shown in the next page.

9

2.1.2 K-means Initialization

In this phase, the mean vectors and covariance is initialized using K-Means clustering with K=2

which is done in the previous task. The Centroid of the K-clusters are used during initilization. Figure 12: GMM Clustering with K-Means Initialization

Figure 13: Log Likelihood Graph for K-Means Initialization

As it is visible, both the initialization work well and can be used in their own way. Both work

well for the EM Algorithm, however, it is theoratically better to use the K-means initialization as it

already has the centroids and the EM algorithm can converge quickly.

10

2.2 Dataset 2

This dataset contains 1500 points in total. A depiction of the dataset is shown in the gure below. Figure 14: Dataset 2

2.2.1 Random Initialization

Here, the mean vectors and covariance is randomly initialized for the K value set 2 and is used to the

the E-M algorithm. The plot for the clustering and the log likelihood can be seen below. Figure 15: GMM Clustering with Random Initialization

As, we can see from the gures that the algorithm converges very well and the log likelihood graph is

also converging at a very good rate with respect to the Iterations which can be seen in gure 16. The

Plot and Log likelihood graph for K-means initialization is shown in the next page.

11

Figure 16: Log Likelihood Graph for Random Initialization

2.2.2 K-means Initialization In this phase, the mean vectors and covariance is initialized using K-Means clustering with K=3

which is done in the previous task. The Centroid of the K-clusters are used during initilization. Figure 17: GMM Clustering with K-Means Initialization

12

Figure 18: Log Likelihood Graph for K-Means Initialization

As it is visible, both the initialization work well and can be used in their own way. Both work

well for the EM Algorithm, however, it is theoratically better to use the K-means initialization as it

already has the centroids and the EM algorithm can converge quickly. The code is implemented in

Python 3.6 which is shown in the next page.

13

# –

*- coding: utf-8 –

*-

“””

Created on Mon Nov 26 23:40:23 2018

@author: 1796a

“””

importnumpyasnp

importpandasaspd

importrandomasrand

importmatplotlib.pyplotasplt

fromscipy.statsimport norm

fromsysimport maxint

np.random.seed(42)

file=open(C: Users 1796a Documents Sem 1 FSL Assignment

Codes Assignment 4 Dataset_1.txt)

, !

X=np.loadtxt(file)

labels=(1 *800)+(2

*800)

#1600 samples

data={x: xs,y: ys,label: labels}

df=pd.dF(data=X)

fig=plt.figure()

plt.scatter(datax, datay,24, c=datalabel)

def prob(val, myu, s, lam):

p=lam

for iin range(len(val)):

p *=norm.pdf(vali, myui, sii)

return p

def expectation(dF, param):

for iin range(dF.shape0):

x=dFxi

y=dFyi

p_cluster1=prob(x, y,list(parammyu1), list(params1), paraml0 )

, !

p_cluster2=prob(x, y,list(parammyu2), list(params2), paraml1 )

, !

if p_cluster1>p_cluster2:

dFlabeli=1

else :

dFlabeli=2

return dF

def maximization(dF, param):

pts_cluster1=dFdFlabel==1

pts_cluster2=dFdFlabel==2

14

per_cluster1=len(pts_cluster1)/float(len(dF))

per_cluster2=1-per_cluster1

paraml=per_cluster1, per_cluster2

parammyu1=pts_cluster1x.mean(), pts_cluster1y.mean()

, !

parammyu2=pts_cluster2x.mean(), pts_cluster2y.mean()

, !

params1= pts_cluster1x.std(),0, 0,

pts_cluster1y.std()

, !

params2= pts_cluster2x.std(),0, 0, pts_cluster2y.std()

, !

return param

def dist(old_params, new_params):

dist=0

for param inmyu1,myu2:

for iin range(len(old_params)):

dist+=(old_paramsparami-new_paramsparami) **2

return dist

**0.5

value=maxint

epsilon=0.01

iters=0

df_copy=df.copy()

df_copylabel=map(l x: x+1, np.random.choice(2,len(df)))

params=pd.dF(guess)

while (value>epsilon):

iters+=1

newVal=expectation(df_copy.copy(), params)

newParams=maximization(newVal, params.copy())

value=dist(params, newParams)

print (“iteration {}, value {}”.format(iters, value))

df_copy=newVal

params=newParams

fig=plt.figure()

plt.scatter(df_copya, df_copyb,24, c=df_copylabel)

15

3 Binomial Mixture Model Extra Credit

Binomial Mixture Model to predict the outcome or values of two unknowns can be approached using

Expectation Maximization algorithm. Here, in this dataset, we have 10000 coin tosses by 2 coins

‘C1’ and ‘C2’. There are 1000 sets in the dataset having 100 tosses in each set and each set cor-

ressponds to a single coin, but we do not have the label for the coin, hence we can use Expectation

Maximization to estimate the missing parameters/ probabilities of occurance of both the coins C1 and

C2.

The notations going to be used –

1 = Probability Coin C1 landing a head

1 = Probability of Coin C2 landing a head

For simulation, we set

1 = 0.6 and

1 = 0.5 initially just like how its mentioned in 1. Instead

of calculating which coin would have generated the tosses, it is better to calculate the probability of

each possible completion of the missing values(which is the label in this case).

Let us consider the rst set(rst row) in the dataset. There are 61 Heads (1’s) and 39 Tails (0’s).

P(61 HjC 1) = 100

C61 61

1 (1

1)39

= 0 :0798 (1)

P (61 HjC 2) = 100

C61 61

2 (1

2)39

= 0 :00711 (2)

Taking the weighted average here, we get

W1= (1) (1) + (2)

= 0

:9181 (3)

W 2= (2) (1) + (2)

= 0

:0819 (4)

Since we know that, each set of data (row) is performed by a single coin, we have (X

1; X

2; X

3; ::::::; X

1000)

as the value for each set of data where each X is a vector of 100 coin tosses. We can denote the hidden

value of label of coins as ( (Z

1; Z

2; Z

3; ::::::; Z

1000)

.

We know that, L( j X ) = P(X j) = X

z P

(X; Z j) (5)

If we denote ‘h’ by the number of heads in the i’th set,

P(Z

ij

X

i;

) = P

(X

i; Z

ij

) P

(X

ij

) (6)

where, P(X

ij

) = 100

Ch h

1 (1

1)(1

h)

+ 100

Ch h

2 (1

2)(1

h)

(7)

and, Zi2

C1; C 2 (8)

Therefore, since we know that E(H) = n*P (n=100 here) which is

E(N o of H in C

1j X

i;

1) = 100

P (Z

i=

C1j X

i;

1)

(9)

Therefore, at the Expectation step (E-step) of the E-M algorithm we calculate the Binomial Probability

Mass Function for both the coins C1 and C2 and we append the row probability of both the coins every

set of 100 values (every row).

16

After the expectation step, we need to perform Maximization. So, at the Maximization step, we

maximize the log likelihood function.

The value at each iteration are calculated by normalizing the number of heads in total of all 1000

iterations of coin C1 with respect to the sum of number of heads and tails for coin C1. Similarly the

value of the coin C2 is calculated. This process continues and the values keep getting updated

at each iteration and the process will halt when the difference between the current value and the

previous value is less than a threshold value. The threshold value set in this algorithm is 1e-6.

The algorithm implemented for this problem is given below. Algorithm 3:

Binary Mixture Model1

df Load Dataset

2 1 0.6

3 2 0.5

4 ( 1 ; 2)

5 dif f eucledian (; n ) n

6 lim 1e-6

7 maxI ter 10000

8 while (iterlim) do 9

fori 0to len(df) do 10

C1

h

BinomialPMF (C1)

11 C2

h

BinomialPMF (C2)

12 C1

h C

1

h C

1

h +

C 2

h

13 C2

h C

2

h C

1

h +

C 2

h

14 V al:append (C 1

h; C

2

h)

15 end

16 forj 0to len(Val) do 17

C1h C1 N o H

18 C1t C1 N o T

19 C2h C2 N o H

20 C2t C2 N o T

21 N ewV al:append (C 1h; C 2h )

22 end

23 n1 S um

(C 1h ) S um

(C 1h )+ S um (C 1t)

24 n2 S um

(C 2h ) S um

(C 2h )+ S um (C 2t)

25 n (n 1; n 2)

26 dif f eucledian (; n )

27 n

28 end The Simulation for the algorithm was performed using Python 3.6. After performing the simula-

tions, the results obtained were

1 = 0.5488

2 = 0.4527

and it has been observed that, even upon changing the initial parameters, that is, initial values, the –

nal values converge at the above specied values. The screenshots of the simulation have been shown

in the gure next page.

17

The updation of parameters over various iterations is shown in the Figure 19.

Figure 19: Parameter Updation over various iterations

The Final values for both the coins, that is probability of heads landing on coins C1 and C2 is

shown in the gure 20. Figure 20: Parameter Updation over various iterations

A simulation was performed over the dataset using the EM Algorithm in Python 3.6 with initial

values set as 0.6 and 0.5 for the two coins respectively and nal values obtained as 0.5488 and 0.4527

respectively.

References

1https://www.nature.com/articles/nbt1406

———————————————————————————————————————- 18