Sitemap

Independent Component Analysis (ICA)

8 min readJul 25, 2024

Independent Component Analysis is an Unsupervised Learning Algorithm.
To understand it we will use the standard example of the Cocktail Party Problem.

Cocktail Party Problem

Cocktail Party

In this problem, we have multiple speakers say d of them speaking at a party and we have d microphones to record their voices. Because the microphones are at different distances from the speaker what they record is a combination of the speaker’s voice.

ICA can be used to separate the d individual speaker’s speech from the mixture of voices recorded via the microphone.

Mathematical Statement

Let s be the source vector consisting of d speaker’s signals and x be the data we receive via the microphones then:

Here A is an unknown square matrix called the mixing matrix.

Our dataset would consist of various data points of x at different instances of time and our goal would be to recover the sources s that generated our data. Also, we assume that the Mixing matrix A does not change with time.

To give a simpler equation say d=2 (we had 2 speakers in the party) and A is a 2x2 matrix then the signal recorded via the microphone 1 would be given as:
x1= A11*s1+A12*s2
x2=A21*s1+A22*s2
As you can see the microphones record the intensified mixture of signals.

ICA aims to find matrix W which is the inverse of A matrix called the unmixing matrix which recovers the sources given the data x at various time instances

where s(i) is the source signal at time i and x(i) is the data at time i.
if W is represented in the form of a column matrix then we can use the jth row to determine the sj by multiplying it with xj at a particular instant of time

Press enter or click to view image in full size

Assumptions of ICA

  1. Statistical independence:
    The source signal components (s1,s2,s3…, sd) are statistically independent of each other.
  2. Linearity:
    The observed data is assumed to be a linear combination of independent components of s
  3. Non-Gaussianity:
    The source vector si should not belong to a Gaussian distribution.
    Reason:
    The major problem with Gaussians is that they are rotationally invariant, so we would not be able to find the right mixing matrix implying that we can’t get back the source vector.
    Rotationally Invariant:
    This means that the properties of a Gaussian distribution are not changed when the coordinate system is rotated.
    Proof (optional):
    Any orthogonal matrix with determinant 1 is a rotation matrix. If R is a rotation matrix then multiplying R with R transpose gives us the identity matrix.
    Let s belong to a multivariate Gaussian with mean vector μ and a covariance matrix Σ.
    Let X be a random vector we get from this Gaussian distribution and on applying roration X gets converted to Y where Y=RX.
    Y will also belong to a multivariate Gaussian distribution with mean and covariance matrix RΣR(t) where R(t) means R transpose.
    Since we have already assumed statistical independence we can say that the covariance matrix will be proportional to the identity matrix and hence RΣR(t) = σ²I where Σ=σ²I.
    Hence this shows that even on rotation the Gaussian continues to have the same covariance as X showing that the distribution is unchanged.
    Now what’s the problem with Gaussian distribution for ICA to be performed?
    Say s came from a Gaussian with n=2, thus s ∼ N (0, I) where I is a 2x2 identity matrix.
    Now suppose we get x=As, then the distribution of x will be
    x ∼ N (0, AA(t) ). The calculations are shown below:
Press enter or click to view image in full size

Now let R be a rotation matrix and let B=RA, then had we used B as our mixing matrix, still x would belong to the same normal distribution
x ∼ N (0, AA(t) ) {calculations are shown in the below figure}
Thus there is no way to tell if A or B was the mixing matrix.

Press enter or click to view image in full size

As long as the data is not Gaussian ICA gives pretty good results.

4. Mixing Matrix: The mixing Matrix is assumed to be invertible and invariant with respect to time.

ICA Ambiguities

1. Permutation Problem:
We cannot determine the speaker id’s. precisely we won’t know whether speaker 1 spoke a particular sentence or speaker 2.
Mathematically: Given only x(i)’s there is no way to differentiate W and PW where P is a permutation matrix( matrix with one 1 in each row, basically it shuffles the rows)

2. Scaling Problem:
We can’t determine the absolute intensity to which the source sounds were subjected.
Multiplying s1 with 5 does not necessarily mean that speaking 1 was shouting at 5 volume.
This is because we can scale A by alpha and s by 1/alpha then x would remain the same as x=(alpha)*A* (1/alpha)*s.
However, for the cocktail party problem, this is not a concern. scaling the source signal si by alpha only affects the volume of the speaker’s speech.
And even the sign does not matter as si or -si both sound identical when played on a speaker { ears cannot distinguish the phase inversion of a pure tone}

Linear Transformations on Density functions

Before we move on to calculating our unmixing matrix W, we shall understand the effect of applying a linear transformation on a probability density function.
Let us see an example. Say s ∼ U(0,1) { s belongs to a uniform distribution}
and x=As, then what is the probability density of x?
if W=A^(-1) {unmixing matrix} then it is tempting to calculate s=Wx and then find the likelihood of that s thereby px=ps(Wx).
But the above approach is wrong. Say if A=2 then x=2s and hence the distribution of x would be x∼ U(0,2) and the likelihood of the values between 0 to 2 would be 1/2 rather than 1 {area under the pdf curve = 1 }.
Hence the complete formula is px=ps(Wx)*|W|.

ICA Algorithm

Now coming to the easiest part of ICA which is the algorithm itself
(Ironic - _ -). It’s what we have been doing all our lives, Maximum likelihood estimation and Gradient descent (Ascent in this example).

Suppose the distribution of each source sj is given by p(sj) and since we have assumed them to be independent, the joint distribution of s is:

Using what we learned in the previous section about linear transformations on probability density functions we can say that since s=Wx then

Now the only thing left is to specify a density function for s.
Recall that to define the probability density function for a variable, we first define a cumulative distributive function and then take its derivative to get the pdf.
A cdf is a monotonic function that increases from 0 to 1.
Note that you can choose any valid cdf except that of Gaussian distribution.
To continue with my calculations I shall take the example of a sigmoid function as my cdf { just an example}.
Thus pdf=derivative(sigmoid function). if g(s) is my sigmoid function then:

The square matrix W is the parameter of our model and given my training set {x(i); i= 1,2,3,…n} then the log-likelihood is given by:

Press enter or click to view image in full size

We would then apply gradient ascent to maximize the log-likelihood in terms of W. Also note that the derivative of a determinant is given by

Hence the updation rule for a single training example {stochastic gradient ascent} is given as

Press enter or click to view image in full size

alpha- learning rate.
[derivative of sigmoid g(x)=g(x)(1-g(x)) which would be our pdf and then taking its derivative we get 1–2*g(x)]

After the algorithm converges we can recover the sources.

Applications of ICA:
1. Medical uses-:
-EEG (Electroencephalography): removing artifacts to analyze brain signals
-MEG (Magnetoencephalography): isolation of brain signals

2. Image processing
3. Genomics and Bioinformatics

Remarks:

  1. If we have prior knowledge of the sources’ densities, then it is a good idea to substitute that instead of sigmoid in ps.
    But in the absence of such knowledge, the sigmoid function does a pretty good job.
  2. Also, we assume that either the data x(i) has been preprocessed to have zero mean, or it can naturally be expected to have zero mean (such as acoustic signals).
    This is important because our assumption that ps = g’(s) implies E[s] = 0 (the derivative of the logistic function is a symmetric function, and hence gives a density corresponding to a random variable with zero mean), which implies E[x] = E[As] = 0.
  3. We assume that all the x(i) in our training set are independent and thus we could write likelihood as a product over different examples.
    This is an incorrect assumption as audio and time series data are dependent but this will not hurt the performance as long as we have enough data.
    If we have correlated data in the training set we can shuffle them randomly to accelerate Gradient ascent.

References:

  1. CS-229 notes
  2. Blogs and articles on the NET.

Thank You for reading, I Hope u got an understanding of ICA.
Sudhanva Savyasachi S

--

--

Sudhanva Savyasachi S
Sudhanva Savyasachi S

Written by Sudhanva Savyasachi S

Loves math, data science, and machine learning. Actively looking for opportunities to solve business problems using data science.

No responses yet