![[Pasted image 20251105181742.png]]
+![[Pasted image 20251105220430.png]]
+- what functions are described by decision trees
+- how do we make predicitions using a given tree
+- how do we learn a tree from data
+- what are the advantages & disadvantages of this method?
+
+![[Pasted image 20251105220543.png]]
+
+![[Pasted image 20251105220626.png]]
+(too lazy to make mermaid diagram here)
+
+![[Pasted image 20251105220649.png]]
+
+- split the feature space using one feature at a time, recursively
+- the splitting
+ - forms a tree structure
+ - partitions the space in "rectangles"
+- in each rectangle / leaf, we assign a value
+
+
+![[Pasted image 20251105220734.png]]
+
+Tree Learning is made of 2 parts
+1. partitioning the feature space / learning the structure of the tree
+2. Assigning values to each part (first this)
+
+## Learning: value Assignment
+Given 0-1 loss and the given partioning, what is the optimal value $c_l$ to assign to each leaf $l$.
+$$h(\mathbf{x})=\sum_{l\in\text{Leaves}}c_l\mathbf{I}(\mathbf{x}\in l)$$
+what is the $\mathbf{I}$ function here? TODO
+
+$$\min_{c_l}\sum^N_{i=1}\mathbf{I}(h(\mathbf{x_i})\neq y_i)$$
+(answer: the majority label of all the objects assigned to the leaf is the optimal choice)
+
+![[Pasted image 20251105221215.png]]
+
+## learning: splitting
+- if all objects in a node belong to one class: we are done, it's a leaf
+- If not: find a node-variable combination that increases the quality of the tree the most if we split the node further
+- how do you measure quality?
+ - reduction in error before / after the split?
+ - not sensitive enough?
+
+## measure of improvement
+S = set of objects in the node
+F = feature that we are considering
+$|S_V|$ = number of objects in the node with value $V$
+
+$$I(S)-\sum_{V\in\text{Values}(F)}\frac{|S_V|}{|S|}I(S_V)$$
+If I(.) is the misclassification error in a set, this measures how much the 0-1 loss will decrease if we split the node into multiple nodes using feature $F$
+other ways to measure impurity of a node I(.)?
+
+Splitting Criteria: Choosing I(S)
+Missclassification:
+$$1-\max_c p_c$$
+
+Entropy:
+$$-\sum_c p_c\log(p_c)$$
+
+Gini Index:
+$$\sum_c p_c(1-p_c)$$
+
+![[Pasted image 20251105221710.png]]
+![[Pasted image 20251105221727.png]]
+TODO (waht the fuck do these graphs mean)
+
+![[Pasted image 20251105221833.png]]
+
+## information gain
+
+$$I(S)-\sum_{V\in\text{Values}(F)}\frac{|S_V|}{|S|}I(S_V)$$ with $$I(S)=-\sum_c p_c\log(p_c)$$ (which is entropy)
+
+![[Pasted image 20251105222032.png]]
+
+## splitting criteria: gain ratio
+- the information gain is biased towards features with many discrete values (that checks out, when you have just a bunch of categories essentially, you would also be really eager to split on those a lot)
+- we may overfit on such features (for instance, someone's social security number)
+- one idea by Quinlan is to correct for this using the entropy of the feature, the intrinsic value (IV)
+- actually some (many) implementations only use binary splits
+$$IG(S)=I(S)-\sum_{V\in\text{Values}(F)}\frac{|S_V|}{|S|}I(S_V)$$
+$$IV(S)=-\sum_{V\in\text{Values}(F)}\frac{|S_V|}{|S|}\log\frac{|S_V|}{|S|}$$
+$$\text{GainRatio}(S)=\frac{IG(S)}{IV(S)}$$
+
+Splitting: continuous variables
+
+extra challenge: we need to find a threshold for the split
+consider all thresholds and pick the optimal one
+how many thresholds do we need to consider?
+
+Learning: Stopping
+- if we try to fit pure trees (ones where all the nodes are really just one single thing) - we quickly start overfitting, which makes sense because we also try to fit the outliars really hard
+
+- criteria:
+ - min node size
+ - max tree depth
+ - min information gain
+
+- alternative / complementary strat:
+ - grow and prune
+
+
+## learning: pruning
+$$C_\alpha(\text{Tree})=\sum_{l\in\text{Leaves}}|S_l|I(S_l)+\alpha|\text{Leaves}|$$
+Leaves: Leaves in the tree
+for each alpha, optimise the tree by removing subtrees
+>select optimal alpha using a validation set!
+
+![[Pasted image 20251105222820.png]]
+
+## generalisations:
+
+- Multi-Class:
+ - take the sum over all classes in the entropy term (should be easy enough)
+- Regression
+ - Squared loss as the loss function
+ - Leaf value: average outcome
+ - split criterion: min the variance
+ - Covered in ProbStat course
+
+Summary:
+- non-linear classifier
+- choices in learning:
+ - splitting criterion (information gain, gain ratio, entropy / gini index, etc.)
+ - stopping (min gain, node size, max depth)
+ - pruning (remove parts based on performance on validation set)
+
+## pros and cons
+
+pros:
+- "interpretable"
+- automatic feature selection
+- easy to incorporate discrete features and missing values
+- fast
+
+cons:
+- unstable
+- cannot model linear relationships efficiently
+- "greedy"
+
+
+## classifier combining:
+- multiple classifiers that each make a prediction: perhaps they make different kinds of mistakes
+- how do we combine these predictions?
+- does it help to combine them?
+
+![[Pasted image 20251105224356.png]]
+
+### condorcet's Jury Theorem
+
+*Question*
+ When does it make sense to combine decisions?
+
+*Theorem*
+ if p > 0.5, adding voters to the jury increases probability of correct majority vote (assuming voting is independent)
+
+## Combining Strategies
+- Fixed Rules
+ - Hard rules e.g. majority voting
+ - soft rules: e.g. mean, product $$
+\begin{align}
+\max_y\sum_{j=1}^C\frac{1}{C}h_j(y|x)\\
+\max_y\prod_{j=1}^C h_j(y|x) \\
+\min_{p(y|\mathbf{x})}\sum_jD(h_j(y|\mathbf{x}),p(y|\mathbf{x}))
+\end{align}
+$$
+
+- learned rules
+ - learn a classifier to output a decision based on the outputs o a set of base classifiers
+
+
+### what to combine
+- where do multiple classifiers come from?
+ - different classifiers on the same dataset?
+ - the same classifier on different datasets?
+- Diversity may be helpful
+ - let's consider an exmaple where we construct "diverse" classifiers using different versions of the same dataset
+
+## random forests
+
+- single tree might easily overfit and might be unstable
+- combine multiple trees
+- should be sufficiently "different"
+ 1. randomly s elect features
+ 2. randomly select samples (with replacement)
+
+random forest: construct a large number of trees using randomly selected objects and features and come their decisions
+
+## bagging (bootstrap aggregation)
+![[Pasted image 20251105225000.png]]
+
+## random subspaces
+![[Pasted image 20251105225024.png]]
+
+## random forest
+![[Pasted image 20251105225036.png]]
+
+## parameters
+Trees:
+- depth
+- number of leaves
+- min number of objects per node
+- information criterion
+- pruning
+
+Forest:
+- number of trees
+- size of subspaces considered
+
+Pros:
+- flexible / low bias
+- works for many different types of data
+- embarassingly parallell (one hell of an adjective here by the slides)
+- produces out-of-bag (OOB) estimates
+- (scale) invariant
+- (relatively) few hyperparameters
+
+Cons:
+- harder to interpret than single trees
+- computationally expensive
+
+
+- decision trees: a greedy space partitioning approach to non-linear classification that leads to an "intuitive" classifier, but typically has high variance
+- classifier combining can lead to better predictions by combining multiple diverse classifiers
+- random forests are a combination of decisions trees constructed u sing random subsampling of objects and features
+
+## multi-layer perceptrons
+(non-linear discriminative classifiers)
+
+## today: the common learning algorithm used in ML today
+- multi-layer perceptrons / artifical neural networks / feed-forward "deep" neural networks
+- connected topics:
+ - logistic regression
+ - gradient descent
+ - classifier combining
+
+![[Pasted image 20251105225616.png]]
+
+![[Pasted image 20251105225642.png]]
+we sure love this
+
+## perceptron math
+$$\sum^N_{i=1}\max(-y_i(\mathbf{w^\top}\mathbf{x}_i+w_0),0)$$
+Loss: perceptron loss
+function class: linear
+optimiser: (stochastic) gradient descent
+
+## perceptron training
+$$\frac{\partial J(\mathbf{w})}{\partial\mathbf{w}}=\sum_{y_i(\mathbf{w^T\mathbf{x}_i})<0}-y_i\mathbf{x}_i$$
+$$\mathbf{w}^{t+1}=\mathbf{w}^t+\alpha_t\sum_{y_i(\mathbf{w^{t^T}\mathbf{x}_i})<0}y_i\mathbf{x}_i$$
+guaranteed to converge if:
+1. problem is linearly separable
+2. we choose the right, decreasing, learning rate
+
+## logistic regression
+![[Pasted image 20251105230143.png]]
+
+
+![[Pasted image 20251105230213.png]]
+
+## multi-layer perceptrons
+![[Pasted image 20251105230237.png]]
+![[Pasted image 20251105230254.png]]
+![[Pasted image 20251105230314.png]]
+
+![[Pasted image 20251105230333.png]]
+
+
+![[Pasted image 20251105230342.png]]
+
+- consufing naming: MLP as general term
+- other activation functions:
+ - sigmoid (logistic (we had this one earlier))
+ - Tanh
+ - Modern alternatives: ReLU and its variants
+
+![[Pasted image 20251105230519.png]]
+
+
+![[Pasted image 20251105230529.png]]
+
+
+## prediction: forward pass
+Given a model, we can do prediction / inference in a *forward pass*
+![[Pasted image 20251105230600.png]]
+
+## objective function (still J)
+![[Pasted image 20251105230636.png]]
+(is the objective function in this case, with one hidden layer)
+![[Pasted image 20251105230652.png]]
+
+## why learning is hard
+- if the model output is wrong, how shuoold we update the model? which part of the error should we attribute to the different weights?
+![[Pasted image 20251105232759.png]]
+
+## learning the MLP weights
+- remember the general setup: we are again going to use gradient descent so we need the gradient
+- intuitively, what the gradient will do is propagate the error at teh end of the network, back across the graph to tell us how to change the weights
+- this clever way of efficiently calculating the gradient is called "back propagation"
+
+$$\mathbf{w}^{t+1}=\mathbf{w}^t-\alpha_t\nabla_{\mathbf{w}}J(\mathbf{w}^t)$$
+the inverted delta is a "Nabla"
+$$\nabla_\mathbf{w}J(\mathbf{w})=
+\begin{bmatrix}
+\frac{\partial J(\mathbf{w})}{\partial w_1} \\
+... \\
+\frac{\partial J(\mathbf{w})}{\partial w_n}
+\end{bmatrix}$$
+## backpropagation
+![[Pasted image 20251105233229.png]]
+![[Pasted image 20251105233246.png]]
+
+it is important you understand the intuition here: why this is called back-propagation and how it relates to the gradient
+
+
+![[Pasted image 20251105233330.png]]
+(TODO: yo alright you lost me here ngl, im gonna have to pick this one apart)
+
+![[Pasted image 20251105233501.png]]
+![[Pasted image 20251105233510.png]]
+
+## learning the MLP weights
+![[Pasted image 20251105233554.png]]
+![[Pasted image 20251105233604.png]]
+
+![[Pasted image 20251105233620.png]]
+
+there are many variations:
+- activation functions and how to combine them
+- how many hidden layers?
+- how to connect different neurons, share parameters etc.
+- intialisation, optimisation procedure, regularisation, weight decay ...
+- these combinations lead to the models that are used today:
+ - Convolutional neural networks
+ - residual networks
+ - transformers
+
+## challenges
+- non-convex risk function: may get stuck in local optima, converge slowly etc.
+- many architectural choices: which is optimal?
+- flexible model: risk of overfitting
+- there is an *art* to getting networks to work well
+- difficult to interpret the model (black box)
+- with many parameters: computationally demanding (but can be partially addressed with hardware)
+
+![[Pasted image 20251105233825.png]]
+
+![[Pasted image 20251105233857.png]]
+
+![[Pasted image 20251105233915.png]]
+
+![[Pasted image 20251105233924.png]]
+
+![[Pasted image 20251105233943.png]]
+
+Pros and cons:
+
+Pros:
+- flexible model class
+- good empirical performance on many (structured) problems
+- can be easily adapted to different learning settings
+
+Cons:
+- computationally really expensive
+- lots of hyper-parameters
+- does not converge to a unique optimum
+- optimizing them can be an art
+- hard to interpret
+
+recap:
+- perceptrons are simple linear binary classifiers
+- multi-layer perceptions are connected architectures of perceptron-like nodes, inspired by a simplfiied model of the brain
+- they are trained using (stochastic) gradient descent, efficiently calculating the graident using back propagation
+
+## unsupervised learning
+until now we have been using supervised methods
+(cinema)
+- each training example described by a feature vector and a label
+
+unlabelled data: what now?
+- unsupervised learning: no labels / targets present
+
+What can you do?
+- Clustering
+ - discover structures in unlabelled data
+- Dimensionality Reduction
+ - does not use information about the labels
+
+## clustering
+- explain what clustering is an its applications
+- explain k-,means algorithm
+- explain hierarchical clustering, single and complete link
+- rpos and cons of k-means and hierarchical clustering
+- implement k-means
+
+
+![[Pasted image 20251105234401.png]]
+
+clusteriing is about finding natural groups in the data where
+- items within the group are close together
+- items between groups are far apart
+
+subjective: for one a pattern is interesting, for another boring
+
+![[Pasted image 20251105234436.png]]
+
+![[Pasted image 20251105234455.png]]
+
+## clustering applications:
+- market research: find groups of similar customers
+- social networks: find communities with similar interests / characteristics
+- recommender systems: find groups of users with similar ratings
+
+## what do we need for clustering?
+1. proximity measure either
+ 1. similarity measure $s(x_i, x_k)$: large if x_i and x_k are similar, or
+ 2. dissimilarity (distance) measure, small if they are similar
+2. Criterion function to evaluate a clustering
+ 1. ![[Pasted image 20251105234627.png]]
+3. algorithm to compute clustering
+ 1. e.g. by optimizing the criterion function
+
+## distance measure
+- typically we need to define a distance between objects first:
+- euclid: $$d(\mathbf{x},\mathbf{y})=\sqrt{\sum^l_{i=1}(x_i-y_i)^2}$$ Manhattan:
+$$d(\mathbf{x},\mathbf{y})=\sum^l_{i=1}|x_i-y_i|$$
+even more:
+- Cosine similarity: $$s_{\text{cos}}(\mathbf{x},\mathbf{y})=\frac{\mathbf{x}^T\mathbf{y}}{||\mathbf{x}||||\mathbf{y}||}$$
+- Pearson's correlation coefficient:
+![[Pasted image 20251105234949.png]]
+
+and more for all kinds of features like discrete, mixed and categorical
+
+## cluster eval. (hard)]
+- no agreed upon performance measure, many proposed
+- intra-cluster cohesion (compactness)
+ - cohesion measures how near the data points in a cluster are to the cluster's mean
+ - sum of sqaured errors (SSE) is a commonly used measure
+- inter-cluster separation (isolation):
+ - separation means that different cluster means should be far away from one another
+- in most applications, experts judgments are still the key
+
+![[Pasted image 20251105235122.png]]
+
+## Hard vs. Soft
+Hard Assignments: each point assigned to 1 cluster
+- K-means
+- Hierarchical clustering
+
+- Soft Assignemnts: each point assigned cluster membership
+ - fuzzy c-means
+ - probabilistic mixture models
+(soft is not covered for us)
+
+
+## k-means clustering
+K-means algorithm
+- let the set of $n$ data points be ${x_1, x_2,...x_n}$ (they are all vectors)
+- x_i is a feature vector okay
+- P is number of dimensions
+
+- the k-means algorithm partitions the given data into k-clusters:
+ - each cluster has a cluster centre (cluser mean), called a centroid
+ - K is specified by the user
+
+Given k, the k-means algorithm works as follows:
+1. choose random k random data points (seeds) to be initial centroids, cluster centers
+2. Assign each data point to the closest centroid
+3. re-compute the centroids using the current cluster memberships
+4. If a convergence criterion is not met, repeat steps 2 and 3
+
+K-means how it works:
+![[Pasted image 20251106000826.png]]
+![[Pasted image 20251106000833.png]]
+![[Pasted image 20251106000843.png]]
+![[Pasted image 20251106000902.png]]
+![[Pasted image 20251106000909.png]]
+
+- when do we know when to stop?
+- what is it trying to optimize?
+- how do we choose the number of centers?
+- are we sure it will terminate?
+- are we sure it will find an optimal clustering?
+
+## K-means convergence (stopping) criterion:
+- no (or min) reassignments of data points to different clusters
+- no (or min) change of c entroids or
+- min decrease in the sum of squared errors (SSE)
+
+Sum of Squared Errors:
+![[Pasted image 20251106001045.png]]
+![[Pasted image 20251106001107.png]]
+DISTORTION IS THE COST FUNCTION
+
+## random initalisation
+$2\leq k < m$
+- random pick K training examples
+- set $\mu_1, \mu_2,...,\mu_k$ equal to these examples
+lcoal optima:
+![[Pasted image 20251106001219.png]]
+![[Pasted image 20251106001228.png]]
+
+k-means summary:
+cons:
+- finds only convex clusters ("round shapes")
+- sensitive to initialisation
+- can get stuck in local minima
+
+pros:
+- very simple
+- fast
+
+## hierarchical clustering
+- selecting k is a problem of granularity
+ - how course or fine-grained is the structure in the data?
+ - no cluster algorithm able to pick k
+- instead of picking k find a hierarchy of structure
+ - coarse (is what they meant earlier as well): top level contains all points
+ - fine grained: bottom level one cluster per data point
+
+![[Pasted image 20251106001536.png]]
+
+## hierarchical clustering approaches
+- agglomerative (bottom-up)
+ - each point starts as cluster
+ - group two closest clusters
+ - stop at some point
+
+- divisive (top-down)
+ - all points start in one cluster
+ - split cluster in some "Sensible way"
+ - stop at some point
+(man i love these descriptions this is truly cinema)
+
+Divisive: hierachical k-means
+- apply k-means recurisvely:
+ - run k-mena on the original data for k=2
+ - for each of the resulting clusters run k-means with k=2
+
+agglomerative clustering:
+- starting from individual observations, produce sequence of clusterings of increasing size
+- at each level, twoc lusters chosen by criterion are merged
+
+1. determine distances between all clusters
+2. merge clusters that are closest
+3. IF no. of clusters > 1 then GOTO 1
+
+- which clusters to start with?
+- what is the distance between clusters?
+- final number of clusters?
+
+## different merging rules
+- single linkage: two nearest objects in the clusters $$g(R,S)=\min_{ij}\{d(x_i,x_j):x_i\in R,x_j\in S\}$$
+- complete linkage: two most remote objects in the clusters: $$g(R,S)=\max_{ij}\{d(x_i,x_j):x_i\in R,x_j\in S\}$$
+- Average linkage: cluster centres: $$g(R,S)=\frac{1}{|R||S|}\sum_{ij}\{d(x_i,x_j):x_i\in R, x_j\in S\}$$
+hierarchical clustering: how it works?
+
+Input:
+ - dataset X:[n x p] or directly:
+ - dissimiliarity matrix, D:[n x n]
+ - linkage type
+
+output: dendrogram
+
+
+![[Pasted image 20251106002203.png]]
+
+
+
+![[Pasted image 20251106002214.png]]
+![[Pasted image 20251106002222.png]]
+![[Pasted image 20251106002229.png]]
+![[Pasted image 20251106002243.png]]
+![[Pasted image 20251106002252.png]]
+![[Pasted image 20251106002259.png]]
+![[Pasted image 20251106002306.png]]
+![[Pasted image 20251106002314.png]]
+![[Pasted image 20251106002321.png]]
+![[Pasted image 20251106002344.png]]
+
+(ok so this is about finding parirs and not first creating a bunch of pairs, merging them and then all that stuff - no you just start merging the most similar stuff, alrigth that checks out I was getting a bit confused about that in terms of order)
+
+![[Pasted image 20251106002418.png]]
+
+
+![[Pasted image 20251106002426.png]]
+
+![[Pasted image 20251106002435.png]]
+
+![[Pasted image 20251106002443.png]]
+
+pros:
+- dendrogram gives overview of all possible clusterings
+- linkage type allows to find clusters of varying shapes
+- different dissimilarity measures can be used
+
+cons:
+- computationally intensive
+- clustering limited to "hiearchical nestings"
+
+https://bost.ocks.org/mike/miserables/
+
+http://benjiec.github.io/scatter-matrix/demo/demo.html#
+
+https://lvdmaaten.github.io/tsne/
+
+## clustering summary:
+- when we don't have (training) labels: clustering
+- definition of clusters is vague and evaluation is hard
+- for clustering we need to:
+ - define distance measure
+ - define criterion function to eval a clustering
+ - select clustering algo
+- discussing algos:
+ - hierachical clustering
+ - k-means clusttering
+
+## dimensionality reduction
+- typically, data sets are high-dimensional: each instance is described by many features
+- why do we want to reduce data dimensionality?
+ - make storage or processing of data easier
+ - (visual) discovery of hiddene structure in the data
+ - remove redundant and noisy features
+ - intrinsic dimensionality might be smaller
+- what does it mean to reduce dimensioanlity?
+- how can we use Principal Component Analysis to reduce dimensionality?
+
+![[Pasted image 20251106002901.png]]
+
+### redundant features:
+- get a population, predict some property
+ - instances represents as {urefu, height} pairs
+ - what is the dimensiaonlity of this data? (2?)
+ - height = urefu in Swahili
+
+so yea, cinema about that
+![[Pasted image 20251106003018.png]]
+
+
+- data points from different geographic areas over time:
+ - X1: # of skidding accidents
+ - X2: # of burst water pipes
+ - X3: # of snow plow expenditures
+ - X4: # of forest fires
+ - X5: # of patients with heat stroke
+
+... ok, but could we maybe just use temperature here?
+
+![[Pasted image 20251106003829.png]]
+
+![[Pasted image 20251106003852.png]]
+
+![[Pasted image 20251106003900.png]]
+
+typically: data sets are *high-dimensional*: each instance is described by many features
+
+## what does it mean to reduce dimensionality?
+
+## reducing dimensionality: methods
+- feature selection
+ - pick a subset of the original dimensions
+ - use domain knowledge
+ - use statistics-based selection methods
+- feature extraction
+ - construct a new set of dimensions $E_i=f(X_1,...,X_d)$![[Pasted image 20251106004052.png]]
+ - (linear) combinations of original
+ - whilst *preserving the structure* in the original data
+
+## feature extraction
+- many important dimensionality reductions techniques are linear techniques
+- these project the data onto a linear subspace of lower dimensionality (e.g. Principal Components Analysis)
+
+![[Pasted image 20251106004149.png]]
+
+
+![[Pasted image 20251106004202.png]]
+(thank you)
+
+
+## using principal components analysis (PCA)
+- pca maps the data ontoa linear subspace such that the variance of the projected data is maximized
+
+- defines a set of principal components
+ - 1st: direction of the greatest variability in the data
+ - 2nd: perpendicular to 1st, greatest variability of what's left
+ - ... and so on until d (original dimensionality)
+- first m components become m new dimensions
+ - change coordinates of every data point to these dimensions
+
+
+![[Pasted image 20251106004404.png]]
+
+![[Pasted image 20251106004416.png]]
+![[Pasted image 20251106004428.png]]
+
+![[Pasted image 20251106004446.png]]
+
+## PCA optimisation problem
+- PCA maps the data onto a linear subspace such that the variance of the projected data is maximized
+- SO PCA performs maximisation $$\max_{||w||^2=1}\text{var}(w^TX)$$
+- Constriant: $w^Tw=1\rightarrow||w||^2=1$
+ - requires w (direction of projection) to be unit vector (length is 1); solves scaling problem and guarantees unique solution
+
+## zero-mean data
+- recall the definition of variance ![[Pasted image 20251106004843.png]]
+- for PCA shift your data to zero-mean
+ - for each feature, calcualte the mean, subtract the mean from each data point in that feature
+- helps with:
+ - means does not affect the caclulation of variance
+ - simplifies covariance matrix: $M=\frac{1}{n}XX^T$
+
+![[Pasted image 20251106005003.png]]
+
+- our objective is to max. variance $$\max_{||w||^2=1}\text{var}(w^TX)$$
+- assuming zero-mean data: $$\text{var}(w^TX)=\frac{1}{n}(w^TX)(w^TX)^T=\frac{1}{n}w^TXX^Tw=w^TMw$$
+## lagrange multiplier
+- introduce a language multiplier $\lambda$ to incorporate the constraint into our optimization $$L(w,\lambda)=w^TMw-\lambda(w^Tw-1)$$
+![[Pasted image 20251106005249.png]]
+- it penalizes any deviation from the constraint
+ - if the constraint is violated (i.e. $w^Tw\neq 1$)it will either increase or decrease the value of the Lagrangian
+
+PCA opt. problem
+- enforce constraint using lagrange multiplier
+- ![[Pasted image 20251106005353.png]]
+- set gradient with respect to $\mathbf{w}$ to zero: $$
+\begin{align}
+\frac{\partial w^TMw-\lambda(w^Tw-1)}{\partial w}=0\\
+2Mw-2\lambda w=0\\
+Mw=\lambda w
+\end{align}$$
+![[Pasted image 20251106005526.png]]
+
+principal components are given by the eigenvectors of the covariance matrix
+
+first principal component is given by the eigenvector with the correspoding highest eigenvalue etc.
+
+![[Pasted image 20251106005628.png]]
+
+![[Pasted image 20251106005655.png]]
+
+![[Pasted image 20251106005719.png]]
+(honetly I dont get this because I dont get eigenpairs, eigenvectors and eigenvalues entirely, so I will have to look at this) TODO
+
+![[Pasted image 20251106005742.png]]
+
+![[Pasted image 20251106005803.png]]
+(though this overview at least makes some sense to me so that is good)
+
+(there is an example on the slides as well, check that out perhaps)
+
+![[Pasted image 20251106005851.png]]
+
+![[Pasted image 20251106005919.png]]
+![[Pasted image 20251106005956.png]]
+
+![[Pasted image 20251106010007.png]]
+
+![[Pasted image 20251106010018.png]]
+
+PCA has some practical issues:
+- covariance extremely sensitive to large values
+ - multiple some dimensions by 1000
+ - dominantes covariances
+ - become a principal compoennt
+ - normalize each dimension to zero mean and unit variance (formula was somewhere above)
+- PCA assumes underlying subspace is linear
+ - 1d: straightline, 2d: plane
+
+## PCA and classification
+- pca is unsupervised
+ - max. the overall variance of the data along a small set of directions
+ - does knot know anything about class labels
+ - can pick direction that makes it hard to separate classes
+- discriminative approach
+ - look for a dimension that makes it easy to separate classes
+
+pros:
+- reflects our intuitions about the data
+- dramatic reduction in size of data
+ - faster processing (as long as reduction is fast), smaller storage
+
+Cons:
+- too expensive for many applications
+- understand asumptions behind the methods (linearity, etc.)
+
+
+![[Pasted image 20251106010304.png]]
+
+recap:
+- dimensionality reductions builds a condensed data representation
+- this removes redundant or noisy features and identifies correlations
+- PCA projects data onto the principal eigenvectors fo teh covariance matrix: max. variance of the projection
+
+http://setosa.io/ev/principal-component-analysis/
+
+http://peterbloem.nl/blog/pca
+
+(there is something about Convolutional neural networks but it seems to be optional - I think I went to this last time actually but hey)
\ No newline at end of file