"type": "split",
"children": [
{
- "id": "69ea515b2aa83af1",
- "type": "split",
+ "id": "0d762e903c6b0576",
+ "type": "tabs",
"children": [
{
- "id": "0101728309e6c9e1",
- "type": "tabs",
- "children": [
- {
- "id": "e2e550886a75d1d2",
- "type": "leaf",
- "state": {
- "type": "markdown",
- "state": {
- "file": "University/Machine Learning/Full Notes.md",
- "mode": "source",
- "source": false,
- "backlinks": true,
- "backlinkOpts": {
- "collapseAll": false,
- "extraContext": false,
- "sortOrder": "alphabetical",
- "showSearch": false,
- "searchQuery": "",
- "backlinkCollapsed": false,
- "unlinkedCollapsed": true
- }
- },
- "icon": "lucide-file",
- "title": "Full Notes"
- }
- }
- ]
- },
- {
- "id": "0d762e903c6b0576",
- "type": "tabs",
- "children": [
- {
- "id": "214929be76b06d19",
- "type": "leaf",
- "state": {
- "type": "markdown",
- "state": {
- "file": "University/Machine Learning/Full Notes.md",
- "mode": "source",
- "source": false,
- "backlinks": true,
- "backlinkOpts": {
- "collapseAll": false,
- "extraContext": false,
- "sortOrder": "alphabetical",
- "showSearch": false,
- "searchQuery": "",
- "backlinkCollapsed": false,
- "unlinkedCollapsed": true
- }
- },
- "icon": "lucide-file",
- "title": "Full Notes"
+ "id": "214929be76b06d19",
+ "type": "leaf",
+ "state": {
+ "type": "markdown",
+ "state": {
+ "file": "University/Machine Learning/Full Notes.md",
+ "mode": "source",
+ "source": false,
+ "backlinks": true,
+ "backlinkOpts": {
+ "collapseAll": false,
+ "extraContext": false,
+ "sortOrder": "alphabetical",
+ "showSearch": false,
+ "searchQuery": "",
+ "backlinkCollapsed": false,
+ "unlinkedCollapsed": true
}
- }
- ]
+ },
+ "icon": "lucide-file",
+ "title": "Full Notes"
+ }
}
- ],
- "direction": "horizontal"
+ ]
}
],
"direction": "vertical"
"pdf-plus:PDF++: Toggle auto-paste": false
}
},
- "active": "e2e550886a75d1d2",
+ "active": "214929be76b06d19",
"lastOpenFiles": [
- "Pasted image 20251103163149.png",
- "Pasted image 20251103162442.png",
- "Pasted image 20251103161635.png",
- "Pasted image 20251103161604.png",
- "Pasted image 20251103161333.png",
- "Pasted image 20251103161144.png",
- "Pasted image 20251103161028.png",
- "Pasted image 20251103160756.png",
- "Pasted image 20251102180335.png",
- "Pasted image 20251102180326.png",
- "Pasted image 20251102175852.png",
- "Untitled 1.md",
+ "Pasted image 20251104172129.png",
+ "Pasted image 20251104172116.png",
+ "Pasted image 20251104172012.png",
+ "Pasted image 20251104172000.png",
+ "Pasted image 20251104171952.png",
+ "Pasted image 20251104171659.png",
+ "Pasted image 20251104171550.png",
+ "Pasted image 20251104171524.png",
+ "Pasted image 20251104171324.png",
+ "Pasted image 20251104171316.png",
"University/Machine Learning/Full Notes.md",
+ "Pasted image 20251104170706.png",
+ "Untitled 1.md",
"some_ideas.md",
"University/Machine Learning",
"Physics/Just some questions.md",
(nvm it doesn't seem to be reading the next slide, maybe figure this out TODO)
we just estimate $p(x_i|y)$ per feature and multiply them
-$p(x|y)=p(x_1,x_2,x_3,x_4,...,x_d|y)=\prod$
+$$\begin{align}
+p(x|y)=p(x_1,x_2,x_3,x_4,...,x_d|y)=\prod_{i=1}^d p(x_i|y)=\\p(x_1|y)p(x_2|y)...p(x_d|y)\end{align}$$
+(there is no curse of dimensionality)
+
+### Parametric vs Non. Parametric
+But that means you still have to choose a model for $p(x_i|y)$.
+![[Pasted image 20251104161342.png]]
+![[Pasted image 20251104161351.png]]
+
+EXAMPLE:
+![[Pasted image 20251104161450.png]]
+![[Pasted image 20251104161501.png]]
+TODO: what is the $\exp$ function?
+
+![[Pasted image 20251104161540.png]]
+
+### Zero Frequency Problem
+![[Pasted image 20251104161712.png]]
+(there is also an example that has to do with email spam and this, and it seems to be not working well on Naive Bayes im ngl)
+
+Pros and Cons of Naive Bayes:
+- can handle high dimensional feature spaces
+- fast training time
+- can handle continuous and discrete data
+
+Cons:
+- can't deal with correlated features
+
+EXAMPLE
+![[Pasted image 20251104161913.png]]
+
+things you should be able to do:
+- explain the difference between parametric and non-parametric density estimation
+- explain parzen, k-nearest neighbour and niave bayes density estimation and classification in detail
+- explain the advantages and disadvatnages of those methods
+- implement knn classfier in Python
+
+# Evaluation
+![[Pasted image 20251104162104.png]]
+![[Pasted image 20251104162129.png]]
+
+### answering the question of what classifier to use
+- hard if we can't visualise the data
+- we need some kind of criteria
+ - Typical answer: classification / performance error
+- test it on independent data
+- for simplicity we assume now that classification error is good enough (though other factors may be in play)
+
+![[Pasted image 20251104162353.png]]
+
+Error is the sum of Bernoulli random variables:
+$$\hat{\epsilon}=\frac{1}{N}\sum_{i=1}^N Z_i$$ where: $$Z_i$$ is 0 if $x_i$ was correct
+and 1 if $x_i$ was incorrectly classified
+
+Variance:
+$$\sigma^2_{\hat{\epsilon}}=Var(\hat{\epsilon}|\text{test set size } N)=\frac{\epsilon(1-\epsilon)}{N}$$
+you can also compute the standard deviation for different sample sizes and error:
+![[Pasted image 20251104162733.png]]
+
+## training vs. test set size
+- Large training set -> good classifiers
+- large test set -> reliable, unbiased error estimate
+- In practice often just a single design set is given
+
+![[Pasted image 20251104162845.png]]
+
+## this is what is called bootstrapping
+![[Pasted image 20251104162934.png]]
+TODO: okay honestly I don't entirely get this ngl
+## k-fold cross validation
+![[Pasted image 20251104163022.png]]
+TODO: i don't understand this for the same reason
+do you also retrain the classifier?
+I guess so, same with the one above, you do this many many times, and I guess you take the best idea you ahve in either case
+so I guess that checks out.
+
+## leave-one-out procedure
+![[Pasted image 20251104163117.png]]
+i assume the same goes here as for the other ones
+
+## hyper-parameters
+- ML methods often have 'hyperparameters'
+- Parzen density estimator: width "h"
+- knn: number of neighbours "k"
+- decisions trees: pruning method, stopping criterion
+- neural networks: architecture, learning rate
+
+- Don't optimise these numbers by looking at the test set!
+
+## double cross validation
+
+![[Pasted image 20251104163416.png]]
+we going crazy now, and you can apparently use this to optimise the hyperparameters
+
+![[Pasted image 20251104163518.png]]
+
+
+## apparent classifciation error
+![[Pasted image 20251104165636.png]]
+
+## learning curves
+- curves that plot (estimated) classification errors against the number of sampels in training set
+- usually plot error both on training and on test set
+- gives insight into:
+ - amount of overtraining
+ - usefulness of additional data
+ - allows comparison between classifiers
+ - stability of training
+
+There is no single best classifier
+![[Pasted image 20251104165837.png]]
+![[Pasted image 20251104165911.png]]
+![[Pasted image 20251104165918.png]]
+
+- larger training sets yield better classifiers (wow really)
+- independent test sets needed for unbiased error estimates
+- larger tests yield more accurate error estimates
+- LOO cross validation "optimal" but may be infeasible
+- 10-fold cross validation is often used
+- more complex classifiers need larger training sets
+ - as well as larger feature sets
+- small training sets need simpler classifeiers or smaller feature sets
+
+## squared error:
+imagine you have the following error:
+$$E[||g(x)-y||^2]$$
+you can derive something more general
+
+## bias-variance dilemma
+- when we are given some data we may get lucky, or unlucky:
+ - sometiems we get very a-typical data
+- to say something general we need to average over different (training) sets
+
+the classifier is now also a function of the training set:
+$$D = \{(y_i,x_i); i=1,...,N\}$$
+$$g(x;D)$$
+![[Pasted image 20251104170328.png]]
+![[Pasted image 20251104170353.png]]
+![[Pasted image 20251104170414.png]]
+
+variance: how much does classifer g vary over different training sets
+bias: how much does the average classifer g differ from the true output
+
+![[Pasted image 20251104170504.png]]
+
+this was originally derived for neutral networks and squared error
+general phenomenon though: we encounter it often in pattern recognition
+
+more simple classifier is more stable (and needs less data)
+more complex classifier only works when you have sufficnet training data
+
+## feature curve
+![[Pasted image 20251104170656.png]]
+![[Pasted image 20251104170706.png]]
+
+there is a fundamental tradeoff between the two error / performances of the two classes
+
+Standard Classification Error: $$\epsilon=\epsilon_1p(y_1)+\epsilon_2p(y_2)$$
+Weighted Classification Error: $$\epsilon=\lambda_{12}\epsilon_1p(y_1)+\lambda_{21}\epsilon_2p(y_2)$$
+F1-Score (harmonic Mean): $$F_1=2\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$$
+## types of error and performance measures
+
+Error: Probability of Erroneous Classifications
+Performance / Accuracy: 1 - error
+Sensitivty of a target class [e.g. diseased patients]: performance for objects from that target class
+Specificity: performance for all objects outside target class
+Precision of a target class: fraction of correct objects among all objects assigned to that class
+Recall: fraction of correctly classifed objects; identical to sensitivity when related to particular class
+True positive rate: identical to sensitivity
+False Positive Rate: error for all objects outside target
+
+## confusion matrices
+Provides counts of class-dependent errors: how many objects have been classified as A that should have been B?
+- give a more deteailed view than overall error
+- cna be used to estimate overall cost for classifier
+
+
+![[Pasted image 20251104171316.png]]
+![[Pasted image 20251104171324.png]]
+
+## ROC Analysis (receiver operator characteristic)
+![[Pasted image 20251104171524.png]]
+![[Pasted image 20251104171550.png]]
+TODO: waht?
+
+### area under ROC curve: AUC
+![[Pasted image 20251104171659.png]]
+
+### how to interpret ROC and AUC:
+- each point on the ROC curve represents a specific classification threshold (ok that is cool but what is that TODO)
+- A classifier that randomly guesses produces a curve along the diagonal line (from-bottom left to top right) - ok that checks out
+- A classifier that perfectly separates will reach the top left corner (true positive rate =1 and false positive rate = 0): AUC = 1.0
+- so the closer the ROC curve is to the top-left corner the better the classifer is at distinguishing between the two classes
+
+is the threshold like how many thigns we give it acccess to or something? TODO (seems to be, something like that)
+
+![[Pasted image 20251104171952.png]]
+![[Pasted image 20251104172012.png]]
+![[Pasted image 20251104172116.png]]
+
+ok this checks out more and more
+
+![[Pasted image 20251104172129.png]]
+
+conclusions:
+- there is no best classifier
+- there are alternative principles to find a good classifier
+ - maximising the likelihood
+ - minimising the classification error
+ - minimising the mean squared error
+
+- there is a fundamental tradeoff between the bias and the variance of a classifer (depending on how flexible / complex the classifier is)
+- finding the correct regulariser is a 'black art' of ML