Loading...
Loading...
Machine Learning is a subset of AI where systems learn from data to make decisions without being explicitly programmed.
Traditional Programming: Rules + Data → Output
Machine Learning: Data + Output → Rules (Model)
Types of ML:
Complete Workflow:
import pandas as pd
from sklearn.impute import SimpleImputer
# Remove rows with missing data
df.dropna(subset=['salary'])
# Fill with mean (numerical)
df['age'].fillna(df['age'].mean(), inplace=True)
# Imputer
imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent'
X = imputer.fit_transform(X)
Why? Algorithms like KNN, SVM, Neural Networks sensitive to scale.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Standardization: (x - mean) / std → mean=0, std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Normalization: (x - min) / (max - min) → [0, 1]
scaler = MinMaxScaler()
# Robust (outlier-resistant): (x - median) / IQR
scaler = RobustScaler()
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label Encoding (ordinal: low < medium < high)
le = LabelEncoder()
df['grade'] = le.fit_transform(df['grade']) # A=0, B=1, C=2
# One-Hot Encoding (nominal: no order)
pd.get_dummies(df, columns=['city'])
# Delhi → [1,0,0], Mumbai → [0,1,0], Pune → [0,0,1]
Predict continuous values. Best-fit line minimizing MSE.
y = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ
Cost Function: MSE = (1/n)Σ(y_pred - y_actual)²
Gradient Descent Update:
w = w - α * ∂(MSE)/∂w (α = learning rate)
Metrics: MAE, MSE, RMSE, R² (coefficient of determination)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Predict binary classes (0 or 1). Uses sigmoid function.
σ(z) = 1 / (1 + e^(-z)) Output: probability [0, 1]
Cost: Binary Cross-Entropy (Log Loss)
Decision boundary: P(y=1) > 0.5 → class 1
Tree-like model of decisions. Splits data on feature that maximizes information gain.
Information Gain = Entropy(parent) - Σ(weighted Entropy(child))
Entropy(S) = -Σ p_i * log₂(p_i)
Gini Impurity = 1 - Σ p_i² (CART algorithm uses this)
Hyperparameters:
Ensemble of Decision Trees using Bagging (Bootstrap Aggregating).
1. Sample n subsets (with replacement) from training data
2. Train one Decision Tree on each subset
(with random subset of features at each split)
3. Predict by majority vote (classification) or average (regression)
Why better? Reduces variance (overfitting) by averaging uncorrelated trees.
Find hyperplane that maximally separates classes.
Margin = 2 / ||w|| (maximize margin)
Support Vectors = points closest to hyperplane
Kernel Trick (for non-linear data):
- Linear Kernel: K(x,y) = x·y
- RBF/Gaussian: K(x,y) = exp(-γ||x-y||²)
- Polynomial: K(x,y) = (x·y + c)^d
C (regularization): Low C = soft margin (more misclassification allowed)
High C = hard margin (tries to classify all correctly → overfitting)
Classify based on k nearest data points (majority vote).
Distance: Euclidean = √Σ(xᵢ - yᵢ)²
Manhattan = Σ|xᵢ - yᵢ|
Small k → complex boundary (overfitting)
Large k → smooth boundary (underfitting)
Optimal k → use cross-validation (typically odd, √n)
Algorithm:
1. Initialize k centroids randomly
2. Assign each point to nearest centroid
3. Recompute centroids as mean of cluster
4. Repeat until convergence
Choosing k: Elbow Method (plot inertia vs k, find elbow)
Limitation: Sensitive to outliers, assumes spherical clusters
Dimensionality Reduction — transform data to fewer dimensions while retaining variance.
1. Standardize data
2. Compute covariance matrix
3. Find eigenvalues + eigenvectors
4. Select top k eigenvectors (principal components)
5. Project data onto new space
Explained variance ratio: how much variance each PC captures
Choose k where cumulative variance > 95%
Input Layer → Hidden Layers → Output Layer
Each neuron: z = Σ(wᵢxᵢ) + b, output = activation(z)
| Function | Formula | Use Case | |---|---|---| | Sigmoid | 1/(1+e^-z) | Binary output | | Tanh | (e^z - e^-z)/(e^z + e^-z) | Hidden layers | | ReLU | max(0, z) | Hidden layers (most common) | | Leaky ReLU | max(0.01z, z) | Fixes dead neurons | | Softmax | e^zᵢ/Σe^zⱼ | Multi-class output |
Forward Pass: Compute prediction (y_hat)
Loss: L = -(y*log(y_hat) + (1-y)*log(1-y_hat))
Backward Pass: Compute gradients ∂L/∂w using chain rule
Update: w = w - α * ∂L/∂w
L1 Regularization: Loss + λΣ|w| → Sparse weights, feature selection
L2 Regularization: Loss + λΣw² → Small weights, prevents large weights
Dropout: Randomly zero out neurons during training (p=0.5 typical)
Batch Normalization: Normalize layer inputs for stable training
Early Stopping: Monitor validation loss, stop when it starts increasing
Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive: TP FN
Actual Negative: FP TN
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP) (out of predicted positive, how many actually positive)
Recall = TP / (TP + FN) (out of actual positive, how many detected)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
When to use what?
ROC-AUC: Area Under ROC Curve. AUC=1 perfect, AUC=0.5 random.
MAE = (1/n)Σ|y - ŷ| (Mean Absolute Error)
MSE = (1/n)Σ(y - ŷ)² (Mean Squared Error)
RMSE = √MSE (Root MSE, same unit as y)
R² = 1 - SS_res/SS_tot (1 = perfect, 0 = mean only, can be negative)
Complete ML notes: supervised/unsupervised learning, regression, classification, SVM, decision trees, neural networks, evaluation metrics, and interview questions for B.Tech CS Sem 6.
72 pages · 3.5 MB · Updated 2026-03-11
Supervised learning mein labeled data hoti hai (input + correct output). Model output predict karna seekhta hai. Unsupervised mein unlabeled data se patterns dhundhte hain (clustering, dimensionality reduction).
Model training data pe bahut acha karta hai but test data pe bura — overfitting. Rokne ke liye: regularization (L1/L2), dropout, cross-validation, more training data, simpler model.
High bias = underfitting (simple model). High variance = overfitting (complex model). Sweet spot: balance dono. Bagging reduces variance, Boosting reduces bias.
Random Forest multiple trees ka ensemble hai — reduces overfitting through bagging + random feature selection. Decision Tree single tree hai, easily overfits.
Optimization algorithm jo model parameters (weights) ko update karta hai loss function minimize karne ke liye. Loss ke negative gradient direction mein small steps leta hai.
Compiler Design — Complete Notes CS Sem 6
Compiler Design
Software Engineering — Complete Notes with SDLC, Agile, Testing
Software Engineering
DBMS Complete Notes — B.Tech CS Sem 4
Database Management Systems
Engineering Mathematics 1 — Calculus, Matrices, Differential Equations
Engineering Mathematics 1
Programming Fundamentals Using C — Complete Notes
Programming Fundamentals (C)
Your feedback helps us improve notes and tutorials.