Loading...
Loading...
Data Science pipeline:
Data Collection → Cleaning → EDA → Feature Engineering
→ Model Training → Evaluation → Deployment
Setup:
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
jupyter notebook # or jupyter lab
import numpy as np
# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
zeros = np.zeros((3, 4)) # 3×4 zeros
ones = np.ones((2, 3))
identity = np.eye(4) # 4×4 identity matrix
range_arr = np.arange(0, 10, 2) # [0,2,4,6,8]
linspace = np.linspace(0, 1, 5) # [0, .25, .5, .75, 1]
random_arr = np.random.rand(3, 3) # uniform [0,1)
normal_arr = np.random.randn(100) # standard normal
# Shape operations
print(matrix.shape) # (3, 3)
print(matrix.dtype) # int64
flat = matrix.flatten() # 1D
reshaped = arr.reshape(5, 1) # column vector
# Indexing & Slicing
print(matrix[1, 2]) # 6 (row 1, col 2)
print(matrix[:, 1]) # [2,5,8] (all rows, col 1)
print(matrix[0:2, 1:3]) # [[2,3],[5,6]] (submatrix)
bool_idx = arr[arr > 3] # [4, 5] (boolean indexing)
# Math operations (vectorized — no loops needed!)
a = np.array([1,2,3,4])
b = np.array([10,20,30,40])
print(a + b) # [11,22,33,44]
print(a * b) # element-wise: [10,40,90,160]
print(np.dot(a, b)) # dot product: 300
print(np.sqrt(a)) # [1, 1.41, 1.73, 2]
# Statistics
marks = np.array([85, 90, 78, 92, 88, 76, 95, 82])
print(np.mean(marks)) # 85.75
print(np.std(marks)) # 6.32
print(np.percentile(marks, [25, 50, 75])) # quartiles
print(np.argmax(marks)) # index of max = 6
# Matrix operations
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])
print(np.dot(A, B)) # matrix multiply
print(np.linalg.det(A)) # determinant
print(np.linalg.inv(A)) # inverse
eigenvalues, eigenvectors = np.linalg.eig(A)
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [22, 25, 23, 24, 22],
'Score': [88, 92, 78, 95, 85],
'Subject': ['CS', 'IT', 'CS', 'ECE', 'IT']
})
# Basic info
print(df.shape) # (5, 4)
print(df.dtypes) # column types
print(df.describe()) # statistics for numeric cols
print(df.info()) # non-null counts, dtypes
# Selection
print(df['Name']) # Series (one column)
print(df[['Name', 'Score']]) # DataFrame (multiple)
print(df.iloc[0]) # first row by position
print(df.loc[df['Score'] > 85]) # conditional filter
print(df.query("Subject == 'CS' and Score >= 80"))
# Missing values
df_messy = df.copy()
df_messy.loc[1, 'Score'] = np.nan
print(df_messy.isnull().sum()) # count nulls per col
df_filled = df_messy.fillna(df_messy['Score'].mean()) # impute with mean
df_dropped = df_messy.dropna() # remove null rows
# Group operations
subject_stats = df.groupby('Subject').agg({
'Score': ['mean', 'max', 'count'],
'Age': 'mean'
})
print(subject_stats)
# Sort
df_sorted = df.sort_values('Score', ascending=False)
# Apply custom function
df['Grade'] = df['Score'].apply(lambda x: 'A' if x>=90 else 'B' if x>=80 else 'C')
# Merge two DataFrames
courses = pd.DataFrame({'Subject':['CS','IT','ECE'], 'Dept':['Science','Tech','Engg']})
merged = df.merge(courses, on='Subject', how='left')
# Pivot table
pivot = df.pivot_table(values='Score', index='Subject', aggfunc=['mean','count'])
# Read/Write
df.to_csv('students.csv', index=False)
df_read = pd.read_csv('students.csv')
df.to_excel('students.xlsx', index=False)
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
# Figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# 1. Histogram — distribution
axes[0,0].hist(df['Score'], bins=5, color='steelblue', edgecolor='white')
axes[0,0].set_title('Score Distribution')
axes[0,0].set_xlabel('Score'); axes[0,0].set_ylabel('Count')
# 2. Box plot — outliers and quartiles
axes[0,1].boxplot([df[df.Subject==s]['Score'] for s in df.Subject.unique()],
labels=df.Subject.unique())
axes[0,1].set_title('Score by Subject')
# 3. Scatter plot — correlation
axes[0,2].scatter(df['Age'], df['Score'], c='coral', s=100, alpha=0.7)
axes[0,2].set_title('Age vs Score')
# 4. Bar chart — comparison
subj_mean = df.groupby('Subject')['Score'].mean()
axes[1,0].bar(subj_mean.index, subj_mean.values, color=['#3b82f6','#10b981','#f59e0b'])
axes[1,0].set_title('Average Score by Subject')
# 5. Seaborn heatmap — correlation matrix
data_numeric = df[['Age', 'Score']].assign(
CS=(df.Subject=='CS').astype(int))
sns.heatmap(data_numeric.corr(), annot=True, cmap='coolwarm', ax=axes[1,1])
axes[1,1].set_title('Correlation Matrix')
# 6. Seaborn pair plot for full EDA
# sns.pairplot(df, hue='Subject') # in separate cell
plt.tight_layout()
plt.savefig('eda_plots.png', dpi=150, bbox_inches='tight')
plt.show()
from sklearn.preprocessing import (StandardScaler, MinMaxScaler,
LabelEncoder, OneHotEncoder)
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
# Load data
df = pd.read_csv('student_performance.csv')
# 1. Handle missing values
imputer = SimpleImputer(strategy='median') # or 'mean', 'most_frequent'
df[['age', 'study_hours']] = imputer.fit_transform(df[['age', 'study_hours']])
# 2. Encode categorical variables
# Label encoding (ordinal: fail < pass < distinction)
le = LabelEncoder()
df['grade_code'] = le.fit_transform(df['grade'])
# One-hot encoding (nominal: subject, gender)
df = pd.get_dummies(df, columns=['subject', 'gender'], drop_first=True)
# 3. Feature scaling
X = df.drop('final_score', axis=1)
y = df['final_score']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=(y>=60))
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train only!
X_test_scaled = scaler.transform(X_test) # transform only
# 4. Outlier detection
from scipy import stats
z_scores = np.abs(stats.zscore(df.select_dtypes(include=np.number)))
df_clean = df[(z_scores < 3).all(axis=1)] # remove rows with |z|>3
print(f"Removed {len(df)-len(df_clean)} outlier rows")
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import joblib
# Build pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
# Train
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Evaluate
print(classification_report(y_test, y_pred))
# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Hyperparameter tuning
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [None, 5, 10],
'classifier__min_samples_leaf': [1, 2, 5]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Feature importance
rf_model = grid_search.best_estimator_.named_steps['classifier']
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print(importance_df.head(10))
# Save model
joblib.dump(grid_search.best_estimator_, 'student_model.pkl')
# Load and predict
model = joblib.load('student_model.pkl')
new_student = pd.DataFrame({'age':[20], 'study_hours':[5], ...})
prediction = model.predict(new_student)
Q: Data leakage kya hoti hai ML mein? A: Test data information training mein leak ho jaaye — falsely high accuracy. Classic mistake: scaler ko poore dataset pe fit karo THEN split. Correct: split FIRST then scaler fit only on train. Target variable ki future information bhi leakage hai.
Q: Imbalanced dataset problem kyun hoti hai aur kaise solve karein? A: Majority class dominate karta hai — model minority class (fraud, disease) poorly predict karta hai. Solutions: SMOTE (synthetic minority oversampling), undersampling majority, class_weight='balanced', threshold tuning, F1/AUC metrics (not accuracy).
Q: Correlation aur Causation mein kya fark hai? A: Correlation: do variables saath badhte/ghatte hain (statistical relationship). Causation: ek variable doosre ko cause karta hai. "Ice cream sales aur drowning correlated hain" — summer dono cause karta hai, ice cream drowning nahi. ML mein correlation use hoti hai, causation experiment se prove hoti hai.
Complete Python Data Science notes for BCA Sem 6 — NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn, Data preprocessing, EDA, ML models with Jupyter notebook examples.
46 pages · 2.3 MB · Updated 2026-03-11
Rich ecosystem: NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow sab Python mein. Easy syntax, rapid prototyping. Jupyter notebooks interactive analysis ke liye perfect. Industry standard — 80%+ data scientists Python use karte hain.
NumPy: numerical computing, N-dimensional arrays, fast mathematical operations. Pandas: tabular data (DataFrame), data cleaning, missing values, groupby, merge — spreadsheet-like. Data science mein dono saath use hote hain.
df.isnull().sum() se detect karo. Strategies: dropna() (rows/cols delete), fillna(mean/median/mode) (impute), forward/backward fill, interpolate(). Approach data type aur percentage missing pe depend karta hai.
Distance-based algorithms (KNN, SVM, KMeans) aur gradient descent (neural nets) features ke scale pe sensitive hain. Standardization (mean=0, std=1) ya Min-Max scaling (0-1). Tree-based models (Random Forest) ko scaling ki zaroorat nahi.
Exploratory Data Analysis — data ko samajhna before modeling. Shape, dtypes, missing values, distributions, correlations, outliers check karna. Visualization: histograms, box plots, scatter plots, heatmaps. Insights → better feature engineering.
Software Engineering Notes — BCA Sem 6
Software Engineering
Database Management System Notes — BCA Sem 3
Database Management
C Programming Notes — BCA Sem 1
C Programming
Computer Networks Notes — BCA Sem 4
Computer Networks
Java Programming Notes — BCA Sem 4
Java Programming
Your feedback helps us improve notes and tutorials.