1️⃣ CS Sem 1 2️⃣ CS Sem 2 3️⃣ CS Sem 3 4️⃣ CS Sem 4 5️⃣ CS Sem 5 6️⃣ CS Sem 6 💡 IT Branch 📡 ECE Branch 🏫 Class 9 🎒 Class 10 🔬 Class 11 🧪 Class 12 🎓 MCA / PG 📜 PhD / Research

BCASEM-6Python Data Science

Python & Data Science Notes — BCA Sem 6

✍️ WohoTech Team📅 Last Updated: 2026-03-11📄 46 pages · 2.3 MB

Python Data Science — Introduction

Data Science pipeline:

Data Collection → Cleaning → EDA → Feature Engineering
→ Model Training → Evaluation → Deployment

Setup:

pip install numpy pandas matplotlib seaborn scikit-learn jupyter
jupyter notebook  # or jupyter lab

1. NumPy — Numerical Computing

import numpy as np

# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
zeros = np.zeros((3, 4))       # 3×4 zeros
ones = np.ones((2, 3))
identity = np.eye(4)            # 4×4 identity matrix
range_arr = np.arange(0, 10, 2) # [0,2,4,6,8]
linspace = np.linspace(0, 1, 5) # [0, .25, .5, .75, 1]
random_arr = np.random.rand(3, 3)  # uniform [0,1)
normal_arr = np.random.randn(100)  # standard normal

# Shape operations
print(matrix.shape)      # (3, 3)
print(matrix.dtype)      # int64
flat = matrix.flatten()  # 1D
reshaped = arr.reshape(5, 1)   # column vector

# Indexing & Slicing
print(matrix[1, 2])      # 6 (row 1, col 2)
print(matrix[:, 1])      # [2,5,8] (all rows, col 1)
print(matrix[0:2, 1:3])  # [[2,3],[5,6]] (submatrix)
bool_idx = arr[arr > 3]  # [4, 5] (boolean indexing)

# Math operations (vectorized — no loops needed!)
a = np.array([1,2,3,4])
b = np.array([10,20,30,40])
print(a + b)          # [11,22,33,44]
print(a * b)          # element-wise: [10,40,90,160]
print(np.dot(a, b))   # dot product: 300
print(np.sqrt(a))     # [1, 1.41, 1.73, 2]

# Statistics
marks = np.array([85, 90, 78, 92, 88, 76, 95, 82])
print(np.mean(marks))   # 85.75
print(np.std(marks))    # 6.32
print(np.percentile(marks, [25, 50, 75]))  # quartiles
print(np.argmax(marks)) # index of max = 6

# Matrix operations
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])
print(np.dot(A, B))      # matrix multiply
print(np.linalg.det(A))  # determinant
print(np.linalg.inv(A))  # inverse
eigenvalues, eigenvectors = np.linalg.eig(A)

2. Pandas — Data Analysis

import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [22, 25, 23, 24, 22],
    'Score': [88, 92, 78, 95, 85],
    'Subject': ['CS', 'IT', 'CS', 'ECE', 'IT']
})

# Basic info
print(df.shape)      # (5, 4)
print(df.dtypes)     # column types
print(df.describe()) # statistics for numeric cols
print(df.info())     # non-null counts, dtypes

# Selection
print(df['Name'])                      # Series (one column)
print(df[['Name', 'Score']])           # DataFrame (multiple)
print(df.iloc[0])                      # first row by position
print(df.loc[df['Score'] > 85])        # conditional filter
print(df.query("Subject == 'CS' and Score >= 80"))

# Missing values
df_messy = df.copy()
df_messy.loc[1, 'Score'] = np.nan
print(df_messy.isnull().sum())         # count nulls per col
df_filled = df_messy.fillna(df_messy['Score'].mean())  # impute with mean
df_dropped = df_messy.dropna()         # remove null rows

# Group operations
subject_stats = df.groupby('Subject').agg({
    'Score': ['mean', 'max', 'count'],
    'Age': 'mean'
})
print(subject_stats)

# Sort
df_sorted = df.sort_values('Score', ascending=False)

# Apply custom function
df['Grade'] = df['Score'].apply(lambda x: 'A' if x>=90 else 'B' if x>=80 else 'C')

# Merge two DataFrames
courses = pd.DataFrame({'Subject':['CS','IT','ECE'], 'Dept':['Science','Tech','Engg']})
merged = df.merge(courses, on='Subject', how='left')

# Pivot table
pivot = df.pivot_table(values='Score', index='Subject', aggfunc=['mean','count'])

# Read/Write
df.to_csv('students.csv', index=False)
df_read = pd.read_csv('students.csv')
df.to_excel('students.xlsx', index=False)

3. Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 1. Histogram — distribution
axes[0,0].hist(df['Score'], bins=5, color='steelblue', edgecolor='white')
axes[0,0].set_title('Score Distribution')
axes[0,0].set_xlabel('Score'); axes[0,0].set_ylabel('Count')

# 2. Box plot — outliers and quartiles
axes[0,1].boxplot([df[df.Subject==s]['Score'] for s in df.Subject.unique()],
                  labels=df.Subject.unique())
axes[0,1].set_title('Score by Subject')

# 3. Scatter plot — correlation
axes[0,2].scatter(df['Age'], df['Score'], c='coral', s=100, alpha=0.7)
axes[0,2].set_title('Age vs Score')

# 4. Bar chart — comparison
subj_mean = df.groupby('Subject')['Score'].mean()
axes[1,0].bar(subj_mean.index, subj_mean.values, color=['#3b82f6','#10b981','#f59e0b'])
axes[1,0].set_title('Average Score by Subject')

# 5. Seaborn heatmap — correlation matrix
data_numeric = df[['Age', 'Score']].assign(
    CS=(df.Subject=='CS').astype(int))
sns.heatmap(data_numeric.corr(), annot=True, cmap='coolwarm', ax=axes[1,1])
axes[1,1].set_title('Correlation Matrix')

# 6. Seaborn pair plot for full EDA
# sns.pairplot(df, hue='Subject')  # in separate cell

plt.tight_layout()
plt.savefig('eda_plots.png', dpi=150, bbox_inches='tight')
plt.show()

4. Data Preprocessing

from sklearn.preprocessing import (StandardScaler, MinMaxScaler,
    LabelEncoder, OneHotEncoder)
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Load data
df = pd.read_csv('student_performance.csv')

# 1. Handle missing values
imputer = SimpleImputer(strategy='median')  # or 'mean', 'most_frequent'
df[['age', 'study_hours']] = imputer.fit_transform(df[['age', 'study_hours']])

# 2. Encode categorical variables
# Label encoding (ordinal: fail < pass < distinction)
le = LabelEncoder()
df['grade_code'] = le.fit_transform(df['grade'])

# One-hot encoding (nominal: subject, gender)
df = pd.get_dummies(df, columns=['subject', 'gender'], drop_first=True)

# 3. Feature scaling
X = df.drop('final_score', axis=1)
y = df['final_score']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=(y>=60))

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train only!
X_test_scaled  = scaler.transform(X_test)        # transform only

# 4. Outlier detection
from scipy import stats
z_scores = np.abs(stats.zscore(df.select_dtypes(include=np.number)))
df_clean = df[(z_scores < 3).all(axis=1)]  # remove rows with |z|>3
print(f"Removed {len(df)-len(df_clean)} outlier rows")

5. Complete ML Pipeline

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import joblib

# Build pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# Evaluate
print(classification_report(y_test, y_pred))

# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Hyperparameter tuning
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 5]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Feature importance
rf_model = grid_search.best_estimator_.named_steps['classifier']
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print(importance_df.head(10))

# Save model
joblib.dump(grid_search.best_estimator_, 'student_model.pkl')

# Load and predict
model = joblib.load('student_model.pkl')
new_student = pd.DataFrame({'age':[20], 'study_hours':[5], ...})
prediction = model.predict(new_student)

Exam Important Questions

NumPy aur Python list mein kya difference hai? NumPy kyun faster hai?
Pandas DataFrame mein missing values handle karne ke tarike
StandardScaler aur MinMaxScaler mein kya difference hai?
EDA mein kaunsi visualizations important hain?
Train/test split aur cross-validation kyun karte hain?
Feature importance kaise pata karte hain Random Forest mein?
Categorical encoding — Label vs One-Hot kab kya use karein?
Scikit-learn pipeline ka kya faayda hai?

Viva Questions

Q: Data leakage kya hoti hai ML mein? A: Test data information training mein leak ho jaaye — falsely high accuracy. Classic mistake: scaler ko poore dataset pe fit karo THEN split. Correct: split FIRST then scaler fit only on train. Target variable ki future information bhi leakage hai.

Q: Imbalanced dataset problem kyun hoti hai aur kaise solve karein? A: Majority class dominate karta hai — model minority class (fraud, disease) poorly predict karta hai. Solutions: SMOTE (synthetic minority oversampling), undersampling majority, class_weight='balanced', threshold tuning, F1/AUC metrics (not accuracy).

Q: Correlation aur Causation mein kya fark hai? A: Correlation: do variables saath badhte/ghatte hain (statistical relationship). Causation: ek variable doosre ko cause karta hai. "Ice cream sales aur drowning correlated hain" — summer dono cause karta hai, ice cream drowning nahi. ML mein correlation use hoti hai, causation experiment se prove hoti hai.

📄 Download Complete PDF Notes

Complete Python Data Science notes for BCA Sem 6 — NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn, Data preprocessing, EDA, ML models with Jupyter notebook examples.

46 pages · 2.3 MB · Updated 2026-03-11

Free Download ↓

❓ Frequently Asked Questions

Data Science mein Python kyun preferred hai?▾

Rich ecosystem: NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow sab Python mein. Easy syntax, rapid prototyping. Jupyter notebooks interactive analysis ke liye perfect. Industry standard — 80%+ data scientists Python use karte hain.

NumPy aur Pandas mein kya difference hai?▾

NumPy: numerical computing, N-dimensional arrays, fast mathematical operations. Pandas: tabular data (DataFrame), data cleaning, missing values, groupby, merge — spreadsheet-like. Data science mein dono saath use hote hain.

Missing values kaise handle karte hain?▾

df.isnull().sum() se detect karo. Strategies: dropna() (rows/cols delete), fillna(mean/median/mode) (impute), forward/backward fill, interpolate(). Approach data type aur percentage missing pe depend karta hai.

Feature scaling kyun zaroori hai?▾

Distance-based algorithms (KNN, SVM, KMeans) aur gradient descent (neural nets) features ke scale pe sensitive hain. Standardization (mean=0, std=1) ya Min-Max scaling (0-1). Tree-based models (Random Forest) ko scaling ki zaroorat nahi.

EDA kya hota hai?▾

Exploratory Data Analysis — data ko samajhna before modeling. Shape, dtypes, missing values, distributions, correlations, outliers check karna. Visualization: histograms, box plots, scatter plots, heatmaps. Insights → better feature engineering.

📌 Related Notes

BCASEM-6

Software Engineering Notes — BCA Sem 6

Software Engineering

BCASEM-3

Database Management System Notes — BCA Sem 3

Database Management

BCASEM-1

C Programming Notes — BCA Sem 1

C Programming

BCASEM-4

Computer Networks Notes — BCA Sem 4

Computer Networks