Why Python Took Over Data Science
If you asked a room full of data scientists what language they use daily, roughly seven out of ten would say Python. That wasn't always the case — R had a stronghold in academia for years, and SAS dominated enterprise analytics well into the 2010s. But Python won for one simple reason: it does everything.
Need to scrape a website, clean a messy CSV, train a machine learning model, and deploy it as an API? Python handles all four. R can't say that. SAS definitely can't. And that versatility is exactly why hiring managers now list Python as a must-have skill in over 80% of data science job postings.
Here's what makes Python particularly good for data work:
- Pandas and NumPy turn Python into a spreadsheet on steroids — fast, scriptable, and capable of handling millions of rows without breaking a sweat
- Scikit-learn gives you production-grade machine learning in a few lines of code
- Matplotlib and Seaborn handle visualisation, from quick-and-dirty histograms to publication-quality charts
- The community is massive. Whatever problem you hit, someone on Stack Overflow has already solved it
One thing worth mentioning: Python isn't the fastest language out there. For raw number-crunching, C++ or Julia will smoke it. But speed of development matters more than speed of execution for 95% of data science work, and that's where Python shines.
Setting Up Your Environment (Without the Headaches)
I've watched too many beginners lose an entire Saturday fighting with Python installations. Let's avoid that.
You have two solid options, and I'd recommend picking one based on your comfort level:
Option A: Anaconda — the "just works" approach
Anaconda bundles Python with 250+ data science packages. Download it from anaconda.com, run the installer, and you're done. It also ships with Jupyter Notebook, which is where you'll spend most of your time. The downside? It's a 500MB+ download and installs stuff you'll probably never use.
Option B: pip + venv — the lean approach
If you prefer a cleaner setup (or you're on a machine with limited storage), grab Python from python.org and set up a virtual environment:
python -m venv datasci
source datasci/bin/activate # macOS/Linux
datasci\Scripts\activate # Windows
pip install pandas numpy matplotlib jupyter
Either way, test your setup by opening a Jupyter notebook and running:
import pandas as pd
import numpy as np
print("You're good to go!")
If that prints without errors, you're ready. If it doesn't — and this is important — don't panic. Copy the error message into Google verbatim. Someone else has hit it before you.
Variables, Types, and Why Python Doesn't Make You Declare Them
Coming from languages like Java or C#, Python feels almost suspiciously easy. No type declarations. No semicolons. No curly braces. Just... code.
The core data types you'll use constantly:
- Integers and floats —
age = 32andsalary = 85000.50. Python figures out which is which. - Strings —
name = "Priya". Single or double quotes, doesn't matter. Triple quotes for multiline. - Booleans —
is_employed = True. Note the capital T — this trips people up coming from JavaScript. - Lists —
scores = [88, 92, 76, 95]. Ordered and mutable. Your go-to collection type. - Dictionaries —
person = {"name": "Priya", "role": "analyst"}. Key-value pairs. Think of them as JSON objects.
A quick gotcha that catches beginners: Python is dynamically typed, which means a variable can hold an integer one moment and a string the next. That flexibility is great for quick exploration but can bite you in larger projects. Good habit to build early: name your variables descriptively so the type is obvious from context. user_count is clearly a number. x is not clearly anything.
Arithmetic works exactly how you'd expect: +, -, *, /. Two extras worth knowing: ** for exponents (2**10 gives 1024) and // for integer division (7//2 gives 3, not 3.5).
Loops, Conditionals, and the Art of List Comprehensions
Once you can store data, you need to do things with it. That's where control flow comes in.
If/elif/else works like most languages, except Python uses indentation instead of brackets to define blocks:
score = 82
if score >= 90:
print("Excellent")
elif score >= 70:
print("Good")
else:
print("Needs improvement")
Get the indentation wrong and Python will yell at you with an IndentationError. Most editors handle this automatically, but it's worth knowing why.
For loops in Python are cleaner than in most languages because you iterate directly over items, not indices:
cities = ["Mumbai", "Berlin", "Toronto"]
for city in cities:
print(f"Processing data for {city}")
When you do need the index, use enumerate(): for i, city in enumerate(cities).
Now, here's the thing that separates Python beginners from people who actually write Pythonic code: list comprehensions. They're one-line loops that create new lists:
# Old way
squared = []
for x in range(10):
squared.append(x ** 2)
# Pythonic way
squared = [x ** 2 for x in range(10)]
# With a filter
even_squared = [x ** 2 for x in range(10) if x % 2 == 0]
Once this syntax clicks, you'll use it everywhere. It's not just shorter — it's actually faster than a regular for loop because Python optimises it internally.
Writing Functions That Don't Make Your Future Self Cry
Functions are how you stop copying and pasting code blocks. They're also how you make your analysis reproducible — and reproducibility is a non-negotiable in data science.
def calculate_growth_rate(current, previous):
"""Calculate percentage growth between two periods."""
if previous == 0:
return float('inf')
return ((current - previous) / previous) * 100
q1_growth = calculate_growth_rate(150000, 120000)
print(f"Q1 growth: {q1_growth:.1f}%") # Q1 growth: 25.0%
A few things to notice: the docstring (that triple-quoted comment) isn't just decoration. Tools like Jupyter show it when you press Shift+Tab on a function name. Your future self — and your teammates — will appreciate those three seconds of documentation.
Default parameters make functions flexible without making them complicated:
def clean_column(text, to_lower=True, strip_whitespace=True):
if to_lower:
text = text.lower()
if strip_whitespace:
text = text.strip()
return text
You'll also bump into lambda functions constantly when working with Pandas:
# Convert prices from INR to USD across an entire column
df['price_usd'] = df['price_inr'].apply(lambda x: round(x / 83.5, 2))
Lambda functions are just anonymous one-liners. Don't overthink them — if the logic gets complex, use a regular function instead.
NumPy: The Engine Under the Hood
Every time you use Pandas, Scikit-learn, or TensorFlow, NumPy is doing the heavy lifting behind the scenes. Understanding it directly makes you faster and gives you more control.
The core idea: NumPy arrays are homogeneous, fixed-size, and stored in contiguous memory. That's what makes them fast — a NumPy operation on a million numbers can be 50–100x faster than the equivalent Python loop.
import numpy as np
# Creating arrays
revenue = np.array([45000, 52000, 48000, 61000, 55000])
zeros = np.zeros((3, 4)) # 3x4 matrix of zeros
random_data = np.random.randn(1000) # 1000 normally distributed values
# Operations happen element-wise — no loops needed
revenue_in_thousands = revenue / 1000
adjusted = revenue * 1.08 # 8% increase across all months
# Aggregations
print(f"Average: {revenue.mean():.0f}")
print(f"Total: {revenue.sum()}")
print(f"Best month: {revenue.max()}")
The real power shows up with boolean indexing — filtering data using conditions:
high_months = revenue[revenue > 50000]
# Returns: array([52000, 61000, 55000])
This pattern — applying conditions directly to arrays — is everywhere in data science. It's how you filter datasets, select features, and mask outliers. Get comfortable with it early because Pandas borrows the exact same syntax.
One practical tip: when you're generating sample data for testing or prototyping, np.random is your best friend. np.random.seed(42) makes your random numbers reproducible — use it in every notebook so your results don't change between runs.
Pandas: Where Data Science Actually Happens
If there's one library you absolutely must know for data science in Python, it's Pandas. Full stop. You'll spend more time in Pandas than any other tool — loading data, cleaning it, reshaping it, exploring it, and preparing it for models.
Loading and inspecting data:
import pandas as pd
df = pd.read_csv('sales_q4_2025.csv')
# First things to run on any new dataset:
df.shape # (rows, columns)
df.head() # First 5 rows
df.dtypes # Column data types
df.isnull().sum() # Missing values per column
df.describe() # Stats for numeric columns
I run those five commands on literally every new dataset. It takes thirty seconds and saves hours of confusion later.
Selecting and filtering:
# Single column
df['revenue']
# Multiple columns
df[['product', 'revenue', 'region']]
# Rows matching a condition
df[df['revenue'] > 100000]
# Combining conditions (note the parentheses — they matter)
df[(df['region'] == 'APAC') & (df['revenue'] > 50000)]
The operations you'll use daily:
# Grouping (the Pandas equivalent of SQL GROUP BY)
df.groupby('region')['revenue'].sum()
# Sorting
df.sort_values('revenue', ascending=False)
# Creating new columns
df['profit_margin'] = (df['revenue'] - df['cost']) / df['revenue'] * 100
# Handling missing data
df['city'].fillna('Unknown', inplace=True)
df.dropna(subset=['revenue']) # Drop rows where revenue is missing
The learning curve with Pandas is real — the API is huge and there are usually three ways to do anything. But stick with it. After two weeks of daily use, the common patterns become muscle memory.
Telling Stories with Matplotlib (and When to Use Seaborn Instead)
Data that sits in a table doesn't convince anyone. Charts do. And while there are fancier tools out there (Plotly, Bokeh, Altair), Matplotlib is the foundation all of them are built on. Learn it first.
import matplotlib.pyplot as plt
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May']
revenue = [45, 52, 48, 61, 55]
plt.figure(figsize=(10, 5))
plt.bar(months, revenue, color='#FF5722')
plt.title('Monthly Revenue (Q1-Q2 2025)', fontsize=14, fontweight='bold')
plt.ylabel('Revenue ($K)')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
That gives you a clean, labelled bar chart in about ten lines. Not bad.
When to reach for Seaborn: Statistical visualisations. If you need distribution plots, correlation heatmaps, or anything that involves grouping by categories, Seaborn does in one line what takes ten in Matplotlib:
import seaborn as sns
# Correlation heatmap — incredibly useful during EDA
sns.heatmap(df.corr(), annot=True, cmap='RdBu_r', center=0)
# Distribution comparison
sns.boxplot(data=df, x='department', y='salary')
# Scatterplot with regression line
sns.regplot(data=df, x='experience_years', y='salary', scatter_kws={'alpha': 0.4})
A few non-obvious tips from experience:
- Always set
figsize— the default is too small for most data - Use
plt.tight_layout()to prevent labels from getting clipped - Pick colour-blind-friendly palettes. About 8% of men have some form of colour vision deficiency. Seaborn's default palettes are already decent for this.
- For presentations, export as SVG (
plt.savefig('chart.svg')) instead of PNG — it scales perfectly at any size
Once you're comfortable with Matplotlib and Seaborn, explore Plotly for interactive charts that work in web dashboards. But master the basics first — you'll be surprised how far bar charts, line charts, and histograms take you.
Ready to Take the Next Step?
Our tutorials are just the beginning. Explore our expert-led courses and certifications for hands-on, career-ready training.