Python for Data Science: A Beginner's Guide (2026)

If you asked a room full of data scientists what language they use daily, roughly seven out of ten would say Python. That wasn't always the case — R had a stronghold in academia for years, and SAS dominated enterprise analytics well into the 2010s. But Python won for one simple reason: it does everything.

Need to scrape a website, clean a messy CSV, train a machine learning model, and deploy it as an API? Python handles all four. R can't say that. SAS definitely can't. And that versatility is exactly why hiring managers now list Python as a must-have skill in over 80% of data science job postings.

Here's what makes Python particularly good for data work:

Pandas and NumPy turn Python into a spreadsheet on steroids — fast, scriptable, and capable of handling millions of rows without breaking a sweat
Scikit-learn gives you production-grade machine learning in a few lines of code
Matplotlib and Seaborn handle visualisation, from quick-and-dirty histograms to publication-quality charts
The community is massive. Whatever problem you hit, someone on Stack Overflow has already solved it

One thing worth mentioning: Python isn't the fastest language out there. For raw number-crunching, C++ or Julia will smoke it. But speed of development matters more than speed of execution for 95% of data science work, and that's where Python shines.

I've watched too many beginners lose an entire Saturday fighting with Python installations. Let's avoid that.

You have two solid options, and I'd recommend picking one based on your comfort level:

Option A: Anaconda — the "just works" approach

Anaconda bundles Python with 250+ data science packages. Download it from anaconda.com, run the installer, and you're done. It also ships with Jupyter Notebook, which is where you'll spend most of your time. The downside? It's a 500MB+ download and installs stuff you'll probably never use.

Option B: pip + venv — the lean approach

If you prefer a cleaner setup (or you're on a machine with limited storage), grab Python from python.org and set up a virtual environment:

python -m venv datasci
source datasci/bin/activate   # macOS/Linux
datasci\Scripts\activate      # Windows
pip install pandas numpy matplotlib jupyter

Either way, test your setup by opening a Jupyter notebook and running:

import pandas as pd
import numpy as np
print("You're good to go!")

If that prints without errors, you're ready. If it doesn't — and this is important — don't panic. Copy the error message into Google verbatim. Someone else has hit it before you.

Coming from languages like Java or C#, Python feels almost suspiciously easy. No type declarations. No semicolons. No curly braces. Just... code.

The core data types you'll use constantly:

Integers and floats — age = 32 and salary = 85000.50. Python figures out which is which.
Strings — name = "Priya". Single or double quotes, doesn't matter. Triple quotes for multiline.
Booleans — is_employed = True. Note the capital T — this trips people up coming from JavaScript.
Lists — scores = [88, 92, 76, 95]. Ordered and mutable. Your go-to collection type.
Dictionaries — person = {"name": "Priya", "role": "analyst"}. Key-value pairs. Think of them as JSON objects.

A quick gotcha that catches beginners: Python is dynamically typed, which means a variable can hold an integer one moment and a string the next. That flexibility is great for quick exploration but can bite you in larger projects. Good habit to build early: name your variables descriptively so the type is obvious from context. user_count is clearly a number. x is not clearly anything.

Arithmetic works exactly how you'd expect: +, -, *, /. Two extras worth knowing: ** for exponents (2**10 gives 1024) and // for integer division (7//2 gives 3, not 3.5).

Once you can store data, you need to do things with it. That's where control flow comes in.

If/elif/else works like most languages, except Python uses indentation instead of brackets to define blocks:

score = 82
if score >= 90:
    print("Excellent")
elif score >= 70:
    print("Good")
else:
    print("Needs improvement")

Get the indentation wrong and Python will yell at you with an IndentationError. Most editors handle this automatically, but it's worth knowing why.

For loops in Python are cleaner than in most languages because you iterate directly over items, not indices:

cities = ["Mumbai", "Berlin", "Toronto"]
for city in cities:
    print(f"Processing data for {city}")

When you do need the index, use enumerate(): for i, city in enumerate(cities).

Now, here's the thing that separates Python beginners from people who actually write Pythonic code: list comprehensions. They're one-line loops that create new lists:

# Old way
squared = []
for x in range(10):
    squared.append(x ** 2)

# Pythonic way
squared = [x ** 2 for x in range(10)]

# With a filter
even_squared = [x ** 2 for x in range(10) if x % 2 == 0]

Once this syntax clicks, you'll use it everywhere. It's not just shorter — it's actually faster than a regular for loop because Python optimises it internally.

Functions are how you stop copying and pasting code blocks. They're also how you make your analysis reproducible — and reproducibility is a non-negotiable in data science.

def calculate_growth_rate(current, previous):
    """Calculate percentage growth between two periods."""
    if previous == 0:
        return float('inf')
    return ((current - previous) / previous) * 100

q1_growth = calculate_growth_rate(150000, 120000)
print(f"Q1 growth: {q1_growth:.1f}%")  # Q1 growth: 25.0%

A few things to notice: the docstring (that triple-quoted comment) isn't just decoration. Tools like Jupyter show it when you press Shift+Tab on a function name. Your future self — and your teammates — will appreciate those three seconds of documentation.

Default parameters make functions flexible without making them complicated:

def clean_column(text, to_lower=True, strip_whitespace=True):
    if to_lower:
        text = text.lower()
    if strip_whitespace:
        text = text.strip()
    return text

You'll also bump into lambda functions constantly when working with Pandas:

# Convert prices from INR to USD across an entire column
df['price_usd'] = df['price_inr'].apply(lambda x: round(x / 83.5, 2))

Lambda functions are just anonymous one-liners. Don't overthink them — if the logic gets complex, use a regular function instead.

Every time you use Pandas, Scikit-learn, or TensorFlow, NumPy is doing the heavy lifting behind the scenes. Understanding it directly makes you faster and gives you more control.

The core idea: NumPy arrays are homogeneous, fixed-size, and stored in contiguous memory. That's what makes them fast — a NumPy operation on a million numbers can be 50–100x faster than the equivalent Python loop.

import numpy as np

# Creating arrays
revenue = np.array([45000, 52000, 48000, 61000, 55000])
zeros = np.zeros((3, 4))        # 3x4 matrix of zeros
random_data = np.random.randn(1000)  # 1000 normally distributed values

# Operations happen element-wise — no loops needed
revenue_in_thousands = revenue / 1000
adjusted = revenue * 1.08  # 8% increase across all months

# Aggregations
print(f"Average: {revenue.mean():.0f}")
print(f"Total: {revenue.sum()}")
print(f"Best month: {revenue.max()}")

The real power shows up with boolean indexing — filtering data using conditions:

high_months = revenue[revenue > 50000]
# Returns: array([52000, 61000, 55000])

This pattern — applying conditions directly to arrays — is everywhere in data science. It's how you filter datasets, select features, and mask outliers. Get comfortable with it early because Pandas borrows the exact same syntax.

One practical tip: when you're generating sample data for testing or prototyping, np.random is your best friend. np.random.seed(42) makes your random numbers reproducible — use it in every notebook so your results don't change between runs.

If there's one library you absolutely must know for data science in Python, it's Pandas. Full stop. You'll spend more time in Pandas than any other tool — loading data, cleaning it, reshaping it, exploring it, and preparing it for models.

Loading and inspecting data:

import pandas as pd

df = pd.read_csv('sales_q4_2025.csv')

# First things to run on any new dataset:
df.shape         # (rows, columns)
df.head()        # First 5 rows
df.dtypes        # Column data types
df.isnull().sum() # Missing values per column
df.describe()    # Stats for numeric columns

I run those five commands on literally every new dataset. It takes thirty seconds and saves hours of confusion later.

Selecting and filtering:

# Single column
df['revenue']

# Multiple columns
df[['product', 'revenue', 'region']]

# Rows matching a condition
df[df['revenue'] > 100000]

# Combining conditions (note the parentheses — they matter)
df[(df['region'] == 'APAC') & (df['revenue'] > 50000)]

The operations you'll use daily:

# Grouping (the Pandas equivalent of SQL GROUP BY)
df.groupby('region')['revenue'].sum()

# Sorting
df.sort_values('revenue', ascending=False)

# Creating new columns
df['profit_margin'] = (df['revenue'] - df['cost']) / df['revenue'] * 100

# Handling missing data
df['city'].fillna('Unknown', inplace=True)
df.dropna(subset=['revenue'])  # Drop rows where revenue is missing

The learning curve with Pandas is real — the API is huge and there are usually three ways to do anything. But stick with it. After two weeks of daily use, the common patterns become muscle memory.

Data that sits in a table doesn't convince anyone. Charts do. And while there are fancier tools out there (Plotly, Bokeh, Altair), Matplotlib is the foundation all of them are built on. Learn it first.

import matplotlib.pyplot as plt

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May']
revenue = [45, 52, 48, 61, 55]

plt.figure(figsize=(10, 5))
plt.bar(months, revenue, color='#FF5722')
plt.title('Monthly Revenue (Q1-Q2 2025)', fontsize=14, fontweight='bold')
plt.ylabel('Revenue ($K)')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

That gives you a clean, labelled bar chart in about ten lines. Not bad.

When to reach for Seaborn: Statistical visualisations. If you need distribution plots, correlation heatmaps, or anything that involves grouping by categories, Seaborn does in one line what takes ten in Matplotlib:

import seaborn as sns

# Correlation heatmap — incredibly useful during EDA
sns.heatmap(df.corr(), annot=True, cmap='RdBu_r', center=0)

# Distribution comparison
sns.boxplot(data=df, x='department', y='salary')

# Scatterplot with regression line
sns.regplot(data=df, x='experience_years', y='salary', scatter_kws={'alpha': 0.4})

A few non-obvious tips from experience:

Always set figsize — the default is too small for most data
Use plt.tight_layout() to prevent labels from getting clipped
Pick colour-blind-friendly palettes. About 8% of men have some form of colour vision deficiency. Seaborn's default palettes are already decent for this.
For presentations, export as SVG (plt.savefig('chart.svg')) instead of PNG — it scales perfectly at any size

Once you're comfortable with Matplotlib and Seaborn, explore Plotly for interactive charts that work in web dashboards. But master the basics first — you'll be surprised how far bar charts, line charts, and histograms take you.

Ready to Take the Next Step?

Our tutorials are just the beginning. Explore our expert-led courses and certifications for hands-on, career-ready training.

Explore Courses Browse All Tutorials

def calculate_growth_rate(current, previous): """Calculate percentage growth between two periods.""" if previous == 0: return float('inf') return ((current - previous) / previous) * 100 q1_growth = calculate_growth_rate(150000, 120000) print(f"Q1 growth: {q1_growth:.1f}%") # Q1 growth: 25.0%

import numpy as np # Creating arrays revenue = np.array([45000, 52000, 48000, 61000, 55000]) zeros = np.zeros((3, 4)) # 3x4 matrix of zeros random_data = np.random.randn(1000) # 1000 normally distributed values # Operations happen element-wise — no loops needed revenue_in_thousands = revenue / 1000 adjusted = revenue * 1.08 # 8% increase across all months # Aggregations print(f"Average: {revenue.mean():.0f}") print(f"Total: {revenue.sum()}") print(f"Best month: {revenue.max()}")

import pandas as pd df = pd.read_csv('sales_q4_2025.csv') # First things to run on any new dataset: df.shape # (rows, columns) df.head() # First 5 rows df.dtypes # Column data types df.isnull().sum() # Missing values per column df.describe() # Stats for numeric columns

# Single column df['revenue'] # Multiple columns df[['product', 'revenue', 'region']] # Rows matching a condition df[df['revenue'] > 100000] # Combining conditions (note the parentheses — they matter) df[(df['region'] == 'APAC') & (df['revenue'] > 50000)]

# Grouping (the Pandas equivalent of SQL GROUP BY) df.groupby('region')['revenue'].sum() # Sorting df.sort_values('revenue', ascending=False) # Creating new columns df['profit_margin'] = (df['revenue'] - df['cost']) / df['revenue'] * 100 # Handling missing data df['city'].fillna('Unknown', inplace=True) df.dropna(subset=['revenue']) # Drop rows where revenue is missing

import matplotlib.pyplot as plt months = ['Jan', 'Feb', 'Mar', 'Apr', 'May'] revenue = [45, 52, 48, 61, 55] plt.figure(figsize=(10, 5)) plt.bar(months, revenue, color='#FF5722') plt.title('Monthly Revenue (Q1-Q2 2025)', fontsize=14, fontweight='bold') plt.ylabel('Revenue ($K)') plt.grid(axis='y', alpha=0.3) plt.tight_layout() plt.show()

import seaborn as sns # Correlation heatmap — incredibly useful during EDA sns.heatmap(df.corr(), annot=True, cmap='RdBu_r', center=0) # Distribution comparison sns.boxplot(data=df, x='department', y='salary') # Scatterplot with regression line sns.regplot(data=df, x='experience_years', y='salary', scatter_kws={'alpha': 0.4})

Introduction to Python for Data Science

Why Python Took Over Data Science

Setting Up Your Environment (Without the Headaches)

Variables, Types, and Why Python Doesn't Make You Declare Them

Loops, Conditionals, and the Art of List Comprehensions

Writing Functions That Don't Make Your Future Self Cry

NumPy: The Engine Under the Hood

Pandas: Where Data Science Actually Happens

Telling Stories with Matplotlib (and When to Use Seaborn Instead)

Ready to Take the Next Step?

Introduction to Python for Data Science

Why Python Took Over Data Science

Setting Up Your Environment (Without the Headaches)

Variables, Types, and Why Python Doesn't Make You Declare Them

Loops, Conditionals, and the Art of List Comprehensions

Writing Functions That Don't Make Your Future Self Cry

NumPy: The Engine Under the Hood

Pandas: Where Data Science Actually Happens

Telling Stories with Matplotlib (and When to Use Seaborn Instead)

Ready to Take the Next Step?