| jupytext |
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| kernelspec |
|
||||||||||
| downloads |
|
+++
In the previous chapters, we focused on discrete random variables, which take on a finite or countably infinite number of distinct values. Think about the number of heads in 10 coin flips (0, 1, ..., 10), the number of emails received in an hour (0, 1, 2, ...), or the outcome of a die roll ({1, 2, 3, 4, 5, 6}).
Now, we turn our attention to continuous random variables. These variables can take on any value within a given range or interval. Examples include:
- The exact height of a randomly selected adult (e.g., 1.7532... meters).
- The time until a radioactive particle decays (e.g., 3.1415... seconds).
- The temperature of a room (e.g., 21.5... degrees Celsius).
- The exact concentration of a chemical in a solution.
The key difference lies in the nature of the sample space. For discrete variables, we can list the possible outcomes. For continuous variables, there are uncountably infinite possible values between any two given values.
Example: Measuring the exact height of a person results in a continuous variable, as height can theoretically take any value within a range (e.g., 1.5m, 1.51m, 1.511m, ...). In contrast, counting the number of people taller than 1.8m in a sample of 100 yields a discrete variable (0, 1, 2, ..., 100).
This fundamental difference means we can no longer talk about the probability of the variable taking on a specific value, like
To describe the probability distribution of continuous random variables, we introduce two key functions: the Probability Density Function (PDF) and the Cumulative Distribution Function (CDF).
+++
For a continuous random variable
Key properties of a PDF
-
Non-negativity:
$f_X(x) \ge 0$ for all$x$ . -
Total area equals 1:
$\int_{-\infty}^{\infty} f_X(x) dx = 1$ . (The total probability over the entire range of possible values is 1). -
Probability as Area: The probability that
$X$ falls within an interval$[a, b]$ is given by the integral of the PDF over that interval:$$P(a \le X \le b) = \int_a^b f_X(x) dx$$
Important Note: For any continuous random variable
Example: Imagine the distribution of IQ scores in a large population, often modeled by a Normal distribution (a "bell curve"). The PDF would be highest around the average IQ (e.g., 100) and taper off towards very high or very low scores. The value of the PDF at
In Python, libraries like scipy.stats provide objects representing various continuous distributions, allowing us to evaluate their PDFs using the .pdf() method.
+++
The Cumulative Distribution Function (CDF), denoted as
For a continuous random variable with PDF
Key properties of a CDF
-
$F_X(x)$ is a non-decreasing function of$x$ . -
$\lim_{x \to -\infty} F_X(x) = 0$ . (The probability of being less than or equal to a very small number is 0). -
$\lim_{x \to \infty} F_X(x) = 1$ . (The probability of being less than or equal to a very large number is 1). -
The probability of
$X$ falling in an interval$(a, b]$ can be calculated using the CDF:$$P(a < X \le b) = F_X(b) - F_X(a)$$ -
The PDF can be obtained by differentiating the CDF (where the derivative exists):
$f_X(x) = \frac{d}{dx} F_X(x)$ .
Example: For the IQ distribution, the CDF
scipy.stats objects also provide methods to evaluate the CDF using the .cdf() method.
+++
Similar to discrete random variables, we can define measures of central tendency and spread for continuous variables using integration instead of summation.
Expected Value (Mean):
The expected value or mean of a continuous random variable
Think of it as the center of mass of the distribution.
Variance and Standard Deviation: The variance measures the spread or dispersion of the distribution around the mean. It's the expected value of the squared deviation from the mean:
An alternative computational formula is often more convenient:
where
The Standard Deviation,
Example: For the IQ distribution (often modeled as Normal with mean 100 and standard deviation 15),
scipy.stats distributions often have methods like .mean(), .var(), and .std() to compute these directly, or .expect() for more general expectations.
+++
Percentiles and quantiles help us understand specific points in the distribution relative to the cumulative probability.
The
The quantile function, often denoted
- The 0.5 quantile (or 50th percentile) is the median.
- The 0.25 quantile (25th percentile) is the first quartile (Q1).
- The 0.75 quantile (75th percentile) is the third quartile (Q3).
Example: Finding the IQ score that marks the 95th percentile means finding the value
scipy.stats provides the .ppf() (percent point function) method, which is the inverse of the CDF (quantile function).
+++
Often, we are interested in a new random variable
Method 1: Using the CDF
- Find the CDF of
$Y$ :$F_Y(y) = P(Y \le y) = P(g(X) \le y)$ . - Express the event
${g(X) \le y}$ in terms of$X$ . This might involve solving the inequality for$X$ . Be careful if$g(x)$ is not monotonically increasing (i.e., if it goes up and down). - Calculate
$P(\text{event in terms of } X)$ using the CDF of$X$ ,$F_X(x)$ . - Differentiate the resulting
$F_Y(y)$ with respect to$y$ to find the PDF of$Y$ :$f_Y(y) = \frac{d}{dy} F_Y(y)$ .
Method 2: Change of Variables Formula (for monotonic functions)
If
where
Example: Suppose the temperature
We need to find the range for F. When
Thus, F is uniformly distributed between 59°F and 77°F.
Calculating the expected value of a function
+++
Let's put these concepts into practice using Python libraries. We'll simulate heights assuming a Normal distribution and compare the histogram to the theoretical PDF, as well as perform calculations using scipy.stats and scipy.integrate.
# Essential imports
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import integrate
# Configure plots for better visualization
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # Make plots sharper
plt.rcParams['figure.figsize'] = (10, 6)
scipy.stats provides convenient objects for many standard continuous distributions (we'll explore these in Chapter 9). Let's use the Normal distribution as an example again. Suppose adult male heights (in cm) are Normally distributed with a mean (
# Define the normal distribution for heights
mu_height = 177
sigma_height = 7
# Create a frozen distribution object
height_dist = stats.norm(loc=mu_height, scale=sigma_height) # loc=mean, scale=std dev
# Define a range of height values for plotting
# Go out +/- 4 standard deviations to cover most of the probability mass
x_height = np.linspace(height_dist.ppf(0.0001), height_dist.ppf(0.9999), 500)
# Calculate PDF and CDF values for the range
pdf_values = height_dist.pdf(x_height)
cdf_values = height_dist.cdf(x_height)
# Plot PDF
plt.figure()
plt.plot(x_height, pdf_values, label=f'Normal PDF (mu={mu_height}, sigma={sigma_height})')
plt.fill_between(x_height, pdf_values, alpha=0.2)
plt.title('Probability Density Function (PDF) of Adult Male Height')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()
# Plot CDF
plt.figure()
plt.plot(x_height, cdf_values, label=f'Normal CDF (mu={mu_height}, sigma={sigma_height})')
plt.title('Cumulative Distribution Function (CDF) of Adult Male Height')
plt.xlabel('Height (cm)')
plt.ylabel('Cumulative Probability P(Height <= x)')
# Add lines for key percentiles
plt.axhline(0.5, color='grey', linestyle='--', label='Median (50th percentile)')
plt.axvline(height_dist.ppf(0.5), color='grey', linestyle='--')
plt.legend()
plt.grid(True)
plt.show()
We want to calculate the probability of a height falling within a specific range, e.g., P(170 cm < Height ≤ 185 cm).
Method 1: Using the CDF
# Calculate probability using CDF method
prob_cdf = height_dist.cdf(185) - height_dist.cdf(170)
print(f"P(170 < Height <= 185) using CDF: {prob_cdf:.4f}")
# Visualize this area on the PDF
plt.figure()
plt.plot(x_height, pdf_values, label='Normal PDF')
# Create filter for the range of interest
range_filter = (x_height > 170) & (x_height <= 185)
plt.fill_between(x_height[range_filter], pdf_values[range_filter], alpha=0.4, label='P(170 < Height <= 185)')
plt.title('Area under PDF corresponding to P(170 < Height <= 185)')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()
Method 2: Using Numerical Integration of the PDF
We need to calculate scipy.integrate.quad (which performs quadrature, a numerical integration method).
# The first argument to quad is the function to integrate (our PDF)
# The next two arguments are the lower and upper limits of integration
prob_integral, integration_error = integrate.quad(height_dist.pdf, 170, 185)
print(f"P(170 < Height <= 185) using PDF integration: {prob_integral:.4f}")
print(f"Estimated integration error: {integration_error:.2e}") # Should be very small
As expected, both methods give essentially the same result (within numerical precision). Using the CDF is generally more direct and computationally efficient when available. Integration is necessary if you only have the PDF function defined (e.g., a custom, non-standard distribution).
+++
scipy.stats objects have built-in methods for these common statistics.
# Expected Value (Mean)
e_height = height_dist.mean()
print(f"Expected Height E[X]: {e_height:.2f} cm")
# Variance
var_height = height_dist.var()
print(f"Variance Var(X): {var_height:.2f} cm^2")
# Standard Deviation
std_height = height_dist.std()
print(f"Standard Deviation std(X): {std_height:.2f} cm") # Should match sigma_height
# Calculate Percentiles (Quantiles) using PPF (Percent Point Function)
percentile_95 = height_dist.ppf(0.95) # 95th percentile
median = height_dist.ppf(0.50) # 50th percentile (median)
q1 = height_dist.ppf(0.25) # 25th percentile (Q1)
q3 = height_dist.ppf(0.75) # 75th percentile (Q3)
print(f"25th Percentile (Q1) of Height: {q1:.2f} cm")
print(f"Median Height (Q2): {median:.2f} cm") # Should be equal to the mean for Normal dist.
print(f"75th Percentile (Q3) of Height: {q3:.2f} cm")
print(f"95th Percentile of Height: {percentile_95:.2f} cm")
Calculating Moments using Numerical Integration
We can also calculate moments like integrate.quad and the definitions:
We integrate over a wide practical range instead of
# Define functions representing x*f(x) and x^2*f(x)
def integrand_mean(x):
"""Function x * pdf(x) for calculating E[X]"""
return x * height_dist.pdf(x)
def integrand_mean_square(x):
"""Function x^2 * pdf(x) for calculating E[X^2]"""
return x**2 * height_dist.pdf(x)
# Integrate over a wide range (practically equivalent to -inf to +inf for Normal)
lower_bound = height_dist.ppf(1e-8) # Very small percentile
upper_bound = height_dist.ppf(1 - 1e-8) # Very large percentile
e_height_integral, _ = integrate.quad(integrand_mean, lower_bound, upper_bound)
e_height_sq_integral, _ = integrate.quad(integrand_mean_square, lower_bound, upper_bound)
# Calculate variance from integrated moments
var_height_integral = e_height_sq_integral - (e_height_integral)**2
print(f"E[X] via integration: {e_height_integral:.2f}")
print(f"E[X^2] via integration: {e_height_sq_integral:.2f}")
print(f"Var(X) via integration (E[X^2] - E[X]^2): {var_height_integral:.2f}")
The results from numerical integration closely match the built-in methods, confirming our understanding of the definitions.
+++
We can simulate drawing random samples from the distribution using the .rvs() method (Random VariateS) and compare the results (like the histogram and sample statistics) to the theoretical distribution. This is a fundamental technique in Monte Carlo methods (Chapter 18).
# Generate 10,000 random height samples
num_samples = 10000
# Set random_state for reproducibility - ensures we get the same 'random' samples each time
simulated_heights = height_dist.rvs(size=num_samples, random_state=42)
# Calculate empirical (sample) mean and std dev
empirical_mean = np.mean(simulated_heights)
empirical_std = np.std(simulated_heights) # Use np.std for population std dev estimate based on sample
print(f"Theoretical Mean: {mu_height:.2f}, Empirical Mean (from {num_samples} samples): {empirical_mean:.2f}")
print(f"Theoretical Std Dev: {sigma_height:.2f}, Empirical Std Dev (from {num_samples} samples): {empirical_std:.2f}")
# Plot histogram of simulated data vs theoretical PDF
plt.figure()
# Use density=True to normalize the histogram so its area sums to 1, making it comparable to the PDF
plt.hist(simulated_heights, bins=50, density=True, alpha=0.7, label='Simulated Heights Histogram')
# Overlay the theoretical PDF for comparison
plt.plot(x_height, pdf_values, 'r-', lw=2, label='Theoretical Normal PDF')
plt.title(f'Simulated Heights (N={num_samples}) vs Theoretical PDF')
plt.xlabel('Height (cm)')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()
As the plot shows, the histogram of the randomly generated samples closely follows the shape of the theoretical Probability Density Function. Furthermore, the calculated sample mean and standard deviation are very close to the theoretical parameters (
According to the Law of Large Numbers (Chapter 14), as the number of samples (num_samples) increases, the empirical histogram will better approximate the true PDF, and the sample statistics (like mean and standard deviation) will converge to the true parameters of the distribution.
This hands-on section demonstrated how to work with the core concepts of continuous random variables – PDFs, CDFs, expectations, percentiles – using scipy.stats. We also saw how numerical integration can be used for calculations and how simulation allows us to generate data that follows a specific distribution, enabling comparison between empirical results and theoretical models.
In the next chapter, we will delve into specific, commonly encountered continuous distributions like the Uniform, Exponential, and the Normal distribution in more detail, exploring their properties and applications.