Optimizing Supply Chain Networks with NetworkX

Optimizing Supply Chain Networks with NetworkX

Here’s a advanced $NetworkX$ example that focuses on a different real-world problem: Network Flow Optimization.

In this example, we’ll use $NetworkX$ to model a supply chain network and optimize the flow of goods from suppliers to consumers.

Problem Description:

You have multiple suppliers and consumers connected by a network of routes with limited capacity.

The goal is to find the maximum flow of goods from suppliers to consumers while respecting the capacity constraints of each route.

Python Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph for the supply chain network
G = nx.DiGraph()

# Add nodes for suppliers, consumers, and intermediate nodes (distribution centers)
G.add_nodes_from(["Supplier_1", "Supplier_2", "DC_1", "DC_2", "Consumer_1", "Consumer_2"])

# Add edges with capacities (maximum flow allowed on each route)
edges = [
("Supplier_1", "DC_1", 15), ("Supplier_1", "DC_2", 10),
("Supplier_2", "DC_1", 10), ("Supplier_2", "DC_2", 25),
("DC_1", "Consumer_1", 10), ("DC_1", "Consumer_2", 10),
("DC_2", "Consumer_1", 20), ("DC_2", "Consumer_2", 15)
]
G.add_weighted_edges_from(edges, weight="capacity")

# Define the source nodes (suppliers) and sink nodes (consumers)
sources = ["Supplier_1", "Supplier_2"]
sinks = ["Consumer_1", "Consumer_2"]

# Add a super source and a super sink to combine multiple sources and sinks
G.add_node("Super_Source")
G.add_node("Super_Sink")

# Connect the super source to the real sources with infinite capacity
for source in sources:
G.add_edge("Super_Source", source, capacity=float('inf'))

# Connect the real sinks to the super sink with infinite capacity
for sink in sinks:
G.add_edge(sink, "Super_Sink", capacity=float('inf'))

# Calculate the maximum flow from super source to super sink
flow_value, flow_dict = nx.maximum_flow(G, "Super_Source", "Super_Sink")

# Print the results
print(f"Maximum flow: {flow_value}")
print("Flow distribution:")
for u, flow in flow_dict.items():
for v, f in flow.items():
if f > 0 and u != "Super_Source" and v != "Super_Sink":
print(f"{u} -> {v}: {f}")

# Visualization of the network
pos = nx.spring_layout(G) # Layout for visualization
plt.figure(figsize=(10, 8))

# Draw the nodes with their labels
nx.draw_networkx_nodes(G, pos, node_size=700, node_color="lightblue")
nx.draw_networkx_labels(G, pos, font_size=12, font_color="black")

# Draw the edges with their capacities
nx.draw_networkx_edges(G, pos, width=2, edge_color="gray")
edge_labels = {(u, v): f"{d['capacity']}" for u, v, d in G.edges(data=True) if "capacity" in d}
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=10)

# Highlight the flow values in red
for u, flow in flow_dict.items():
for v, f in flow.items():
if f > 0 and u != "Super_Source" and v != "Super_Sink":
nx.draw_networkx_edges(G, pos, edgelist=[(u, v)], width=3, edge_color="red", alpha=0.6)

plt.title("Supply Chain Network Flow Optimization")
plt.axis("off")
plt.show()

Explanation:

  1. Graph Construction:

    • Nodes represent suppliers, distribution centers (DCs), and consumers.
    • Edges represent transportation routes between these entities, with capacities indicating the maximum amount of goods that can be transported along each route.
  2. Super Source and Super Sink:

    • To handle multiple sources and sinks, a super source and super sink are added to the graph. These are connected to the actual suppliers and consumers with infinite capacity edges.
  3. Maximum Flow Calculation:

    • The nx.maximum_flow function calculates the maximum flow from the super source to the super sink, optimizing the distribution of goods while respecting capacity constraints.
  4. Visualization:

    • The network is visualized with nodes and edges, and the actual flow of goods (as calculated by the algorithm) is highlighted in red.

Output:

Use Case:

This model can be used to optimize supply chain operations, such as distributing products from factories to warehouses and then to retail outlets, ensuring that capacity constraints are respected.

This example showcases $NetworkX$’s ability to handle complex optimization problems with network flow analysis, a critical task in operations research and logistics.

Advanced Network Analysis and Visualization with NetworkX in Python

NetworkX in Python

Here’s a example using $NetworkX$.

In this example, we’ll work with a directed graph (digraph) to model a transportation network, calculate advanced metrics, and visualize the network with different node and edge attributes.

Step 1: Install Necessary Libraries

If you don’t have $NetworkX$ and $Matplotlib$ installed, you can install them using pip:

1
pip install networkx matplotlib

Step 2: Create and Analyze the Transportation Network Graph

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# Step 2.1: Create a directed graph
G = nx.DiGraph()

# Add nodes (cities)
G.add_nodes_from(["A", "B", "C", "D", "E", "F", "G", "H"])

# Add edges (routes) with weights (distances) and capacities (road capacity)
edges = [
("A", "B", {"distance": 5, "capacity": 15}),
("A", "C", {"distance": 10, "capacity": 10}),
("B", "D", {"distance": 8, "capacity": 20}),
("C", "D", {"distance": 2, "capacity": 25}),
("C", "E", {"distance": 7, "capacity": 15}),
("D", "F", {"distance": 6, "capacity": 10}),
("E", "F", {"distance": 3, "capacity": 5}),
("F", "G", {"distance": 4, "capacity": 30}),
("E", "H", {"distance": 12, "capacity": 20}),
("H", "G", {"distance": 9, "capacity": 10}),
]

G.add_edges_from(edges)

# Step 2.2: Calculate advanced network metrics

# Shortest path based on distance
shortest_path_distance = nx.shortest_path(G, source="A", target="G", weight="distance")

# Maximum flow between two nodes based on capacity
max_flow_value, max_flow_dict = nx.maximum_flow(G, "A", "G", capacity="capacity")

# PageRank (used to rank the importance of each node)
pagerank = nx.pagerank(G)

# Eigenvector Centrality (measures the influence of a node in the network)
eigenvector_centrality = nx.eigenvector_centrality_numpy(G)

# Step 2.3: Print out the metrics
print("Shortest Path (Distance) from A to G:", shortest_path_distance)
print("Maximum Flow from A to G:", max_flow_value)
print("PageRank:", pagerank)
print("Eigenvector Centrality:", eigenvector_centrality)

# Step 2.4: Visualize the graph with attributes

# Node positions
pos = nx.spring_layout(G, seed=42)

# Node sizes based on their Eigenvector Centrality
node_sizes = [5000 * eigenvector_centrality[node] for node in G.nodes()]

# Edge widths based on capacity
edge_widths = [G[u][v]["capacity"] / 10 for u, v in G.edges()]

# Edge colors based on distance
edge_colors = [G[u][v]["distance"] for u, v in G.edges()]

plt.figure(figsize=(10, 8))

# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color='skyblue')

# Draw edges with varying widths and colors
nx.draw_networkx_edges(G, pos, width=edge_widths, edge_color=edge_colors, edge_cmap=plt.cm.Blues)

# Draw labels
nx.draw_networkx_labels(G, pos, font_size=12, font_color='black')

# Draw edge labels with distances
edge_labels = {(u, v): f"{G[u][v]['distance']} km" for u, v in G.edges()}
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=10)

plt.title("Transportation Network Graph")
plt.colorbar(plt.cm.ScalarMappable(cmap=plt.cm.Blues), label="Distance (km)")
plt.show()

Explanation:

  • Graph Creation:

    • The directed graph G represents a transportation network, with nodes as cities and edges as routes between them. Each edge has attributes distance (in kilometers) and capacity (road capacity).
  • Network Metrics:

    • Shortest Path (Distance): The shortest path from city “A” to city “G” is calculated based on the distance attribute.
    • Maximum Flow: The maximum flow value is calculated from “A” to “G” based on the road capacity.
    • PageRank: A metric used to rank the importance of each city in the network.
    • Eigenvector Centrality: Measures the influence of a city based on its connections.
  • Visualization:

    • Node Sizes: Nodes are scaled by their Eigenvector Centrality.
    • Edge Widths: Edges are scaled by their capacity.
    • Edge Colors: Edges are colored by their distance, with a color bar representing the scale.
    • Edge Labels: Distances are labeled on each edge.

Result:

Running this code will display a complex transportation network graph where node sizes represent their influence, edge widths represent road capacities, and edge colors represent distances.

You’ll also get the shortest path based on distance, the maximum flow, PageRank, and Eigenvector Centrality metrics printed out.

This graph provides a detailed and visually rich representation of the transportation network, useful for analyzing and optimizing routes.

3D graph in Plotly

3D graph in Plotly

Here’s an example of how to create a 3D graph using Python’s $Plotly$ library.

In this example, we’ll create a 3D surface plot that visualizes a mathematical function.

Step 1: Install Plotly

If you don’t have $Plotly$ installed, you can install it using pip:

1
pip install plotly

Step 2: Create the 3D Graph

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
import plotly.graph_objs as go
import plotly.io as pio

# Generate data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = np.sin(np.sqrt(x**2 + y**2))

# Create a 3D surface plot
surface = go.Surface(z=z, x=x, y=y, colorscale='Viridis')

# Define the layout of the plot
layout = go.Layout(
title='3D Surface Plot',
scene=dict(
xaxis=dict(title='X Axis'),
yaxis=dict(title='Y Axis'),
zaxis=dict(title='Z Axis'),
camera=dict(
eye=dict(x=1.5, y=1.5, z=1.5)
)
),
margin=dict(l=0, r=0, b=0, t=40)
)

# Create the figure and display it
fig = go.Figure(data=[surface], layout=layout)
pio.show(fig)

Explanation of the Code:

  • Data Generation:

    • x and y are generated using numpy‘s linspace function to create a grid of values.
    • z is the function we want to plot, in this case, $z = sin(sqrt(x^2 + y^2))$.
  • Surface Plot:

    • We use plotly.graph_objs.Surface to create the 3D surface plot.
      The colorscale='Viridis' argument applies a color gradient to the surface.
  • Layout:

    • The layout variable customizes the appearance of the plot, including axis titles and the camera angle.
    • The camera settings control the perspective of the 3D plot.
  • Display:

    • pio.show(fig) is used to render and display the graph in a new browser window or in your notebook interface.

Result:

Running this code will produce a beautiful, interactive 3D surface plot.

You can rotate, zoom, and pan the graph to explore it from different angles.

The Viridis colorscale provides a visually appealing gradient to highlight the surface’s features.

Output:

SciPy in Python

SciPy in Python

Here’s a complex example using $SciPy$ to solve a constrained nonlinear optimization problem.

We’ll solve the problem of finding the minimum of a nonlinear objective function subject to nonlinear constraints, which is a common problem in mathematical optimization.

Problem: Constrained Nonlinear Optimization

Objective Function

We want to minimize the following nonlinear function:

$$
f(x, y) = x^2 + y^2 + x \cdot y + \sin(x) + \cos(y)
$$

Constraints

Subject to the following nonlinear constraints:

  1. $(g_1(x, y) = x^2 + y - 1 \geq 0)$
  2. $(g_2(x, y) = x + y^2 - 1 \geq 0)$

Step 1: Import Required Libraries

1
2
3
4
5
6
import numpy as np
from scipy.optimize import minimize, NonlinearConstraint

# Define the objective function
def objective_function(x):
return x[0]**2 + x[1]**2 + x[0]*x[1] + np.sin(x[0]) + np.cos(x[1])

Step 2: Define the Constraints

1
2
3
4
5
6
7
8
9
10
11
# Define the first nonlinear constraint
def constraint1(x):
return x[0]**2 + x[1] - 1

# Define the second nonlinear constraint
def constraint2(x):
return x[0] + x[1]**2 - 1

# Nonlinear constraints should be of the form g(x) >= 0
nlc1 = NonlinearConstraint(constraint1, 0, np.inf)
nlc2 = NonlinearConstraint(constraint2, 0, np.inf)

Step 3: Define Initial Guess and Bounds

1
2
3
4
5
# Initial guess
x0 = np.array([0.5, 0.5])

# Bounds for variables x and y
bounds = [(-2, 2), (-2, 2)]

Step 4: Solve the Optimization Problem

1
2
3
4
5
6
7
8
9
# Use 'trust-constr' method for constrained optimization
result = minimize(objective_function, x0, method='trust-constr', bounds=bounds, constraints=[nlc1, nlc2])

# Print the result
print("Optimization Result:")
print("x:", result.x)
print("Objective Value:", result.fun)
print("Success:", result.success)
print("Message:", result.message)

Explanation of the Code

  1. Objective Function:

    • The objective function f(x, y) is defined as a $Python$ function that takes a vector x as input and returns the scalar value of the function.
  2. Constraints:

    • The constraints g_1(x, y) and g_2(x, y) are also defined as $Python$ functions.
    • NonlinearConstraint is used to represent each constraint, specifying the lower and upper bounds for the constraint (in this case, 0 to np.inf).
  3. Initial Guess and Bounds:

    • The initial guess x0 is an array of initial values for x and y.
    • The bounds specify the lower and upper limits for each variable.
  4. Optimization:

    • The minimize function is called with the trust-constr method, which is suitable for handling constrained nonlinear optimization problems.
    • The solution is stored in the result object, which contains information about the optimal values of x and y, the objective function value at the minimum, and the success status of the optimization.

Step 5: Analyze the Result

The output will provide the optimal values of $(x)$ and $(y)$ that minimize the objective function while satisfying the nonlinear constraints.

It also includes information on whether the optimization was successful.

Output

1
2
3
4
5
Optimization Result:
x: [-0.8463456 1.35880306]
Objective Value: 0.8741750487714591
Success: True
Message: `gtol` termination condition is satisfied.

Conclusion

This example demonstrates how to solve a complicated constrained nonlinear optimization problem using $SciPy$’s minimize function with the trust-constr method.

The problem involves an objective function with multiple variables and nonlinear constraints, making it a sophisticated and challenging optimization problem.

Portfolio Optimization (CVXPY)

CVXPY

Let’s dive into a complex optimization problem using $CVXPY$.

This example involves solving a portfolio optimization problem.

The goal is to allocate investments across multiple assets to maximize expected returns while minimizing risk, with constraints on the investment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import cvxpy as cp
import numpy as np
import matplotlib.pyplot as plt

# Seed for reproducibility
np.random.seed(42)

# Number of assets
n = 10

# Generate random expected returns for each asset
mu = np.random.rand(n)

# Generate a random covariance matrix for asset returns
Sigma = np.random.rand(n, n)
Sigma = Sigma.T @ Sigma # To ensure it is positive semi-definite

# Define the variables (allocation of investments in each asset)
w = cp.Variable(n)

# Define the constraints
constraints = [
cp.sum(w) == 1, # The weights must sum to 1 (100% investment)
w >= 0, # No short-selling (non-negative weights)
]

# Define the objective function (maximize returns, minimize risk)
gamma = cp.Parameter(nonneg=True) # Risk aversion parameter
objective = cp.Maximize(mu.T @ w - gamma * cp.quad_form(w, Sigma))

# Define the problem
problem = cp.Problem(objective, constraints)

# Solve the problem for different values of gamma (risk aversion)
gammas = np.logspace(-3, 3, 100)
returns = []
risks = []

for g in gammas:
gamma.value = g
problem.solve()
returns.append(mu.T @ w.value)
risks.append(np.sqrt(w.value.T @ Sigma @ w.value))

# Plot the efficient frontier
plt.figure(figsize=(10, 6))
plt.plot(risks, returns, marker='o')
plt.xlabel("Risk (Standard Deviation)")
plt.ylabel("Expected Return")
plt.title("Efficient Frontier in Portfolio Optimization")
plt.grid(True)
plt.show()

Explanation of the Code:

  1. Problem Context:

    • The task is to allocate capital across n assets to maximize returns while managing risk. This is a typical problem in finance known as portfolio optimization.
  2. Expected Returns (mu):

    • mu represents the expected returns of each asset. In this example, these are randomly generated.
  3. Covariance Matrix (Sigma):

    • Sigma represents the covariance matrix of asset returns, showing how the returns on different assets move together. It’s crucial for understanding the risk of the portfolio.
  4. Investment Weights (w):

    • w is the decision variable, representing the fraction of total capital allocated to each asset.
  5. Objective Function:

    • The objective is to maximize the portfolio’s expected return (mu.T @ w) while penalizing risk (gamma * cp.quad_form(w, Sigma)). The risk aversion parameter gamma controls the trade-off between return and risk.
  6. Constraints:

    • The constraints ensure that the sum of all investments equals 1 (cp.sum(w) == 1), meaning all capital is invested, and no short-selling is allowed (w >= 0).
  7. Efficient Frontier:

    • The code solves the optimization problem for different levels of risk aversion (gamma). The results are used to plot the “efficient frontier,” a curve that shows the best possible expected return for each level of risk.
  8. Plotting:

    • The efficient frontier is plotted, showing the relationship between risk and return for the optimized portfolios.

Important Notes:

  • Efficient Frontier: The efficient frontier is a key concept in portfolio theory. It represents the set of portfolios that offer the highest expected return for a given level of risk.
  • Quadratic Programming: The problem involves quadratic programming due to the risk term (cp.quad_form(w, Sigma)), which is a quadratic function of the weights.
  • Risk Aversion: By varying gamma, the level of risk aversion, you can explore how different trade-offs between risk and return affect the optimal portfolio.

This example demonstrates a more advanced use of $CVXPY$ in a practical financial context, incorporating quadratic programming and exploring the trade-offs in a multi-objective optimization problem.

Output:

The graph you’ve provided is an Efficient Frontier plot in the context of portfolio optimization.

Here’s a breakdown of what this graph represents:

1. Axes:

  • X-axis (Risk - Standard Deviation): This axis represents the risk associated with the portfolio, measured by the standard deviation of the portfolio’s returns. Higher standard deviation indicates higher risk.
  • Y-axis (Expected Return): This axis shows the expected return of the portfolio. Higher values indicate greater expected returns.

2. Efficient Frontier:

  • The curve itself represents the Efficient Frontier. This is a fundamental concept in portfolio theory. Each point on the curve corresponds to a portfolio that offers the highest expected return for a given level of risk.
  • Moving along the curve from left to right, the risk increases (standard deviation increases), and the expected return also increases. However, the rate of increase in return diminishes as risk increases, which reflects the diminishing returns of taking on additional risk.

3. Interpretation:

  • Leftmost part of the curve: This section represents portfolios with the lowest risk. These portfolios also tend to have lower returns, but they are considered the “safest” in terms of volatility.
  • Rightmost part of the curve: This section shows portfolios with higher risk and higher expected returns. However, these portfolios are more volatile and less predictable.
  • Middle of the curve: This section typically represents the best trade-offs between risk and return. These portfolios are often preferred by investors seeking an optimal balance between risk and return.

4. Why the Curve?:

  • The shape of the curve is due to the fact that increasing risk does not linearly increase expected returns. Initially, small increases in risk can lead to significant increases in return, but after a certain point, additional risk leads to only marginal increases in return.

5. Use in Investment Decisions:

  • Investors use the efficient frontier to choose a portfolio that matches their risk tolerance. For example:
    • A risk-averse investor might choose a portfolio on the lower left of the curve, accepting lower returns for lower risk.
    • A risk-seeking investor might choose a portfolio on the upper right, accepting higher risk for higher potential returns.

6. The Underlying Optimization:

  • The curve is derived from solving multiple optimization problems where the objective is to maximize expected return for a given level of risk or minimize risk for a given level of expected return. These problems are typically solved using mathematical optimization techniques like quadratic programming.

In summary, this graph visualizes the trade-offs between risk and return for a set of optimal portfolios.

Investors can use it to select a portfolio that best aligns with their financial goals and risk tolerance.

Seaborn in Python(3)

Seaborn in Python(3)

Here’s a more complicated visualization example using $Seaborn$, which includes a combination of different plot types, customized aesthetics, and advanced features like hue, size, and style mappings.

This graph will visualize the relationship between multiple variables in a dataset, offering insights into complex interactions.

Complex Seaborn Visualization: Multi-Variable Plot with Customizations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the 'penguins' dataset from Seaborn
penguins = sns.load_dataset("penguins")

# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Create a scatter plot with different markers for species and a regression line
plt.figure(figsize=(12, 8))
sns.scatterplot(
data=penguins,
x="bill_length_mm",
y="bill_depth_mm",
hue="species",
style="island",
size="flipper_length_mm",
sizes=(20, 200),
palette="deep",
markers=["o", "s", "D"],
edgecolor="black"
)

# Add a regression line for each species
sns.regplot(
data=penguins,
x="bill_length_mm",
y="bill_depth_mm",
scatter=False,
color="gray",
line_kws={'lw': 1, 'linestyle': '--'}
)

# Customize the legend
plt.legend(
title='Species, Island, & Flipper Length',
title_fontsize='13',
fontsize='10',
loc='upper left',
bbox_to_anchor=(1, 1),
borderaxespad=0
)

# Add a title and labels
plt.title("Penguins: Bill Length vs. Bill Depth with Flipper Length and Island Information", fontsize=16)
plt.xlabel("Bill Length (mm)", fontsize=14)
plt.ylabel("Bill Depth (mm)", fontsize=14)

# Show the plot
plt.tight_layout()
plt.show()

Explanation of the Plot

  • Scatter Plot:
    The scatter plot displays the relationship between bill length and bill depth for penguins, with points representing individual penguins.
  • Hue:
    Different species of penguins are distinguished by different colors.
  • Style:
    Penguins from different islands are shown with different marker styles (circle, square, diamond).
  • Size:
    The size of the markers corresponds to the flipper length, with larger markers representing longer flippers.
  • Regression Line:
    A dashed regression line is overlaid to show the trend between bill length and depth for the entire dataset.
  • Legend:
    The legend provides information on species, island, and flipper length, helping to interpret the plot.

This visualization combines multiple layers of information into a single, interpretable plot, making it ideal for exploring complex datasets where multiple variables interact.

Output

Seaborn in Python(2)

Seaborn in Python

Here’s a more complex example using $Seaborn$ that involves multiple types of plots combined into a single figure.

This example demonstrates a $FacetGrid$ with customizations, including different plots for subsets of data and a combined heatmap with a categorical scatter plot.

1. FacetGrid with Multiple Plots

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import gridspec

# Load the Titanic dataset
titanic = sns.load_dataset("titanic")

# Create a FacetGrid showing survival rate across different classes and genders
g = sns.FacetGrid(titanic, row="sex", col="class", margin_titles=True, height=4)
g.map(sns.histplot, "age", bins=20, kde=True)

# Add a title
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Survival Rate by Age, Sex, and Class')

# Show the plot
plt.show()

Output:

2. Combined Heatmap and Categorical Scatter Plot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the flights dataset
flights = sns.load_dataset("flights")

# Pivot the dataset for a heatmap
flights_pivot = flights.pivot(index="month", columns="year", values="passengers")

# Set up the figure with a specific layout
fig = plt.figure(figsize=(14, 8))
gs = gridspec.GridSpec(2, 2, width_ratios=[3, 1], height_ratios=[1, 3])

# Heatmap on the left
ax0 = plt.subplot(gs[:, 0])
sns.heatmap(flights_pivot, annot=True, fmt="d", cmap="YlGnBu", ax=ax0)

# Categorical scatter plot (strip plot) on the right
ax1 = plt.subplot(gs[1, 1])
sns.stripplot(x="year", y="passengers", data=flights, jitter=True, ax=ax1)

# Add a title and labels
ax0.set_title('Flights Heatmap (Year vs Month)')
ax1.set_title('Yearly Passenger Distribution')
ax1.set_ylabel('')

# Show the plot
plt.tight_layout()
plt.show()

Output:

3. Violin Plot with a Swarm Plot Overlay

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import seaborn as sns
import matplotlib.pyplot as plt

# Load the iris dataset
iris = sns.load_dataset("iris")

# Create a violin plot
sns.violinplot(x="species", y="petal_length", data=iris, inner=None, palette="Set2")

# Overlay with a swarm plot
sns.swarmplot(x="species", y="petal_length", data=iris, color="k", alpha=0.7)

# Add a title
plt.title("Violin and Swarm Plot of Iris Petal Length by Species")

# Show the plot
plt.show()

Output:

4. PairGrid with Custom Plots

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import seaborn as sns
import matplotlib.pyplot as plt

# Load the tips dataset
tips = sns.load_dataset("tips")

# Create a PairGrid with different plots on the diagonal, upper, and lower triangles
g = sns.PairGrid(tips, diag_sharey=False)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_diag(sns.histplot, kde_kws={"color": "k"})

# Add a title
g.fig.suptitle('PairGrid with Custom Plots', y=1.02)

# Show the plot
plt.show()

Output:


These examples demonstrate more advanced uses of $Seaborn$, including combining different plot types, customizing layouts, and visualizing complex datasets.

Seaborn in Python

Seaborn in Python

Here’s a sample code using $Seaborn$, a $Python$ visualization library based on $Matplotlib$, that creates a variety of plots.

This example demonstrates how to create a simple scatter plot with regression lines, a pair plot, and a heatmap.

1. Scatter Plot with Regression Line

1
2
3
4
5
6
7
8
9
10
11
import seaborn as sns
import matplotlib.pyplot as plt

# Load the example dataset for tips
tips = sns.load_dataset("tips")

# Create a scatter plot with a regression line
sns.lmplot(x="total_bill", y="tip", data=tips)

# Show the plot
plt.show()

Output:

2. Pair Plot

1
2
3
4
5
6
7
8
9
10
11
import seaborn as sns
import matplotlib.pyplot as plt

# Load the iris dataset
iris = sns.load_dataset("iris")

# Create a pair plot
sns.pairplot(iris, hue="species")

# Show the plot
plt.show()

Output:

3. Heatmap

1
2
3
4
5
6
7
8
9
10
11
12
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Create a random matrix
data = np.random.rand(10, 12)

# Create a heatmap
sns.heatmap(data, annot=True, cmap="coolwarm")

# Show the plot
plt.show()

Output:

4. Box Plot

1
2
3
4
5
6
7
8
9
10
11
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset
titanic = sns.load_dataset("titanic")

# Create a box plot
sns.boxplot(x="class", y="age", data=titanic, palette="Set3")

# Show the plot
plt.show()

Output:


These examples showcase the basics of using $Seaborn$ for data visualization.

The library provides a variety of other plot types and customization options to explore!

XGBoost in Python

XGBoost in Python

Here’s a basic example of how to use $XGBoost$ in $Python$ for a classification task.

This example uses the popular $Iris$ $dataset$.

Install XGBoost

If you don’t have $XGBoost$ installed, you can install it using pip:

1
pip install xgboost

Sample Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert the dataset into DMatrix, which is XGBoost's internal data structure
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters for XGBoost
params = {
'objective': 'multi:softmax', # Specify the loss function
'num_class': 3, # Number of classes in the dataset
'max_depth': 3, # Maximum depth of a tree
'eta': 0.3, # Step size shrinkage
'eval_metric': 'mlogloss' # Evaluation metric
}

# Train the model
num_rounds = 50 # Number of boosting rounds
bst = xgb.train(params, dtrain, num_rounds)

# Make predictions on the test set
y_pred = bst.predict(dtest)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

Explanation

  1. Loading the Dataset: We use the $Iris$ $dataset$, which is a common dataset for classification tasks.
    It contains three classes of flowers.

  2. Splitting the Data: The dataset is split into training and testing sets using train_test_split.

  3. DMatrix: $XGBoost$ uses its own data structure called DMatrix for training.
    It is more efficient and optimized for $XGBoost$ operations.

  4. Setting Parameters:

    • objective: Defines the learning task and the corresponding objective function.
      In this case, multi:softmax is used for multiclass classification.
    • num_class: Specifies the number of classes.
    • max_depth: The maximum depth of the trees.
    • eta: The learning rate.
  5. Training: The model is trained using the train function with the specified parameters and number of boosting rounds.

  6. Prediction: The trained model makes predictions on the test set, and the accuracy is calculated using accuracy_score.

This is a basic example, but $XGBoost$ offers a wide range of parameters and options that can be fine-tuned for different types of data and tasks.

Output

1
Accuracy: 100.00%

statsmodels in Python

statsmodels in Python

$statsmodels$ is a powerful $Python$ library used for statistical modeling and analysis.

It provides classes and functions for many statistical models, such as linear regression, generalized linear models, time series analysis, and more.

Here are some sample codes using $statsmodels$:

1. Ordinary Least Squares (OLS) Regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import statsmodels.api as sm
import numpy as np
import pandas as pd

# Generate some example data
np.random.seed(0)
X = np.random.rand(100)
y = 2 * X + 1 + np.random.randn(100) * 0.1

# Add a constant term (intercept) to the independent variable
X = sm.add_constant(X)

# Fit an OLS model
model = sm.OLS(y, X)
results = model.fit()

# Print out the summary of the regression
print(results.summary())

Explanation:

  • X is the independent variable, and y is the dependent variable.
  • sm.add_constant(X) adds a constant (intercept) to the model.
  • model = sm.OLS(y, X) creates an OLS model.
  • results = model.fit() fits the model to the data.
  • results.summary() provides a detailed summary of the regression results.

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
                            OLS Regression Results                            
==============================================================================
Dep. Variable: y R-squared: 0.971
Model: OLS Adj. R-squared: 0.971
Method: Least Squares F-statistic: 3262.
Date: Mon, 12 Aug 2024 Prob (F-statistic): 4.88e-77
Time: 02:46:51 Log-Likelihood: 88.744
No. Observations: 100 AIC: -173.5
Df Residuals: 98 BIC: -168.3
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0222 0.019 52.884 0.000 0.984 1.061
x1 1.9937 0.035 57.117 0.000 1.924 2.063
==============================================================================
Omnibus: 11.746 Durbin-Watson: 2.083
Prob(Omnibus): 0.003 Jarque-Bera (JB): 4.097
Skew: 0.138 Prob(JB): 0.129
Kurtosis: 2.047 Cond. No. 4.30
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

2. Logistic Regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import statsmodels.api as sm
import numpy as np

# Generate some example data
np.random.seed(0)
n_samples = 100
X = np.random.rand(n_samples, 1)
X = sm.add_constant(X) # Add intercept

# Binary outcome (0 or 1)
y = (X[:, 1] + np.random.randn(n_samples) * 0.1 > 0.5).astype(int)

# Fit a logistic regression model
model = sm.Logit(y, X)
results = model.fit()

# Print out the summary of the logistic regression
print(results.summary())

Explanation:

  • y is a binary outcome ($0$ or $1$).
  • sm.Logit(y, X) creates a logistic regression model.
  • results = model.fit() fits the logistic regression model to the data.
  • results.summary() provides a summary of the logistic regression results.

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Optimization terminated successfully.
Current function value: 0.195832
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: Logit Df Residuals: 98
Method: MLE Df Model: 1
Date: Mon, 12 Aug 2024 Pseudo R-squ.: 0.7175
Time: 02:47:53 Log-Likelihood: -19.583
converged: True LL-Null: -69.315
Covariance Type: nonrobust LLR p-value: 1.999e-23
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -8.1375 1.924 -4.229 0.000 -11.909 -4.366
x1 16.9601 3.829 4.430 0.000 9.456 24.464
==============================================================================

3. Time Series Analysis using ARIMA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import statsmodels.api as sm
import numpy as np

# Generate some example time series data
np.random.seed(0)
n_samples = 100
time_series_data = np.cumsum(np.random.randn(n_samples))

# Fit an ARIMA model (ARIMA(p, d, q))
model = sm.tsa.ARIMA(time_series_data, order=(1, 1, 1))
results = model.fit()

# Print out the summary of the ARIMA model
print(results.summary())

# Plot the forecast
forecast = results.get_forecast(steps=10)
forecast_index = np.arange(len(time_series_data), len(time_series_data) + 10)
forecast_mean = forecast.predicted_mean

import matplotlib.pyplot as plt
plt.plot(time_series_data, label='Original Time Series')
plt.plot(forecast_index, forecast_mean, label='Forecast')
plt.legend()
plt.show()

Explanation:

  • time_series_data is the time series data to model.
  • sm.tsa.ARIMA(time_series_data, order=(1, 1, 1)) creates an ARIMA model with specific orders for AR, differencing, and MA.
  • results.get_forecast(steps=10) forecasts the next $10$ steps in the time series.

Output:

4. ANOVA (Analysis of Variance)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import statsmodels.api as sm
import pandas as pd
from statsmodels.formula.api import ols

# Create example data
np.random.seed(0)
df = pd.DataFrame({
'Group': np.repeat(['A', 'B', 'C'], 20),
'Value': np.concatenate([np.random.randn(20) + 1, np.random.randn(20), np.random.randn(20) - 1])
})

# Fit an OLS model with categorical variable
model = ols('Value ~ Group', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Explanation:

  • df contains a categorical variable Group and a continuous variable Value.
  • ols('Value ~ Group', data=df) fits an OLS model treating Group as a categorical variable.
  • sm.stats.anova_lm(model, typ=2) performs ANOVA on the fitted model.

Output:

1
2
3
             sum_sq    df          F        PR(>F)
Group 87.890846 2.0 43.127056 3.920858e-12
Residual 58.081616 57.0 NaN NaN

5. Autoregressive Model (AR)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
import numpy as np
import matplotlib.pyplot as plt

# Generate some example time series data
np.random.seed(0)
n_samples = 100
time_series_data = np.cumsum(np.random.randn(n_samples))

# Fit an Autoregressive (AR) model using AutoReg
model = AutoReg(time_series_data, lags=1)
results = model.fit()

# Print out the summary of the AR model
print(results.summary())

# Plot the predictions
predictions = results.predict(start=90, end=109)

plt.plot(time_series_data, label='Original Time Series')
plt.plot(range(90, 110), predictions, label='Predicted')
plt.legend()
plt.show()

Explanation:

  • AutoReg(time_series_data, lags=1) creates an autoregressive model with a specified number of lags.
  • The lags parameter specifies how many previous time points to use for predicting the next value.
  • results.predict(start=90, end=109) generates predictions for the specified range.

Output:


These examples demonstrate the versatility of $statsmodels$ in performing various statistical analyses and generating useful insights from data.