GC content calculation across DNA sequences

I’ll create a bioinformatics example that analyzes DNA sequences and visualizes the results.

We’ll create a program that:

  1. Analyzes GC content across a DNA sequence using a sliding window
  2. Finds potential CpG islands (regions with high GC content)
  3. Visualizes the results
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt

def calculate_gc_content(sequence, window_size=100, step_size=1):
"""
Calculate GC content using a sliding window approach

Args:
sequence (str): DNA sequence
window_size (int): Size of the sliding window
step_size (int): Step size for sliding the window

Returns:
tuple: Positions and corresponding GC contents
"""
positions = []
gc_contents = []

for i in range(0, len(sequence) - window_size + 1, step_size):
window = sequence[i:i + window_size]
gc_count = window.count('G') + window.count('C')
gc_content = (gc_count / window_size) * 100
positions.append(i)
gc_contents.append(gc_content)

return positions, gc_contents

def find_cpg_islands(positions, gc_contents, threshold=60):
"""
Find potential CpG islands based on GC content threshold

Args:
positions (list): List of positions
gc_contents (list): List of GC contents
threshold (float): GC content threshold for CpG islands

Returns:
list: Start positions of potential CpG islands
"""
islands = []
for i, (pos, gc) in enumerate(zip(positions, gc_contents)):
if gc > threshold:
if not islands or pos - islands[-1] > 100: # New island
islands.append(pos)
return islands

# Example DNA sequence (simulated)
np.random.seed(42)
bases = ['A', 'T', 'G', 'C']
sequence = ''.join(np.random.choice(bases, size=1000))

# Add a simulated CpG island
cpg_island = 'GC' * 50 # Create a GC-rich region
sequence = sequence[:400] + cpg_island + sequence[500:]

# Calculate GC content
positions, gc_contents = calculate_gc_content(sequence, window_size=100, step_size=10)

# Find CpG islands
cpg_islands = find_cpg_islands(positions, gc_contents)

# Plotting
plt.figure(figsize=(12, 6))
plt.plot(positions, gc_contents, 'b-', label='GC Content')
plt.axhline(y=60, color='r', linestyle='--', label='CpG Island Threshold')

# Mark CpG islands
for island in cpg_islands:
plt.axvline(x=island, color='g', alpha=0.3)

plt.xlabel('Position in Sequence')
plt.ylabel('GC Content (%)')
plt.title('DNA Sequence Analysis: GC Content and CpG Islands')
plt.legend()
plt.grid(True)
plt.show()

# Print statistics
print(f"Average GC content: {np.mean(gc_contents):.2f}%")
print(f"Number of potential CpG islands: {len(cpg_islands)}")
for i, pos in enumerate(cpg_islands, 1):
print(f"CpG island {i} starts at position {pos}")

Let me explain the code and its components:

  1. Key Functions:

    • calculate_gc_content: Calculates GC content using a sliding window approach
    • find_cpg_islands: Identifies regions with high GC content (potential CpG islands)
  2. Analysis Process:

    • We generate a random DNA sequence of $1000$ bases
    • Insert a simulated CpG island (GC-rich region) in the middle
    • Calculate GC content using a $100$bp window, sliding $10$bp at a time
    • Identify regions where GC content exceeds $60$% as potential CpG islands
  3. Visualization:

    • Blue line shows GC content across the sequence
    • Red dashed line indicates the CpG island threshold ($60$%)
    • Green vertical lines mark the start of potential CpG islands
    • The plot includes proper labels and a legend
  4. Statistics Output:

    Average GC content: 56.00%
    Number of potential CpG islands: 2
    CpG island 1 starts at position 300
    CpG island 2 starts at position 410
    
    • Calculates and displays average GC content
    • Shows the number and positions of potential CpG islands

This example demonstrates several important bioinformatics concepts:

  • Sequence composition analysis
  • Sliding window analysis
  • Feature detection (CpG islands)
  • Data visualization