I’ll create a bioinformatics example that analyzes DNA sequences and visualizes the results.
We’ll create a program that:
- Analyzes GC content across a DNA sequence using a sliding window
- Finds potential CpG islands (regions with high GC content)
- Visualizes the results
1 | import numpy as np |
Let me explain the code and its components:
Key Functions:
calculate_gc_content
: Calculates GC content using a sliding window approachfind_cpg_islands
: Identifies regions with high GC content (potential CpG islands)
Analysis Process:
- We generate a random DNA sequence of $1000$ bases
- Insert a simulated CpG island (GC-rich region) in the middle
- Calculate GC content using a $100$bp window, sliding $10$bp at a time
- Identify regions where GC content exceeds $60$% as potential CpG islands
Visualization:
- Blue line shows GC content across the sequence
- Red dashed line indicates the CpG island threshold ($60$%)
- Green vertical lines mark the start of potential CpG islands
- The plot includes proper labels and a legend
Statistics Output:
Average GC content: 56.00% Number of potential CpG islands: 2 CpG island 1 starts at position 300 CpG island 2 starts at position 410
- Calculates and displays average GC content
- Shows the number and positions of potential CpG islands
This example demonstrates several important bioinformatics concepts:
- Sequence composition analysis
- Sliding window analysis
- Feature detection (CpG islands)
- Data visualization