Polars in Python

Polars in Python

$Polars$ is a powerful DataFrame library in Python, known for its speed and efficiency, especially with large datasets.

Here’s a useful sample demonstrating some key features of $Polars$, including reading data, data manipulation, and aggregations.

We’ll use a sample dataset to illustrate these functionalities.

Sample Data

Let’s assume we have a CSV file named data.csv with the following content:

1
2
3
4
5
6
7
id,name,age,salary,department
1,Alice,30,70000,HR
2,Bob,40,80000,Engineering
3,Charlie,25,65000,Marketing
4,Diana,35,72000,HR
5,Edward,50,90000,Engineering
6,Frank,45,85000,Marketing

Code Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import polars as pl

# Read CSV file into a Polars DataFrame
df = pl.read_csv('data.csv')

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Select specific columns
selected_df = df.select(['name', 'age', 'salary'])
print("\nSelected Columns (name, age, salary):")
print(selected_df)

# Filter rows where age > 30
filtered_df = df.filter(pl.col('age') > 30)
print("\nFiltered DataFrame (age > 30):")
print(filtered_df)

# Add a new column 'bonus' which is 10% of the salary
df = df.with_column((pl.col('salary') * 0.1).alias('bonus'))
print("\nDataFrame with Bonus Column:")
print(df)

# Group by 'department' and calculate the average salary and total bonus
grouped_df = df.groupby('department').agg([
pl.col('salary').mean().alias('avg_salary'),
pl.col('bonus').sum().alias('total_bonus')
])
print("\nGrouped by Department with Aggregations:")
print(grouped_df)

# Sort the DataFrame by 'salary' in descending order
sorted_df = df.sort('salary', reverse=True)
print("\nSorted DataFrame by Salary (Descending):")
print(sorted_df)

Explanation

  1. Reading CSV:

    • pl.read_csv('data.csv') reads the CSV file into a $Polars$ DataFrame.
  2. Displaying the DataFrame:

    • print(df) shows the original DataFrame.
  3. Selecting Specific Columns:

    • df.select(['name', 'age', 'salary']) selects only the name, age, and salary columns.
  4. Filtering Rows:

    • df.filter(pl.col('age') > 30) filters the DataFrame to include only rows where the age is greater than $30$.
  5. Adding a New Column:

    • df.with_column((pl.col('salary') * 0.1).alias('bonus')) adds a new column bonus which is $10$% of the salary.
  6. Grouping and Aggregating:

    • df.groupby('department').agg([pl.col('salary').mean().alias('avg_salary'), pl.col('bonus').sum().alias('total_bonus')]) groups the DataFrame by department and calculates the average salary and total bonus for each department.
  7. Sorting:

    • df.sort('salary', reverse=True) sorts the DataFrame by salary in descending order.

Output

The output will look something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Original DataFrame:
shape: (6, 5)
┌─────┬─────────┬─────┬────────┬─────────────┐
│ id │ name │ age │ salary │ department │
│ --- │ --- │ --- │ --- │ --- │
│ i64 │ str │ i64 │ i64 │ str │
├─────┼─────────┼─────┼────────┼─────────────┤
│ 1 │ Alice │ 30 │ 70000 │ HR │
│ 2 │ Bob │ 40 │ 80000 │ Engineering │
│ 3 │ Charlie │ 25 │ 65000 │ Marketing │
│ 4 │ Diana │ 35 │ 72000 │ HR │
│ 5 │ Edward │ 50 │ 90000 │ Engineering │
│ 6 │ Frank │ 45 │ 85000 │ Marketing │
└─────┴─────────┴─────┴────────┴─────────────┘

Selected Columns (name, age, salary):
shape: (6, 3)
┌─────────┬─────┬────────┐
│ name │ age │ salary │
│ --- │ --- │ --- │
│ str │ i64 │ i64 │
├─────────┼─────┼────────┤
│ Alice │ 30 │ 70000 │
│ Bob │ 40 │ 80000 │
│ Charlie │ 25 │ 65000 │
│ Diana │ 35 │ 72000 │
│ Edward │ 50 │ 90000 │
│ Frank │ 45 │ 85000 │
└─────────┴─────┴────────┘

Filtered DataFrame (age > 30):
shape: (4, 5)
┌─────┬─────────┬─────┬────────┬─────────────┐
│ id │ name │ age │ salary │ department │
│ --- │ --- │ --- │ --- │ --- │
│ i64 │ str │ i64 │ i64 │ str │
├─────┼─────────┼─────┼────────┼─────────────┤
│ 2 │ Bob │ 40 │ 80000 │ Engineering │
│ 4 │ Diana │ 35 │ 72000 │ HR │
│ 5 │ Edward │ 50 │ 90000 │ Engineering │
│ 6 │ Frank │ 45 │ 85000 │ Marketing │
└─────┴─────────┴─────┴────────┴─────────────┘

DataFrame with Bonus Column:
shape: (6, 6)
┌─────┬─────────┬─────┬────────┬─────────────┬───────┐
│ id │ name │ age │ salary │ department │ bonus │
│ --- │ --- │ --- │ --- │ --- │ --- │
│ i64 │ str │ i64 │ i64 │ str │ f64 │
├─────┼─────────┼─────┼────────┼─────────────┼───────┤
│ 1 │ Alice │ 30 │ 70000 │ HR │ 7000 │
│ 2 │ Bob │ 40 │ 80000 │ Engineering │ 8000 │
│ 3 │ Charlie │ 25 │ 65000 │ Marketing │ 6500 │
│ 4 │ Diana │ 35 │ 72000 │ HR │ 7200 │
│ 5 │ Edward │ 50 │ 90000 │ Engineering │ 9000 │
│ 6 │ Frank │ 45 │ 85000 │ Marketing │ 8500 │
└─────┴─────────┴─────┴────────┴─────────────┴───────┘

Grouped by Department with Aggregations:
shape: (3, 3)
┌─────────────┬────────────┬────────────┐
│ department │ avg_salary │ total_bonus│
│ --- │ --- │ --- │
│ str │ f64 │ f64 │
├─────────────┼────────────┼────────────┤
│ HR │ 71000 │ 14200 │
│ Engineering │ 85000 │ 17000 │
│ Marketing │ 75000 │ 15000 │
└─────────────┴────────────┴────────────┘

Sorted DataFrame by Salary (Descending):
shape: (6, 6)
┌─────┬─────────┬─────┬────────┬─────────────┬───────┐
│ id │ name │ age │ salary │ department │ bonus │
│ --- │ --- │ --- │ --- │ --- │ --- │
│ i64 │ str │ i64 │ i64 │ str │ f64 │
├─────┼─────────┼─────┼────────┼─────────────┼───────┤
│ 5 │ Edward │ 50 │ 90000 │ Engineering │ 9000 │
│ 6 │ Frank │ 45 │ 85000 │ Marketing │ 8500 │
│ 2 │ Bob │ 40 │ 80000 │ Engineering │ 8000 │
│ 4 │ Diana │ 35 │ 72000 │ HR │ 7200 │


1 │ Alice │ 30 │ 70000 │ HR │ 7000 │
│ 3 │ Charlie │ 25 │ 65000 │ Marketing │ 6500 │
└─────┴─────────┴─────┴────────┴─────────────┴───────┘

This example covers some fundamental operations with the $Polars$ library, making it a useful starting point for data manipulation and analysis tasks.