Data science often begins with Python scripts, which are excellent for transforming messy data into meaningful insights, but not always efficient when it comes to speed. This is especially true when running the same script across many different inputs. A classic case of an embarrassingly parallel problem, where tasks can run independently without needing to communicate to each other. Welcome GNU Parallel: a powerful tool that transforms your single-threaded Python tasks into parallel operations, harnessing the full power of your CPU.
GNU Parallel is a shell tool to execute jobs in parallel on one or more computers. It's very simple to use particularly for embarrassingly parallel tasks, that would otherwise run sequentially. The shell tool can parse multiple inputs, running scripts or any CLI commands against them in parallel at the same time, allowing us to use all available CPU. As it's a tool that can be integrated on top of your python scripts, we can get things done faster without rewriting our code.
To get GNU Parallel is very simple.
# Ubuntu / Debian
sudo apt-get install parallel
# MacOS
brew install parallel
The basic syntax follows this pattern:
parallel [options] command ::: arguments
Important options:
-j N
specifies the number of jobs to run in parallel (default is the number of CPU cores).--dry-run
shows what would be executed without actually running the commands, handy during development and testing.--bar
displays a progress bar.Key replacement features:
{}
represents the input arguments{.}
represents the filename without the extension{/}
represents the filename without the directory path:::
separates the command from the input argumentsLet's create a simple python script that data scientists might encounter regularly - batch image resizing for computer vision projects.
Here's a complete Python script that resizes images to a specified width
and height
, saving them to a new directory.
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "Pillow",
# ]
# ///
import sys
import os
from pathlib import Path
from PIL import Image
def resize_image(input_path, output_path, target_width=224, target_height=224):
"""Resize image"""
try:
with Image.open(input_path) as img:
# Convert to RGB if necessary (handles PNG with alpha, etc.)
if img.mode != 'RGB':
img = img.convert('RGB')
# Resize maintaining aspect ratio, then center crop
img.thumbnail((target_width, target_height), Image.Resampling.LANCZOS)
# Create centered crop if needed
width, height = img.size
if width != target_width or height != target_height:
left = (width - target_width) // 2
top = (height - target_height) // 2
right = left + target_width
bottom = top + target_height
img = img.crop((left, top, right, bottom))
Path(os.path.dirname(output_path)).mkdir(parents=True, exist_ok=True)
img.save(output_path, 'JPEG', quality=95)
print(f"Success processing: {input_path} -> {output_path}")
except Exception as e:
print(f"Error processing {input_path}: {e}")
if __name__ == "__main__":
if len(sys.argv) not in [3, 4, 5]:
print("Usage: python resize_image.py <input_path> <output_path> [width] [height]")
print("Example: python resize_image.py input.jpg processed/output.jpg 300 300")
sys.exit(1)
input_path = sys.argv[1]
output_path = sys.argv[2]
# Optional width and height parameters
width = int(sys.argv[3]) if len(sys.argv) > 3 else 300
height = int(sys.argv[4]) if len(sys.argv) > 4 else 300
resize_image(input_path, output_path, width, height)
# For example we have a folder with images to be processed
$ for img in images/*.jpg; do
echo $img
done
images/18152ecff8b937eb7eb5f88e64dcd1f134e77f0d.jpg
images/39bd7298aab2ac17c2adf313306aa1ea8bea1021.jpg
images/696466495ed4373790d8e2c24a6ccedd70f32c84.jpg
images/add5b3f12898bcbcc45cfb0ef5371c1b3d4df9e8.jpg
images/c2bbc96ece4ae546b503a57587d1035b82154815.jpg
images/c327aad0d8bc44b65dc2ea38a70a780dec36b350.jpg
images/cb61950871d5ec60c6c82a3e9fca9c43c2365f3d.jpg
images/ea163022f89ea6c10995c1c8ac0aae74db0541c6.jpg
images/eb268d7d7ae234c8b39fef70ad8c4d9c9cb06f29.jpg
images/ebda49b2cb5555f0896cb68e8f9832106a4991ce.jpg
Traditional serial processing:
for img in images/*.jpg; do
# Resize image to 300x300
uv run resize.py $img "processed/$(basename -- $img)" 300 300
done
GNU parallel processing:
parallel uv run resize.py {} "processed/{/}" 300 300 ::: images/*.jpg
This scales beautifully with our compute capacity, allowing us to process multiple images simultaneously. In addition to that, GNU parallel does not stop processing other images if one fails.
# Example if there is a corrupt image
parallel uv run resize.py {} "processed/{/}" 300 300 ::: images/*.jpg
...
Success processing: images/c2bbc96ece4ae546b503a57587d1035b82154815.jpg -> processed/c2bbc96ece4ae546b503a57587d1035b82154815.jpg
Error processing images/corrupt.jpg: cannot identify image file 'images/corrupt.jpg'
Success processing: images/ea163022f89ea6c10995c1c8ac0aae74db0541c6.jpg -> processed/ea163022f89ea6c10995c1c8ac0aae74db0541c6.jpg
...
This pattern ensures robust batch processing: bad images get logged when failed and the rest of the images continue to be processed.
GNU Parallel offers advantages for data scientists:
parallel
.