Getting Things Done Faster

written by Ricky Lim on 2025-08-17

Data science often begins with Python scripts, which are excellent for transforming messy data into meaningful insights, but not always efficient when it comes to speed. This is especially true when running the same script across many different inputs. A classic case of an embarrassingly parallel problem, where tasks can run independently without needing to communicate to each other. Welcome GNU Parallel: a powerful tool that transforms your single-threaded Python tasks into parallel operations, harnessing the full power of your CPU.

What is GNU Parallel?

GNU Parallel is a shell tool to execute jobs in parallel on one or more computers. It's very simple to use particularly for embarrassingly parallel tasks, that would otherwise run sequentially. The shell tool can parse multiple inputs, running scripts or any CLI commands against them in parallel at the same time, allowing us to use all available CPU. As it's a tool that can be integrated on top of your python scripts, we can get things done faster without rewriting our code.

Installation

To get GNU Parallel is very simple.

# Ubuntu / Debian
sudo apt-get install parallel

# MacOS
brew install parallel

Basic Syntax

The basic syntax follows this pattern:

parallel [options] command ::: arguments

Important options:

-j N specifies the number of jobs to run in parallel (default is the number of CPU cores).
--dry-run shows what would be executed without actually running the commands, handy during development and testing.
--bar displays a progress bar.

Key replacement features:

{} represents the input arguments
{.}represents the filename without the extension
{/} represents the filename without the directory path
::: separates the command from the input arguments

Practical example

Let's create a simple python script that data scientists might encounter regularly - batch image resizing for computer vision projects. Here's a complete Python script that resizes images to a specified width and height, saving them to a new directory.

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "Pillow",
# ]
# ///

import sys
import os
from pathlib import Path

from PIL import Image

def resize_image(input_path, output_path, target_width=224, target_height=224):
    """Resize image"""
    try:
        with Image.open(input_path) as img:
            # Convert to RGB if necessary (handles PNG with alpha, etc.)
            if img.mode != 'RGB':
                img = img.convert('RGB')

            # Resize maintaining aspect ratio, then center crop
            img.thumbnail((target_width, target_height), Image.Resampling.LANCZOS)

            # Create centered crop if needed
            width, height = img.size
            if width != target_width or height != target_height:
                left = (width - target_width) // 2
                top = (height - target_height) // 2
                right = left + target_width
                bottom = top + target_height
                img = img.crop((left, top, right, bottom))

            Path(os.path.dirname(output_path)).mkdir(parents=True, exist_ok=True)

            img.save(output_path, 'JPEG', quality=95)

            print(f"Success processing: {input_path} -> {output_path}")

    except Exception as e:
        print(f"Error processing {input_path}: {e}")

if __name__ == "__main__":
    if len(sys.argv) not in [3, 4, 5]:
        print("Usage: python resize_image.py <input_path> <output_path> [width] [height]")
        print("Example: python resize_image.py input.jpg processed/output.jpg 300 300")
        sys.exit(1)

    input_path = sys.argv[1]
    output_path = sys.argv[2]

    # Optional width and height parameters
    width = int(sys.argv[3]) if len(sys.argv) > 3 else 300
    height = int(sys.argv[4]) if len(sys.argv) > 4 else 300

    resize_image(input_path, output_path, width, height)

Serial vs Parallel Execution

# For example we have a folder with images to be processed
$ for img in images/*.jpg; do
echo $img
done

images/18152ecff8b937eb7eb5f88e64dcd1f134e77f0d.jpg
images/39bd7298aab2ac17c2adf313306aa1ea8bea1021.jpg
images/696466495ed4373790d8e2c24a6ccedd70f32c84.jpg
images/add5b3f12898bcbcc45cfb0ef5371c1b3d4df9e8.jpg
images/c2bbc96ece4ae546b503a57587d1035b82154815.jpg
images/c327aad0d8bc44b65dc2ea38a70a780dec36b350.jpg
images/cb61950871d5ec60c6c82a3e9fca9c43c2365f3d.jpg
images/ea163022f89ea6c10995c1c8ac0aae74db0541c6.jpg
images/eb268d7d7ae234c8b39fef70ad8c4d9c9cb06f29.jpg
images/ebda49b2cb5555f0896cb68e8f9832106a4991ce.jpg

Traditional serial processing:

for img in images/*.jpg; do
    # Resize image to 300x300
    uv run resize.py $img "processed/$(basename -- $img)" 300 300
done

GNU parallel processing:

parallel uv run resize.py {} "processed/{/}" 300 300 ::: images/*.jpg

This scales beautifully with our compute capacity, allowing us to process multiple images simultaneously. In addition to that, GNU parallel does not stop processing other images if one fails.

# Example if there is a corrupt image
parallel uv run resize.py {} "processed/{/}" 300 300 ::: images/*.jpg

...
Success processing: images/c2bbc96ece4ae546b503a57587d1035b82154815.jpg -> processed/c2bbc96ece4ae546b503a57587d1035b82154815.jpg
Error processing images/corrupt.jpg: cannot identify image file 'images/corrupt.jpg'
Success processing: images/ea163022f89ea6c10995c1c8ac0aae74db0541c6.jpg -> processed/ea163022f89ea6c10995c1c8ac0aae74db0541c6.jpg
...

This pattern ensures robust batch processing: bad images get logged when failed and the rest of the images continue to be processed.

Key Takeaways

GNU Parallel offers advantages for data scientists:

Simplicity: Zero code changes needed, just wrap your existing scripts with parallel.
Language agnostic: Works with any command-line tool, not just Python scripts.
Better Error Handling: Failed jobs don't stop the entire batch.