Table of Contents

PHP vs. Python vs. Perl -- Regular Expression Showdown

The Goal

I was in a discussion yesterday with one of my co-workers about the speed of Spamassassin. We were talking about how slow it is and good ways in which to speed it up. He mentioned some optimizations in other languages, which got me wondering about exactly what the speed differences would be in a test of PHP, Python and Perl. This writeup details the results of my tests of 5 different scripts on 2 different machines running 2 different distros of Linux.

The Hardware

I've run these tests on my “work” workstation, which is a Sun Ultra 20 running Gentoo Linux. From here on out, I will refer to this machine as the “Sun box”. The details are as follows.

CPU single, single core AMD Opteron 2.6GHz
Memory 2GB
OS Gentoo Linux

The second machine was my personal laptop, a Gateway MX6931 running Ubuntu Linux. From here on out, I will refer to this machine as the “Laptop”. The details are as follows

CPU single, dual core Intel Core2 1.66GHz
Memory 2GB
OS Ubuntu 7.10

The Interpreters

Here is the output of the version command from each of our 3 interpreters on each of the machines. The first set is from the Sun box.

PHP:

$ php -v
PHP 5.2.5-pl1-gentoo (cli) (built: Jan  4 2008 12:35:35)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies

Python:

$ python -V
Python 2.4.4

Perl:

$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.16-gentoo-r1, archname=i686-linux
    uname='linux kagome 2.6.16-gentoo-r1 #2 smp mon jun 5 19:01:24 cdt 2006 i686 amd athlon(tm) 64 x2 dual core processor 4200+ gnulinux '
    config_args='-des -Darchname=i686-linux -Dcccdlflags=-fPIC -Dccdlflags=-rdynamic -Dcc=i686-pc-linux-gnu-gcc -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr -Dlocincpth=  -Doptimize=-O2 -march=i686 -pipe -Duselargefiles -Dd_semctl_semun -Dscriptdir=/usr/bin -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dinstallman1dir=/usr/share/man/man1 -Dinstallman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dinc_version_list=5.8.0 5.8.0/i686-linux 5.8.2 5.8.2/i686-linux 5.8.4 5.8.4/i686-linux 5.8.5 5.8.5/i686-linux 5.8.6 5.8.6/i686-linux 5.8.7 5.8.7/i686-linux  -Dcf_by=Gentoo -Ud_csh -Dusenm -Di_ndbm -Di_gdbm -Di_db'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='i686-pc-linux-gnu-gcc', ccflags ='-fno-strict-aliasing -pipe -Wdeclaration-after-statement -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -march=i686 -pipe',
    cppflags='-fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/include/gdbm'
    ccversion='', gccversion='4.1.1 (Gentoo 4.1.1)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='i686-pc-linux-gnu-gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.4.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO
  Built under linux
  Compiled at Jun 30 2006 17:21:05
  @INC:
    /etc/perl
    /usr/lib/perl5/vendor_perl/5.8.8/i686-linux
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/site_perl/5.8.8/i686-linux
    /usr/lib/perl5/site_perl/5.8.8
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/5.8.8/i686-linux
    /usr/lib/perl5/5.8.8
    /usr/local/lib/site_perl
    .

This is the same info from the laptop.

PHP:

$ php -v
PHP 5.2.3-1ubuntu6.3 (cli) (built: Jan 10 2008 09:38:37)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies

Python:

$ python -V
Python 2.5.1

Perl:

$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.15.7, archname=i486-linux-gnu-thread-multi
    uname='linux terranova 2.6.15.7 #1 smp thu jul 12 14:27:56 utc 2007 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.6.1.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8
    gnulibc_version='2.6.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: MULTIPLICITY PERL_IMPLICIT_CONTEXT
                        PERL_MALLOC_WRAP THREADS_HAVE_PIDS USE_ITHREADS
                        USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API
  Built under linux
  Compiled at Dec  4 2007 08:56:39
  @INC:
    /etc/perl
    /usr/local/lib/perl/5.8.8
    /usr/local/share/perl/5.8.8
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8
    /usr/share/perl/5.8
    /usr/local/lib/site_perl
    .

The Scripts

I wrote a total of 5 scripts for this experiment. There is one for each interpreter as well as one additional script for Python. The reason there are 2 Python scripts is due to the option in Python's regular expression library to compile the regular expressions prior to use. Therefore, I have one Python script which uses pre-compiled regular expressions and one that does not. The reason for the 2 Perl scripts is that the first one I wrote uses the same programmatic mechanism of looping through an array of regular expression strings and using a string variable (m/$r/) in the regex matching. At the behest of one of my co-workers, I wrote a second version where all the regexs were hard coded in the match expression (m/regex code.*$/). The difference, as you will see is quite dramatic.

Though these are different languages, I kept the execution of the scripts almost completely the same between them, with the exception of one of the Perl scripts. The basis of the scripts is that they all use a set of 5 different regular expressions in an array to try and match against lines in an email logfile. More on the logfile later. If there is a match, a simple integer counter is incremented and the script moves on to the next line. Very basic, but also very real world. Parsing log files is definitely one of the major uses for interpreted languages, which is what this is about. The only thing I'm not doing is aggregating any kind of data from what I'm parsing as I want this to be purely about the speed of the regular expression matching. Now, the source of the 5 scripts.

PHP

#!/usr/bin/php
<?
$regexStrs = array(
    '#pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})#'
 ,
    '#postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org#' ,
    '#postfix.*connect from ([^\[]*)\[([^\[]+)\]#' ,
    '#postfix.*lost connection#' ,
    '#postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent#'
,
);
 
$logfile = 'maillog';
$counter = 0;
 
$fh = fopen($logfile , 'r');
while (false !== ($line = fgets($fh))) {
    foreach ($regexStrs as $r) {
        if (preg_match($r , $line , $m)) {
            $counter++;
            break;
        }
    }
 
printf("Number of matches: %d\n" , $counter);
?>

Perl (interpolated string loop)

#!/usr/bin/perl
 
my @regexStrs = (
    'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})' ,
    'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' ,
    'postfix.*connect from ([^\[]*)\[([^\[]+)\]' ,
    'postfix.*lost connection' ,
    'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' ,
);
 
my $logfile = 'maillog';
my $counter = 0;
 
open(FH , $logfile);
while (my $line = <FH>) {
    foreach my $r (@regexStrs) {
        if ($line =~ m#$r#) {
            $counter++;
            last;
        }
    }
}
close(FH);
 
printf("Number of matches: %d\n" , $counter);

Perl (hard coded regexs)

#!/usr/bin/perl
 
my $logfile = 'maillog';
my $counter = 0;
 
open(FH , $logfile);
while (my $line = <FH>) {
    if ($line =~ m#pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})#) {
        $counter++;
    }
    elsif ($line =~ m#postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org#) {
        $counter++;
    }
    elsif ($line =~ m#postfix.*connect from ([^\[]*)\[([^\[]+)\]#) {
        $counter++;
    }
    elsif ($line =~ m#postfix.*lost connection#) {
        $counter++;
    }
    elsif ($line =~ m#postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent#) {
        $counter++;
    }
}
close(FH);
 
printf("Number of matches: %d\n" , $counter);

Python (no pre-compiled R.E.)

#!/usr/bin/python
 
import re
 
regexStrs = (
    r'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})'
,
    r'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' ,
    r'postfix.*connect from ([^\[]*)\[([^\[]+)\]' ,
    r'postfix.*lost connection' ,
    r'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' ,
)
 
logfile = 'maillog'
counter = 0
 
fh = open(logfile)
while True:
    line = fh.readline()
    if not line: break
    for r in regexStrs:
        m = re.search(r , line)
        if m:
            counter += 1
            break
fh.close()
 
print 'Number of matches: %d' % counter

Python (pre-compiled R.E.)

#!/usr/bin/python
 
import re
 
regexStrs = (
    r'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})'
,
    r'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' ,
    r'postfix.*connect from ([^\[]*)\[([^\[]+)\]' ,
    r'postfix.*lost connection' ,
    r'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' ,
)
 
res = []
logfile = 'maillog'
counter = 0
 
for rs in regexStrs:
    res.append(re.compile(rs))
 
fh = open(logfile)
while True:
    line = fh.readline()
    if not line: break
    for r in res:
        m = r.search(line)
        if m:
            counter += 1
            break
fh.close()
 
print 'Number of matches: %d' % counter

The Testbed

The testbed was very simple. The testing script is just a simple shell script that ran the Unix time on each of our 5 test scripts 5 consecutive times each. The “maillog” file that was used was simply a compilation of a number of days of maillogs from my personal mail server. Nothing was altered in this file and the same file was used in all tests. The size of that file is 72886261. The same maillog file was used in all tests, as you can see from the scripts above. For reasons that should be obvious, I'm not going to post my personal maillogs here.

The source of the testbed shell script:

#!/bin/sh
 
scripts="re_compile.py re_nocompile.py re_nocompile.php re_nocompile.pl re_noninterpstr.pl"
 
for s in $scripts ; do
    for i in $(seq 5) ; do
        echo -n "Run $i for $s:  "
        time ./$s
    done
    echo
done

And now, on to the results of the tests!

The Results

The tables below show the Unix time output of these tests on each machine. The fastest time for each language is in bold.

The Sun Box

PHP Python (non-compiled) Python (compiled) Perl (interpolated string loop) Perl (hard coded regexes)
Real User Sys Real User Sys Real User Sys Real User Sys Real User Sys
Test 1 9.45s 9.10s 0.07s 13.42s 12.66s 0.11s 7.97s 7.20s 0.11s 31.89s 29.43s 0.17s 1.59s 1.53s 0.05s
Test 2 9.90s 9.06s 0.06s 13.28s 12.45s 0.15s 7.86s 7.25s 0.01s 31.78s 29.97s 0.18s 1.67s 1.52s 0.07s
Test 3 9.58s 9.07s 0.06s 13.56s 12.59s 0.10s 7.45s 7.09s 0.13s 31.29s 29.61s 0.14s 2.32s 1.46s 0.04s
Test 4 9.52s 9.08s 0.10s 13.58s 12.63s 0.09s 7.40s 7.18s 0.04s 33.19s 30.27s 0.15s 1.76s 1.47s 0.04s
Test 5 9.94s 8.87s 0.12s 13.00s 12.44s 0.12s 7.43s 7.19s 0.08s 33.22s 30.22s 0.16s 1.82s 1.42s 0.10s

The Laptop

PHP Python (non-compiled) Python (compiled) Perl (interpolated string loop) Perl (hard coded regexes)
Real User Sys Real User Sys Real User Sys Real User Sys Real User Sys
Test 1 14.25s 14.05s 0.10s 12.10s 11.98s 0.06s 6.12s 6.03s 0.07s 42.42s 42.11s 0.11s 1.63s 1.54s 0.08s
Test 2 14.00s 13.62s 0.08s 12.27s 11.91s 0.04s 6.17s 6.08s 0.06s 43.02s 42.72s 0.09s 1.71s 1.64s 0.05s
Test 3 14.24s 14.01s 0.14s 12.43s 12.29s 0.06s 6.14s 5.98s 0.08s 43.15s 42.67s 0.15s 1.71s 1.65s 0.06s
Test 4 13.94s 13.62s 0.08s 12.21s 11.93s 0.11s 6.30s 6.22s 0.05s 43.25s 43.00s 0.07s 1.63s 1.58s 0.05s
Test 5 14.30s 14.06s 0.14s 12.24s 12.08s 0.09s 6.32s 6.19s 0.08s 31.89s 42.60s 0.20s 1.61s 1.55s 0.05s

Just For Fun...

…one of my co-workers whipped up this C code which uses libpcre just to see how it would perform versus the interpreted languages. I'm not including it in the main results because this is a test of 3 interpreted languages speed capabilities, but I thought I would drop the results in here just for fun.

The Code

#include <stdio.h>
#include <string.h>
#include <err.h>
#include <pcre.h>
 
char *pat[] = {
    "pop3d-ssl.\\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\\d]((?:\\d{1,3}\\.){3}\\d{1,3
})" ,
    "postfix/smtpd.*NOQUEUE.*Client host \\[([^\\]]+)\\].*zen.spamhaus.org" ,
    "postfix.*connect from ([^\\[]*)\\[([^\\[]+)\\]" ,
    "postfix.*lost connection" ,
    "postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent" ,
    NULL
};
 
const char *logfile = "maillog";
int counter = 0;
 
pcre *re[5];
 
int main(void) {
        const char *err_txt;
        int i, err_offset, match, ovec[30];
        FILE *f;
        char *s, buf[1024];
 
        for (i = 0; pat[i]; i++) {
                re[i] = pcre_compile(pat[i], 0, &err_txt, &err_offset, 0);
                if (!re[i]) {
                        errx(1, "PCRE compile error at %d of %s: %s",
                                err_offset, pat[i], err_txt);
                }
        }
 
        if (!(f = fopen(logfile, "r"))) return 1;
 
        while ((s = fgets(buf, sizeof(buf), f))) {
                for (i = 0; pat[i]; i++) {
                        match = pcre_exec(re[i], 0, s, strlen(s),
                                        0, 0, ovec, 30);
                        if (match > 0) {
                                counter++;
                                break;
                        }
                }
        }
 
        fclose(f);
 
        printf("Number of matches: %d\n", counter);
 
        return 0;
}

The compile command:

$ cc -Wall -o pcretest pcretest.c -I/usr/local/include -L/usr/local/lib -lpcre

The Results

The Sun Box

Real User Sys
Test 1 6.70s 6.44s 0.07s
Test 2 8.03s 7.83s 0.03s
Test 3 9.04s 6.84s 0.05s
Test 4 9.03s 6.53s 0.09s
Test 5 7.26s 6.63s 0.08s

The Laptop

Real User Sys
Test 1 13.14s 12.92s 0.06s
Test 2 13.08s 12.88s 0.06s
Test 3 13.09s 12.94s 0.02s
Test 4 13.21s 13.00s 0.07s
Test 5 13.07s 12.88s 0.04s

Conclusion

Well, it appears to be that the non pre-compiled Python regexes are about on par with PHP. My Sun box running Gentoo was probably a bit faster because I'm running a bit more stripped down version of the php binary compiled specifically for my machine, rather than the generic i386 binary on the Ubuntu laptop.

The Python numbers are fairly consistent in terms of the compiled versions being about twice as fast as the non compiled versions.

I think that the most amazing thing here is difference in the 2 Perl tests. If you use a scalar string variable as the regular expression, it's dog slow. However, if you hard code that string in the expression, it's lightning fast. I was not expecting this kind of a discrepancy at all, but I'm glad that I tested both approaches.

Though I didn't include it in the official results, I thought it was kind of interesting that the compiled C program performed about the same as the Python program with pre-compiled regular expressions.

I think the conclusion that I have to draw from this experiment is that Perl is your best choice, as is often the case, for a simple static regular expression based parser. On the other hand, if you wanted a more dynamic approach to the regular expressions that you are using (like loading them in from a file, command-line, etc.), compiled Python is definitely your best answer, but PHP is also a good candidate. It's pretty obvious that Perl is not the language to use in that particular case.

Please, feel free to post to the discussion here in answer to this writeup.