User Tools

Site Tools


programming:general:phpvspythonvsperl

This is an old revision of the document!


PHP vs. Python vs. Perl -- Regular Expression Showdown

The Goal

I was in a discussion yesterday with one of my co-workers about the speed of Spamassassin. We were talking about how slow it is and good ways in which to speed it up. He mentioned some optimizations in other languages, which got me wondering about exactly what the speed differences would be in a test of PHP, Python and Perl. This writeup details the results of my tests of 5 different scripts on 2 different machines running 2 different distros of Linux.

The Hardware

I've run these tests on my “work” workstation, which is a Sun Ultra 20 running Gentoo Linux. From here on out, I will refer to this machine as the “Sun box”. The details are as follows.

CPU single, single core AMD Opteron 2.6GHz
Memory 2GB
OS Gentoo Linux

The second machine was my personal laptop, a Gateway MX6931 running Gentoo Linux. From here on out, I will refer to this machine as the “Laptop”. The details are as follows

CPU single, dual core Intel Core2 1.66GHz
Memory 2GB
OS Ubuntu 6.10

The Interpreters

Here is the output of the version command from each of our 3 interpreters on each of the machines. The first set is from the Sun box.

PHP:

$ php -v
PHP 5.2.5-pl1-gentoo (cli) (built: Jan  4 2008 12:35:35)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies

Python:

$ python -V
Python 2.4.4

Perl:

$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.16-gentoo-r1, archname=i686-linux
    uname='linux kagome 2.6.16-gentoo-r1 #2 smp mon jun 5 19:01:24 cdt 2006 i686 amd athlon(tm) 64 x2 dual core processor 4200+ gnulinux '
    config_args='-des -Darchname=i686-linux -Dcccdlflags=-fPIC -Dccdlflags=-rdynamic -Dcc=i686-pc-linux-gnu-gcc -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr -Dlocincpth=  -Doptimize=-O2 -march=i686 -pipe -Duselargefiles -Dd_semctl_semun -Dscriptdir=/usr/bin -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dinstallman1dir=/usr/share/man/man1 -Dinstallman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dinc_version_list=5.8.0 5.8.0/i686-linux 5.8.2 5.8.2/i686-linux 5.8.4 5.8.4/i686-linux 5.8.5 5.8.5/i686-linux 5.8.6 5.8.6/i686-linux 5.8.7 5.8.7/i686-linux  -Dcf_by=Gentoo -Ud_csh -Dusenm -Di_ndbm -Di_gdbm -Di_db'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='i686-pc-linux-gnu-gcc', ccflags ='-fno-strict-aliasing -pipe -Wdeclaration-after-statement -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O2 -march=i686 -pipe',
    cppflags='-fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/include/gdbm'
    ccversion='', gccversion='4.1.1 (Gentoo 4.1.1)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='i686-pc-linux-gnu-gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.4.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO
  Built under linux
  Compiled at Jun 30 2006 17:21:05
  @INC:
    /etc/perl
    /usr/lib/perl5/vendor_perl/5.8.8/i686-linux
    /usr/lib/perl5/vendor_perl/5.8.8
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/site_perl/5.8.8/i686-linux
    /usr/lib/perl5/site_perl/5.8.8
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/5.8.8/i686-linux
    /usr/lib/perl5/5.8.8
    /usr/local/lib/site_perl
    .

This is the same info from the laptop.

PHP:

$ php -v
PHP 5.2.3-1ubuntu6.3 (cli) (built: Jan 10 2008 09:38:37)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies

Python:

$ python -V
Python 2.5.1

Perl:

$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.15.7, archname=i486-linux-gnu-thread-multi
    uname='linux terranova 2.6.15.7 #1 smp thu jul 12 14:27:56 utc 2007 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.6.1.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8
    gnulibc_version='2.6.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: MULTIPLICITY PERL_IMPLICIT_CONTEXT
                        PERL_MALLOC_WRAP THREADS_HAVE_PIDS USE_ITHREADS
                        USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API
  Built under linux
  Compiled at Dec  4 2007 08:56:39
  @INC:
    /etc/perl
    /usr/local/lib/perl/5.8.8
    /usr/local/share/perl/5.8.8
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8
    /usr/share/perl/5.8
    /usr/local/lib/site_perl
    .

The Scripts

I wrote a total of 5 scripts for this experiment. There is one for each interpreter as well as one additional script for Python. The reason there are 2 Python scripts is due to the option in Python's regular expression library to compile the regular expressions prior to use. Therefore, I have one Python script which uses pre-compiled regular expressions and one that does not. The reason for the 2 Perl scripts is that the first one I wrote uses the same programmatic mechanism of looping through an array of regular expression strings and using a string variable (m/$r/) in the regex matching. At the behest of one of my co-workers, I wrote a second version where all the regexs were hard coded in the match expression (m/regex code.*$/). The difference, as you will see is quite dramatic.

Though these are different languages, I kept the execution of the scripts almost completely the same between them, with the exception of one of the Perl scripts. The basis of the scripts is that they all use a set of 5 different regular expressions in an array to try and match against lines in an email logfile. More on the logfile later. If there is a match, a simple integer counter is incremented and the script moves on to the next line. Very basic, but also very real world. Parsing log files is definitely one of the major uses for interpreted languages, which is what this is about. The only thing I'm not doing is aggregating any kind of data from what I'm parsing as I want this to be purely about the speed of the regular expression matching. Now, the source of the 5 scripts.

PHP

#!/usr/bin/php
<?
$regexStrs = array(
    '#pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})#'
 ,
    '#postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org#' ,
    '#postfix.*connect from ([^\[]*)\[([^\[]+)\]#' ,
    '#postfix.*lost connection#' ,
    '#postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent#'
,
);
 
$logfile = 'maillog';
$counter = 0;
 
$fh = fopen($logfile , 'r');
while (false !== ($line = fgets($fh))) {
    foreach ($regexStrs as $r) {
        if (preg_match($r , $line , $m)) {
            $counter++;
            break;
        }
    }
 
printf("Number of matches: %d\n" , $counter);
?>

Perl (interpolated string loop)

#!/usr/bin/perl
 
my @regexStrs = (
    'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})' ,
    'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' ,
    'postfix.*connect from ([^\[]*)\[([^\[]+)\]' ,
    'postfix.*lost connection' ,
    'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' ,
);
 
my $logfile = 'maillog';
my $counter = 0;
 
open(FH , $logfile);
while (my $line = <FH>) {
    foreach my $r (@regexStrs) {
        if ($line =~ m#$r#) {
            $counter++;
            last;
        }
    }
}
close(FH);
 
printf("Number of matches: %d\n" , $counter);

Perl (hard coded regexs)

#!/usr/bin/perl
 
my $logfile = 'maillog';
my $counter = 0;
 
open(FH , $logfile);
while (my $line = <FH>) {
    if ($line =~ m#pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})#) {
        $counter++;
    }
    elsif ($line =~ m#postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org#) {
        $counter++;
    }
    elsif ($line =~ m#postfix.*connect from ([^\[]*)\[([^\[]+)\]#) {
        $counter++;
    }
    elsif ($line =~ m#postfix.*lost connection#) {
        $counter++;
    }
    elsif ($line =~ m#postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent#) {
        $counter++;
    }
}
close(FH);
 
printf("Number of matches: %d\n" , $counter);

Python (no pre-compiled R.E.)

#!/usr/bin/python
 
import re
 
regexStrs = (
    r'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})'
,
    r'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' ,
    r'postfix.*connect from ([^\[]*)\[([^\[]+)\]' ,
    r'postfix.*lost connection' ,
    r'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' ,
)
 
logfile = 'maillog'
counter = 0
 
fh = open(logfile)
while True:
    line = fh.readline()
    if not line: break
    for r in regexStrs:
        m = re.search(r , line)
        if m:
            counter += 1
            break
fh.close()
 
print 'Number of matches: %d' % counter

Python (pre-compiled R.E.)

#!/usr/bin/python
 
import re
 
regexStrs = (
    r'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})'
,
    r'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' ,
    r'postfix.*connect from ([^\[]*)\[([^\[]+)\]' ,
    r'postfix.*lost connection' ,
    r'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' ,
)
 
res = []
logfile = 'maillog'
counter = 0
 
for rs in regexStrs:
    res.append(re.compile(rs))
 
fh = open(logfile)
while True:
    line = fh.readline()
    if not line: break
    for r in res:
        m = r.search(line)
        if m:
            counter += 1
            break
fh.close()
 
print 'Number of matches: %d' % counter

The Testbed

The testbed was very simple. The testing script is just a simple shell script that ran the Unix time on each of our 5 test scripts 5 consecutive times each. The “maillog” file that was used was simply a compilation of a number of days of maillogs from my personal mail server. Nothing was altered in this file and the same file was used in all tests. The size of that file is 72886261. The same maillog file was used in all tests, as you can see from the scripts above. For reasons that should be obvious, I'm not going to post my personal maillogs here.

The source of the testbed shell script:

#!/bin/sh
 
scripts="re_compile.py re_nocompile.py re_nocompile.php re_nocompile.pl re_noninterpstr.pl"
 
for s in $scripts ; do
    for i in $(seq 5) ; do
        echo -n "Run $i for $s:  "
        time ./$s
    done
    echo
done

And now, on to the results of the tests!

The Results

The tables below show the Unix time output of these tests on each machine. The fastest time for each language is in bold.

The Sun Box

PHP Python (non-compiled) Python (compiled) Perl (interpolated string loop) Perl (hard coded regexes)
Test 1 real: 9.45s, user: 9.10s, sys: 0.07s real: 13.42s, user: 12.66s, sys: 0.11s real: 7.97s, user: 7.20s, sys: 0.11s real: 31.89s, user: 29.43s, sys: 0.17s real: 1.59s, user: 1.53s, sys: 0.05s
Test 2 real: 9.90s, user: 9.06s, sys: 0.06s real: 13.28s, user: 12.45s, sys: 0.15s real: 7.86s, user: 7.25s, sys: 0.01s real: 31.78s, user: 29.97s, sys: 0.18s real: 1.67s, user: 1.52s, sys: 0.07s
Test 3 real: 9.58s, user: 9.07s, sys: 0.06s real: 13.56s, user: 12.59s, sys: 0.10s real: 7.45s, user: 7.09s, sys: 0.13s real: 31.29s, user: 29.61s, sys: 0.14s real: 2.32s, user: 1.46s, sys: 0.04s
Test 4 real: 9.52s, user: 9.08s, sys: 0.10s real: 13.58s, user: 12.63s, sys: 0.09s real: 7.40s, user: 7.18s, sys: 0.04s real: 33.19s, user: 30.27s, sys: 0.15s real: 1.76s, user: 1.47s, sys: 0.04s
Test 5 real: 9.94s, user: 8.87s, sys: 0.12s real: 13.00s, user: 12.44s, sys: 0.12s real: 7.43s, user: 7.19s, sys: 0.08s real: 33.22s, user: 30.22s, sys: 0.16s real: 1.82s, user: 1.42s, sys: 0.10s

The Laptop

PHP Python (non-compiled) Python (compiled) Perl
Test 1 real: 14.25s, user: 14.05s, sys: 0.10s real: 12.10s, user: 11.98s, sys: 0.06s real: 6.12s, user: 6.03s, sys: 0.07s real: 42.42s, user: 42.11s, sys: 0.11s
Test 2 real: 14.00s, user: 13.62s, sys: 0.08s real: 12.27s, user: 11.91s, sys: 0.04s real: 6.17s, user: 6.08s, sys: 0.06s real: 43.02s, user: 42.72s, sys: 0.09s
Test 3 real: 14.24s, user: 14.01s, sys: 0.14s real: 12.43s, user: 12.29s, sys: 0.06s real: 6.14s, user: 5.98s, sys: 0.08s real: 43.15s, user: 42.67s, sys: 0.15s
Test 4 real: 13.94s, user: 13.62s, sys: 0.08s real: 12.21s, user: 11.93s, sys: 0.11s real: 6.30s, user: 6.22s, sys: 0.05s real: 43.25s, user: 43.00s, sys: 0.07s
Test 5 real: 14.30s, user: 14.06s, sys: 0.14s real: 12.24s, user: 12.08s, sys: 0.09s real: 6.32s, user: 6.19s, sys: 0.08s real: 31.89s, user: 42.60s, sys: 0.20s

Conclusion

Well, it appears to be that the non-pre-compiled Python regexes are about on par with PHP. My Sun box running Gentoo was probably a bit faster because I'm running a bit more stripped down version of the php binary compiled specifically for my machine, rather than the generic i386 binary on the Ubuntu laptop. The Python numbers are fairly consistent

programming/general/phpvspythonvsperl.1201896893.txt.gz · Last modified: 2008/02/01 20:14 by crustymonkey