====== PHP vs. Python vs. Perl -- Regular Expression Showdown ====== ===== The Goal ===== I was in a discussion yesterday with one of my co-workers about the speed of [[wp>Spamassassin]]. We were talking about how slow it is and good ways in which to speed it up. He mentioned some optimizations in other languages, which got me wondering about exactly what the speed differences would be in a test of [[wp>PHP]], [[wp>Python]] and [[wp>Perl]]. This writeup details the results of my tests of 5 different scripts on 2 different machines running 2 different distros of [[wp>Linux]]. ===== The Hardware ===== I've run these tests on my "work" workstation, which is a **Sun Ultra 20** running **Gentoo Linux**. From here on out, I will refer to this machine as the "Sun box". The details are as follows. ^ CPU | single, single core AMD Opteron 2.6GHz | ^ Memory | 2GB | ^ OS | Gentoo Linux | The second machine was my personal laptop, a **Gateway MX6931** running **Ubuntu Linux**. From here on out, I will refer to this machine as the "Laptop". The details are as follows ^ CPU | single, dual core Intel Core2 1.66GHz | ^ Memory | 2GB | ^ OS | Ubuntu 7.10 | ==== The Interpreters ==== Here is the output of the version command from each of our 3 interpreters on each of the machines. The first set is from the Sun box. **PHP:** $ php -v PHP 5.2.5-pl1-gentoo (cli) (built: Jan 4 2008 12:35:35) Copyright (c) 1997-2007 The PHP Group Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies **Python:** $ python -V Python 2.4.4 **Perl:** $ perl -V Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.16-gentoo-r1, archname=i686-linux uname='linux kagome 2.6.16-gentoo-r1 #2 smp mon jun 5 19:01:24 cdt 2006 i686 amd athlon(tm) 64 x2 dual core processor 4200+ gnulinux ' config_args='-des -Darchname=i686-linux -Dcccdlflags=-fPIC -Dccdlflags=-rdynamic -Dcc=i686-pc-linux-gnu-gcc -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr -Dlocincpth= -Doptimize=-O2 -march=i686 -pipe -Duselargefiles -Dd_semctl_semun -Dscriptdir=/usr/bin -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dinstallman1dir=/usr/share/man/man1 -Dinstallman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dinc_version_list=5.8.0 5.8.0/i686-linux 5.8.2 5.8.2/i686-linux 5.8.4 5.8.4/i686-linux 5.8.5 5.8.5/i686-linux 5.8.6 5.8.6/i686-linux 5.8.7 5.8.7/i686-linux -Dcf_by=Gentoo -Ud_csh -Dusenm -Di_ndbm -Di_gdbm -Di_db' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='i686-pc-linux-gnu-gcc', ccflags ='-fno-strict-aliasing -pipe -Wdeclaration-after-statement -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm', optimize='-O2 -march=i686 -pipe', cppflags='-fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/include/gdbm' ccversion='', gccversion='4.1.1 (Gentoo 4.1.1)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='i686-pc-linux-gnu-gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.4.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.4' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: PERL_MALLOC_WRAP USE_LARGE_FILES USE_PERLIO Built under linux Compiled at Jun 30 2006 17:21:05 @INC: /etc/perl /usr/lib/perl5/vendor_perl/5.8.8/i686-linux /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib/perl5/site_perl/5.8.8/i686-linux /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/5.8.8/i686-linux /usr/lib/perl5/5.8.8 /usr/local/lib/site_perl . This is the same info from the laptop. **PHP:** $ php -v PHP 5.2.3-1ubuntu6.3 (cli) (built: Jan 10 2008 09:38:37) Copyright (c) 1997-2007 The PHP Group Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies **Python:** $ python -V Python 2.5.1 **Perl:** $ perl -V Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.15.7, archname=i486-linux-gnu-thread-multi uname='linux terranova 2.6.15.7 #1 smp thu jul 12 14:27:56 utc 2007 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.6.1.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8 gnulibc_version='2.6.1' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP THREADS_HAVE_PIDS USE_ITHREADS USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API Built under linux Compiled at Dec 4 2007 08:56:39 @INC: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl . ===== The Scripts ===== I wrote a total of 5 scripts for this experiment. There is one for each interpreter as well as one additional script for Python. The reason there are 2 Python scripts is due to the option in Python's regular expression library to compile the regular expressions prior to use. Therefore, I have one Python script which uses pre-compiled regular expressions and one that does not. The reason for the 2 Perl scripts is that the first one I wrote uses the same programmatic mechanism of looping through an array of regular expression strings and using a string variable (''m/$r/'') in the regex matching. At the behest of one of my co-workers, I wrote a second version where all the regexs were hard coded in the match expression (''m/regex code.*$/''). The difference, as you will see is quite dramatic. Though these are different languages, I kept the execution of the scripts almost completely the same between them, with the exception of one of the Perl scripts. The basis of the scripts is that they all use a set of 5 different regular expressions in an array to try and match against lines in an email logfile. More on the logfile later. If there is a match, a simple integer counter is incremented and the script moves on to the next line. Very basic, but also very real world. Parsing log files is definitely one of the major uses for interpreted languages, which is what this is about. The only thing I'm not doing is aggregating any kind of data from what I'm parsing as I want this to be purely about the speed of the regular expression matching. Now, the source of the 5 scripts. ==== PHP ==== #!/usr/bin/php ]+)>.*delays=([^,]+),.*status=sent#' , ); $logfile = 'maillog'; $counter = 0; $fh = fopen($logfile , 'r'); while (false !== ($line = fgets($fh))) { foreach ($regexStrs as $r) { if (preg_match($r , $line , $m)) { $counter++; break; } } printf("Number of matches: %d\n" , $counter); ?> ==== Perl (interpolated string loop) ==== #!/usr/bin/perl my @regexStrs = ( 'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})' , 'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' , 'postfix.*connect from ([^\[]*)\[([^\[]+)\]' , 'postfix.*lost connection' , 'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' , ); my $logfile = 'maillog'; my $counter = 0; open(FH , $logfile); while (my $line = ) { foreach my $r (@regexStrs) { if ($line =~ m#$r#) { $counter++; last; } } } close(FH); printf("Number of matches: %d\n" , $counter); ==== Perl (hard coded regexs) ==== #!/usr/bin/perl my $logfile = 'maillog'; my $counter = 0; open(FH , $logfile); while (my $line = ) { if ($line =~ m#pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})#) { $counter++; } elsif ($line =~ m#postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org#) { $counter++; } elsif ($line =~ m#postfix.*connect from ([^\[]*)\[([^\[]+)\]#) { $counter++; } elsif ($line =~ m#postfix.*lost connection#) { $counter++; } elsif ($line =~ m#postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent#) { $counter++; } } close(FH); printf("Number of matches: %d\n" , $counter); ==== Python (no pre-compiled R.E.) ==== #!/usr/bin/python import re regexStrs = ( r'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})' , r'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' , r'postfix.*connect from ([^\[]*)\[([^\[]+)\]' , r'postfix.*lost connection' , r'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' , ) logfile = 'maillog' counter = 0 fh = open(logfile) while True: line = fh.readline() if not line: break for r in regexStrs: m = re.search(r , line) if m: counter += 1 break fh.close() print 'Number of matches: %d' % counter ==== Python (pre-compiled R.E.) ==== #!/usr/bin/python import re regexStrs = ( r'pop3d-ssl.\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\d]((?:\d{1,3}\.){3}\d{1,3})' , r'postfix/smtpd.*NOQUEUE.*Client host \[([^\]]+)\].*zen.spamhaus.org' , r'postfix.*connect from ([^\[]*)\[([^\[]+)\]' , r'postfix.*lost connection' , r'postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent' , ) res = [] logfile = 'maillog' counter = 0 for rs in regexStrs: res.append(re.compile(rs)) fh = open(logfile) while True: line = fh.readline() if not line: break for r in res: m = r.search(line) if m: counter += 1 break fh.close() print 'Number of matches: %d' % counter ===== The Testbed ===== The testbed was very simple. The testing script is just a simple shell script that ran the Unix ''time'' on each of our 5 test scripts 5 consecutive times each. The "maillog" file that was used was simply a compilation of a number of days of maillogs from my personal mail server. Nothing was altered in this file and the same file was used in all tests. The size of that file is **72886261**. The same maillog file was used in all tests, as you can see from the scripts above. For reasons that should be obvious, I'm not going to post my personal maillogs here. The source of the testbed shell script: #!/bin/sh scripts="re_compile.py re_nocompile.py re_nocompile.php re_nocompile.pl re_noninterpstr.pl" for s in $scripts ; do for i in $(seq 5) ; do echo -n "Run $i for $s: " time ./$s done echo done And now, on to the results of the tests! ===== The Results ===== The tables below show the Unix ''time'' output of these tests on each machine. The fastest time for each language is in bold. ==== The Sun Box ==== | ^ PHP ^^^ Python (non-compiled) ^^^ Python (compiled) ^^^ Perl (interpolated string loop) ^^^ Perl (hard coded regexes) ^^^ | | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | ^ Test 1 | **9.45s** | 9.10s | 0.07s | 13.42s | 12.66s | 0.11s | 7.97s | 7.20s | 0.11s | 31.89s | **29.43s** | 0.17s | **1.59s** | 1.53s | 0.05s | ^ Test 2 | 9.90s | 9.06s | **0.06s** | 13.28s | 12.45s | 0.15s | 7.86s | 7.25s | **0.01s** | 31.78s | 29.97s | 0.18s | 1.67s | 1.52s | 0.07s | ^ Test 3 | 9.58s | 9.07s | **0.06s** | 13.56s | 12.59s | 0.10s | 7.45s | **7.09s** | 0.13s | 31.29s | 29.61s | **0.14s** | 2.32s | 1.46s | **0.04s** | ^ Test 4 | 9.52s | 9.08s | 0.10s | 13.58s | 12.63s | **0.09s** | **7.40s** | 7.18s | 0.04s | **33.19s** | 30.27s | 0.15s | 1.76s | 1.47s | **0.04s** | ^ Test 5 | 9.94s | **8.87s** | 0.12s | **13.00s** | **12.44s** | 0.12s | 7.43s | 7.19s | 0.08s | 33.22s | 30.22s | 0.16s | 1.82s | **1.42s** | 0.10s | ==== The Laptop ==== | ^ PHP ^^^ Python (non-compiled) ^^^ Python (compiled) ^^^ Perl (interpolated string loop) ^^^ Perl (hard coded regexes) ^^^ | | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | **Real** | **User** | **Sys** | ^ Test 1 | 14.25s | 14.05s | 0.10s | **12.10s** | 11.98s | 0.06s | **6.12s** | 6.03s | 0.07s | **42.42s** | **42.11s** | 0.11s | 1.63s | **1.54s** | 0.08s | ^ Test 2 | 14.00s | 13.62s | **0.08s** | 12.27s | **11.91s** | **0.04s** | 6.17s | 6.08s | 0.06s | 43.02s | 42.72s | 0.09s | 1.71s | 1.64s | **0.05s** | ^ Test 3 | 14.24s | **14.01s** | 0.14s | 12.43s | 12.29s | 0.06s | 6.14s | **5.98s** | 0.08s | 43.15s | 42.67s | 0.15s | 1.71s | 1.65s | 0.06s | ^ Test 4 | **13.94s** | 13.62s | **0.08s** | 12.21s | 11.93s | 0.11s | 6.30s | 6.22s | **0.05s** | 43.25s | 43.00s | **0.07s** | 1.63s | 1.58s | **0.05s** | ^ Test 5 | 14.30s | 14.06s | 0.14s | 12.24s | 12.08s | 0.09s | 6.32s | 6.19s | 0.08s | 31.89s | 42.60s | 0.20s | **1.61s** | 1.55s | **0.05s** | ===== Just For Fun... ===== ...one of my co-workers whipped up this C code which uses ''libpcre'' just to see how it would perform versus the interpreted languages. I'm not including it in the main results because this is a test of 3 interpreted languages speed capabilities, but I thought I would drop the results in here just for fun. ==== The Code ==== #include #include #include #include char *pat[] = { "pop3d-ssl.\\s+LOGIN.*?user=([^,]+).*?ip=.*?[^\\d]((?:\\d{1,3}\\.){3}\\d{1,3 })" , "postfix/smtpd.*NOQUEUE.*Client host \\[([^\\]]+)\\].*zen.spamhaus.org" , "postfix.*connect from ([^\\[]*)\\[([^\\[]+)\\]" , "postfix.*lost connection" , "postfix/virtual.*?: ([^:]+): to=<([^>]+)>.*delays=([^,]+),.*status=sent" , NULL }; const char *logfile = "maillog"; int counter = 0; pcre *re[5]; int main(void) { const char *err_txt; int i, err_offset, match, ovec[30]; FILE *f; char *s, buf[1024]; for (i = 0; pat[i]; i++) { re[i] = pcre_compile(pat[i], 0, &err_txt, &err_offset, 0); if (!re[i]) { errx(1, "PCRE compile error at %d of %s: %s", err_offset, pat[i], err_txt); } } if (!(f = fopen(logfile, "r"))) return 1; while ((s = fgets(buf, sizeof(buf), f))) { for (i = 0; pat[i]; i++) { match = pcre_exec(re[i], 0, s, strlen(s), 0, 0, ovec, 30); if (match > 0) { counter++; break; } } } fclose(f); printf("Number of matches: %d\n", counter); return 0; } The compile command: $ cc -Wall -o pcretest pcretest.c -I/usr/local/include -L/usr/local/lib -lpcre ==== The Results ==== === The Sun Box === | ^ Real ^ User ^ Sys ^ ^ Test 1 | **6.70s** | **6.44s** | 0.07s | ^ Test 2 | 8.03s | 7.83s | **0.03s** | ^ Test 3 | 9.04s | 6.84s | 0.05s | ^ Test 4 | 9.03s | 6.53s | 0.09s | ^ Test 5 | 7.26s | 6.63s | 0.08s | === The Laptop === | ^ Real ^ User ^ Sys ^ ^ Test 1 | 13.14s | 12.92s | 0.06s | ^ Test 2 | 13.08s | **12.88s** | 0.06s | ^ Test 3 | 13.09s | 12.94s | **0.02s** | ^ Test 4 | 13.21s | 13.00s | 0.07s | ^ Test 5 | **13.07s** | **12.88s** | 0.04s | ===== Conclusion ===== Well, it appears to be that the non pre-compiled Python regexes are about on par with PHP. My Sun box running Gentoo was probably a bit faster because I'm running a bit more stripped down version of the php binary compiled specifically for my machine, rather than the generic i386 binary on the Ubuntu laptop. The Python numbers are fairly consistent in terms of the compiled versions being about twice as fast as the non compiled versions. I think that the most amazing thing here is difference in the 2 Perl tests. If you use a scalar string variable as the regular expression, it's dog slow. However, if you hard code that string in the expression, it's lightning fast. I was not expecting this kind of a discrepancy at all, but I'm glad that I tested both approaches. Though I didn't include it in the //official// results, I thought it was kind of interesting that the compiled C program performed about the same as the Python program with pre-compiled regular expressions. I think the conclusion that I have to draw from this experiment is that Perl is your best choice, as is often the case, for a simple static regular expression based parser. On the other hand, if you wanted a more dynamic approach to the regular expressions that you are using (like loading them in from a file, command-line, etc.), compiled Python is definitely your best answer, but PHP is also a good candidate. It's pretty obvious that Perl is not the language to use in that particular case. Please, feel free to post to the discussion here in answer to this writeup.