I often use sed when developing, for example to clean some input files or extract useful information from logs.
I used to use the default implementation provided in Mac OSX (10.7.4). But this week I face a massive performance issue. One of my script simply takes forever to run.
I can't give you the exact sed version because neither "sed -v" nor "sed --version" are implemented on Mac OSX...
My goal with this script was to read logs line from a web application and to rebuild the request which generates this log. There are probably better ways than a sed script, but in my case it was the way I want to do it.
Here is an example of log line:
Here is the sed command:
This expression was painful to write, believe me. And I was very happy after having successfully test it on serval variants of log lines. But then I launch my script on a full log file (1,2M+ lines). But after 5 or 10 minutes it wasn't finished so I run it on a 100 lines file with time. Here are the results:
Holy crap 16s for 100 lines, it's about 53 hours for 1.2M lines!!!
At this point I started hacking my regex and google the problem. After many unsuccessful attempts, I read somewhere that the Mac OSX implementation of sed is old and that alternatively GNU sed can be used. So I installed gsed via Mac Port ("sudo port install gsed")
I tried the exactly the same sed expression on my 100 lines sample:
It was far better (130x faster)! But if you do the math it will have taken approximately 25m to run on 1.2M lines file.
It could be OK but in my case the script have to be run several times. To speed up the process I can take advantage of my multi core CPU. The Mac Book provides by my company has an i7 processor with 4 physical cores (8 virtual).
To achieved that I used another GNU tool: GNU Parallel. There is also a Mac port available for that ("sudo port install parallel"). GNU parallel can be used to run in parallel a command on multiple input files. We can split the input file and then run the gsed command with parallel.
There is no need for splitting into temp files - let GNU Parallel do that for you:
ReplyDeletecat log |
parallel --pipe -q0 gsed "s/.*GET \(.\{5,\}\) HTTP\/1\.1\" 200 [0-9]* \"\([^\"]\{5,\}\)\" \"\([^\"]\{5,\}\)\".*\[visCook=\([^]]\{5,\}\)\].*\[sesCook=\([^]]\{5,\}\)\].*\[clientIp=\([^]]\{5,\}\)\].*/curl --user-agent \"\3\" --cookie \"visCook=\4; sesCook=\5\" --header \"X-Forwarded-For: \6\" --referer \"\2\" \"$RBL_URL\1\"/" >> curl_commands.txt
Thanks for you suggestion.
DeleteIt's cleaner and it improves the execution time (1m40)!