I used to use the default implementation provided in Mac OSX (10.7.4). But this week I face a massive performance issue. One of my script simply takes forever to run.
I can't give you the exact sed version because neither "sed -v" nor "sed --version" are implemented on Mac OSX...
My goal with this script was to read logs line from a web application and to rebuild the request which generates this log. There are probably better ways than a sed script, but in my case it was the way I want to do it.
Here is an example of log line:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68.41.74.201 api.citygridmedia.com - [11/Jul/2012:16:46:57 -0700] "GET /rbl/tracker/imp?listing_id=5235941&action_target=listing_profile&publisher=seo_google&initial_publisher=seo_google&reference_id=1&placement=rbl_cs_reinvent&src=cs_reinvent HTTP/1.1" 200 43 "http://detroit.citysearch.com/profile/5235941/grosse_pointe_park_mi/mama_rosa_s_pizzeria.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2" "[conType=external_listing_connection] [capProvId=1] [liId=5235941] [bucket=alpha] [impId=000f00000ab9f3be1494614bf797712c56d3ca20ce] [isCust=0] [ver=alpha] [initPubCode=seo_google] [actTarName=listing_profile] [placeDesc=rbl_cs_reinvent] [isNotBillable=1] [visCook=8584b1a5e824ba682ad01dc8e3f49beb3f1b68d8] [ua=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2] [sesCook=8b2d5a19e93aeb04ab36d7c1a61efd7f3536dd7e] [curPubCode=seo_google] [webReqId=1342050417843925] [clientIp=68.41.74.201] [clickId=000f00000a2f853119395842e9ab08f80ed7d74a2d] [tierCheck=0] [hasBudget=0]" 20 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sed "/s/.*GET \(.\{5,\}\) HTTP\/1\.1\" 200 [0-9]* \"\([^\"]\{5,\}\)\" \"\([^\"]\{5,\}\)\".*\[visCook=\([^]]\{5,\}\)\].*\[sesCook=\([^]]\{5,\}\)\].*\[clientIp=\([^]]\{5,\}\)\].*/curl --user-agent \"\3\" --cookie \"visCook=\4; sesCook=\5\" --header \"X-Forwarded-For: \6\" --referer \"\2\" \"$RBL_URL\1\"/" logs.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
real 0m16.672s | |
user 0m16.665s | |
sys 0m0.007s |
At this point I started hacking my regex and google the problem. After many unsuccessful attempts, I read somewhere that the Mac OSX implementation of sed is old and that alternatively GNU sed can be used. So I installed gsed via Mac Port ("sudo port install gsed")
I tried the exactly the same sed expression on my 100 lines sample:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gsed "/s/.*GET \(.\{5,\}\) HTTP\/1\.1\" 200 [0-9]* \"\([^\"]\{5,\}\)\" \"\([^\"]\{5,\}\)\".*\[visCook=\([^]]\{5,\}\)\].*\[sesCook=\([^]]\{5,\}\)\].*\[clientIp=\([^]]\{5,\}\)\].*/curl --user-agent \"\3\" --cookie \"visCook=\4; sesCook=\5\" --header \"X-Forwarded-For: \6\" --referer \"\2\" \"$RBL_URL\1\"/" logs.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
real 0m0.127s | |
user 0m0.036s | |
sys 0m0.009s |
It could be OK but in my case the script have to be run several times. To speed up the process I can take advantage of my multi core CPU. The Mac Book provides by my company has an i7 processor with 4 physical cores (8 virtual).
To achieved that I used another GNU tool: GNU Parallel. There is also a Mac port available for that ("sudo port install parallel"). GNU parallel can be used to run in parallel a command on multiple input files. We can split the input file and then run the gsed command with parallel.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
CORE=8 | |
# Split file for parallelization | |
line_count=`cat $1 | wc -l` | |
split_size=`echo "(${line_count}+${CORE})/${CORE}" | bc` | |
split -l $split_size $1 part- | |
# Run gsed on parallel | |
parallel -q0 gsed "s/.*GET \(.\{5,\}\) HTTP\/1\.1\" 200 [0-9]* \"\([^\"]\{5,\}\)\" \"\([^\"]\{5,\}\)\".*\[visCook=\([^]]\{5,\}\)\].*\[sesCook=\([^]]\{5,\}\)\].*\[clientIp=\([^]]\{5,\}\)\].*/curl --user-agent \"\3\" --cookie \"visCook=\4; sesCook=\5\" --header \"X-Forwarded-For: \6\" --referer \"\2\" \"$RBL_URL\1\"/" ::: part-* >> curl_commands.txt | |
# Clean temp files | |
rm part-* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
real 2m5.979s | |
user 11m38.038s | |
sys 0m12.888s |
There is no need for splitting into temp files - let GNU Parallel do that for you:
ReplyDeletecat log |
parallel --pipe -q0 gsed "s/.*GET \(.\{5,\}\) HTTP\/1\.1\" 200 [0-9]* \"\([^\"]\{5,\}\)\" \"\([^\"]\{5,\}\)\".*\[visCook=\([^]]\{5,\}\)\].*\[sesCook=\([^]]\{5,\}\)\].*\[clientIp=\([^]]\{5,\}\)\].*/curl --user-agent \"\3\" --cookie \"visCook=\4; sesCook=\5\" --header \"X-Forwarded-For: \6\" --referer \"\2\" \"$RBL_URL\1\"/" >> curl_commands.txt
Thanks for you suggestion.
DeleteIt's cleaner and it improves the execution time (1m40)!