7 CONTENT="Modular DocBook HTML Stylesheet Version 1.64
10 TITLE="Privoxy User Manual"
11 HREF="index.html"><LINK
14 HREF="seealso.html"><LINK
17 HREF="../p_doc.css"></HEAD
36 >Privoxy User Manual</TH
76 >9.1. Regular Expressions</A
84 >"regular expressions"</SPAN
86 in various config files. Assuming support for <SPAN
90 Compatible Regular Expressions) is compiled in, which is the default. Such
91 configuration directives do not require regular expressions, but they can be
92 used to increase flexibility by matching a pattern with wild-cards against
95 > If you are reading this, you probably don't understand what <SPAN
99 > are, or what they can do. So this will be a very brief
100 introduction only. A full explanation would require a book ;-)</P
104 >"Regular expressions"</SPAN
105 > is a way of matching one character
106 expression against another to see if it matches or not. One of the
110 > is a literal string of readable characters
111 (letter, numbers, etc), and the other is a complex string of literal
112 characters combined with wild-cards, and other special characters, called
113 meta-characters. The <SPAN
115 >"meta-characters"</SPAN
116 > have special meanings and
117 are used to build the complex pattern to be matched against. Perl Compatible
118 Regular Expressions is an enhanced form of the regular expression language
119 with backward compatibility.</P
121 > To make a simple analogy, we do something similar when we use wild-card
122 characters when listing files with the <B
129 > matches all filenames. The <SPAN
133 character here is the asterisk which matches any and all characters. We can be
134 more specific and use <TT
137 > to match just individual
140 >"dir file?.text"</SPAN
148 >, etc. We are pattern
149 matching, using a similar technique to <SPAN
151 >"regular expressions"</SPAN
154 > Regular expressions do essentially the same thing, but are much, much more
155 powerful. There are many more <SPAN
157 >"special characters"</SPAN
159 building complex patterns however. Let's look at a few of the common ones,
160 and then some examples:</P
171 > - Matches any single character, e.g. <SPAN
204 > - The preceding character or expression is matched ZERO or ONE
222 > - The preceding character or expression is matched ONE or MORE
240 > - The preceding character or expression is matched ZERO or MORE
261 > character denotes that
262 the following character should be taken literally. This is used where one of the
263 special characters (e.g. <SPAN
266 >) needs to be taken literally and
267 not as a special meta-character.
284 > - Characters enclosed in brackets will be matched if
285 any of the enclosed characters are encountered.
302 > - parentheses are used to group a sub-expression,
303 or multiple sub-expressions.
323 > character works like an
327 > conditional statement. A match is successful if the
328 sub-expression on either side of <SPAN
347 >s/string1/string2/g</I
348 > - This is used to rewrite strings of text.
352 > is replaced by <SPAN
364 > These are just some of the ones you are likely to use when matching URLs with
368 >, and is a long way from a definitive
369 list. This is enough to get us started with a few simple examples which may
370 be more illuminating:</P
379 that uses the common combination of <SPAN
386 denote any character, zero or more times. In other words, any string at all.
387 So we start with a literal forward slash, then our regular expression pattern
391 >) another literal forward slash, the string
395 >, another forward slash, and lastly another
400 a directory path here. This will match any file with the path that has a
401 directory named <SPAN
408 any characters, and this could conceivably be more forward slashes, so it
409 might expand into a much longer looking path. For example, this could match:
412 >"/eye/hate/spammers/banners/annoy_me_please.gif"</SPAN
416 >"/banners/annoying.html"</SPAN
417 >, or almost an infinite number of other
418 possible combinations, just so it has <SPAN
424 > A now something a little more complex:</P
430 >/.*/adv((er)?ts?|ertis(ing|ements?))?/</TT
433 We have several literal forward slashes again (<SPAN
437 building another expression that is a file path statement. We have another
441 >, so we are matching against any conceivable sub-path, just so
442 it matches our expression. The only true literal that <I
446 > our pattern is <SPAN
450 the forward slashes. What comes after the <SPAN
454 interesting part. </P
459 > means the preceding expression (either a
460 literal character or anything grouped with <SPAN
464 can exist or not, since this means either zero or one match. So
467 >"((er)?ts?|ertis(ing|ements?))"</SPAN
468 > is optional, as are the
469 individual sub-expressions: <SPAN
475 >"(ing|ements?)"</SPAN
486 >. We have two of those. For instance,
489 >"(ing|ements?)"</SPAN
490 >, can expand to match either <SPAN
500 >. What is being done here, is an
501 attempt at matching as many variations of <SPAN
503 >"advertisement"</SPAN
505 similar, as possible. So this would expand to match just <SPAN
521 >"advertisement"</SPAN
525 >"advertisements"</SPAN
526 >. You get the idea. But it would not match
529 >"advertizements"</SPAN
533 >). We could fix that by
534 changing our regular expression to:
537 >"/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/"</SPAN
538 >, which would then match
545 >/.*/advert[0-9]+\.(gif|jpe?g)</TT
548 another path statement with forward slashes. Anything in the square brackets
552 > can be matched. This is using <SPAN
556 shorthand expression to mean any digit one through nine. It is the same as
560 >. So any digit matches. The <SPAN
564 means one or more of the preceding expression must be included. The preceding
565 expression here is what is in the square brackets -- in this case, any digit
566 one through nine. Then, at the end, we have a grouping: <SPAN
570 This includes a <SPAN
573 >, so this needs to match the expression on
574 either side of that bar character also. A simple <SPAN
577 > on one side, and the other
578 side will in turn match either <SPAN
588 > means the letter <SPAN
592 can be matched once or not at all. So we are building an expression here to
593 match image GIF or JPEG type image file. It must include the literal
597 >, then one or more digits, and a <SPAN
601 (which is now a literal, and not a special character, since it is escaped
605 >), and lastly either <SPAN
615 >. Some possible matches would
618 >"//advert1.jpg"</SPAN
622 >"/nasty/ads/advert1234.gif"</SPAN
626 >"/banners/from/hell/advert99.jpg"</SPAN
627 >. It would not match
631 > (no leading slash), or
634 >"/adverts232.jpg"</SPAN
635 > (the expression does not include an
641 >"/advert1.jsp"</SPAN
646 in the expression anywhere).</P
652 >s/microsoft(?!.com)/MicroSuck/i</TT
655 a substitution. <SPAN
658 > will replace any occurrence of
665 > at the end of the expression
666 means ignore case. The <SPAN
670 the match should fail if <SPAN
677 >. In other words, this acts like a <SPAN
681 modifier. In case this is a hyperlink, we don't want to break it ;-).</P
683 > We are barely scratching the surface of regular expressions here so that you
684 can understand the default <SPAN
688 configuration files, and maybe use this knowledge to customize your own
689 installation. There is much, much more that can be done with regular
690 expressions. Now that you know enough to get started, you can learn more on
693 > More reading on Perl Compatible Regular expressions:
695 HREF="http://www.perldoc.com/perl5.6/pod/perlre.html"
697 >http://www.perldoc.com/perl5.6/pod/perlre.html</A
709 >'s Internal Pages</A
715 > proxies each requested
716 web page, it is easy for <SPAN
720 trap certain special URLs. In this way, we can talk directly to
725 configured, see how our rules are being applied, change these
726 rules and other configuration options, and even turn
730 > filtering off, all with
731 a web browser. </P
733 > The URLs listed below are the special ones that allow direct access
741 > must be running to access these. If
742 not, you will get a friendly error message. Internet access is not
761 HREF="http://config.privoxy.org/"
763 >http://config.privoxy.org/</A
768 > Alternately, this may be reached at <A
773 variation may not work as reliably as the above in some configurations.
779 Show information about the current configuration:
789 HREF="http://config.privoxy.org/show-status"
791 >http://config.privoxy.org/show-status</A
799 Show the source code version numbers:
809 HREF="http://config.privoxy.org/show-version"
811 >http://config.privoxy.org/show-version</A
819 Show the client's request headers:
829 HREF="http://config.privoxy.org/show-request"
831 >http://config.privoxy.org/show-request</A
839 Show which actions apply to a URL and why:
849 HREF="http://config.privoxy.org/show-url-info"
851 >http://config.privoxy.org/show-url-info</A
859 Toggle Privoxy on or off. In this case, <SPAN
863 to run, but only as a pass-through proxy, with no actions taking place:
873 HREF="http://config.privoxy.org/toggle"
875 >http://config.privoxy.org/toggle</A
880 > Short cuts. Turn off, then on:
890 HREF="http://config.privoxy.org/toggle?set=disable"
892 >http://config.privoxy.org/toggle?set=disable</A
904 HREF="http://config.privoxy.org/toggle?set=enable"
906 >http://config.privoxy.org/toggle?set=enable</A
914 Edit the actions list file:
924 HREF="http://config.privoxy.org/edit-actions"
926 >http://config.privoxy.org/edit-actions</A
934 > These may be bookmarked for quick reference. </P
941 >9.2.1. Bookmarklets</A
944 > Here are some bookmarklets to allow you to easily access a
948 > version of this page. They are designed for MS Internet
949 Explorer, but should work equally well in Netscape, Mozilla, and other
950 browsers which support JavaScript. They are designed to run directly from
951 your bookmarks - not by clicking the links below (although that will work for
954 > To save them, right-click the link and choose <SPAN
956 >"Add to Favorites"</SPAN
960 >"Add Bookmark"</SPAN
961 > (Netscape). You will get a warning that
964 >"may not be safe"</SPAN
965 > - just click OK. Then you can run the
966 Bookmarklet directly from your favourites/bookmarks. For even faster access,
967 you can put them on the <SPAN
970 > bar (IE) or the <SPAN
974 > (Netscape), and run them with a single click. </P
982 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=enabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
991 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=disabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1000 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=toggle','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1003 > (Toggles between enabled and disabled)
1009 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y','ijbstatus','width=250,height=2,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1011 >View Privoxy Status</A
1018 > Credit: The site which gave me the general idea for these bookmarklets is
1020 HREF="http://www.bookmarklets.com"
1022 >www.bookmarklets.com</A
1024 have more information about bookmarklets. </P
1033 >9.3. Anatomy of an Action</A
1046 > to any given URL can be complex, and not always so
1047 easy to understand what is happening. And sometimes we need to be able to
1055 doing. Especially, if something <SPAN
1059 is causing us a problem inadvertantly. It can be a little daunting to look at
1060 the actions and filters files themselves, since they tend to be filled with
1063 >"regular expressions"</SPAN
1064 > whose consequences are not always
1070 HREF="http://config.privoxy.org/show-url-info"
1072 >http://config.privoxy.org/show-url-info</A
1074 page that can show us very specifically how <SPAN
1078 are being applied to any given URL. This is a big help for troubleshooting.
1081 > First, enter one URL (or partial URL) at the prompt, and then
1086 how the current configuration will handle it. This will not
1087 help with filtering effects from the <TT
1091 also will not tell you about any other URLs that may be embedded within the
1092 URL you are testing. For instance, images such as ads are expressed as URLs
1093 within the raw page source of HTML pages. So you will only get info for the
1094 actual URL that is pasted into the prompt area -- not any sub-URLs. If you
1095 want to know about embedded URLs like ads, you will have to dig those out of
1096 the HTML source. Use your browser's <SPAN
1098 >"View Page Source"</SPAN
1100 for this. Or right click on the ad, and grab the URL.</P
1102 > Let's look at an example, <A
1103 HREF="http://google.com"
1107 one section at a time:</P
1117 > System default actions:
1119 { -add-header -block -deanimate-gifs -downgrade -fast-redirects -filter
1120 -hide-forwarded -hide-from -hide-referer -hide-user-agent -image
1121 -image-blocker -limit-connect -no-compression -no-cookies-keep
1122 -no-cookies-read -no-cookies-set -no-popups -vanilla-wafer -wafer }
1130 > This is the top section, and only tells us of the compiled in defaults. This
1131 is basically what <SPAN
1138 > defined, i.e. it does nothing. Every action
1139 is disabled. This is not particularly informative for our purposes here. OK,
1150 > Matches for http://google.com:
1152 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1153 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1154 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1155 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1156 -hide-user-agent -image +image-blocker{blank} +no-compression
1157 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1158 -vanilla-wafer -wafer }
1161 { -no-cookies-keep -no-cookies-read -no-cookies-set }
1173 > This is much more informative, and tells us how we have defined our
1177 >, and which ones match for our example,
1181 >. The first grouping shows our default
1182 settings, which would apply to all URLs. If you look at your <SPAN
1186 file, this would be the section just below the <SPAN
1190 near the top. This applies to all URLs as signified by the single forward
1197 > These are the default actions we have enabled. But we can define additional
1198 actions that would be exceptions to these general rules, and then list
1199 specific URLs that these exceptions would apply to. Last match wins.
1200 Just below this then are two explict matches for <SPAN
1202 >".google.com"</SPAN
1204 The first is negating our various cookie blocking actions (i.e. we will allow
1205 cookies here). The second is allowing <SPAN
1207 >"fast-redirects"</SPAN
1209 that there is a leading dot here -- <SPAN
1211 >".google.com"</SPAN
1213 match any hosts and sub-domains, in the google.com domain also, such as
1216 >"www.google.com"</SPAN
1217 >. So, apparently, we have these actions defined
1218 somewhere in the lower part of our actions file, and
1222 > is referenced in these sections. </P
1224 > And now we pull it altogether in the bottom section and summarize how
1228 > is appying all its <SPAN
1245 > Final results:
1247 -add-header -block -deanimate-gifs -downgrade -fast-redirects
1248 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1249 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1250 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1251 -hide-user-agent -image +image-blocker{blank} -limit-connect +no-compression
1252 -no-cookies-keep -no-cookies-read -no-cookies-set +no-popups -vanilla-wafer
1261 > Now another example, <SPAN
1263 >"ad.doubleclick.net"</SPAN
1274 > { +block +image }
1289 > We'll just show the interesting part here, the explicit matches. It is
1290 matched three different times. Each as an <SPAN
1292 >"+block +image"</SPAN
1294 which is the expanded form of one of our aliases that had been defined as:
1297 >"+imageblock"</SPAN
1301 > are defined in the
1302 first section of the actions file and typically used to combine more
1303 than one action.)</P
1305 > Any one of these would have done the trick and blocked this as an unwanted
1306 image. This is unnecessarily redundant since the last case effectively
1307 would also cover the first. No point in taking chances with these guys
1308 though ;-) Note that if you want an ad or obnoxious
1309 URL to be invisible, it should be defined as <SPAN
1311 >"ad.doubleclick.net"</SPAN
1313 is done here -- as both a <SPAN
1323 >. The custom alias <SPAN
1325 >"+imageblock"</SPAN
1329 > One last example. Let's try <SPAN
1331 >"http://www.rhapsodyk.net/adsl/HOWTO/"</SPAN
1333 This one is giving us problems. We are getting a blank page. Hmmm...</P
1343 > Matches for http://www.rhapsodyk.net/adsl/HOWTO/:
1345 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1346 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1347 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1348 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1349 -hide-user-agent -image +image-blocker{blank} +no-compression
1350 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1351 -vanilla-wafer -wafer }
1370 we did not want this at all! Now we see why we get the blank page. We could
1371 now add a new action below this that explictly does <I
1375 block (-block) pages with <SPAN
1378 >. There are various ways to
1379 handle such exceptions. Example:</P
1398 > Now the page displays ;-) Be sure to flush your browser's caches when
1399 making such changes. Or, try using <TT
1404 > But now what about a situation where we get no explicit matches like
1424 > That actually was very telling and pointed us quickly to where the problem
1425 was. If you don't get this kind of match, then it means one of the default
1426 rules in the first section is causing the problem. This would require some
1427 guesswork, and maybe a little trial and error to isolate the offending rule.
1428 One likely cause would be one of the <SPAN
1432 adding the URL for the site to one of aliases that turn off <SPAN
1447 .worldpay.com # for quietpc.com
1467 >"{ -filter -no-cookies -no-cookies-keep }"</SPAN
1469 your own exception to negate filtering: </P
1491 > is an alias that disables most actions. This can be
1492 used as a last resort for problem sites. Remember to flush caches! If this
1493 still does not work, you will have to go through the remaining actions one by
1494 one to find which one(s) is causing the problem.</P