7 CONTENT="Modular DocBook HTML Stylesheet Version 1.64
10 TITLE="Privoxy User Manual"
11 HREF="index.html"><LINK
14 HREF="seealso.html"><LINK
17 HREF="../p_doc.css"></HEAD
36 >Privoxy User Manual</TH
76 >9.1. Regular Expressions</A
84 >"regular expressions"</SPAN
86 in various config files. Assuming support for <SPAN
90 Compatible Regular Expressions) is compiled in, which is the default. Such
91 configuration directives do not require regular expressions, but they can be
92 used to increase flexibility by matching a pattern with wild-cards against
95 > If you are reading this, you probably don't understand what <SPAN
99 > are, or what they can do. So this will be a very brief
100 introduction only. A full explanation would require a book ;-)</P
104 >"Regular expressions"</SPAN
105 > is a way of matching one character
106 expression against another to see if it matches or not. One of the
110 > is a literal string of readable characters
111 (letter, numbers, etc), and the other is a complex string of literal
112 characters combined with wild-cards, and other special characters, called
113 meta-characters. The <SPAN
115 >"meta-characters"</SPAN
116 > have special meanings and
117 are used to build the complex pattern to be matched against. Perl Compatible
118 Regular Expressions is an enhanced form of the regular expression language
119 with backward compatibility.</P
121 > To make a simple analogy, we do something similar when we use wild-card
122 characters when listing files with the <B
129 > matches all filenames. The <SPAN
133 character here is the asterisk which matches any and all characters. We can be
134 more specific and use <TT
137 > to match just individual
140 >"dir file?.text"</SPAN
148 >, etc. We are pattern
149 matching, using a similar technique to <SPAN
151 >"regular expressions"</SPAN
154 > Regular expressions do essentially the same thing, but are much, much more
155 powerful. There are many more <SPAN
157 >"special characters"</SPAN
159 building complex patterns however. Let's look at a few of the common ones,
160 and then some examples:</P
171 > - Matches any single character, e.g. <SPAN
204 > - The preceding character or expression is matched ZERO or ONE
222 > - The preceding character or expression is matched ONE or MORE
240 > - The preceding character or expression is matched ZERO or MORE
261 > character denotes that
262 the following character should be taken literally. This is used where one of the
263 special characters (e.g. <SPAN
266 >) needs to be taken literally and
267 not as a special meta-character.
284 > - Characters enclosed in brackets will be matched if
285 any of the enclosed characters are encountered.
302 > - parentheses are used to group a sub-expression,
303 or multiple sub-expressions.
323 > character works like an
327 > conditional statement. A match is successful if the
328 sub-expression on either side of <SPAN
347 >s/string1/string2/g</I
348 > - This is used to rewrite strings of text.
352 > is replaced by <SPAN
364 > These are just some of the ones you are likely to use when matching URLs with
368 >, and is a long way from a definitive
369 list. This is enough to get us started with a few simple examples which may
370 be more illuminating:</P
379 that uses the common combination of <SPAN
386 denote any character, zero or more times. In other words, any string at all.
387 So we start with a literal forward slash, then our regular expression pattern
391 >) another literal forward slash, the string
395 >, another forward slash, and lastly another
400 a directory path here. This will match any file with the path that has a
401 directory named <SPAN
408 any characters, and this could conceivably be more forward slashes, so it
409 might expand into a much longer looking path. For example, this could match:
412 >"/eye/hate/spammers/banners/annoy_me_please.gif"</SPAN
416 >"/banners/annoying.html"</SPAN
417 >, or almost an infinite number of other
418 possible combinations, just so it has <SPAN
424 > A now something a little more complex:</P
430 >/.*/adv((er)?ts?|ertis(ing|ements?))?/</TT
433 We have several literal forward slashes again (<SPAN
437 building another expression that is a file path statement. We have another
441 >, so we are matching against any conceivable sub-path, just so
442 it matches our expression. The only true literal that <I
446 > our pattern is <SPAN
450 the forward slashes. What comes after the <SPAN
454 interesting part. </P
459 > means the preceding expression (either a
460 literal character or anything grouped with <SPAN
464 can exist or not, since this means either zero or one match. So
467 >"((er)?ts?|ertis(ing|ements?))"</SPAN
468 > is optional, as are the
469 individual sub-expressions: <SPAN
475 >"(ing|ements?)"</SPAN
486 >. We have two of those. For instance,
489 >"(ing|ements?)"</SPAN
490 >, can expand to match either <SPAN
500 >. What is being done here, is an
501 attempt at matching as many variations of <SPAN
503 >"advertisement"</SPAN
505 similar, as possible. So this would expand to match just <SPAN
521 >"advertisement"</SPAN
525 >"advertisements"</SPAN
526 >. You get the idea. But it would not match
529 >"advertizements"</SPAN
533 >). We could fix that by
534 changing our regular expression to:
537 >"/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/"</SPAN
538 >, which would then match
545 >/.*/advert[0-9]+\.(gif|jpe?g)</TT
548 another path statement with forward slashes. Anything in the square brackets
552 > can be matched. This is using <SPAN
556 shorthand expression to mean any digit one through nine. It is the same as
560 >. So any digit matches. The <SPAN
564 means one or more of the preceding expression must be included. The preceding
565 expression here is what is in the square brackets -- in this case, any digit
566 one through nine. Then, at the end, we have a grouping: <SPAN
570 This includes a <SPAN
573 >, so this needs to match the expression on
574 either side of that bar character also. A simple <SPAN
577 > on one side, and the other
578 side will in turn match either <SPAN
588 > means the letter <SPAN
592 can be matched once or not at all. So we are building an expression here to
593 match image GIF or JPEG type image file. It must include the literal
597 >, then one or more digits, and a <SPAN
601 (which is now a literal, and not a special character, since it is escaped
605 >), and lastly either <SPAN
615 >. Some possible matches would
618 >"//advert1.jpg"</SPAN
622 >"/nasty/ads/advert1234.gif"</SPAN
626 >"/banners/from/hell/advert99.jpg"</SPAN
627 >. It would not match
631 > (no leading slash), or
634 >"/adverts232.jpg"</SPAN
635 > (the expression does not include an
641 >"/advert1.jsp"</SPAN
646 in the expression anywhere).</P
652 >s/microsoft(?!.com)/MicroSuck/i</TT
655 a substitution. <SPAN
658 > will replace any occurrence of
665 > at the end of the expression
666 means ignore case. The <SPAN
670 the match should fail if <SPAN
677 >. In other words, this acts like a <SPAN
681 modifier. In case this is a hyperlink, we don't want to break it ;-).</P
683 > We are barely scratching the surface of regular expressions here so that you
684 can understand the default <SPAN
688 configuration files, and maybe use this knowledge to customize your own
689 installation. There is much, much more that can be done with regular
690 expressions. Now that you know enough to get started, you can learn more on
693 > More reading on Perl Compatible Regular expressions:
695 HREF="http://www.perldoc.com/perl5.6/pod/perlre.html"
697 >http://www.perldoc.com/perl5.6/pod/perlre.html</A
709 >'s Internal Pages</A
715 > proxies each requested
716 web page, it is easy for <SPAN
720 trap certain special URLs. In this way, we can talk directly to
725 configured, see how our rules are being applied, change these
726 rules and other configuration options, and even turn
730 > filtering off, all with
731 a web browser. </P
733 > The URLs listed below are the special ones that allow direct access
741 > must be running to access these. If
742 not, you will get a friendly error message. Internet access is not
761 HREF="http://config.privoxy.org/"
763 >http://config.privoxy.org/</A
768 > Alternately, this may be reached at <A
773 variation may not work as reliably as the above in some configurations.
779 Show information about the current configuration:
789 HREF="http://config.privoxy.org/show-status"
791 >http://config.privoxy.org/show-status</A
799 Show the source code version numbers:
809 HREF="http://config.privoxy.org/show-version"
811 >http://config.privoxy.org/show-version</A
819 Show the client's request headers:
829 HREF="http://config.privoxy.org/show-request"
831 >http://config.privoxy.org/show-request</A
839 Show which actions apply to a URL and why:
849 HREF="http://config.privoxy.org/show-url-info"
851 >http://config.privoxy.org/show-url-info</A
859 Toggle Privoxy on or off. In this case, <SPAN
863 to run, but only as a pass-through proxy, with no actions taking place:
873 HREF="http://config.privoxy.org/toggle"
875 >http://config.privoxy.org/toggle</A
880 > Short cuts. Turn off, then on:
890 HREF="http://config.privoxy.org/toggle?set=disable"
892 >http://config.privoxy.org/toggle?set=disable</A
904 HREF="http://config.privoxy.org/toggle?set=enable"
906 >http://config.privoxy.org/toggle?set=enable</A
914 Edit the actions list file:
924 HREF="http://config.privoxy.org/edit-actions"
926 >http://config.privoxy.org/edit-actions</A
934 > These may be bookmarked for quick reference. </P
941 >9.2.1. Bookmarklets</A
944 > Below are some <SPAN
946 >"bookmarklets"</SPAN
947 > to allow you to easily access a
951 > version of some of <SPAN
955 special pages. They are designed for MS Internet Explorer, but should work
956 equally well in Netscape, Mozilla, and other browsers which support
957 JavaScript. They are designed to run directly from your bookmarks - not by
958 clicking the links below (although that should work for testing).</P
960 > To save them, right-click the link and choose <SPAN
962 >"Add to Favorites"</SPAN
966 >"Add Bookmark"</SPAN
967 > (Netscape). You will get a warning that
970 >"may not be safe"</SPAN
971 > - just click OK. Then you can run the
972 Bookmarklet directly from your favourites/bookmarks. For even faster access,
973 you can put them on the <SPAN
976 > bar (IE) or the <SPAN
980 > (Netscape), and run them with a single click. </P
988 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=enabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
997 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=disabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1006 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=toggle','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1009 > (Toggles between enabled and disabled)
1015 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y','ijbstatus','width=250,height=2,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1017 >View Privoxy Status</A
1024 HREF="javascript:w=Math.floor(screen.width/2);h=Math.floor(screen.height*0.9);void(window.open('http://www.privoxy.org/actions','Feedback','screenx='+w+',width='+w+',height='+h+',scrollbars=yes,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1026 >Actions file feedback system</A
1033 > Credit: The site which gave me the general idea for these bookmarklets is
1035 HREF="http://www.bookmarklets.com"
1037 >www.bookmarklets.com</A
1039 have more information about bookmarklets. </P
1048 >9.3. Anatomy of an Action</A
1061 > to any given URL can be complex, and not always so
1062 easy to understand what is happening. And sometimes we need to be able to
1070 doing. Especially, if something <SPAN
1074 is causing us a problem inadvertantly. It can be a little daunting to look at
1075 the actions and filters files themselves, since they tend to be filled with
1078 >"regular expressions"</SPAN
1079 > whose consequences are not always
1085 HREF="http://config.privoxy.org/show-url-info"
1087 >http://config.privoxy.org/show-url-info</A
1089 page that can show us very specifically how <SPAN
1093 are being applied to any given URL. This is a big help for troubleshooting.
1096 > First, enter one URL (or partial URL) at the prompt, and then
1101 how the current configuration will handle it. This will not
1102 help with filtering effects from the <TT
1106 also will not tell you about any other URLs that may be embedded within the
1107 URL you are testing. For instance, images such as ads are expressed as URLs
1108 within the raw page source of HTML pages. So you will only get info for the
1109 actual URL that is pasted into the prompt area -- not any sub-URLs. If you
1110 want to know about embedded URLs like ads, you will have to dig those out of
1111 the HTML source. Use your browser's <SPAN
1113 >"View Page Source"</SPAN
1115 for this. Or right click on the ad, and grab the URL.</P
1117 > Let's look at an example, <A
1118 HREF="http://google.com"
1122 one section at a time:</P
1132 > System default actions:
1134 { -add-header -block -deanimate-gifs -downgrade -fast-redirects -filter
1135 -hide-forwarded -hide-from -hide-referer -hide-user-agent -image
1136 -image-blocker -limit-connect -no-compression -no-cookies-keep
1137 -no-cookies-read -no-cookies-set -no-popups -vanilla-wafer -wafer }
1145 > This is the top section, and only tells us of the compiled in defaults. This
1146 is basically what <SPAN
1153 > defined, i.e. it does nothing. Every action
1154 is disabled. This is not particularly informative for our purposes here. OK,
1165 > Matches for http://google.com:
1167 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1168 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1169 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1170 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1171 -hide-user-agent -image +image-blocker{blank} +no-compression
1172 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1173 -vanilla-wafer -wafer }
1176 { -no-cookies-keep -no-cookies-read -no-cookies-set }
1188 > This is much more informative, and tells us how we have defined our
1192 >, and which ones match for our example,
1196 >. The first grouping shows our default
1197 settings, which would apply to all URLs. If you look at your <SPAN
1201 file, this would be the section just below the <SPAN
1205 near the top. This applies to all URLs as signified by the single forward
1212 > These are the default actions we have enabled. But we can define additional
1213 actions that would be exceptions to these general rules, and then list
1214 specific URLs that these exceptions would apply to. Last match wins.
1215 Just below this then are two explict matches for <SPAN
1217 >".google.com"</SPAN
1219 The first is negating our various cookie blocking actions (i.e. we will allow
1220 cookies here). The second is allowing <SPAN
1222 >"fast-redirects"</SPAN
1224 that there is a leading dot here -- <SPAN
1226 >".google.com"</SPAN
1228 match any hosts and sub-domains, in the google.com domain also, such as
1231 >"www.google.com"</SPAN
1232 >. So, apparently, we have these actions defined
1233 somewhere in the lower part of our actions file, and
1237 > is referenced in these sections. </P
1239 > And now we pull it altogether in the bottom section and summarize how
1243 > is appying all its <SPAN
1260 > Final results:
1262 -add-header -block -deanimate-gifs -downgrade -fast-redirects
1263 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1264 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1265 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1266 -hide-user-agent -image +image-blocker{blank} -limit-connect +no-compression
1267 -no-cookies-keep -no-cookies-read -no-cookies-set +no-popups -vanilla-wafer
1276 > Now another example, <SPAN
1278 >"ad.doubleclick.net"</SPAN
1289 > { +block +image }
1304 > We'll just show the interesting part here, the explicit matches. It is
1305 matched three different times. Each as an <SPAN
1307 >"+block +image"</SPAN
1309 which is the expanded form of one of our aliases that had been defined as:
1312 >"+imageblock"</SPAN
1316 > are defined in the
1317 first section of the actions file and typically used to combine more
1318 than one action.)</P
1320 > Any one of these would have done the trick and blocked this as an unwanted
1321 image. This is unnecessarily redundant since the last case effectively
1322 would also cover the first. No point in taking chances with these guys
1323 though ;-) Note that if you want an ad or obnoxious
1324 URL to be invisible, it should be defined as <SPAN
1326 >"ad.doubleclick.net"</SPAN
1328 is done here -- as both a <SPAN
1338 >. The custom alias <SPAN
1340 >"+imageblock"</SPAN
1344 > One last example. Let's try <SPAN
1346 >"http://www.rhapsodyk.net/adsl/HOWTO/"</SPAN
1348 This one is giving us problems. We are getting a blank page. Hmmm...</P
1358 > Matches for http://www.rhapsodyk.net/adsl/HOWTO/:
1360 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1361 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1362 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1363 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1364 -hide-user-agent -image +image-blocker{blank} +no-compression
1365 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1366 -vanilla-wafer -wafer }
1385 we did not want this at all! Now we see why we get the blank page. We could
1386 now add a new action below this that explictly does <I
1390 block (-block) pages with <SPAN
1393 >. There are various ways to
1394 handle such exceptions. Example:</P
1413 > Now the page displays ;-) Be sure to flush your browser's caches when
1414 making such changes. Or, try using <TT
1419 > But now what about a situation where we get no explicit matches like
1439 > That actually was very telling and pointed us quickly to where the problem
1440 was. If you don't get this kind of match, then it means one of the default
1441 rules in the first section is causing the problem. This would require some
1442 guesswork, and maybe a little trial and error to isolate the offending rule.
1443 One likely cause would be one of the <SPAN
1447 adding the URL for the site to one of aliases that turn off <SPAN
1462 .worldpay.com # for quietpc.com
1482 >"{ -filter -no-cookies -no-cookies-keep }"</SPAN
1484 your own exception to negate filtering: </P
1506 > is an alias that disables most actions. This can be
1507 used as a last resort for problem sites. Remember to flush caches! If this
1508 still does not work, you will have to go through the remaining actions one by
1509 one to find which one(s) is causing the problem.</P