7 CONTENT="Modular DocBook HTML Stylesheet Version 1.64
10 TITLE="Privoxy User Manual"
11 HREF="index.html"><LINK
14 HREF="seealso.html"><LINK
17 HREF="../p_doc.css"></HEAD
36 >Privoxy User Manual</TH
76 >9.1. Regular Expressions</A
84 >"regular expressions"</SPAN
86 in various config files. Assuming support for <SPAN
90 Compatible Regular Expressions) is compiled in, which is the default. Such
91 configuration directives do not require regular expressions, but they can be
92 used to increase flexibility by matching a pattern with wild-cards against
95 > If you are reading this, you probably don't understand what <SPAN
99 > are, or what they can do. So this will be a very brief
100 introduction only. A full explanation would require a book ;-)</P
104 >"Regular expressions"</SPAN
105 > is a way of matching one character
106 expression against another to see if it matches or not. One of the
110 > is a literal string of readable characters
111 (letter, numbers, etc), and the other is a complex string of literal
112 characters combined with wild-cards, and other special characters, called
113 meta-characters. The <SPAN
115 >"meta-characters"</SPAN
116 > have special meanings and
117 are used to build the complex pattern to be matched against. Perl Compatible
118 Regular Expressions is an enhanced form of the regular expression language
119 with backward compatibility.</P
121 > To make a simple analogy, we do something similar when we use wild-card
122 characters when listing files with the <B
129 > matches all filenames. The <SPAN
133 character here is the asterisk which matches any and all characters. We can be
134 more specific and use <TT
137 > to match just individual
140 >"dir file?.text"</SPAN
148 >, etc. We are pattern
149 matching, using a similar technique to <SPAN
151 >"regular expressions"</SPAN
154 > Regular expressions do essentially the same thing, but are much, much more
155 powerful. There are many more <SPAN
157 >"special characters"</SPAN
159 building complex patterns however. Let's look at a few of the common ones,
160 and then some examples:</P
172 > - Matches any single character, e.g. <SPAN
207 > - The preceding character or expression is matched ZERO or ONE
227 > - The preceding character or expression is matched ONE or MORE
247 > - The preceding character or expression is matched ZERO or MORE
270 > character denotes that
271 the following character should be taken literally. This is used where one of the
272 special characters (e.g. <SPAN
275 >) needs to be taken literally and
276 not as a special meta-character. Example: <SPAN
278 >"example\.com"</SPAN
280 sure the period is recognized only as a period (and not expanded to its
281 metacharacter meaning of any single character).
300 > - Characters enclosed in brackets will be matched if
301 any of the enclosed characters are encountered. For instance, <SPAN
305 matches any numeric digit (zero through nine). As an example, we can combine
309 > to match any digit one of more times: <SPAN
331 > - parentheses are used to group a sub-expression,
332 or multiple sub-expressions.
354 > character works like an
358 > conditional statement. A match is successful if the
359 sub-expression on either side of <SPAN
362 > matches. As an example:
365 >"/(this|that) example/"</SPAN
366 > uses grouping and the bar character
367 and would match either <SPAN
369 >"this example"</SPAN
392 >s/string1/string2/g</I
393 > - This is used to rewrite strings of text.
397 > is replaced by <SPAN
401 example. There must of course be a match on <SPAN
413 > These are just some of the ones you are likely to use when matching URLs with
417 >, and is a long way from a definitive
418 list. This is enough to get us started with a few simple examples which may
419 be more illuminating:</P
428 that uses the common combination of <SPAN
435 denote any character, zero or more times. In other words, any string at all.
436 So we start with a literal forward slash, then our regular expression pattern
440 >) another literal forward slash, the string
444 >, another forward slash, and lastly another
449 a directory path here. This will match any file with the path that has a
450 directory named <SPAN
457 any characters, and this could conceivably be more forward slashes, so it
458 might expand into a much longer looking path. For example, this could match:
461 >"/eye/hate/spammers/banners/annoy_me_please.gif"</SPAN
465 >"/banners/annoying.html"</SPAN
466 >, or almost an infinite number of other
467 possible combinations, just so it has <SPAN
473 > A now something a little more complex:</P
479 >/.*/adv((er)?ts?|ertis(ing|ements?))?/</TT
482 We have several literal forward slashes again (<SPAN
486 building another expression that is a file path statement. We have another
490 >, so we are matching against any conceivable sub-path, just so
491 it matches our expression. The only true literal that <I
495 > our pattern is <SPAN
499 the forward slashes. What comes after the <SPAN
503 interesting part. </P
508 > means the preceding expression (either a
509 literal character or anything grouped with <SPAN
513 can exist or not, since this means either zero or one match. So
516 >"((er)?ts?|ertis(ing|ements?))"</SPAN
517 > is optional, as are the
518 individual sub-expressions: <SPAN
524 >"(ing|ements?)"</SPAN
535 >. We have two of those. For instance,
538 >"(ing|ements?)"</SPAN
539 >, can expand to match either <SPAN
549 >. What is being done here, is an
550 attempt at matching as many variations of <SPAN
552 >"advertisement"</SPAN
554 similar, as possible. So this would expand to match just <SPAN
570 >"advertisement"</SPAN
574 >"advertisements"</SPAN
575 >. You get the idea. But it would not match
578 >"advertizements"</SPAN
582 >). We could fix that by
583 changing our regular expression to:
586 >"/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/"</SPAN
587 >, which would then match
594 >/.*/advert[0-9]+\.(gif|jpe?g)</TT
597 another path statement with forward slashes. Anything in the square brackets
601 > can be matched. This is using <SPAN
605 shorthand expression to mean any digit one through nine. It is the same as
609 >. So any digit matches. The <SPAN
613 means one or more of the preceding expression must be included. The preceding
614 expression here is what is in the square brackets -- in this case, any digit
615 one through nine. Then, at the end, we have a grouping: <SPAN
619 This includes a <SPAN
622 >, so this needs to match the expression on
623 either side of that bar character also. A simple <SPAN
626 > on one side, and the other
627 side will in turn match either <SPAN
637 > means the letter <SPAN
641 can be matched once or not at all. So we are building an expression here to
642 match image GIF or JPEG type image file. It must include the literal
646 >, then one or more digits, and a <SPAN
650 (which is now a literal, and not a special character, since it is escaped
654 >), and lastly either <SPAN
664 >. Some possible matches would
667 >"//advert1.jpg"</SPAN
671 >"/nasty/ads/advert1234.gif"</SPAN
675 >"/banners/from/hell/advert99.jpg"</SPAN
676 >. It would not match
680 > (no leading slash), or
683 >"/adverts232.jpg"</SPAN
684 > (the expression does not include an
690 >"/advert1.jsp"</SPAN
695 in the expression anywhere).</P
701 >s/microsoft(?!.com)/MicroSuck/i</TT
704 a substitution. <SPAN
707 > will replace any occurrence of
714 > at the end of the expression
715 means ignore case. The <SPAN
719 the match should fail if <SPAN
726 >. In other words, this acts like a <SPAN
730 modifier. In case this is a hyperlink, we don't want to break it ;-).</P
732 > We are barely scratching the surface of regular expressions here so that you
733 can understand the default <SPAN
737 configuration files, and maybe use this knowledge to customize your own
738 installation. There is much, much more that can be done with regular
739 expressions. Now that you know enough to get started, you can learn more on
742 > More reading on Perl Compatible Regular expressions:
744 HREF="http://www.perldoc.com/perl5.6/pod/perlre.html"
746 >http://www.perldoc.com/perl5.6/pod/perlre.html</A
758 >'s Internal Pages</A
764 > proxies each requested
765 web page, it is easy for <SPAN
769 trap certain special URLs. In this way, we can talk directly to
774 configured, see how our rules are being applied, change these
775 rules and other configuration options, and even turn
779 > filtering off, all with
780 a web browser. </P
782 > The URLs listed below are the special ones that allow direct access
790 > must be running to access these. If
791 not, you will get a friendly error message. Internet access is not
810 HREF="http://config.privoxy.org/"
812 >http://config.privoxy.org/</A
817 > Alternately, this may be reached at <A
822 variation may not work as reliably as the above in some configurations.
828 Show information about the current configuration:
838 HREF="http://config.privoxy.org/show-status"
840 >http://config.privoxy.org/show-status</A
848 Show the source code version numbers:
858 HREF="http://config.privoxy.org/show-version"
860 >http://config.privoxy.org/show-version</A
868 Show the client's request headers:
878 HREF="http://config.privoxy.org/show-request"
880 >http://config.privoxy.org/show-request</A
888 Show which actions apply to a URL and why:
898 HREF="http://config.privoxy.org/show-url-info"
900 >http://config.privoxy.org/show-url-info</A
908 Toggle Privoxy on or off. In this case, <SPAN
912 to run, but only as a pass-through proxy, with no actions taking place:
922 HREF="http://config.privoxy.org/toggle"
924 >http://config.privoxy.org/toggle</A
929 > Short cuts. Turn off, then on:
939 HREF="http://config.privoxy.org/toggle?set=disable"
941 >http://config.privoxy.org/toggle?set=disable</A
953 HREF="http://config.privoxy.org/toggle?set=enable"
955 >http://config.privoxy.org/toggle?set=enable</A
963 Edit the actions list file:
973 HREF="http://config.privoxy.org/edit-actions"
975 >http://config.privoxy.org/edit-actions</A
983 > These may be bookmarked for quick reference. See next. </P
990 >9.2.1. Bookmarklets</A
993 > Below are some <SPAN
995 >"bookmarklets"</SPAN
996 > to allow you to easily access a
1000 > version of some of <SPAN
1004 special pages. They are designed for MS Internet Explorer, but should work
1005 equally well in Netscape, Mozilla, and other browsers which support
1006 JavaScript. They are designed to run directly from your bookmarks - not by
1007 clicking the links below (although that should work for testing).</P
1009 > To save them, right-click the link and choose <SPAN
1011 >"Add to Favorites"</SPAN
1015 >"Add Bookmark"</SPAN
1016 > (Netscape). You will get a warning that
1019 >"may not be safe"</SPAN
1020 > - just click OK. Then you can run the
1021 Bookmarklet directly from your favorites/bookmarks. For even faster access,
1022 you can put them on the <SPAN
1025 > bar (IE) or the <SPAN
1029 > (Netscape), and run them with a single click. </P
1037 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=enabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1046 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=disabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1055 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=toggle','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1058 > (Toggles between enabled and disabled)
1064 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y','ijbstatus','width=250,height=2,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1066 >View Privoxy Status</A
1073 HREF="javascript:w=Math.floor(screen.width/2);h=Math.floor(screen.height*0.9);void(window.open('http://www.privoxy.org/actions','Feedback','screenx='+w+',width='+w+',height='+h+',scrollbars=yes,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1075 >Actions file feedback system</A
1082 > Credit: The site which gave me the general idea for these bookmarklets is
1084 HREF="http://www.bookmarklets.com"
1086 >www.bookmarklets.com</A
1088 have more information about bookmarklets. </P
1097 >9.3. Chain of Events</A
1100 > Let's take a quick look at the basic sequence of events when a web page is
1101 requested by your browser and <SPAN
1111 > First, the web browser requests a page, and this request is intercepted by
1123 > traps any request for internal CGI
1124 pages (e.g http://p.p/) and relays these back to the browser.
1129 > If the URL matches a <SPAN
1132 > pattern, then it is blocked
1133 and the banner displayed.
1138 > Untrusted URLs are blocked. If URLs are being added to the
1142 > file, then that is done.
1149 >"+fast-redirect"</SPAN
1150 > is processed, stripping unwanted parts
1151 of the request web page URL.
1156 > At this point, <SPAN
1159 > relays the request to the
1160 web server, and requests the page (assuming nothing up to this point has
1161 prevented getting us from this far).
1166 > The first few hundred bytes are read from the web server and
1169 >"+kill-popups"</SPAN
1170 > is processed, if enabled.
1178 > applies, the rest of the page is read into
1179 memory and then the filters are processed. Filters are applied in the order they
1180 are specified in the <TT
1184 page, which is now filtered, is then sent by
1193 > As the browser receives the filtered page content, it will read and request any
1194 embedded URLs on the page, e.g. an ad image. As the browser requests these
1195 secondary URLs from whatever server they may be on,
1199 > handles these same as above, and the process
1200 is repeated for each such URL. Note that a fancy web page may have many, many
1201 such URLs for graphics, frames, etc.
1213 >9.4. Anatomy of an Action</A
1226 > to any given URL can be complex, and not always so
1227 easy to understand what is happening. And sometimes we need to be able to
1235 doing. Especially, if something <SPAN
1239 is causing us a problem inadvertently. It can be a little daunting to look at
1240 the actions and filters files themselves, since they tend to be filled with
1243 >"regular expressions"</SPAN
1244 > whose consequences are not always
1247 > One quick test to see if <SPAN
1250 > is causing a problem
1251 or not, is to disable it temporarily. This should be the first troubleshooting
1253 HREF="appendix.html#BOOKMARKLETS"
1254 >the Bookmarklets</A
1255 > section on a quick
1256 and easy way to do this (be sure to flush caches afterwards!).</P
1263 HREF="http://config.privoxy.org/show-url-info"
1265 >http://config.privoxy.org/show-url-info</A
1267 page that can show us very specifically how <SPAN
1271 are being applied to any given URL. This is a big help for troubleshooting.</P
1273 > First, enter one URL (or partial URL) at the prompt, and then
1278 how the current configuration will handle it. This will not
1279 help with filtering effects (i.e. the <SPAN
1286 > file since this is handled very differently
1287 and not so easy to trap! It also will not tell you about any other URLs that
1288 may be embedded within the URL you are testing (i.e. a web page). For
1289 instance, images such as ads are expressed as URLs within the raw page source
1290 of HTML pages. So you will only get info for the actual URL that is pasted
1291 into the prompt area -- not any sub-URLs. If you want to know about embedded
1292 URLs like ads, you will have to dig those out of the HTML source. Use your
1295 >"View Page Source"</SPAN
1296 > option for this. Or right click on
1297 the ad, and grab the URL.</P
1299 > Let's look at an example, <A
1300 HREF="http://google.com"
1304 one section at a time:</P
1314 > System default actions:
1316 { -add-header -block -deanimate-gifs -downgrade -fast-redirects -filter
1317 -hide-forwarded -hide-from -hide-referer -hide-user-agent -image
1318 -image-blocker -limit-connect -no-compression -no-cookies-keep
1319 -no-cookies-read -no-cookies-set -no-popups -vanilla-wafer -wafer }
1327 > This is the top section, and only tells us of the compiled in defaults. This
1328 is basically what <SPAN
1335 > defined, i.e. it does nothing. Every action
1336 is disabled. This is not particularly informative for our purposes here. OK,
1347 > Matches for http://google.com:
1349 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1350 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1351 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1352 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1353 -hide-user-agent -image +image-blocker{blank} +no-compression
1354 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1355 -vanilla-wafer -wafer }
1358 { -no-cookies-keep -no-cookies-read -no-cookies-set }
1370 > This is much more informative, and tells us how we have defined our
1374 >, and which ones match for our example,
1378 >. The first grouping shows our default
1379 settings, which would apply to all URLs. If you look at your <SPAN
1383 file, this would be the section just below the <SPAN
1387 near the top. This applies to all URLs as signified by the single forward
1394 > These are the default actions we have enabled. But we can define additional
1395 actions that would be exceptions to these general rules, and then list
1396 specific URLs that these exceptions would apply to. Last match wins.
1397 Just below this then are two explicit matches for <SPAN
1399 >".google.com"</SPAN
1401 The first is negating our various cookie blocking actions (i.e. we will allow
1402 cookies here). The second is allowing <SPAN
1404 >"fast-redirects"</SPAN
1406 that there is a leading dot here -- <SPAN
1408 >".google.com"</SPAN
1410 match any hosts and sub-domains, in the google.com domain also, such as
1413 >"www.google.com"</SPAN
1414 >. So, apparently, we have these actions defined
1415 somewhere in the lower part of our actions file, and
1419 > is referenced in these sections. </P
1421 > And now we pull it altogether in the bottom section and summarize how
1425 > is applying all its <SPAN
1442 > Final results:
1444 -add-header -block -deanimate-gifs -downgrade -fast-redirects
1445 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1446 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1447 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1448 -hide-user-agent -image +image-blocker{blank} -limit-connect +no-compression
1449 -no-cookies-keep -no-cookies-read -no-cookies-set +no-popups -vanilla-wafer
1458 > Now another example, <SPAN
1460 >"ad.doubleclick.net"</SPAN
1471 > { +block +image }
1486 > We'll just show the interesting part here, the explicit matches. It is
1487 matched three different times. Each as an <SPAN
1489 >"+block +image"</SPAN
1491 which is the expanded form of one of our aliases that had been defined as:
1494 >"+imageblock"</SPAN
1498 > are defined in the
1499 first section of the actions file and typically used to combine more
1500 than one action.)</P
1502 > Any one of these would have done the trick and blocked this as an unwanted
1503 image. This is unnecessarily redundant since the last case effectively
1504 would also cover the first. No point in taking chances with these guys
1505 though ;-) Note that if you want an ad or obnoxious
1506 URL to be invisible, it should be defined as <SPAN
1508 >"ad.doubleclick.net"</SPAN
1510 is done here -- as both a <SPAN
1520 >. The custom alias <SPAN
1522 >"+imageblock"</SPAN
1526 > One last example. Let's try <SPAN
1528 >"http://www.rhapsodyk.net/adsl/HOWTO/"</SPAN
1530 This one is giving us problems. We are getting a blank page. Hmmm...</P
1540 > Matches for http://www.rhapsodyk.net/adsl/HOWTO/:
1542 { -add-header -block +deanimate-gifs -downgrade +fast-redirects
1543 +filter{html-annoyances} +filter{js-annoyances} +filter{no-popups}
1544 +filter{webbugs} +filter{nimda} +filter{banners-by-size} +filter{hal}
1545 +filter{fun} +hide-forwarded +hide-from{block} +hide-referer{forge}
1546 -hide-user-agent -image +image-blocker{blank} +no-compression
1547 +no-cookies-keep -no-cookies-read -no-cookies-set +no-popups
1548 -vanilla-wafer -wafer }
1567 we did not want this at all! Now we see why we get the blank page. We could
1568 now add a new action below this that explicitly does <I
1572 block (-block) pages with <SPAN
1575 >. There are various ways to
1576 handle such exceptions. Example:</P
1595 > Now the page displays ;-) Be sure to flush your browser's caches when
1596 making such changes. Or, try using <TT
1601 > But now what about a situation where we get no explicit matches like
1621 > That actually was very telling and pointed us quickly to where the problem
1622 was. If you don't get this kind of match, then it means one of the default
1623 rules in the first section is causing the problem. This would require some
1624 guesswork, and maybe a little trial and error to isolate the offending rule.
1625 One likely cause would be one of the <SPAN
1629 adding the URL for the site to one of aliases that turn off <SPAN
1644 .worldpay.com # for quietpc.com
1664 >"{ -filter -no-cookies -no-cookies-keep }"</SPAN
1666 your own exception to negate filtering: </P
1688 > is an alias that disables most actions. This can be
1689 used as a last resort for problem sites. Remember to flush caches! If this
1690 still does not work, you will have to go through the remaining actions one by
1691 one to find which one(s) is causing the problem.</P