7 CONTENT="Modular DocBook HTML Stylesheet Version 1.60"><LINK
9 TITLE="Privoxy User Manual"
10 HREF="index.html"><LINK
13 HREF="seealso.html"><LINK
16 HREF="../p_doc.css"></HEAD
35 >Privoxy User Manual</TH
75 >14.1. Regular Expressions</A
81 > uses Perl-style <SPAN
86 HREF="actions-file.html"
90 HREF="filter-file.html"
94 HREF="http://www.pcre.org/"
99 HREF="http://www.oesterhelt.org/pcrs/"
104 > If you are reading this, you probably don't understand what <SPAN
108 > are, or what they can do. So this will be a very brief
109 introduction only. A full explanation would require a <A
110 HREF="http://www.oreilly.com/catalog/regex/"
115 > Regular expressions provide a language to describe patterns that can be
116 run against strings of characters (letter, numbers, etc), to see if they
117 match the string or not. The patterns are themselves (sometimes complex)
118 strings of literal characters, combined with wild-cards, and other special
119 characters, called meta-characters. The <SPAN
121 >"meta-characters"</SPAN
123 special meanings and are used to build complex patterns to be matched against.
124 Perl Compatible Regular Expressions are an especially convenient
128 > of the regular expression language.</P
130 > To make a simple analogy, we do something similar when we use wild-card
131 characters when listing files with the <B
138 > matches all filenames. The <SPAN
142 character here is the asterisk which matches any and all characters. We can be
143 more specific and use <TT
146 > to match just individual
149 >"dir file?.text"</SPAN
157 >, etc. We are pattern
158 matching, using a similar technique to <SPAN
160 >"regular expressions"</SPAN
163 > Regular expressions do essentially the same thing, but are much, much more
164 powerful. There are many more <SPAN
166 >"special characters"</SPAN
168 building complex patterns however. Let's look at a few of the common ones,
169 and then some examples:</P
181 > - Matches any single character, e.g. <SPAN
216 > - The preceding character or expression is matched ZERO or ONE
236 > - The preceding character or expression is matched ONE or MORE
256 > - The preceding character or expression is matched ZERO or MORE
279 > character denotes that
280 the following character should be taken literally. This is used where one of the
281 special characters (e.g. <SPAN
284 >) needs to be taken literally and
285 not as a special meta-character. Example: <SPAN
287 >"example\.com"</SPAN
289 sure the period is recognized only as a period (and not expanded to its
290 meta-character meaning of any single character).
309 > - Characters enclosed in brackets will be matched if
310 any of the enclosed characters are encountered. For instance, <SPAN
314 matches any numeric digit (zero through nine). As an example, we can combine
318 > to match any digit one of more times: <SPAN
340 > - parentheses are used to group a sub-expression,
341 or multiple sub-expressions.
363 > character works like an
367 > conditional statement. A match is successful if the
368 sub-expression on either side of <SPAN
371 > matches. As an example:
374 >"/(this|that) example/"</SPAN
375 > uses grouping and the bar character
376 and would match either <SPAN
378 >"this example"</SPAN
392 > These are just some of the ones you are likely to use when matching URLs with
396 >, and is a long way from a definitive
397 list. This is enough to get us started with a few simple examples which may
398 be more illuminating:</P
407 that uses the common combination of <SPAN
414 denote any character, zero or more times. In other words, any string at all.
415 So we start with a literal forward slash, then our regular expression pattern
419 >) another literal forward slash, the string
423 >, another forward slash, and lastly another
428 a directory path here. This will match any file with the path that has a
429 directory named <SPAN
436 any characters, and this could conceivably be more forward slashes, so it
437 might expand into a much longer looking path. For example, this could match:
440 >"/eye/hate/spammers/banners/annoy_me_please.gif"</SPAN
444 >"/banners/annoying.html"</SPAN
445 >, or almost an infinite number of other
446 possible combinations, just so it has <SPAN
452 > A now something a little more complex:</P
458 >/.*/adv((er)?ts?|ertis(ing|ements?))?/</TT
461 We have several literal forward slashes again (<SPAN
465 building another expression that is a file path statement. We have another
469 >, so we are matching against any conceivable sub-path, just so
470 it matches our expression. The only true literal that <I
474 > our pattern is <SPAN
478 the forward slashes. What comes after the <SPAN
482 interesting part. </P
487 > means the preceding expression (either a
488 literal character or anything grouped with <SPAN
492 can exist or not, since this means either zero or one match. So
495 >"((er)?ts?|ertis(ing|ements?))"</SPAN
496 > is optional, as are the
497 individual sub-expressions: <SPAN
503 >"(ing|ements?)"</SPAN
514 >. We have two of those. For instance,
517 >"(ing|ements?)"</SPAN
518 >, can expand to match either <SPAN
528 >. What is being done here, is an
529 attempt at matching as many variations of <SPAN
531 >"advertisement"</SPAN
533 similar, as possible. So this would expand to match just <SPAN
549 >"advertisement"</SPAN
553 >"advertisements"</SPAN
554 >. You get the idea. But it would not match
557 >"advertizements"</SPAN
561 >). We could fix that by
562 changing our regular expression to:
565 >"/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/"</SPAN
566 >, which would then match
573 >/.*/advert[0-9]+\.(gif|jpe?g)</TT
576 another path statement with forward slashes. Anything in the square brackets
580 > can be matched. This is using <SPAN
584 shorthand expression to mean any digit one through nine. It is the same as
588 >. So any digit matches. The <SPAN
592 means one or more of the preceding expression must be included. The preceding
593 expression here is what is in the square brackets -- in this case, any digit
594 one through nine. Then, at the end, we have a grouping: <SPAN
598 This includes a <SPAN
601 >, so this needs to match the expression on
602 either side of that bar character also. A simple <SPAN
605 > on one side, and the other
606 side will in turn match either <SPAN
616 > means the letter <SPAN
620 can be matched once or not at all. So we are building an expression here to
621 match image GIF or JPEG type image file. It must include the literal
625 >, then one or more digits, and a <SPAN
629 (which is now a literal, and not a special character, since it is escaped
633 >), and lastly either <SPAN
643 >. Some possible matches would
646 >"//advert1.jpg"</SPAN
650 >"/nasty/ads/advert1234.gif"</SPAN
654 >"/banners/from/hell/advert99.jpg"</SPAN
655 >. It would not match
659 > (no leading slash), or
662 >"/adverts232.jpg"</SPAN
663 > (the expression does not include an
669 >"/advert1.jsp"</SPAN
674 in the expression anywhere).</P
676 > We are barely scratching the surface of regular expressions here so that you
677 can understand the default <SPAN
681 configuration files, and maybe use this knowledge to customize your own
682 installation. There is much, much more that can be done with regular
683 expressions. Now that you know enough to get started, you can learn more on
686 > More reading on Perl Compatible Regular expressions:
688 HREF="http://www.perldoc.com/perl5.6/pod/perlre.html"
690 >http://www.perldoc.com/perl5.6/pod/perlre.html</A
693 > For information on regular expression based substititions and their applications
694 in filters, please see the <A
695 HREF="filter-file.html"
696 >filter file tutorial</A
709 >'s Internal Pages</A
715 > proxies each requested
716 web page, it is easy for <SPAN
720 trap certain special URLs. In this way, we can talk directly to
725 configured, see how our rules are being applied, change these
726 rules and other configuration options, and even turn
730 > filtering off, all with
731 a web browser. </P
733 > The URLs listed below are the special ones that allow direct access
741 > must be running to access these. If
742 not, you will get a friendly error message. Internet access is not
761 HREF="http://config.privoxy.org/"
763 >http://config.privoxy.org/</A
768 > There is a shortcut: <A
773 doesn't provide a fallback to a real page, in case the request is not
783 Show information about the current configuration, including viewing and
784 editing of actions files:
794 HREF="http://config.privoxy.org/show-status"
796 >http://config.privoxy.org/show-status</A
804 Show the source code version numbers:
814 HREF="http://config.privoxy.org/show-version"
816 >http://config.privoxy.org/show-version</A
824 Show the browser's request headers:
834 HREF="http://config.privoxy.org/show-request"
836 >http://config.privoxy.org/show-request</A
844 Show which actions apply to a URL and why:
854 HREF="http://config.privoxy.org/show-url-info"
856 >http://config.privoxy.org/show-url-info</A
864 Toggle Privoxy on or off. In this case, <SPAN
868 to run, but only as a pass-through proxy, with no actions taking place:
878 HREF="http://config.privoxy.org/toggle"
880 >http://config.privoxy.org/toggle</A
885 > Short cuts. Turn off, then on:
895 HREF="http://config.privoxy.org/toggle?set=disable"
897 >http://config.privoxy.org/toggle?set=disable</A
909 HREF="http://config.privoxy.org/toggle?set=enable"
911 >http://config.privoxy.org/toggle?set=enable</A
919 > These may be bookmarked for quick reference. See next. </P
926 >14.2.1. Bookmarklets</A
929 > Below are some <SPAN
931 >"bookmarklets"</SPAN
932 > to allow you to easily access a
936 > version of some of <SPAN
940 special pages. They are designed for MS Internet Explorer, but should work
941 equally well in Netscape, Mozilla, and other browsers which support
942 JavaScript. They are designed to run directly from your bookmarks - not by
943 clicking the links below (although that should work for testing).</P
945 > To save them, right-click the link and choose <SPAN
947 >"Add to Favorites"</SPAN
951 >"Add Bookmark"</SPAN
952 > (Netscape). You will get a warning that
955 >"may not be safe"</SPAN
956 > - just click OK. Then you can run the
957 Bookmarklet directly from your favorites/bookmarks. For even faster access,
958 you can put them on the <SPAN
961 > bar (IE) or the <SPAN
965 > (Netscape), and run them with a single click. </P
973 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=enabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
982 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=disabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
984 >Privoxy - Disable</A
991 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=toggle','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
993 >Privoxy - Toggle Privoxy</A
994 > (Toggles between enabled and disabled)
1000 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y','ijbstatus','width=250,height=2,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1002 >Privoxy- View Status</A
1009 HREF="javascript:w=Math.floor(screen.width/2);h=Math.floor(screen.height*0.9);void(window.open('http://www.privoxy.org/actions/index.php?url='+escape(location.href),'Feedback','screenx='+w+',width='+w+',height='+h+',scrollbars=yes,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1011 >Privoxy - Submit Actions File Feedback</A
1018 HREF="javascript:void(window.open('http://config.privoxy.org/show-url-info?url='+escape(location.href),'Why').focus());"
1027 > Credit: The site which gave us the general idea for these bookmarklets is
1029 HREF="http://www.bookmarklets.com"
1031 >www.bookmarklets.com</A
1033 have more information about bookmarklets. </P
1042 >14.3. Chain of Events</A
1045 > Let's take a quick look at the basic sequence of events when a web page is
1046 requested by your browser and <SPAN
1056 > First, your web browser requests a web page. The browser knows to send
1057 the request to <SPAN
1060 >, which will in turn,
1061 relay the request to the remote web server after passing the following
1070 > traps any request for its own internal CGI
1071 pages (e.g http://p.p/) and sends the CGI page back to the browser.
1079 > checks to see if the URL
1081 HREF="actions-file.html#BLOCK"
1087 so, the URL is then blocked, and the remote web server will not be contacted.
1089 HREF="actions-file.html#HANDLE-AS-IMAGE"
1092 >"+handle-as-image"</SPAN
1095 is then checked and if it does not match, an
1099 > page is sent back. Otherwise, if it does match,
1100 an image is returned. The type of image depends on the setting of <A
1101 HREF="actions-file.html#SET-IMAGE-BLOCKER"
1104 >"+set-image-blocker"</SPAN
1107 (blank, checkerboard pattern, or an HTTP redirect to an image elsewhere).
1112 > Untrusted URLs are blocked. If URLs are being added to the
1116 > file, then that is done.
1121 > If the URL pattern matches the <A
1122 HREF="actions-file.html#FAST-REDIRECTS"
1125 >"+fast-redirects"</SPAN
1128 it is then processed. Unwanted parts of the requested URL are stripped.
1133 > Now the rest of the client browser's request headers are processed. If any
1134 of these match any of the relevant actions (e.g. <A
1135 HREF="actions-file.html#HIDE-USER-AGENT"
1138 >"+hide-user-agent"</SPAN
1141 etc.), headers are suppressed or forged as determined by these actions and
1147 > Now the web server starts sending its response back (i.e. typically a web page and related
1153 > First, the server headers are read and processed to determine, among other
1154 things, the MIME type (document type) and encoding. The headers are then
1155 filtered as deterimed by the
1157 HREF="actions-file.html#CRUNCH-INCOMING-COOKIES"
1160 >"+crunch-incoming-cookies"</SPAN
1164 HREF="actions-file.html#SESSION-COOKIES-ONLY"
1167 >"+session-cookies-only"</SPAN
1171 HREF="actions-file.html#DOWNGRADE-HTTP-VERSION"
1174 >"+downgrade-http-version"</SPAN
1183 HREF="actions-file.html#KILL-POPUPS"
1186 >"+kill-popups"</SPAN
1189 action applies, and it is an HTML or JavaScript document, the popup-code in the
1190 response is filtered on-the-fly as it is received.
1196 HREF="actions-file.html#FILTER"
1203 HREF="actions-file.html#DEANIMATE-GIFS"
1206 >"+deanimate-gifs"</SPAN
1209 action applies (and the document type fits the action), the rest of the page is
1210 read into memory (up to a configurable limit). Then the filter rules (from
1214 >) are processed against the buffered
1215 content. Filters are applied in the order they are specified in the
1219 > file. Animated GIFs, if present, are
1220 reduced to either the first or last frame, depending on the action
1221 setting.The entire page, which is now filtered, is then sent by
1225 > back to your browser.
1229 HREF="actions-file.html#FILTER"
1236 HREF="actions-file.html#DEANIMATE-GIFS"
1239 >"+deanimate-gifs"</SPAN
1245 > passes the raw data through
1246 to the client browser as it becomes available.
1251 > As the browser receives the now (probably filtered) page content, it
1252 reads and then requests any URLs that may be embedded within the page
1253 source, e.g. ad images, stylesheets, JavaScript, other HTML documents (e.g.
1254 frames), sounds, etc. For each of these objects, the browser issues a new
1255 request. And each such request is in turn processed as above. Note that a
1256 complex web page may have many such embedded URLs.
1268 >14.4. Anatomy of an Action</A
1276 HREF="actions-file.html#ACTIONS"
1279 HREF="actions-file.html#FILTER"
1282 to any given URL can be complex, and not always so
1283 easy to understand what is happening. And sometimes we need to be able to
1291 doing. Especially, if something <SPAN
1295 is causing us a problem inadvertently. It can be a little daunting to look at
1296 the actions and filters files themselves, since they tend to be filled with
1298 HREF="appendix.html#REGEX"
1299 >regular expressions</A
1300 > whose consequences are not
1301 always so obvious. </P
1303 > One quick test to see if <SPAN
1306 > is causing a problem
1307 or not, is to disable it temporarily. This should be the first troubleshooting
1309 HREF="appendix.html#BOOKMARKLETS"
1310 >the Bookmarklets</A
1311 > section on a quick
1312 and easy way to do this (be sure to flush caches afterward!).</P
1319 HREF="http://config.privoxy.org/show-url-info"
1321 >http://config.privoxy.org/show-url-info</A
1323 page that can show us very specifically how <SPAN
1327 are being applied to any given URL. This is a big help for troubleshooting.</P
1329 > First, enter one URL (or partial URL) at the prompt, and then
1334 how the current configuration will handle it. This will not
1335 help with filtering effects (i.e. the <A
1336 HREF="actions-file.html#FILTER"
1345 > file since this is handled very
1346 differently and not so easy to trap! It also will not tell you about any other
1347 URLs that may be embedded within the URL you are testing. For instance, images
1348 such as ads are expressed as URLs within the raw page source of HTML pages. So
1349 you will only get info for the actual URL that is pasted into the prompt area
1350 -- not any sub-URLs. If you want to know about embedded URLs like ads, you
1351 will have to dig those out of the HTML source. Use your browser's <SPAN
1355 > option for this. Or right click on the ad, and grab the
1358 > Let's try an example, <A
1359 HREF="http://google.com"
1363 and look at it one section at a time:</P
1373 > Matches for http://google.com:
1375 In file: default.action <SPAN
1385 -crunch-outgoing-cookies
1386 -crunch-incoming-cookies
1387 +deanimate-gifs{last}
1388 -downgrade-http-version
1392 -filter{shockwave-flash}
1393 -filter{crude-parental}
1394 +filter{html-annoyances}
1395 +filter{js-annoyances}
1396 +filter{content-cookies}
1398 +filter{refresh-tags}
1400 +filter{banners-by-size}
1401 +hide-forwarded-for-headers
1402 +hide-from-header{block}
1403 +hide-referer{forge}
1408 +prevent-compression
1411 +session-cookies-only
1412 +set-image-blocker{pattern} }
1415 { -session-cookies-only }
1421 In file: user.action <SPAN
1428 (no matches in this file) </PRE
1434 > This tells us how we have defined our
1436 HREF="actions-file.html#ACTIONS"
1442 which ones match for our example, <SPAN
1445 >. The first listing
1446 is any matches for the <TT
1448 >standard.action</TT
1453 >. Then next is <SPAN
1460 > file. The large, multi-line listing,
1461 is how the actions are set to match for all URLs, i.e. our default settings.
1462 If you look at your <SPAN
1465 > file, this would be the section
1466 just below the <SPAN
1469 > section near the top. This will apply to
1470 all URLs as signified by the single forward slash at the end of the listing
1476 > But we can define additional actions that would be exceptions to these general
1477 rules, and then list specific URLs (or patterns) that these exceptions would
1478 apply to. Last match wins. Just below this then are two explicit matches for
1481 >".google.com"</SPAN
1482 >. The first is negating our previous cookie setting,
1484 HREF="actions-file.html#SESSION-COOKIES-ONLY"
1487 >"+session-cookies-only"</SPAN
1490 (i.e. not persistent). So we will allow persistent cookies for google. The
1496 HREF="actions-file.html#FAST-REDIRECTS"
1499 >"+fast-redirects"</SPAN
1502 action, allowing this to take place unmolested. Note that there is a leading
1505 >".google.com"</SPAN
1506 >. This will match any hosts and
1507 sub-domains, in the google.com domain also, such as
1510 >"www.google.com"</SPAN
1511 >. So, apparently, we have these two actions
1512 defined somewhere in the lower part of our <TT
1519 > is referenced somewhere in these latter
1525 > file, we again have no hits.</P
1527 > And finally we pull it all together in the bottom section and summarize how
1531 > is applying all its <SPAN
1548 > Final results:
1552 -crunch-outgoing-cookies
1553 -crunch-incoming-cookies
1554 +deanimate-gifs{last}
1555 -downgrade-http-version
1559 -filter{shockwave-flash}
1560 -filter{crude-parental}
1561 +filter{html-annoyances}
1562 +filter{js-annoyances}
1563 +filter{content-cookies}
1565 +filter{refresh-tags}
1567 +filter{banners-by-size}
1568 +hide-forwarded-for-headers
1569 +hide-from-header{block}
1570 +hide-referer{forge}
1575 +prevent-compression
1578 -session-cookies-only
1579 +set-image-blocker{pattern} </PRE
1585 > Notice the only difference here to the previous listing, is to
1588 >"fast-redirects"</SPAN
1591 >"session-cookies-only"</SPAN
1594 > Now another example, <SPAN
1596 >"ad.doubleclick.net"</SPAN
1607 > { +block +handle-as-image }
1610 { +block +handle-as-image }
1613 { +block +handle-as-image }
1614 .doubleclick.net</PRE
1620 > We'll just show the interesting part here, the explicit matches. It is
1621 matched three different times. Each as an <SPAN
1623 >"+block +handle-as-image"</SPAN
1625 which is the expanded form of one of our aliases that had been defined as:
1628 >"+imageblock"</SPAN
1630 HREF="actions-file.html#ALIASES"
1636 the first section of the actions file and typically used to combine more
1637 than one action.)</P
1639 > Any one of these would have done the trick and blocked this as an unwanted
1640 image. This is unnecessarily redundant since the last case effectively
1641 would also cover the first. No point in taking chances with these guys
1642 though ;-) Note that if you want an ad or obnoxious
1643 URL to be invisible, it should be defined as <SPAN
1645 >"ad.doubleclick.net"</SPAN
1647 is done here -- as both a <A
1648 HREF="actions-file.html#BLOCK"
1659 HREF="actions-file.html#HANDLE-AS-IMAGE"
1662 >"+handle-as-image"</SPAN
1665 The custom alias <SPAN
1667 >"+imageblock"</SPAN
1668 > just simplifies the process and make
1669 it more readable.</P
1671 > One last example. Let's try <SPAN
1673 >"http://www.rhapsodyk.net/adsl/HOWTO/"</SPAN
1675 This one is giving us problems. We are getting a blank page. Hmmm...</P
1685 > Matches for http://www.rhapsodyk.net/adsl/HOWTO/:
1687 In file: default.action <SPAN
1697 -crunch-incoming-cookies
1698 -crunch-outgoing-cookies
1700 -downgrade-http-version
1702 +filter{html-annoyances}
1703 +filter{js-annoyances}
1704 +filter{kill-popups}
1707 +filter{banners-by-size}
1710 +hide-forwarded-for-headers
1711 +hide-from-header{block}
1712 +hide-referer{forge}
1716 +prevent-compression
1719 +session-cookies-only
1720 +set-image-blocker{blank} }
1723 { +block +handle-as-image }
1737 we did not want this at all! Now we see why we get the blank page. We could
1738 now add a new action below this that explicitly does <I
1749 various ways to handle such exceptions. Example:</P
1766 > Now the page displays ;-) Be sure to flush your browser's caches when
1767 making such changes. Or, try using <TT
1772 > But now what about a situation where we get no explicit matches like
1783 > { +block +handle-as-image }
1790 > That actually was very telling and pointed us quickly to where the problem
1791 was. If you don't get this kind of match, then it means one of the default
1792 rules in the first section is causing the problem. This would require some
1793 guesswork, and maybe a little trial and error to isolate the offending rule.
1794 One likely cause would be one of the <SPAN
1798 adding the URL for the site to one of aliases that turn off <SPAN
1813 .worldpay.com # for quietpc.com
1831 >"{ -filter -session-cookies-only }"</SPAN
1833 Or you could do your own exception to negate filtering: </P
1850 > This would probably be most appropriately put in <TT
1854 for local site exceptions.</P
1859 > is an alias that disables most actions. This can be
1860 used as a last resort for problem sites. Remember to flush caches! If this
1861 still does not work, you will have to go through the remaining actions one by
1862 one to find which one(s) is causing the problem.</P