7 CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
10 TITLE="Privoxy 3.0.4 User Manual"
11 HREF="index.html"><LINK
14 HREF="seealso.html"><LINK
17 HREF="../p_doc.css"></HEAD
28 SUMMARY="Header navigation table"
37 >Privoxy 3.0.4 User Manual</TH
79 >14.1. Regular Expressions</H2
84 > uses Perl-style <SPAN
89 HREF="actions-file.html"
93 HREF="filter-file.html"
97 HREF="http://www.pcre.org/"
106 > If you are reading this, you probably don't understand what <SPAN
110 > are, or what they can do. So this will be a very brief
111 introduction only. A full explanation would require a <A
112 HREF="http://www.oreilly.com/catalog/regex/"
117 > Regular expressions provide a language to describe patterns that can be
118 run against strings of characters (letter, numbers, etc), to see if they
119 match the string or not. The patterns are themselves (sometimes complex)
120 strings of literal characters, combined with wild-cards, and other special
121 characters, called meta-characters. The <SPAN
123 >"meta-characters"</SPAN
125 special meanings and are used to build complex patterns to be matched against.
126 Perl Compatible Regular Expressions are an especially convenient
130 > of the regular expression language.</P
132 > To make a simple analogy, we do something similar when we use wild-card
133 characters when listing files with the <B
140 > matches all filenames. The <SPAN
144 character here is the asterisk which matches any and all characters. We can be
145 more specific and use <TT
148 > to match just individual
151 >"dir file?.text"</SPAN
159 >, etc. We are pattern
160 matching, using a similar technique to <SPAN
162 >"regular expressions"</SPAN
165 > Regular expressions do essentially the same thing, but are much, much more
166 powerful. There are many more <SPAN
168 >"special characters"</SPAN
170 building complex patterns however. Let's look at a few of the common ones,
171 and then some examples:</P
186 > - Matches any single character, e.g. <SPAN
224 > - The preceding character or expression is matched ZERO or ONE
247 > - The preceding character or expression is matched ONE or MORE
270 > - The preceding character or expression is matched ZERO or MORE
296 > character denotes that
297 the following character should be taken literally. This is used where one of the
298 special characters (e.g. <SPAN
301 >) needs to be taken literally and
302 not as a special meta-character. Example: <SPAN
304 >"example\.com"</SPAN
306 sure the period is recognized only as a period (and not expanded to its
307 meta-character meaning of any single character).
329 > - Characters enclosed in brackets will be matched if
330 any of the enclosed characters are encountered. For instance, <SPAN
334 matches any numeric digit (zero through nine). As an example, we can combine
338 > to match any digit one of more times: <SPAN
363 > - parentheses are used to group a sub-expression,
364 or multiple sub-expressions.
389 > character works like an
393 > conditional statement. A match is successful if the
394 sub-expression on either side of <SPAN
397 > matches. As an example:
400 >"/(this|that) example/"</SPAN
401 > uses grouping and the bar character
402 and would match either <SPAN
404 >"this example"</SPAN
418 > These are just some of the ones you are likely to use when matching URLs with
422 >, and is a long way from a definitive
423 list. This is enough to get us started with a few simple examples which may
424 be more illuminating:</P
436 that uses the common combination of <SPAN
443 denote any character, zero or more times. In other words, any string at all.
444 So we start with a literal forward slash, then our regular expression pattern
448 >) another literal forward slash, the string
452 >, another forward slash, and lastly another
457 a directory path here. This will match any file with the path that has a
458 directory named <SPAN
465 any characters, and this could conceivably be more forward slashes, so it
466 might expand into a much longer looking path. For example, this could match:
469 >"/eye/hate/spammers/banners/annoy_me_please.gif"</SPAN
473 >"/banners/annoying.html"</SPAN
474 >, or almost an infinite number of other
475 possible combinations, just so it has <SPAN
481 > A now something a little more complex:</P
489 >/.*/adv((er)?ts?|ertis(ing|ements?))?/</TT
493 We have several literal forward slashes again (<SPAN
497 building another expression that is a file path statement. We have another
501 >, so we are matching against any conceivable sub-path, just so
502 it matches our expression. The only true literal that <SPAN
509 > our pattern is <SPAN
513 the forward slashes. What comes after the <SPAN
517 interesting part. </P
522 > means the preceding expression (either a
523 literal character or anything grouped with <SPAN
527 can exist or not, since this means either zero or one match. So
530 >"((er)?ts?|ertis(ing|ements?))"</SPAN
531 > is optional, as are the
532 individual sub-expressions: <SPAN
538 >"(ing|ements?)"</SPAN
549 >. We have two of those. For instance,
552 >"(ing|ements?)"</SPAN
553 >, can expand to match either <SPAN
566 >. What is being done here, is an
567 attempt at matching as many variations of <SPAN
569 >"advertisement"</SPAN
571 similar, as possible. So this would expand to match just <SPAN
587 >"advertisement"</SPAN
591 >"advertisements"</SPAN
592 >. You get the idea. But it would not match
595 >"advertizements"</SPAN
599 >). We could fix that by
600 changing our regular expression to:
603 >"/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/"</SPAN
604 >, which would then match
613 >/.*/advert[0-9]+\.(gif|jpe?g)</TT
617 another path statement with forward slashes. Anything in the square brackets
621 > can be matched. This is using <SPAN
625 shorthand expression to mean any digit one through nine. It is the same as
629 >. So any digit matches. The <SPAN
633 means one or more of the preceding expression must be included. The preceding
634 expression here is what is in the square brackets -- in this case, any digit
635 one through nine. Then, at the end, we have a grouping: <SPAN
639 This includes a <SPAN
642 >, so this needs to match the expression on
643 either side of that bar character also. A simple <SPAN
646 > on one side, and the other
647 side will in turn match either <SPAN
657 > means the letter <SPAN
661 can be matched once or not at all. So we are building an expression here to
662 match image GIF or JPEG type image file. It must include the literal
666 >, then one or more digits, and a <SPAN
670 (which is now a literal, and not a special character, since it is escaped
674 >), and lastly either <SPAN
684 >. Some possible matches would
687 >"//advert1.jpg"</SPAN
691 >"/nasty/ads/advert1234.gif"</SPAN
695 >"/banners/from/hell/advert99.jpg"</SPAN
696 >. It would not match
700 > (no leading slash), or
703 >"/adverts232.jpg"</SPAN
704 > (the expression does not include an
710 >"/advert1.jsp"</SPAN
715 in the expression anywhere).</P
717 > We are barely scratching the surface of regular expressions here so that you
718 can understand the default <SPAN
722 configuration files, and maybe use this knowledge to customize your own
723 installation. There is much, much more that can be done with regular
724 expressions. Now that you know enough to get started, you can learn more on
727 > More reading on Perl Compatible Regular expressions:
729 HREF="http://www.perldoc.com/perl5.6/pod/perlre.html"
731 >http://www.perldoc.com/perl5.6/pod/perlre.html</A
734 > For information on regular expression based substitutions and their applications
735 in filters, please see the <A
736 HREF="filter-file.html"
737 >filter file tutorial</A
751 >'s Internal Pages</H2
756 > proxies each requested
757 web page, it is easy for <SPAN
761 trap certain special URLs. In this way, we can talk directly to
766 configured, see how our rules are being applied, change these
767 rules and other configuration options, and even turn
771 > filtering off, all with
772 a web browser. </P
774 > The URLs listed below are the special ones that allow direct access
782 > must be running to access these. If
783 not, you will get a friendly error message. Internet access is not
802 HREF="http://config.privoxy.org/"
804 >http://config.privoxy.org/</A
809 > There is a shortcut: <A
814 doesn't provide a fall-back to a real page, in case the request is not
824 Show information about the current configuration, including viewing and
825 editing of actions files:
835 HREF="http://config.privoxy.org/show-status"
837 >http://config.privoxy.org/show-status</A
845 Show the source code version numbers:
855 HREF="http://config.privoxy.org/show-version"
857 >http://config.privoxy.org/show-version</A
865 Show the browser's request headers:
875 HREF="http://config.privoxy.org/show-request"
877 >http://config.privoxy.org/show-request</A
885 Show which actions apply to a URL and why:
895 HREF="http://config.privoxy.org/show-url-info"
897 >http://config.privoxy.org/show-url-info</A
905 Toggle Privoxy on or off. In this case, <SPAN
909 to run, but only as a pass-through proxy, with no actions taking place:
919 HREF="http://config.privoxy.org/toggle"
921 >http://config.privoxy.org/toggle</A
926 > Short cuts. Turn off, then on:
936 HREF="http://config.privoxy.org/toggle?set=disable"
938 >http://config.privoxy.org/toggle?set=disable</A
950 HREF="http://config.privoxy.org/toggle?set=enable"
952 >http://config.privoxy.org/toggle?set=enable</A
960 > These may be bookmarked for quick reference. See next. </P
968 >14.2.1. Bookmarklets</H3
970 > Below are some <SPAN
972 >"bookmarklets"</SPAN
973 > to allow you to easily access a
977 > version of some of <SPAN
981 special pages. They are designed for MS Internet Explorer, but should work
982 equally well in Netscape, Mozilla, and other browsers which support
983 JavaScript. They are designed to run directly from your bookmarks - not by
984 clicking the links below (although that should work for testing).</P
986 > To save them, right-click the link and choose <SPAN
988 >"Add to Favorites"</SPAN
992 >"Add Bookmark"</SPAN
993 > (Netscape). You will get a warning that
996 >"may not be safe"</SPAN
997 > - just click OK. Then you can run the
998 Bookmarklet directly from your favorites/bookmarks. For even faster access,
999 you can put them on the <SPAN
1002 > bar (IE) or the <SPAN
1006 > (Netscape), and run them with a single click. </P
1014 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=enabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1016 >Privoxy - Enable</A
1023 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=disabled','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1025 >Privoxy - Disable</A
1032 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y&set=toggle','ijbstatus','width=250,height=100,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1034 >Privoxy - Toggle Privoxy</A
1035 > (Toggles between enabled and disabled)
1041 HREF="javascript:void(window.open('http://config.privoxy.org/toggle?mini=y','ijbstatus','width=250,height=2,resizable=yes,scrollbars=no,toolbar=no,location=no,directories=no,status=no,menubar=no,copyhistory=no').focus());"
1043 >Privoxy- View Status</A
1050 HREF="javascript:void(window.open('http://config.privoxy.org/show-url-info?url='+escape(location.href),'Why').focus());"
1059 > Credit: The site which gave us the general idea for these bookmarklets is
1061 HREF="http://www.bookmarklets.com/"
1063 >www.bookmarklets.com</A
1065 have more information about bookmarklets. </P
1075 >14.3. Chain of Events</H2
1077 > Let's take a quick look at the basic sequence of events when a web page is
1078 requested by your browser and <SPAN
1088 > First, your web browser requests a web page. The browser knows to send
1089 the request to <SPAN
1092 >, which will in turn,
1093 relay the request to the remote web server after passing the following
1102 > traps any request for its own internal CGI
1103 pages (e.g http://p.p/) and sends the CGI page back to the browser.
1111 > checks to see if the URL
1113 HREF="actions-file.html#BLOCK"
1119 so, the URL is then blocked, and the remote web server will not be contacted.
1121 HREF="actions-file.html#HANDLE-AS-IMAGE"
1124 >"+handle-as-image"</SPAN
1127 is then checked and if it does not match, an
1131 > page is sent back. Otherwise, if it does match,
1132 an image is returned. The type of image depends on the setting of <A
1133 HREF="actions-file.html#SET-IMAGE-BLOCKER"
1136 >"+set-image-blocker"</SPAN
1139 (blank, checkerboard pattern, or an HTTP redirect to an image elsewhere).
1144 > Untrusted URLs are blocked. If URLs are being added to the
1148 > file, then that is done.
1153 > If the URL pattern matches the <A
1154 HREF="actions-file.html#FAST-REDIRECTS"
1157 >"+fast-redirects"</SPAN
1160 it is then processed. Unwanted parts of the requested URL are stripped.
1165 > Now the rest of the client browser's request headers are processed. If any
1166 of these match any of the relevant actions (e.g. <A
1167 HREF="actions-file.html#HIDE-USER-AGENT"
1170 >"+hide-user-agent"</SPAN
1173 etc.), headers are suppressed or forged as determined by these actions and
1179 > Now the web server starts sending its response back (i.e. typically a web page and related
1185 > First, the server headers are read and processed to determine, among other
1186 things, the MIME type (document type) and encoding. The headers are then
1187 filtered as determined by the
1189 HREF="actions-file.html#CRUNCH-INCOMING-COOKIES"
1192 >"+crunch-incoming-cookies"</SPAN
1196 HREF="actions-file.html#SESSION-COOKIES-ONLY"
1199 >"+session-cookies-only"</SPAN
1203 HREF="actions-file.html#DOWNGRADE-HTTP-VERSION"
1206 >"+downgrade-http-version"</SPAN
1215 HREF="actions-file.html#KILL-POPUPS"
1218 >"+kill-popups"</SPAN
1221 action applies, and it is an HTML or JavaScript document, the popup-code in the
1222 response is filtered on-the-fly as it is received.
1228 HREF="actions-file.html#FILTER"
1235 HREF="actions-file.html#DEANIMATE-GIFS"
1238 >"+deanimate-gifs"</SPAN
1241 action applies (and the document type fits the action), the rest of the page is
1242 read into memory (up to a configurable limit). Then the filter rules (from
1246 >) are processed against the buffered
1247 content. Filters are applied in the order they are specified in one of the
1248 filter files. Animated GIFs, if present, are
1249 reduced to either the first or last frame, depending on the action
1250 setting.The entire page, which is now filtered, is then sent by
1254 > back to your browser.
1258 HREF="actions-file.html#FILTER"
1265 HREF="actions-file.html#DEANIMATE-GIFS"
1268 >"+deanimate-gifs"</SPAN
1274 > passes the raw data through
1275 to the client browser as it becomes available.
1280 > As the browser receives the now (probably filtered) page content, it
1281 reads and then requests any URLs that may be embedded within the page
1282 source, e.g. ad images, stylesheets, JavaScript, other HTML documents (e.g.
1283 frames), sounds, etc. For each of these objects, the browser issues a new
1284 request. And each such request is in turn processed as above. Note that a
1285 complex web page may have many such embedded URLs.
1298 >14.4. Anatomy of an Action</H2
1305 HREF="actions-file.html#ACTIONS"
1308 HREF="actions-file.html#FILTER"
1311 to any given URL can be complex, and not always so
1312 easy to understand what is happening. And sometimes we need to be able to
1323 doing. Especially, if something <SPAN
1327 is causing us a problem inadvertently. It can be a little daunting to look at
1328 the actions and filters files themselves, since they tend to be filled with
1330 HREF="appendix.html#REGEX"
1331 >regular expressions</A
1332 > whose consequences are not
1333 always so obvious. </P
1335 > One quick test to see if <SPAN
1338 > is causing a problem
1339 or not, is to disable it temporarily. This should be the first troubleshooting
1341 HREF="appendix.html#BOOKMARKLETS"
1342 >the Bookmarklets</A
1343 > section on a quick
1344 and easy way to do this (be sure to flush caches afterward!). Looking at the
1345 logs is a good idea too.</P
1352 HREF="http://config.privoxy.org/show-url-info"
1354 >http://config.privoxy.org/show-url-info</A
1356 page that can show us very specifically how <SPAN
1360 are being applied to any given URL. This is a big help for troubleshooting.</P
1362 > First, enter one URL (or partial URL) at the prompt, and then
1367 how the current configuration will handle it. This will not
1368 help with filtering effects (i.e. the <A
1369 HREF="actions-file.html#FILTER"
1375 one of the filter files since this is handled very
1376 differently and not so easy to trap! It also will not tell you about any other
1377 URLs that may be embedded within the URL you are testing. For instance, images
1378 such as ads are expressed as URLs within the raw page source of HTML pages. So
1379 you will only get info for the actual URL that is pasted into the prompt area
1380 -- not any sub-URLs. If you want to know about embedded URLs like ads, you
1381 will have to dig those out of the HTML source. Use your browser's <SPAN
1385 > option for this. Or right click on the ad, and grab the
1388 > Let's try an example, <A
1389 HREF="http://google.com"
1393 and look at it one section at a time:</P
1403 > Matches for http://google.com:
1405 In file: default.action <SPAN
1415 -crunch-outgoing-cookies
1416 -crunch-incoming-cookies
1417 +deanimate-gifs{last}
1418 -downgrade-http-version
1422 -filter{shockwave-flash}
1423 -filter{crude-parental}
1424 +filter{html-annoyances}
1425 +filter{js-annoyances}
1426 +filter{content-cookies}
1428 +filter{refresh-tags}
1430 +filter{banners-by-size}
1431 +hide-forwarded-for-headers
1432 +hide-from-header{block}
1433 +hide-referer{forge}
1438 +prevent-compression
1441 +session-cookies-only
1442 +set-image-blocker{pattern} }
1445 { -session-cookies-only }
1451 In file: user.action <SPAN
1458 (no matches in this file) </PRE
1464 > This tells us how we have defined our
1466 HREF="actions-file.html#ACTIONS"
1472 which ones match for our example, <SPAN
1475 >. The first listing
1476 is any matches for the <TT
1478 >standard.action</TT
1483 >. Then next is <SPAN
1490 > file. The large, multi-line listing,
1491 is how the actions are set to match for all URLs, i.e. our default settings.
1492 If you look at your <SPAN
1495 > file, this would be the section
1496 just below the <SPAN
1499 > section near the top. This will apply to
1500 all URLs as signified by the single forward slash at the end of the listing
1506 > But we can define additional actions that would be exceptions to these general
1507 rules, and then list specific URLs (or patterns) that these exceptions would
1508 apply to. Last match wins. Just below this then are two explicit matches for
1511 >".google.com"</SPAN
1512 >. The first is negating our previous cookie setting,
1514 HREF="actions-file.html#SESSION-COOKIES-ONLY"
1517 >"+session-cookies-only"</SPAN
1520 (i.e. not persistent). So we will allow persistent cookies for google. The
1529 HREF="actions-file.html#FAST-REDIRECTS"
1532 >"+fast-redirects"</SPAN
1535 action, allowing this to take place unmolested. Note that there is a leading
1538 >".google.com"</SPAN
1539 >. This will match any hosts and
1540 sub-domains, in the google.com domain also, such as
1543 >"www.google.com"</SPAN
1544 >. So, apparently, we have these two actions
1545 defined somewhere in the lower part of our <TT
1552 > is referenced somewhere in these latter
1558 > file, we again have no hits.</P
1560 > And finally we pull it all together in the bottom section and summarize how
1564 > is applying all its <SPAN
1581 > Final results:
1585 -crunch-outgoing-cookies
1586 -crunch-incoming-cookies
1587 +deanimate-gifs{last}
1588 -downgrade-http-version
1592 -filter{shockwave-flash}
1593 -filter{crude-parental}
1594 +filter{html-annoyances}
1595 +filter{js-annoyances}
1596 +filter{content-cookies}
1598 +filter{refresh-tags}
1600 +filter{banners-by-size}
1601 +hide-forwarded-for-headers
1602 +hide-from-header{block}
1603 +hide-referer{forge}
1608 +prevent-compression
1611 -session-cookies-only
1612 +set-image-blocker{pattern} </PRE
1618 > Notice the only difference here to the previous listing, is to
1621 >"fast-redirects"</SPAN
1624 >"session-cookies-only"</SPAN
1626 which are actived specifically for this site in our configuration.</P
1628 > Now another example, <SPAN
1630 >"ad.doubleclick.net"</SPAN
1641 > { +block +handle-as-image }
1644 { +block +handle-as-image }
1647 { +block +handle-as-image }
1648 .doubleclick.net</PRE
1654 > We'll just show the interesting part here, the explicit matches. It is
1655 matched three different times. Each as an <SPAN
1657 >"+block +handle-as-image"</SPAN
1659 which is the expanded form of one of our aliases that had been defined as:
1662 >"+imageblock"</SPAN
1664 HREF="actions-file.html#ALIASES"
1670 the first section of the actions file and typically used to combine more
1671 than one action.)</P
1673 > Any one of these would have done the trick and blocked this as an unwanted
1674 image. This is unnecessarily redundant since the last case effectively
1675 would also cover the first. No point in taking chances with these guys
1676 though ;-) Note that if you want an ad or obnoxious
1677 URL to be invisible, it should be defined as <SPAN
1679 >"ad.doubleclick.net"</SPAN
1681 is done here -- as both a <A
1682 HREF="actions-file.html#BLOCK"
1696 HREF="actions-file.html#HANDLE-AS-IMAGE"
1699 >"+handle-as-image"</SPAN
1702 The custom alias <SPAN
1704 >"+imageblock"</SPAN
1705 > just simplifies the process and make
1706 it more readable.</P
1708 > One last example. Let's try <SPAN
1710 >"http://www.rhapsodyk.net/adsl/HOWTO/"</SPAN
1712 This one is giving us problems. We are getting a blank page. Hmmm ...</P
1722 > Matches for http://www.rhapsodyk.net/adsl/HOWTO/:
1724 In file: default.action <SPAN
1734 -crunch-incoming-cookies
1735 -crunch-outgoing-cookies
1737 -downgrade-http-version
1739 +filter{html-annoyances}
1740 +filter{js-annoyances}
1741 +filter{kill-popups}
1744 +filter{banners-by-size}
1747 +hide-forwarded-for-headers
1748 +hide-from-header{block}
1749 +hide-referer{forge}
1753 +prevent-compression
1756 +session-cookies-only
1757 +set-image-blocker{blank} }
1760 { +block +handle-as-image }
1774 we did not want this at all! Now we see why we get the blank page. We could
1775 now add a new action below this that explicitly does <SPAN
1789 various ways to handle such exceptions. Example:</P
1806 > Now the page displays ;-) Be sure to flush your browser's caches when
1807 making such changes. Or, try using <TT
1812 > But now what about a situation where we get no explicit matches like
1823 > { +block +handle-as-image }
1830 > That actually was very telling and pointed us quickly to where the problem
1831 was. If you don't get this kind of match, then it means one of the default
1832 rules in the first section is causing the problem. This would require some
1833 guesswork, and maybe a little trial and error to isolate the offending rule.
1834 One likely cause would be one of the <SPAN
1838 tend to be harder to troubleshoot. Try adding the URL for the site to one of
1839 aliases that turn off <SPAN
1854 .worldpay.com # for quietpc.com
1872 >"{ -filter -session-cookies-only }"</SPAN
1874 Or you could do your own exception to negate filtering: </P
1891 > This would turn off all filtering for that site. This would probably be most
1892 appropriately put in <TT
1898 > Images that are inexplicably being blocked, may well be hitting the
1901 >"+filter{banners-by-size}"</SPAN
1902 > rule, which assumes
1903 that images of certain sizes are ad banners (works well most of the time
1904 since these tend to be standardized).</P
1909 > is an alias that disables most actions. This can be
1910 used as a last resort for problem sites. Remember to flush caches! If this
1911 still does not work, you will have to go through the remaining actions one by
1912 one to find which one(s) is causing the problem.</P
1920 SUMMARY="Footer navigation table"