1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN""http://www.w3.org/TR/html4/loose.dtd">
8 CONTENT="Modular DocBook HTML Stylesheet Version 1.79"><LINK
10 TITLE="Privoxy 3.0.7 User Manual"
11 HREF="index.html"><LINK
14 HREF="actions-file.html"><LINK
16 TITLE="Privoxy's Template Files"
17 HREF="templates.html"><LINK
21 <LINK REL="STYLESHEET" TYPE="text/css" HREF="p_doc.css">
33 SUMMARY="Header navigation table"
42 >Privoxy 3.0.7 User Manual</TH
50 HREF="actions-file.html"
82 > On-the-fly text substitutions need
83 to be defined in a <SPAN
87 can then be invoked as an <SPAN
95 > supports three different filter actions:
99 HREF="actions-file.html#FILTER"
103 rewrite the content that is send to the client,
107 HREF="actions-file.html#CLIENT-HEADER-FILTER"
108 >client-header-filter</A
111 to rewrite headers that are send by the client, and
115 HREF="actions-file.html#SERVER-HEADER-FILTER"
116 >server-header-filter</A
119 to rewrite headers that are send by the server, and</P
124 > also supports two tagger actions:
128 HREF="actions-file.html#CLIENT-HEADER-TAGGER"
129 >client-header-tagger</A
136 HREF="actions-file.html#SERVER-HEADER-TAGGER"
137 >server-header-tagger</A
140 Taggers and filters use the same syntax in the filter files, the differnce
141 is that taggers don't modify the text they are filtering, but use a rewritten
142 version of the filtered text as tag. The tags can then be used to change the
143 applying actions through sections with <A
144 HREF="actions-file.html#TAG-PATTERN"
148 > Multiple filter files can be defined through the <TT
151 HREF="config.html#FILTERFILE"
154 > config directive. The filters
155 as supplied by the developers will be found in
159 >. It is recommended that any locally
160 defined or modified filters go in a separately defined file such as
167 > Command tasks for content filters are to eliminate common annoyances in
168 HTML and JavaScript, such as pop-up windows,
169 exit consoles, crippled windows without navigation tools, the
170 infamous <BLINK> tag etc, to suppress images with certain
171 width and height attributes (standard banner sizes or web-bugs),
172 or just to have fun.</P
174 > Content filtering works on any text-based document type, including
175 HTML, JavaScript, CSS etc. (all <TT
189 Substitutions are made at the source level, so if you want to <SPAN
193 > filters, you should first be familiar with HTML syntax,
194 and, of course, regular expressions.</P
197 HREF="actions-file.html"
200 filter file is organized in sections, which are called <SPAN
207 here. Each filter consists of a heading line, that starts with one of the
220 >CLIENT-HEADER-FILTER:</TT
223 >SERVER-HEADER-FILTER:</TT
225 followed by the filter's <SPAN
231 >, and a short (one line)
238 > of what it does. Below that line
245 >, i.e. lines that define the actual
246 text substitutions. By convention, the name of a filter
247 should describe what the filter <SPAN
254 comment is used in the <A
255 HREF="http://config.privoxy.org/"
261 > Once a filter called <TT
267 in the filter file, it can be invoked by using an action of the form
271 HREF="actions-file.html#FILTER"
281 HREF="actions-file.html"
285 > Filter definitions start with a header line that contains the filter
286 type, the filter name and the filter description.
287 A content filter header line for a filter called <SPAN
301 >FILTER: foo Replace all "foo" with "bar"</PRE
307 > Below that line, and up to the next header line, come the jobs that
308 define what text replacements the filter executes. They are specified
309 in a syntax that imitates <A
310 HREF="http://www.perl.org/"
317 > operator. If you are familiar with Perl, you
318 will find this to be quite intuitive, and may want to look at the
319 PCRS documentation for the subtle differences to Perl behaviour. Most
320 notably, the non-standard option letter <TT
324 which turns the default to ungreedy matching.</P
328 HREF="http://en.wikipedia.org/wiki/Regular_expressions"
335 >, you might want to take a look at
337 HREF="appendix.html#REGEX"
338 >Appendix on regular expressions</A
341 HREF="http://perldoc.perl.org/perlre.html"
347 HREF="http://perldoc.perl.org/perlop.html"
353 > operator's syntax</A
355 HREF="http://perldoc.perl.org/perlre.html"
360 The below examples might also help to get you started.</P
367 >9.1. Filter File Tutorial</A
370 > Now, let's complete our <SPAN
373 > content filter. We have already defined
374 the heading, but the jobs are still missing. Since all it does is to replace
381 >, there is only one (trivial) job
398 > But wait! Didn't the comment say that <SPAN
408 > should be replaced? Our current job will only take
409 care of the first <SPAN
412 > on each page. For global substitution,
413 we'll need to add the <TT
432 > Our complete filter now looks like this:</P
442 >FILTER: foo Replace all "foo" with "bar"
449 > Let's look at some real filters for more interesting examples. Here you see
450 a filter that protects against some common annoyances that arise from JavaScript
451 abuse. Let's look at its jobs one after the other:</P
461 >FILTER: js-annoyances Get rid of particularly annoying JavaScript abuse
463 # Get rid of JavaScript referrer tracking. Test page: http://www.randomoddness.com/untitled.htm
465 s|(<script.*)document\.referrer(.*</script>)|$1"Not Your Business!"$2|Usg</PRE
471 > Following the header line and a comment, you see the job. Note that it uses
475 > as the delimiter instead of <TT
479 the pattern contains a forward slash, which would otherwise have to be escaped
485 > Now, let's examine the pattern: it starts with the text <TT
489 enclosed in parentheses. Since the dot matches any character, and <TT
495 >"Match an arbitrary number of the element left of myself"</SPAN
507 it matches the whole page, from the start of the first <script> tag.</P
509 > That's more than we want, but the pattern continues: <TT
511 >document\.referrer</TT
513 matches only the exact string <SPAN
515 >"document.referrer"</SPAN
523 >, i.e. preceded by a backslash, to take away its
524 special meaning as a joker, and make it just a regular dot. So far, the meaning is:
525 Match from the start of the first <script> tag in a the page, up to, and including,
528 >"document.referrer"</SPAN
536 in the page (and appear in that order).</P
538 > But there's still more pattern to go. The next element, again enclosed in parentheses,
541 >.*</script></TT
542 >. You already know what <TT
546 means, so the whole pattern translates to: Match from the start of the first <script>
547 tag in a page to the end of the last <script> tag, provided that the text
550 >"document.referrer"</SPAN
551 > appears somewhere in between.</P
553 > This is still not the whole story, since we have ignored the options and the parentheses:
554 The portions of the page matched by sub-patterns that are enclosed in parentheses, will be
555 remembered and be available through the variables <TT
559 the substitute. The <TT
562 > option switches to ungreedy matching, which means
566 > in the pattern will only <SPAN
570 text in between <SPAN
582 >"document.referrer"</SPAN
583 >, and that the second <TT
587 only span the text up to the <SPAN
595 >"</script>"</SPAN
597 tag. Furthermore, the <TT
600 > option says that the match may span
601 multiple lines in the page, and the <TT
604 > option again means that the
605 substitution is global.</P
607 > So, to summarize, the pattern means: Match all scripts that contain the text
610 >"document.referrer"</SPAN
611 >. Remember the parts of the script from
612 (and including) the start tag up to (and excluding) the string
615 >"document.referrer"</SPAN
619 >, and the part following
620 that string, up to and including the closing tag, as <TT
625 > Now the pattern is deciphered, but wasn't this about substituting things? So
626 lets look at the substitute: <TT
628 >$1"Not Your Business!"$2</TT
630 easy to read: The text remembered as <TT
636 >"Not Your Business!"</TT
644 the quotation marks!), followed by the text remembered as <TT
648 This produces an exact copy of the original string, with the middle part
651 >"document.referrer"</SPAN
658 > The whole job now reads: Replace <SPAN
660 >"document.referrer"</SPAN
664 >"Not Your Business!"</TT
665 > wherever it appears inside a
666 <script> tag. Note that this job won't break JavaScript syntax,
667 since both the original and the replacement are syntactically valid
668 string objects. The script just won't have access to the referrer
669 information anymore.</P
671 > We'll show you two other jobs from the JavaScript taming department, but
672 this time only point out the constructs of special interest:</P
682 ># The status bar is for displaying link targets, not pointless blahblah
684 s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig</PRE
693 > stands for whitespace characters (space, tab, newline,
694 carriage return, form feed), so that <TT
700 or more whitespace"</SPAN
708 makes this matching of arbitrary text ungreedy. (Note that the <TT
712 option is not set). The <TT
715 > construct means: <SPAN
724 > a double quote"</SPAN
729 a back-reference to the first parenthesis just like <TT
733 with the difference that in the <SPAN
739 >, a backslash indicates
740 a back-reference, whereas in the <SPAN
746 >, it's the dollar.</P
748 > So what does this job do? It replaces assignments of single- or double-quoted
751 >"window.status"</SPAN
752 > object with a dummy assignment
753 (using a variable name that is hopefully odd enough not to conflict with
754 real variables in scripts). Thus, it catches many cases where e.g. pointless
755 descriptions are displayed in the status bar instead of the link target when
756 you move your mouse over links.</P
766 ># Kill OnUnload popups. Yummy. Test: http://www.zdnet.com/zdsubs/yahoo/tree/yfs.html
768 s/(<body [^>]*)onunload(.*>)/$1never$2/iU</PRE
776 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
780 > in the HTML DOM was a <SPAN
787 When I close a browser window, I want it to close and die. Basta.
788 This job replaces the <SPAN
794 >"<body>"</SPAN
795 > tags with the dummy word <TT
802 > option makes the pattern matching
803 case-insensitive. Also note that ungreedy matching alone doesn't always guarantee
804 a minimal match: In the first parenthesis, we had to use <TT
811 > to prevent the match from exceeding the
812 <body> tag if it doesn't contain <SPAN
818 > The last example is from the fun department:</P
828 >FILTER: fun Fun text replacements
830 # Spice the daily news:
832 s/microsoft(?!\.com)/MicroSuck/ig</PRE
841 > part (a so-called negative lookahead)
842 in the job's pattern, which means: Don't match, if the string
846 > appears directly following <SPAN
850 in the page. This prevents links to microsoft.com from being trashed, while
851 still replacing the word everywhere else.</P
861 ># Buzzword Bingo (example for extended regex syntax)
863 s* industry[ -]leading \
865 | customer[ -]focused \
867 | award[ -]winning # Comments are OK, too! \
868 | high[ -]performance \
869 | solutions[ -]based \
873 *<font color="red"><b>BINGO!</b></font> \
883 > option in this job turns on extended syntax, and allows for
884 e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting. </P
886 > You get the idea?</P
893 NAME="PREDEFINED-FILTERS"
894 >9.2. The Pre-defined Filters</A
897 >The distribution <TT
900 > file contains a selection of
901 pre-defined filters for your convenience:</P
917 > The purpose of this filter is to get rid of particularly annoying JavaScript abuse.
924 > replaces JavaScript references to the browser's referrer information
925 with the string "Not Your Business!". This compliments the <TT
928 HREF="actions-file.html#HIDE-REFERRER"
931 > action on the content level.
936 > removes the bindings to the DOM's
938 HREF="http://www.w3.org/TR/2000/REC-DOM-Level-2-Events-20001113/events.html#Events-eventgroupings-htmlevents"
942 > which we feel has no right to exist and is responsible for most <SPAN
944 >"exit consoles"</SPAN
946 nasty windows that pop up when you close another one.
951 > removes code that causes new windows to be opened with undesired properties, such as being
952 full-screen, non-resizeable, without location, status or menu bar etc.
959 > Use with caution. This is an aggressive filter, and can break sites that
960 rely heavily on JavaScript.
973 > This is a very radical measure. It removes virtually all JavaScript event bindings, which
974 means that scripts can not react to user actions such as mouse movements or clicks, window
975 resizing etc, anymore. Use with caution!
982 >strongly discourage</I
984 > using this filter as a default since it breaks
985 many legitimate scripts. It is meant for use only on extra-nasty sites (should you really
999 > This filter will undo many common instances of HTML based abuse.
1009 are neutralized (yeah baby!), and browser windows will be created as
1010 resizeable (as of course they should be!), and will have location,
1011 scroll and menu bars -- even if specified otherwise.
1024 > Most cookies are set in the HTTP dialog, where they can be intercepted
1029 HREF="actions-file.html#CRUNCH-INCOMING-COOKIES"
1030 >crunch-incoming-cookies</A
1036 HREF="actions-file.html#CRUNCH-OUTGOING-COOKIES"
1037 >crunch-outgoing-cookies</A
1040 actions. But web sites increasingly make use of HTML meta tags and JavaScript
1041 to sneak cookies to the browser on the content level.
1044 > This filter disables most HTML and JavaScript code that reads or sets
1045 cookies. It cannot detect all clever uses of these types of code, so it
1046 should not be relied on as an absolute fix. Use it wherever you would also
1047 use the cookie crunch actions.
1060 > Disable any refresh tags if the interval is greater than nine seconds (so
1061 that redirections done via refresh tags are not destroyed). This is useful
1062 for dial-on-demand setups, or for those who find this HTML feature
1071 >unsolicited-popups</I
1076 > This filter attempts to prevent only <SPAN
1078 >"unsolicited"</SPAN
1080 windows from opening, yet still allow pop-up windows that the user
1081 has explicitly chosen to open. It was added in version 3.0.1,
1082 as an improvement over earlier such filters.
1085 > Technical note: The filter works by redefining the window.open JavaScript
1086 function to a dummy function, <TT
1088 >PrivoxyWindowOpen()</TT
1090 during the loading and rendering phase of each HTML page access, and
1091 restoring the function afterward.
1094 > This is recommended only for browsers that cannot perform this function
1095 reliably themselves. And be aware that some sites require such windows
1096 in order to function normally. Use with caution.
1109 > Attempt to prevent <SPAN
1115 > pop-up windows from opening.
1116 Note this should be used with even more discretion than the above, since
1117 it is more likely to break some sites that require pop-ups for normal
1118 usage. Use with caution.
1131 > This is a helper filter that has no value if used alone. It makes the
1134 >banners-by-size</TT
1137 >banners-by-link</TT
1139 (see below) filters more effective and should be enabled together with them.
1152 > This filter removes image tags purely based on what size they are. Fortunately
1153 for us, many ads and banner images tend to conform to certain standardized
1154 sizes, which makes this filter quite effective for ad stripping purposes.
1157 > Occasionally this filter will cause false positives on images that are not ads,
1158 but just happen to be of one of the standard banner sizes.
1161 > Recommended only for those who require extreme ad blocking. The default
1162 block rules should catch 95+% of all ads <SPAN
1168 > this filter enabled.
1181 > This is an experimental filter that attempts to kill any banners if
1182 their URLs seem to point to known or suspected click trackers. It is currently
1183 not of much value and is not recommended for use by default.
1196 > Webbugs are small, invisible images (technically 1X1 GIF images), that
1197 are used to track users across websites, and collect information on them.
1198 As an HTML page is loaded by the browser, an embedded image tag causes the
1199 browser to contact a third-party site, disclosing the tracking information
1200 through the requested URL and/or cookies for that third-party domain, without
1201 the user ever becoming aware of the interaction with the third-party site.
1202 HTML-ized spam also uses a similar technique to verify email addresses.
1205 > This filter removes the HTML code that loads such <SPAN
1221 > A rather special-purpose filter that can be used to enlarge textareas (those
1222 multi-line text boxes in web forms) and turn off hard word wrap in them.
1223 It was written for the sourceforge.net tracker system where such boxes are
1224 a nuisance, but it can be handy on other sites, too.
1227 > It is not recommended to use this filter as a default.
1240 > Many consider windows that move, or resize themselves to be abusive. This filter
1241 neutralizes the related JavaScript code. Note that some sites might not display
1242 or behave as intended when using this filter. Use with caution.
1250 >frameset-borders</I
1255 > Some web designers seem to assume that everyone in the world will view their
1256 web sites using the same browser brand and version, screen resolution etc,
1257 because only that assumption could explain why they'd use static frame sizes,
1258 yet prevent their frames from being resized by the user, should they be too
1259 small to show their whole content.
1262 > This filter removes the related HTML code. It should only be applied to sites
1276 > Many Microsoft products that generate HTML use non-standard extensions (read:
1277 violations) of the ISO 8859-1 aka Latin-1 character set. This can cause those
1278 HTML documents to display with errors on standard-compliant platforms.
1281 > This filter translates the MS-only characters into Latin-1 equivalents.
1282 It is not necessary when using MS products, and will cause corruption of
1283 all documents that use 8-bit character sets other than Latin-1. It's mostly
1284 worthwhile for Europeans on non-MS platforms, if weird garbage characters
1285 sometimes appear on some pages, or user agents that don't correct for this on
1300 > A filter for shockwave haters. As the name suggests, this filter strips code
1301 out of web pages that is used to embed shockwave flash objects.
1311 >quicktime-kioskmode</I
1316 > Change HTML code that embeds Quicktime objects so that kioskmode, which
1317 prevents saving, is disabled.
1330 > Text replacements for subversive browsing fun. Make fun of your favorite
1331 Monopolist or play buzzword bingo.
1344 > A demonstration-only filter that shows how <SPAN
1348 can be used to delete web content on a keyword basis.
1361 > An experimental collection of text replacements to disable malicious HTML and JavaScript
1362 code that exploits known security holes in Internet Explorer.
1365 > Presently, it only protects against Nimda and a cross-site scripting bug, and
1366 would need active maintenance to provide more substantial protection.
1379 > Some web sites have very specific problems, the cure for which doesn't apply
1380 anywhere else, or could even cause damage on other sites.
1383 > This is a collection of such site-specific cures which should only be applied
1384 to the sites they were intended for, which is what the supplied
1388 > file does. Users shouldn't need to change
1389 anything regarding this filter.
1402 > A CSS based block for Google text ads. Also removes a width limitation
1403 and the toolbar advertisement.
1416 > Another CSS based block, this time for Yahoo text ads. And removes
1417 a width limitation as well.
1430 > Another CSS based block, this time for MSN text ads. And removes
1431 tracking URLs, as well as a width limitation.
1444 > Cleans up some Blogspot blogs. Read the fine print before using this one!
1447 > This filter also intentionally removes some navigation stuff and sets the
1448 page width to 100%. As a result, some rounded <SPAN
1452 appear to early or not at all and as fixing this would require a browser
1453 that understands background-size (CSS3), they are removed instead.
1466 > Server-header filter to change the Content-Type from xml to html.
1479 > Server-header filter to change the Content-Type from html to xml.
1492 > Removes the non-standard <TT
1496 anchor and area HTML tags.
1504 >hide-tor-exit-notation</I
1509 > Client-header filter to remove the <B
1512 > exit node notation
1513 found in Host and Referer headers.
1522 > are chained and <SPAN
1526 is configured to use socks4a, one can use <SPAN
1528 >"http://www.example.org.foobar.exit/"</SPAN
1530 to access the host <SPAN
1532 >"www.example.org"</SPAN
1543 > As the HTTP client isn't aware of this notation, it treats the
1546 >"www.example.org.foobar.exit"</SPAN
1547 > as host and uses it
1555 server's point of view the resulting headers are invalid and can cause problems.
1561 > header can trigger <SPAN
1563 >"hot-linking"</SPAN
1565 protections, an invalid <SPAN
1568 > header will make it impossible for
1569 the server to find the right vhost (several domains hosted on the same IP address).
1572 > This client-header filter removes the <SPAN
1575 > part in those headers
1576 to prevent the mentioned problems. Note that it only modifies
1577 the HTTP headers, it doesn't make it impossible for the server
1581 > exit node based on the IP address
1582 the request is coming from.
1594 SUMMARY="Footer navigation table"
1605 HREF="actions-file.html"
1623 HREF="templates.html"
1643 >Privoxy's Template Files</TD