X-Git-Url: http://www.privoxy.org/gitweb/misc.html?a=blobdiff_plain;f=doc%2Fwebserver%2Fuser-manual%2Ffilter-file.html;h=19642832daf9a1eb044f5130a60da6ea61749c7a;hb=72081f829de368392d04076728f8c991178c0080;hp=2ca2bb5b84677580d9a25f161ab560384ab9142e;hpb=701f0d2c06084708ab71fe06ded88d4b666dc826;p=privoxy.git diff --git a/doc/webserver/user-manual/filter-file.html b/doc/webserver/user-manual/filter-file.html index 2ca2bb5b..19642832 100644 --- a/doc/webserver/user-manual/filter-file.html +++ b/doc/webserver/user-manual/filter-file.html @@ -1,13 +1,13 @@ + The Filter FilePrivoxy 3.1.1 User ManualPrivoxy 3.0.3 User Manual

9. The Filter File

9. The Filter File

All text substitutions that can be invoked through the - filter action must first be defined in the filter file, which is typically called default.filter and which can be - selected through the filterfile config option.

Filtering works on any text-based document type, including plain - text, HTML, JavaScript, CSS etc. (all Filtering works on any text-based document type, including + HTML, JavaScript, CSS etc. (all text/*text/* - MIME types). Substitutions are made at the source level, so if - you want to except text/plain). + Substitutions are made at the source level, so if you want to "roll your own" filters, you should be - familiar with HTML syntax.

"roll + your own" filters, you should be familiar with HTML syntax.

Just like the keyword FILTER:FILTER:, followed by the filter's .

Once a filter called Once a filter called namename has been defined in the filter file, it can be invoked by using an action of the form - +filter{{name}name} in any Perl's - s///s/// operator. If you are familiar with Perl, you will find this to be quite intuitive, and may want to look at the PCRS man page for the subtle differences to Perl behaviour. Most notably, the non-standard - option letter UU is supported, which turns the default to ungreedy matching.

the - s///s/// operator's syntax and

9.1. Filter File Tutorial

9.1. Filter File Tutorial

Now, let's complete our "foo" on each page. For global substitution, - we'll need to add the gg option:

Following the header line and a comment, you see the job. Note that it uses - | as the delimiter instead of | as the delimiter instead of //, because the pattern contains a forward slash, which would otherwise have to be escaped - by a backslash (\\).

Now, let's examine the pattern: it starts with the text Now, let's examine the pattern: it starts with the text <script.*<script.* - enclosed in parentheses. Since the dot matches any character, and ** means: text, i.e. it matches the whole page, from the start of the first <script> tag.

That's more than we want, but the pattern continues: That's more than we want, but the pattern continues: document\.referrerdocument\.referrer matches only the exact string

But there's still more pattern to go. The next element, again enclosed in parentheses, - is .*</script>. You already know what .*</script>. You already know what .*.* means, so the whole pattern translates to: Match from the start of the first <script> tag in a page to the end of the last <script> tag, provided that the text @@ -460,17 +469,17 @@ CLASS="QUOTE" >

This is still not the whole story, since we have ignored the options and the parentheses: The portions of the page matched by sub-patterns that are enclosed in parentheses, will be - remembered and be available through the variables $1, $2, ...$1, $2, ... in - the substitute. The UU option switches to ungreedy matching, which means - that the first .*.* in the pattern will only "eat up""document.referrer", and that the second , and that the second .*.* will only span the text up to the "</script>" - tag. Furthermore, the ss option says that the match may span - multiple lines in the page, and the gg option again means that the substitution is global.

"document.referrer" as as $1$1, and the part following - that string, up to and including the closing tag, as $2$2.

Now the pattern is deciphered, but wasn't this about substituting things? So - lets look at the substitute: $1"Not Your Business!"$2$1"Not Your Business!"$2 is - easy to read: The text remembered as $1$1, followed by - "Not Your Business!""Not Your Business!" (including - the quotation marks!), followed by the text remembered as $2$2. This produces an exact copy of the original string, with the middle part (the "document.referrer") replaced by ) replaced by "Not Your - Business!".

The whole job now reads: Replace "document.referrer" by - "Not Your Business!""Not Your Business!" wherever it appears inside a <script> tag. Note that this job won't break JavaScript syntax, since both the original and the replacement are syntactically valid @@ -595,31 +604,31 @@ s/window\.status\s*=\s*(['"]).*?\1/dUmMy=1/ig

\s\s stands for whitespace characters (space, tab, newline, - carriage return, form feed), so that \s*\s* means: "zero or more whitespace". The . The ? in ? in .*?.*? - makes this matching of arbitrary text ungreedy. (Note that the UU - option is not set). The ['"]['"] construct means: "a single @@ -630,13 +639,13 @@ CLASS="EMPHASIS" >or a double quote". Finally, . Finally, \1\1 is - a backreference to the first parenthesis just like $1$1 above, with the difference that in the "<body>" tags with the dummy word tags with the dummy word nevernever. - Note that the ii option makes the pattern matching case-insensitive. Also note that ungreedy matching alone doesn't always guarantee - a minimal match: In the first parenthesis, we had to use [^>]*[^>]* - instead of .*.* to prevent the match from exceeding the <body> tag if it doesn't contain

Note the Note the (?!\.com)(?!\.com) part (a so-called negative lookahead) in the job's pattern, which means: Don't match, if the string

The The xx option in this job turns on extended syntax, and allows for e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.

You get the idea?

9.2. The Pre-defined Filters

The distribution default.filter file contains a selection of +pre-defined filters for your convenience:

js-annoyances

The purpose of this filter is to get rid of particularly annoying JavaScript abuse. + To that end, it +

  • replaces JavaScript references to the browser's referrer information + with the string "Not Your Business!". This compliments the hide-referrer action on the content level. +

  • removes the bindings to the DOM's + unload + event which we feel has no right to exist and is responsible for most "exit consoles", i.e. + nasty windows that pop up when you close another one. +

  • removes code that causes new windows to be opened with undesired properties, such as being + full-screen, non-resizable, without location, status or menu bar etc. +

+

js-events

This is a very radical measure. It removes virtually all JavaScript event bindings, which + means that scripts can not react to user actions such as mouse movements or clicks, window + resizing etc, anymore. +

We strongly discourage using this filter as a default since it breaks + many legitimate scripts. It is meant for use only on extra-nasty sites (should you really + need to go there). +

html-annoyances

This filter will undo many common instances of HTML based abuse. +

The BLINK and MARQUEE tags + are neutralized (yeah baby!), and browser windows will be created as + resizable (as of course they should be!), and will have location, + scroll and menu bars -- even if specified otherwise. +

content-cookies

Most cookies are set in the HTTP dialogue, where they can be intercepted + by the + crunch-incoming-cookies + and crunch-outgoing-cookies + actions. But web sites increasingly make use of HTML meta tags and JavaScript + to sneak cookies to the browser on the content level. +

This filter disables HTML and JavaScript code that reads or sets cookies. Use + it wherever you would also use the cookie crunch actions. +

refresh tags

Disable any refresh tags if the interval is greater than nine seconds (so + that redirections done via refresh tags are not destroyed). This is useful + for dial-on-demand setups, or for those who find this HTML feature + annoying. +

unsolicited-popups

This filter attempts to prevent only "unsolicited" pop-up + windows from opening, yet still allow pop-up windows that the user + has explicitly chosen to open. It was added in version 3.0.1, + as an improvement over earlier such filters. +

Technical note: The filter works by redefining the window.open JavaScript + function to a dummy function during the loading and rendering phase of each + HTML page access, and restoring the function afterwards. +

all-popups

Attempt to prevent all pop-up windows from opening. + Note this should be used with more discretion than the above, since it is + more likely to break some sites that require pop-ups for normal usage. Use + with caution. +

img-reorder

This is a helper filter that has no value if used alone. It makes the + banners-by-size and banners-by-link + (see below) filters more effective and should be enabled together with them. +

banners-by-size

This filter removes image tags purely based on what size they are. Fortunately + for us, many ads and banner images tend to conform to certain standardized + sizes, which makes this filter quite effective for ad stripping purposes. +

Occasionally this filter will cause false positives on images that are not ads, + but just happen to be of one of the standard banner sizes. +

banners-by-link

This is an experimental filter that attempts to kill any banners if + their URLs seem to point to known or suspected click trackers. It is currently + not of much value and is not recommended for use by default. +

webbugs

Webbugs are small, invisible images (technically 1X1 GIF images), that + are used to track users across websites, and collect information on them. + As an HTML page is loaded by the browser, an embedded image tag causes the + browser to contact a third-party site, disclosing the tracking information + through the requested URL and/or cookies for that third-party domain, without + the use ever becoming aware of the interaction with the third-party site. + HTML-ized spam also uses a similar technique to verify email addresses. +

This filter removes the HTML code that loads such "webbugs". +

tiny-textforms

A rather special-purpose filter that can be used to enlarge textareas (those + multi-line text boxes in web forms) and turn off hard word wrap in them. + It was written for the sourceforge.net tracker system where such boxes are + a nuisance, but it can be handy on other sites, too. +

It is not recommended to use this filter as a default. +

jumping-windows

Many consider windows that move, or resize themselves to be abusive. This filter + neutralizes the related JavaScript code. Note that some sites might not display + or behave as intended when using this filter. +

frameset-borders

Some web designers seem to assume that everyone in the world will view their + web sites using the same browser brand and version, screen resolution etc, + because only that assumption could explain why they'd use static frame sizes, + yet prevent their frames from being resized by the user, should they be too + small to show their whole content. +

This filter removes the related HTML code. It should only be applied to sites + which need it. +

demoronizer

Many Microsoft products that generate HTML use non-standard extensions (read: + violations) of the ISO 8859-1 aka Latin-1 character set. This causes those + HTML documents to display with errors on standard-compliant platforms. +

This filter translates the MS-only characters into Latin-1 equivalents. + It is not necessary when using MS products, and will cause corruption of + all documents that use 8-bit character sets other than Latin-1. It's mostly + worthwhile for Europeans on non-MS platforms, if wierd garbage characters + sometimes appear on some pages. +

shockwave-flash

A filter for shockwave haters. As the name suggests, this filter strips code + out of web pages that is used to embed shockwave flash objects. +

quicktime-kioskmode

Change HTML code that embeds Quicktime objects so that kioskmode, which + prevents saving, is disabled. +

fun

Text replacements for subversive browsing fun. Make fun of your favorite + Monopolist or play buzzword bingo. +

crude-parental

A demonstration-only filter that shows how Privoxy + can be used to delete web content on a keyword basis. +

ie-exploits

A collection of text replacements to disable malicious HTML and JavaScript + code that exploits known security holes in Internet Explorer. +

Presently, it only protects against Nimda and a cross-site scripting bug, and + would need active maintenance to provide more substantial protection. +

site-specifics

Some web sites have very specific problems, the cure for which doesn't apply + anywhere else, or could even cause damage on other sites. +

This is a collection of such site-specific cures which should only be applied + to the sites they were intended for, which is what the supplied + default.action file does. Users shouldn't need to change + anything regarding this filter. +