- Junkbuster can use "regular expressions" in various config files.
- Assuming support for "pcre" (Perl Compatible Regular Expressions) is
- compiled in, which is the default. Such configuration directives do
- not require regular expressions, but they can be used to increase
- flexibility by matching a pattern with wild-cards against URLs.
-
- If you are reading this, you probably don't understand what "regular
- expressions" are, or what they can do. So this will be a very brief
- introduction only. A full explanation would require a book ;-)
-
- "Regular expressions" is a way of matching one character expression
- against another to see if it matches or not. One of the "expressions"
- is a literal string of readable characters (letter, numbers, etc), and
- the other is a complex string of literal characters combined with
- wild-cards, and other special characters, called meta-characters. The
- "meta-characters" have special meanings and are used to build the
- complex pattern to be matched against. Perl Compatible Regular
- Expressions is an enhanced form of the regular expression language
- with backward compatibility.
-
- To make a simple analogy, we do something similar when we use
- wild-card characters when listing files with the dir command in DOS.
- *.* matches all filenames. The "special" character here is the
- asterisk which matches any and all characters. We can be more specific
- and use ? to match just individual characters. So "dir file?.text"
- would match "file1.txt", "file2.txt", etc. We are pattern matching,
- using a similar technique to "regular expressions"!
-
- Regular expressions do essentially the same thing, but are much, much
- more powerful. There are many more "special characters" and ways of
- building complex patterns however. Let's look at a few of the common
- ones, and then some examples:
-
- . - Matches any single character, e.g. "a", "A", "4", ":", or "@".
-
- ? - The preceding character or expression is matched ZERO or ONE
- times. Either/or.
-
- + - The preceding character or expression is matched ONE or MORE
- times.
-
- * - The preceding character or expression is matched ZERO or MORE
- times.
-
- \ - The "escape" character denotes that the following character should
- be taken literally. This is used where one of the special characters
- (e.g. ".") needs to be taken literally and not as a special
- meta-character.
-
- [] - Characters enclosed in brackets will be matched if any of the
- enclosed characters are encountered.
-
- () - parentheses are used to group a sub-expression, or multiple
- sub-expressions.
-
- | - The "bar" character works like an "or" conditional statement. A
- match is successful if the sub-expression on either side of "|"
- matches.
-
- s/string1/string2/g - This is used to rewrite strings of text.
- "string1" is replaced by "string2" in this example.
-
- These are just some of the ones you are likely to use when matching
- URLs with Junkbuster, and is a long way from a definitive list. This
- is enough to get us started with a few simple examples which may be
- more illuminating:
-
- /.*/banners/.* - A simple example that uses the common combination of
- "." and "*" to denote any character, zero or more times. In other
- words, any string at all. So we start with a literal forward slash,
- then our regular expression pattern (".*") another literal forward
- slash, the string "banners", another forward slash, and lastly another
- ".*". We are building a directory path here. This will match any file
- with the path that has a directory named "banners" in it. The ".*"
- matches any characters, and this could conceivably be more forward
- slashes, so it might expand into a much longer looking path. For
- example, this could match:
- "/eye/hate/spammers/banners/annoy_me_please.gif", or just
- "/banners/annoying.html", or almost an infinite number of other
- possible combinations, just so it has "banners" in the path somewhere.
-
- A now something a little more complex:
-
- /.*/adv((er)?ts?|ertis(ing|ements?))?/ - We have several literal
- forward slashes again ("/"), so we are building another expression
- that is a file path statement. We have another ".*", so we are
- matching against any conceivable sub-path, just so it matches our
- expression. The only true literal that must match our pattern is adv,
- together with the forward slashes. What comes after the "adv" string
- is the interesting part.
-
- Remember the "?" means the preceding expression (either a literal
- character or anything grouped with "(...)" in this case) can exist or
- not, since this means either zero or one match. So
- "((er)?ts?|ertis(ing|ements?))" is optional, as are the individual
- sub-expressions: "(er)", "(ing|ements?)", and the "s". The "|" means
- "or". We have two of those. For instance, "(ing|ements?)", can expand
- to match either "ing" OR "ements?". What is being done here, is an
- attempt at matching as many variations of "advertisement", and
- similar, as possible. So this would expand to match just "adv", or
- "advert", or "adverts", or "advertising", or "advertisement", or
- "advertisements". You get the idea. But it would not match
- "advertizements" (with a "z"). We could fix that by changing our
- regular expression to: "/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/",
- which would then match either spelling.
-
- /.*/advert[0-9]+\.(gif|jpe?g) - Again another path statement with
- forward slashes. Anything in the square brackets "[]" can be matched.
- This is using "0-9" as a shorthand expression to mean any digit one
- through nine. It is the same as saying "0123456789". So any digit
- matches. The "+" means one or more of the preceding expression must be
- included. The preceding expression here is what is in the square
- brackets -- in this case, any digit one through nine. Then, at the
- end, we have a grouping: "(gif|jpe?g)". This includes a "|", so this
- needs to match the expression on either side of that bar character
- also. A simple "gif" on one side, and the other side will in turn
- match either "jpeg" or "jpg", since the "?" means the letter "e" is
- optional and can be matched once or not at all. So we are building an
- expression here to match image GIF or JPEG type image file. It must
- include the literal string "advert", then one or more digits, and a
- "." (which is now a literal, and not a special character, since it is
- escaped with "\"), and lastly either "gif", or "jpeg", or "jpg". Some
- possible matches would include: "//advert1.jpg",
- "/nasty/ads/advert1234.gif", "/banners/from/hell/advert99.jpg". It
- would not match "advert1.gif" (no leading slash), or "/adverts232.jpg"
- (the expression does not include an "s"), or "/advert1.jsp" ("jsp" is
- not in the expression anywhere).
-
- s/microsoft(?!.com)/MicroSuck/i - This is a substitution. "MicroSuck"
- will replace any occurrence of "microsoft". The "i" at the end of the
- expression means ignore case. The "(?!.com)" means the match should
- fail if "microsoft" is followed by ".com". In other words, this acts
- like a "NOT" modifier. In case this is a hyperlink, we don't want to
- break it ;-).
-
- We are barely scratching the surface of regular expressions here so
- that you can understand the default Junkbuster configuration files,
- and maybe use this knowledge to customize your own installation. There
- is much, much more that can be done with regular expressions. Now that
- you know enough to get started, you can learn more on your own :/
-
- More reading on Perl Compatible Regular expressions:
- [53]http://www.perldoc.com/perl5.6/pod/perlre.html
-
-References
-
- 1. http://ijbswa.sourceforge.net/user-manual/
- 2. mailto:ijbswa-developers@lists.sourceforge.net
- 3. file://localhost/home/swa/sf/current/doc/source/tmp.html#INTRODUCTION
- 4. file://localhost/home/swa/sf/current/doc/source/tmp.html#AEN27
- 5. file://localhost/home/swa/sf/current/doc/source/tmp.html#INSTALLATION
- 6. file://localhost/home/swa/sf/current/doc/source/tmp.html#INSTALLATION-SOURCE
- 7. file://localhost/home/swa/sf/current/doc/source/tmp.html#INSTALLATION-RH
- 8. file://localhost/home/swa/sf/current/doc/source/tmp.html#INSTALLATION-SUSE
- 9. file://localhost/home/swa/sf/current/doc/source/tmp.html#INSTALLATION-OS2
- 10. file://localhost/home/swa/sf/current/doc/source/tmp.html#INSTALLATION-WIN
- 11. file://localhost/home/swa/sf/current/doc/source/tmp.html#INSTALLATION-OTHER
- 12. file://localhost/home/swa/sf/current/doc/source/tmp.html#CONFIGURATION
- 13. file://localhost/home/swa/sf/current/doc/source/tmp.html#AEN172
- 14. file://localhost/home/swa/sf/current/doc/source/tmp.html#ACTIONSFILE
- 15. file://localhost/home/swa/sf/current/doc/source/tmp.html#FILTERFILE
- 16. file://localhost/home/swa/sf/current/doc/source/tmp.html#AEN1130
- 17. file://localhost/home/swa/sf/current/doc/source/tmp.html#QUICKSTART
- 18. file://localhost/home/swa/sf/current/doc/source/tmp.html#CONTACT
- 19. file://localhost/home/swa/sf/current/doc/source/tmp.html#COPYRIGHT
- 20. file://localhost/home/swa/sf/current/doc/source/tmp.html#AEN1195
- 21. file://localhost/home/swa/sf/current/doc/source/tmp.html#AEN1201
- 22. file://localhost/home/swa/sf/current/doc/source/tmp.html#SEEALSO
- 23. file://localhost/home/swa/sf/current/doc/source/tmp.html#APPENDIX
- 24. file://localhost/home/swa/sf/current/doc/source/tmp.html#REGEX
- 25. http://i.j.b/
- 26. http://sourceforge.net/projects/ijbswa/
- 27. http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/ijbswa/current/
- 28. http://www.gnu.org/
- 29. http://i.j.b/
- 30. file://localhost/home/swa/sf/current/doc/source/tmp.html#ACTIONSFILE
- 31. http://i.j.b/
- 32. http://i.j.b/
- 33. http://i.j.b/
- 34. http://i.j.b/show-url-info
- 35. http://i.j.b/
- 36. http://www.perldoc.com/perl5.6/pod/perlre.html
- 37. file://localhost/home/swa/sf/current/doc/source/tmp.html#REGEX
- 38. http://i.j.b/
- 39. http://sourceforge.net/tracker/?atid=361118&group_id=11118&func=browse
- 40. http://sourceforge.net/mail/?group_id=11118
- 41. http://sourceforge.net/tracker/?group_id=11118&atid=111118
- 42. http://www.gnu.org/copyleft/gpl.html
- 43. http://www.junkbusters.com/ht/en/ijbfaq.html
- 44. http://www.waldherr.org/junkbuster/
- 45. http://sourceforge.net/projects/ijbswa/
- 46. http://sourceforge.net/projects/ijbswa
- 47. http://ijbswa.sourceforge.net/
- 48. http://i.j.b/
- 49. http://www.junkbusters.com/ht/en/cookies.html
- 50. http://www.waldherr.org/junkbuster/
- 51. http://privacy.net/analyze/
- 52. http://www.squid-cache.org/
- 53. http://www.perldoc.com/perl5.6/pod/perlre.html
+Junkbuster can use "regular expressions" in various config files. Assuming
+support for "pcre" (Perl Compatible Regular Expressions) is compiled in, which
+is the default. Such configuration directives do not require regular
+expressions, but they can be used to increase flexibility by matching a pattern
+with wild-cards against URLs.
+
+If you are reading this, you probably don't understand what "regular
+expressions" are, or what they can do. So this will be a very brief
+introduction only. A full explanation would require a book ;-)
+
+"Regular expressions" is a way of matching one character expression against
+another to see if it matches or not. One of the "expressions" is a literal
+string of readable characters (letter, numbers, etc), and the other is a
+complex string of literal characters combined with wild-cards, and other
+special characters, called meta-characters. The "meta-characters" have special
+meanings and are used to build the complex pattern to be matched against. Perl
+Compatible Regular Expressions is an enhanced form of the regular expression
+language with backward compatibility.
+
+To make a simple analogy, we do something similar when we use wild-card
+characters when listing files with the dir command in DOS. *.* matches all
+filenames. The "special" character here is the asterisk which matches any and
+all characters. We can be more specific and use ? to match just individual
+characters. So "dir file?.text" would match "file1.txt", "file2.txt", etc. We
+are pattern matching, using a similar technique to "regular expressions"!
+
+Regular expressions do essentially the same thing, but are much, much more
+powerful. There are many more "special characters" and ways of building complex
+patterns however. Let's look at a few of the common ones, and then some
+examples:
+
+. - Matches any single character, e.g. "a", "A", "4", ":", or "@".
+
+? - The preceding character or expression is matched ZERO or ONE times. Either/
+or.
+
++ - The preceding character or expression is matched ONE or MORE times.
+
+* - The preceding character or expression is matched ZERO or MORE times.
+
+\ - The "escape" character denotes that the following character should be taken
+literally. This is used where one of the special characters (e.g. ".") needs to
+be taken literally and not as a special meta-character.
+
+[] - Characters enclosed in brackets will be matched if any of the enclosed
+characters are encountered.
+
+() - parentheses are used to group a sub-expression, or multiple
+sub-expressions.
+
+| - The "bar" character works like an "or" conditional statement. A match is
+successful if the sub-expression on either side of "|" matches.
+
+s/string1/string2/g - This is used to rewrite strings of text. "string1" is
+replaced by "string2" in this example.
+
+These are just some of the ones you are likely to use when matching URLs with
+Junkbuster, and is a long way from a definitive list. This is enough to get us
+started with a few simple examples which may be more illuminating:
+
+/.*/banners/.* - A simple example that uses the common combination of "." and "
+*" to denote any character, zero or more times. In other words, any string at
+all. So we start with a literal forward slash, then our regular expression
+pattern (".*") another literal forward slash, the string "banners", another
+forward slash, and lastly another ".*". We are building a directory path here.
+This will match any file with the path that has a directory named "banners" in
+it. The ".*" matches any characters, and this could conceivably be more forward
+slashes, so it might expand into a much longer looking path. For example, this
+could match: "/eye/hate/spammers/banners/annoy_me_please.gif", or just "/
+banners/annoying.html", or almost an infinite number of other possible
+combinations, just so it has "banners" in the path somewhere.
+
+A now something a little more complex:
+
+/.*/adv((er)?ts?|ertis(ing|ements?))?/ - We have several literal forward
+slashes again ("/"), so we are building another expression that is a file path
+statement. We have another ".*", so we are matching against any conceivable
+sub-path, just so it matches our expression. The only true literal that must
+match our pattern is adv, together with the forward slashes. What comes after
+the "adv" string is the interesting part.
+
+Remember the "?" means the preceding expression (either a literal character or
+anything grouped with "(...)" in this case) can exist or not, since this means
+either zero or one match. So "((er)?ts?|ertis(ing|ements?))" is optional, as
+are the individual sub-expressions: "(er)", "(ing|ements?)", and the "s". The "
+|" means "or". We have two of those. For instance, "(ing|ements?)", can expand
+to match either "ing" OR "ements?". What is being done here, is an attempt at
+matching as many variations of "advertisement", and similar, as possible. So
+this would expand to match just "adv", or "advert", or "adverts", or
+"advertising", or "advertisement", or "advertisements". You get the idea. But
+it would not match "advertizements" (with a "z"). We could fix that by changing
+our regular expression to: "/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/", which
+would then match either spelling.
+
+/.*/advert[0-9]+\.(gif|jpe?g) - Again another path statement with forward
+slashes. Anything in the square brackets "[]" can be matched. This is using
+"0-9" as a shorthand expression to mean any digit one through nine. It is the
+same as saying "0123456789". So any digit matches. The "+" means one or more of
+the preceding expression must be included. The preceding expression here is
+what is in the square brackets -- in this case, any digit one through nine.
+Then, at the end, we have a grouping: "(gif|jpe?g)". This includes a "|", so
+this needs to match the expression on either side of that bar character also. A
+simple "gif" on one side, and the other side will in turn match either "jpeg"
+or "jpg", since the "?" means the letter "e" is optional and can be matched
+once or not at all. So we are building an expression here to match image GIF or
+JPEG type image file. It must include the literal string "advert", then one or
+more digits, and a "." (which is now a literal, and not a special character,
+since it is escaped with "\"), and lastly either "gif", or "jpeg", or "jpg".
+Some possible matches would include: "//advert1.jpg", "/nasty/ads/
+advert1234.gif", "/banners/from/hell/advert99.jpg". It would not match
+"advert1.gif" (no leading slash), or "/adverts232.jpg" (the expression does not
+include an "s"), or "/advert1.jsp" ("jsp" is not in the expression anywhere).
+
+s/microsoft(?!.com)/MicroSuck/i - This is a substitution. "MicroSuck" will
+replace any occurrence of "microsoft". The "i" at the end of the expression
+means ignore case. The "(?!.com)" means the match should fail if "microsoft" is
+followed by ".com". In other words, this acts like a "NOT" modifier. In case
+this is a hyperlink, we don't want to break it ;-).
+
+We are barely scratching the surface of regular expressions here so that you
+can understand the default Junkbuster configuration files, and maybe use this
+knowledge to customize your own installation. There is much, much more that can
+be done with regular expressions. Now that you know enough to get started, you
+can learn more on your own :/
+
+More reading on Perl Compatible Regular expressions: http://www.perldoc.com/
+perl5.6/pod/perlre.html
+