5 Modifier Definitions
The purpose is to generate a Racket symbol that is consistent with the symbol of its corresponding perl. These can then be manipulated at the higher levels of the program.
The perl-modifiers require the structures lexp and perl from Lexp-structures.rkt.
(require "Lexp-structures.rkt")
(provide (all-defined-out)) |
(provide |
(all-from-out "Lexp-structures.rkt")) |
5.1 Basic Modifiers
Perl-definitions from perldoc: Perl regular expressions on 2013/12/15.
(perl "treatStringAsMultipleLines" |
"m" |
"Treat string as multiple lines. That is, change '^' and '$' from matching the start or end of line only at the left and right ends of the string to matching them anywhere within the string. \n\n\nUsed together, as '/ms', they let the '.' match any character whatsoever, while still allowing '^' and '$' to match, respectively, just after and just before newlines within the string.") |
(perl "treatStringAsSingleLine" |
"s" |
"Treat string as single line. That is, change '.' to match any character whatsoever, even a newline, which normally it would not match. Used together, as /ms, they let the '.' match any character whatsoever, while still allowing '^' and '$' to match, respectively, just after and just before newlines within the string.\n\n\nUsed together, as /ms, they let the '.' match any character whatsoever, while still allowing '^' and '$' to match, respectively, just after and just before newlines within the string.") |
(perl "caseInsensitive" |
"i" |
"Do case-insensitive pattern matching.\n\nIf locale matching rules are in effect, the case map is taken from the current locale for code points less than 255, and from Unicode rules for larger code points. However, matches that would cross the Unicode rules/non-Unicode rules boundary (ords 255/256) will not succeed. See perllocale.\n\nThere are a number of Unicode characters that match multiple characters under /i. For example, LATIN SMALL LIGATURE FI should match the sequence fi . Perl is not currently able to do this when the multiple characters are in the pattern and are split between groupings, or when one or more are quantified. Thus\n\n\n 1. {LATIN SMALL LIGATURE FI} =~ /fi/i; # Matches\n 2. {LATIN SMALL LIGATURE FI} =~ /[fi][fi]/i; # Doesn't match!\n 3. {LATIN SMALL LIGATURE FI} =~ /fi*/i; # Doesn't match!\n # The below doesn't match, and it isn't clear what $1 and $2 would\n # be even if it did!!\n 4. {LATIN SMALL LIGATURE FI} =~ /(f)(i)/i; # Doesn't match!\n\n\nPerl doesn't match multiple characters in a bracketed character class unless the character that maps to them is explicitly mentioned, and it doesn't match them at all if the character class is inverted, which otherwise could be highly confusing. See Bracketed Character Classes in perlrecharclass, and Negation in perlrecharclass.") |
(perl "AllowWhiteSpaceAndComments" |
"x" |
"'x' tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The # character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or # characters in the pattern \\(outside a character class, where they are unaffected by x\\), then you'll either have to escape them \\(using backslashes or \\Q...\\E \\) or encode them using octal, hex, or \\N{} escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable.\n\nNote that you have to be careful not to include the pattern delimiter in the comment--perl has no way of knowing you did not intend to close the pattern early. See the C-comment deletion code in perlop. Also note that anything inside a \\Q...\\E stays unaffected by /x. And note that /x doesn't affect space interpretation within a single multi-character construct.\n\nFor example in \\x\\{...\\} , regardless of the /x modifier, there can be no spaces. Same for a quantifier such as {3} or {5,} . Similarly, \\(?:...\\) can't have a space between the \\(, ?, and : . Within any delimiters for such a construct, allowed spaces are not affected by /x, and depend on the construct. For example, \\x{...} can't have spaces because hexadecimal numbers don't have spaces in them. But, Unicode properties can have spaces, so in \\p{...} there can be spaces that follow the Unicode rules, for which see Properties accessible through \\p{} and \\P{} in perluniprops.") |
(perl "preserveStringMatched" |
"p" |
"Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching.") |
(perl "useGlobalMatching" |
"g" |
"Global matching effects the way a regexp is used rather than the regex. ") |
(perl "keepCurrentPosition" |
"c" |
"Current position effects the way regexp is used rather than the regex.") |
5.2 Character set modifiers
/d, /u , /a , and /l , available starting in 5.14, are called the character set modifiers; they affect the character set semantics used for the regular expression.
The /d, /u , and /l modifiers are not likely to be of much use to you, and so you need not worry about them very much. They exist for Perl’s internal use, so that complex regular expression data structures can be automatically serialized and later exactly reconstituted, including all their nuances. But, since Perl can’t keep a secret, and there may be rare instances where they are useful, they are documented here.
(perl "asciiSafeMatching" |
"a" |
"ASCII safe matching is to allow code that is to work mostly on ASCII data to not have to concern itself with Unicode.") |
(perl "defaultCharacterSetMatching" |
"d" |
"/d is the Perl Programming Language's old, problematic, pre-5.14 Default character set behavior. Its only use is to force that old behavior.") |
(perl "uniCodeCharcacterSetMatching" |
"u" |
"Sets the character set to UniCode") |
(perl "localeCharacterSetMatching" |
"l" |
"Briefly, /l sets the character set to that of whatever Locale is in effect at the time of the execution of the pattern match.") |