mode files and regex

Syntax highlighting in jEdit is driven by mode files — XML files which define highlighting rules for a particular syntax (e.g. Java, HTML, C, Perl). jEdit comes with 134 mode files installed. Adding new modes is as simple as creating a new mode file.

See Writing Edit Modes in the User's Guide for a how-to.

mode file limitations
- repeating regex in mode file
- matching blank lines
mode files in jEdit 4.3
Perl mode file for 4.3

mode file limitations

The mode file constructs as of jEdit 4.2 are suberb for well-structured languages like Java, XML, C, etc. For more free-form languages like Perl and Xilize, there are limitations — some of which have workarounds.

Types of limitations:

mode file elements that could be less restrictive — in some cases without loss of performance
limitations of regular expressions themselves — these we must live with

repeating regex in mode file

The SPAN_REGEXP element is an example:

The SPAN_REGEXP rule is similar to the SPAN rule except the start sequence is taken to be a regular expression. In addition to the attributes supported by the SPAN tag, the HASH_CHAR attribute must be specified. It must be set to the first character that the regular expression matches. This rules out using regular expressions which can match more than one character at the start position. The regular expression match cannot span more than one line, either.

Thus, if you need a regex with more than one character at the start position, you must use several SPAN_REGEXP rules — one for each distinct start character. This is not so burdensome since you can write the regex one and copy/paste/modify. So there is a workaround, but it can make mode file maintenance difficult when the regex in question is lengthy.

Suggestion: add the ability to define "properties" (as in ant build scripts) to make it easier to use the same regex snippet repeatedly within a mode file.

matching blank lines

It is not possible to match blank lines — unnecessary in well-formed languages but critical for Xilize and possibly other in the more-free-form class. Xilize has the notion of signed blocks of text where a block is a group of lines preceded and followed by one or more a blank lines (or start/end-of-file). Each block may start with a unique signature which determines what language constructs are recognized within that block. Current mode handling is sufficient for applying sets of rules to subportions of a file but cannot recognize a block of text as it is defined here.

Suggestion: add element to match on a block. Or create a new rule attribute AT_START_OF_BLOCK that would match at the end of contiguous blank lines, that is, after this: \n\p{Space}*\n\p{Space}*

mode files in jEdit 4.3

Because 4.3 requires Java 1.4 or better which contains java.util.regex the old regular expression package, gnu.regexp, is being removed. This impacts mode files. gnu.regexp is more like Perl's regex engine than Java's. In particular:

gnu.regexp has these pre-defined character classes

[[:alnum:]] matches any alphanumeric character
[[:alpha:]] matches any alphabetical character
[[:blank:]] matches a space or horizontal tab
[[:cntrl:]] matches a control character
[[:digit:]] matches a decimal digit
[[:graph:]] matches a non-space, non-control character
[[:lower:]] matches a lowercase letter
[[:print:]] same as [[:graph:]], but also space and tab
[[:punct:]] matches a punctuation character
[[:space:]] matches any whitespace character, including newlines
[[:upper:]] matches an uppercase letter
[[:xdigit:]] matches a valid hexadecimal digit

which map onto Java's

POSIX character classes (US-ASCII only)
\p{Lower}     A lower-case alphabetic character: [a-z]
\p{Upper}     An upper-case alphabetic character:[A-Z]
\p{ASCII}     All ASCII:[\x00-\x7F]
\p{Alpha}     An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}     A decimal digit: [0-9]
\p{Alnum}     An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}     Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}     A visible character: [\p{Alnum}\p{Punct}]
\p{Print}     A printable character: [\p{Graph}\x20]
\p{Blank}     A space or a tab: [ \t]
\p{Cntrl}     A control character: [\x00-\x1F\x7F]
\p{XDigit}     A hexadecimal digit: [0-9a-fA-F]
\p{Space}     A whitespace character: [ \t\n\x0B\f\r]

More importantly, gnu.regexp is more forgiving than java.util.regex when it comes to regex's that cause infinite loops.

Perl mode file for 4.3

Syntax highlighting Perl code that uses regex is a challenge — and cannot be done to perfection with the current jEdit tools. At issue is just what use-cases to cover.

Perl regex syntax

The Perl syntax elements covered in the jEdit 4.2 perl.xml file: q//, qq//, qr//, qx//, tr///, y/// m//, s///.

The complete set of "Regexp Quote-like Operators" from the Perl manual:

m/PATTERN/cgimosx
/PATTERN/cgimosx
q/STRING/
'STRING'
qq/STRING/
"STRING"
qr/STRING/imosx
qx/STRING/
`STRING`
qw/STRING/
s/PATTERN/REPLACEMENT/egimosx
tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds

Note the different option strings cgimosx, imosx, egimosx, and cds. The current perl.xml mode is in error with respect to the permissible options.

Perl's 'm' function

Delimiters used with 'm' (from the Perl manual):

If / is the delimiter then the initial m is optional. With the m you can use any pair of non-alphanumeric, non-whitespace characters as delimiters.

In Java regex that is: [^\p{Alnum}\p{Space}]

I cannot find any Perl documentation regarding m( ... ), m{ ... }, and m[ ... ] forms. Not saying it's not the case, just can't find it in the manuals.

perl.xml in jEdit 4.2 (current release)

In order to see the difference between 4.2 Perl mode and the new work for 4.3, I created this small test file. Download to see how jEdit colorizes the perl code.

download test.pl

# Using jEdit 4.2 Perl mode (gnu.regexp)
#
# a collection of syntax coloring test cases

#########################################################
# correct, positive tests
# valid syntax that is correctly recognized in 4.2

x =~ m/E/
x =~ m{E}
x =~ m#E#
x =~ m/\//
x =~ m#a\#b#
x =~ m#\#b#
x =~ m{E}
x =~ m#\##

#########################################################
# correct, negative tests
# invalid syntax that is recognized as such and colored (or not) correctly in 4.2

x =~ m/\/
x =~ m/a\/
x =~ m#a\#

# (I think the last one is invalid syntax, not certain though)

#########################################################
# negative tests that pass and should not
# invalid syntax that is wrongly recognized and colorized in 4.2
# (I think this is invalid syntax)

X =~ m//////

#########################################################
# positive tests that fail and should not
# valid cases that fail, creating syntax coloring errors in 4.2

# note the following is taken from the perl manual
# (http://www.perl.com/doc/manual/html/pod/perlre.html)
# jEdit currently fails to highlight this syntax correctly

$_ = 'a' x 8;
  m<
     (?{ $cnt = 0 })
     (
       a
       (?{
           local $cnt = $cnt + 1;
       })
     )*
     aaaa
     (?{ $res = $cnt })
   >x;

# same as above

x =~ m{ \(
          (
            [^()]+
          |
            \( [^()]* \)
          )+
       \)
     }x

updating perl.xml for jEdit 4.3

Using m// as an example, perl.xml revision=1.28, does this

    <SEQ_REGEXP TYPE="MARKUP"
            HASH_CHAR="m"
            AT_WORD_START="TRUE"
    >m\{(?:.*?[^\\])*?\}[sgiexom]*</SEQ_REGEXP>

    <SEQ_REGEXP TYPE="MARKUP"
            HASH_CHAR="m"
            AT_WORD_START="TRUE"
    >m(\p{Punct})(?:.*?[^\\])\1[sgiexom]*</SEQ_REGEXP>

according to the Perl manual the valid Perl regex options for 'm' are cgimosx not sgiexom, change e to c.

I did not find a reference to m{...} in the manual

Issues:

`m/////`	wrongly (I think) recognized as valid in both 4.2 and 4.3 perl.xml rev 1.23
any multiline regex	cannot be handled with a single SEQ_REGEXP element