Friday, June 08, 2007

GNU Source-highlight 2.7

I've just released the new version of GNU Source-highlight 2.7.

Apart from some bug fixes (and the addition to reference generation for docbook output format) the main novelties concern language definition file syntax which has been improved in order to handle regular expression back references and conditionals. This permits writing regular expressions for language definition files more comfortably.

Previously, regular expressions in a language definition file containing parenthesis were automatically transformed so that the marking parenthesis of the regular expressions became non-marking parenthesis (i.e., no subexpressions were created). This, however, did not permit specifying a back reference or conditionals in a regular expression. While this was not a big lack for standard uses, there might be cases these regular expression mechanisms helped in writing small and compact regular expressions.

For this reason, a new additional syntax for specifying a regular expression was introduced using backticks ` ` to overcome the limitations of the other two syntaxes (i.e., ' ' and " "). With this syntax, the marked subexpressions are not transformed, and so you can use regular expressions mechanisms that rely on marked subexpressions, such as back references and conditionals.

This syntax is also crucial for highlighting specific program parts of some programming languages, such as, e.g., Perl regular expressions (e.g., in substitution expressions) that can be expressed in many forms, in particular, separators for the part to be replaced and the part to replace with can be any non alphanumerical characters21, for instance,


s/foo/bar/g
s|foo|bar|g
s#foo#bar#g
s@foo@bar@g

Using this syntax, and backreferences, we can easily define a single language element to deal with these expressions (without specify all the cases for each possible non alphanumerical character):

regexp = `s([^[:alnum:][:blank:]]).*\1.*\1[ixsmogce]*`

Another new feature in language definition file syntax concerns the possibility of assign a different language element to each subexpressions: Often, you need to specify two program elements in the same regular expressions, because they are tightly related, but you also need to highlight them differently.

For instance, you might want to highlight the name of a class (or interface) in a class (or interface) definition (e.g., in Java). Thus, you can rely on the preceding class keyword which will then be followed by an identifier.

A definition such as

keyword = '(\<(?:class|interface))([[:blank:]]+)([$[:alnum:]]+)'

will not produce a good final result, since the name of the class will be highlighted as a keyword, which is not what you might have wanted: for instance, the class name should be highlighted as a type. Up to version 2.6, the only way to do this was to use state or environments (State/Environment Definitions) but this tended to be quite difficult to write. Since version 2.7, you can specify a regular expression with marked subexpressions and bind each of them to a specific language element (the regular expression must be enclosed in `, see Ways of specifying regular expressions):

(elem1,...,elemn) = `(subexp1)(...)(subexpn)`

Now, with this syntax, we can accomplish our previous goal:

(keyword,normal,type) =
`(\<(?:class|interface))([[:blank:]]+)([$[:alnum:]]+)`

This way, the class (or interface) will be highlighted as a keyword, the separating blank characters are formatted as normal, and the name of the class as a type. This mechanism permits expressing regular expressions for some situation in a much more compact and probably more readable way. For instance, for highlighting ChangeLog parts (the optional * as a symbol, the optional file name and the element specified in parenthesis as a file element, and the rest as normal) such as

* src/Makefile.am (source_highlight_SOURCES): correctly include
changelog_scanner.ll

* this is a comment without a file name

before version 2.6, we used to use these two language definitions:

state symbol start '^(?:[[:blank:]]+)\*[[:blank:]]+' begin
state file start '[^:]+\:' begin
normal start '.'
end
end

state normal start '^(?:[[:blank:]]+)' begin
state file start '[^:]+\:' begin
normal start '.'
end
end

which can be hard to read after having written them. Now, we can write them more easily (see changelog.lang):

(normal,symbol,normal,file)=
`(^[[:blank:]]+)(\*)([[:blank:]]+)((?:[^:]+\:)?)`
(normal,file)= `(^[[:blank:]]+)((?:[^:]+\:)?)`

A big thank goes to Elias Pipping for reporting problems in highlighting some perl parts (concerning regular expressions). This pushed me to extend the language definition file syntax in order to handle these cases more easily. :-)

1 comment:

Anonymous said...

Cool! seems to be lines of complicated codes... :D