Skip to content
Permalink
8bed31719c
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
616 lines (449 sloc) 14.2 KB
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"../../../docbook-xml-4.5/docbookx.dtd">
<chapter id="chapter.regexp">
<title>Regular expressions<indexterm class="singular">
<primary>Regular expressions</primary>
<seealso>Segmentation</seealso>
<seealso>Searching</seealso>
</indexterm></title>
<para>The regular expressions (or regex for short) used in searches and
segmentation rules are those supported by Java. Should you need more
specific information, consult the <ulink
url="http://download.oracle.com/javase/1.6.0/docs/api/java/util/regex/Pattern.html">Java
Regex documentation</ulink>. See additional references and examples
below.</para>
<note>
<para>This chapter is intended for advanced users, who need to define
their own variants of segmentation rules or devise more complex and
powerful key search items.</para>
</note>
<table>
<title>Regex - Flags</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">... matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>(?i)</entry>
<entry>Enables case-insensitive matching (by default, the pattern is
case-sensitive).</entry>
</row>
</tbody>
</tgroup>
</table>
<table>
<title>Regex - Character</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">... matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>x</entry>
<entry>The character x, except the following...</entry>
</row>
<row>
<entry>\uhhhh</entry>
<entry>The character with hexadecimal value 0xhhhh</entry>
</row>
<row>
<entry>\t</entry>
<entry>The tab character ('\u0009')</entry>
</row>
<row>
<entry>\n</entry>
<entry>The newline (line feed) character ('\u000A')</entry>
</row>
<row>
<entry>\r</entry>
<entry>The carriage-return character ('\u000D')</entry>
</row>
<row>
<entry>\f</entry>
<entry>The form-feed character ('\u000C')</entry>
</row>
<row>
<entry>\a</entry>
<entry>The alert (bell) character ('\u0007')</entry>
</row>
<row>
<entry>\e</entry>
<entry>The escape character ('\u001B')</entry>
</row>
<row>
<entry>\cx</entry>
<entry>The control character corresponding to x</entry>
</row>
<row>
<entry>\0n</entry>
<entry>The character with octal value 0n (0 &lt;= n &lt;= 7)</entry>
</row>
<row>
<entry>\0nn</entry>
<entry>The character with octal value 0nn (0 &lt;= n &lt;=
7)</entry>
</row>
<row>
<entry>\0mnn</entry>
<entry>The character with octal value 0mnn (0 &lt;= m &lt;= 3, 0
&lt;= n &lt;= 7)</entry>
</row>
<row>
<entry>\xhh</entry>
<entry>The character with hexadecimal value 0xhh</entry>
</row>
</tbody>
</tgroup>
</table>
<table>
<title>Regex - Quotation</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>\</entry>
<entry>Nothing, but quotes the following character. This is required
if you would like to enter any of the meta characters
!$()*+.&lt;&gt;?[\]^{|} to match as themselves.</entry>
</row>
<row>
<entry>\\</entry>
<entry>For example, this is the backslash character</entry>
</row>
<row>
<entry>\Q</entry>
<entry>Nothing, but quotes all characters until \E</entry>
</row>
<row>
<entry>\E</entry>
<entry>Nothing, but ends quoting started by \Q</entry>
</row>
</tbody>
</tgroup>
</table>
<table>
<title>Regex - Classes for Unicode blocks and categories</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>\p{InGreek}</entry>
<entry>A character in the Greek block (simple <ulink
url="http://download.oracle.com/javase/1.6.0/docs/api/java/util/regex/Pattern.html#ubc">
block</ulink>)</entry>
</row>
<row>
<entry>\p{Lu}</entry>
<entry>An uppercase letter (simple <ulink
url="http://download.oracle.com/javase/1.6.0/docs/api/java/util/regex/Pattern.html#ubc">
category</ulink>)</entry>
</row>
<row>
<entry>\p{Sc}</entry>
<entry>A currency symbol</entry>
</row>
<row>
<entry>\P{InGreek}</entry>
<entry>Any character except one in the Greek block
(negation)</entry>
</row>
<row>
<entry>[\p{L}&amp;&amp;[^\p{Lu}]]</entry>
<entry>Any letter except an uppercase letter (subtraction)</entry>
</row>
</tbody>
</tgroup>
</table>
<table>
<title>Regex - Character classes</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>[abc]</entry>
<entry>a, b, or c (simple class)</entry>
</row>
<row>
<entry>[^abc]</entry>
<entry>Any character except a, b, or c (negation)</entry>
</row>
<row>
<entry>[a-zA-Z]</entry>
<entry>a through z or A through Z, inclusive (range)</entry>
</row>
</tbody>
</tgroup>
</table>
<table>
<title>Regex - Predefined character classes</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>.</entry>
<entry>Any character (except for line terminators)</entry>
</row>
<row>
<entry>\d</entry>
<entry>A digit: [0-9]</entry>
</row>
<row>
<entry>\D</entry>
<entry>A non-digit: [^0-9]</entry>
</row>
<row>
<entry>\s</entry>
<entry>A whitespace character: [ \t\n\x0B\f\r]</entry>
</row>
<row>
<entry>\S</entry>
<entry>A non-whitespace character: [^\s]</entry>
</row>
<row>
<entry>\w</entry>
<entry>A word character: [a-zA-Z_0-9]</entry>
</row>
<row>
<entry>\W</entry>
<entry>A non-word character: [^\w]</entry>
</row>
</tbody>
</tgroup>
</table>
<table>
<title>Regex - Boundary matchers</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>^</entry>
<entry>The beginning of a line</entry>
</row>
<row>
<entry>$</entry>
<entry>The end of a line</entry>
</row>
<row>
<entry>\b</entry>
<entry>A word boundary</entry>
</row>
<row>
<entry>\B</entry>
<entry>A non-word boundary</entry>
</row>
</tbody>
</tgroup>
</table>
<table>
<title>Regex - Greedy quantifiers</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>X<emphasis>?</emphasis></entry>
<entry>X, once or not at all</entry>
</row>
<row>
<entry>X<emphasis>*</emphasis></entry>
<entry>X, zero or more times</entry>
</row>
<row>
<entry>X<emphasis>+</emphasis></entry>
<entry>X, one or more times</entry>
</row>
</tbody>
</tgroup>
</table>
<note>
<para>greedy quantifiers will match as much as they can. For example,
<emphasis>a+?</emphasis> will match the aaa in
<emphasis>aaabbb</emphasis></para>
</note>
<table>
<title>Regex - Reluctant (non-greedy) quantifiers</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>X??</entry>
<entry>X, once or not at all</entry>
</row>
<row>
<entry>X*?</entry>
<entry>X, zero or more times</entry>
</row>
<row>
<entry>X+?</entry>
<entry>X, one or more times</entry>
</row>
</tbody>
</tgroup>
</table>
<note>
<para>non-greedy quantifiers will match as little as they can. For
example, <emphasis>a+?</emphasis> will match the first
<emphasis>a</emphasis> in <emphasis>aaabbb</emphasis></para>
</note>
<table>
<title>Regex - Logical operators</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry align="left">The construct</entry>
<entry align="left">...matches the following</entry>
</row>
</thead>
<tbody>
<row>
<entry>XY</entry>
<entry>X followed by Y</entry>
</row>
<row>
<entry>X|Y</entry>
<entry>Either X or Y</entry>
</row>
<row>
<entry>(XY)</entry>
<entry>XY as a single group</entry>
</row>
</tbody>
</tgroup>
</table>
<section id="regex.tools.and.examples.of.use">
<title><indexterm class="singular">
<primary>Regular expressions</primary>
<secondary>Tools</secondary>
</indexterm>Regex tools and examples of use<indexterm class="singular">
<primary>Regular expressions</primary>
<secondary>Examples of use</secondary>
</indexterm></title>
<para>A number of interactive tools are available to develop and test
regular expressions. They generally follow much the same pattern (for an
example from the Regular Expression Tester see below): the regular
expression (top entry) analyzes the search text (Text box in the middle) ,
yielding the hits, shown in the result Text box.</para>
<figure id="regerx.tester">
<title>Regex Tester</title>
<mediaobject>
<imageobject role="html">
<imagedata fileref="images/RegexTester.png"/>
</imageobject>
<imageobject role="fo">
<imagedata fileref="images/RegexTester.png" width="80%"/>
</imageobject>
</mediaobject>
</figure>
<para>See <ulink url="http://weitz.de/regex-coach/">The Regex
Coach</ulink> for Windows,Linux, FreeBSD versions of a stand-alone tool.
This is much the same as the above example.</para>
<para>A nice collection of useful regex cases can be found in
<application>OmegaT</application> itself (see Options &gt; Segmentation).
The following list includes expressions you may find useful when searching
through the translation memory:</para>
<table>
<title>Regex - Examples of regular expressions in translations</title>
<tgroup align="left" cols="2" rowsep="1">
<colspec align="left" colnum="1"/>
<thead>
<row>
<entry>Regular expression</entry>
<entry>Finds the following:</entry>
</row>
</thead>
<tbody>
<row>
<entry>(\b\w+\b)\s\1\b</entry>
<entry>double words</entry>
</row>
<row>
<entry>[\.,]\s*[\.,]+</entry>
<entry>comma or a period, followed by spaces and yet another comma
or period</entry>
</row>
<row>
<entry>\. \s+$</entry>
<entry>extra spaces after the period at the end of the
line</entry>
</row>
<row>
<entry>\s+a\s+[aeiou]</entry>
<entry>English: words, beginning with vowels, should generally be
preceded by "an", not "a"</entry>
</row>
<row>
<entry>\s+an\s+[^aeiou]</entry>
<entry>English: the same check as above, but concerning consonants
("a", not "an")</entry>
</row>
<row>
<entry>\s{2,}</entry>
<entry>more than one space</entry>
</row>
<row>
<entry>\.[A-Z]</entry>
<entry>Period, followed by an upper-case letter - possibly a space
is missing between the period and the start of a new
sentence?</entry>
</row>
<row>
<entry>\bis\b</entry>
<entry>search for “is”, not “this” or “isn't” etc.</entry>
</row>
</tbody>
</tgroup>
</table>
</section>
</chapter>