Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
2015OmegaT/OmegaT/doc_src/en/Regexp.xml
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
616 lines (449 sloc)
14.2 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" | |
"../../../docbook-xml-4.5/docbookx.dtd"> | |
<chapter id="chapter.regexp"> | |
<title>Regular expressions<indexterm class="singular"> | |
<primary>Regular expressions</primary> | |
<seealso>Segmentation</seealso> | |
<seealso>Searching</seealso> | |
</indexterm></title> | |
<para>The regular expressions (or regex for short) used in searches and | |
segmentation rules are those supported by Java. Should you need more | |
specific information, consult the <ulink | |
url="http://download.oracle.com/javase/1.6.0/docs/api/java/util/regex/Pattern.html">Java | |
Regex documentation</ulink>. See additional references and examples | |
below.</para> | |
<note> | |
<para>This chapter is intended for advanced users, who need to define | |
their own variants of segmentation rules or devise more complex and | |
powerful key search items.</para> | |
</note> | |
<table> | |
<title>Regex - Flags</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">... matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>(?i)</entry> | |
<entry>Enables case-insensitive matching (by default, the pattern is | |
case-sensitive).</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<table> | |
<title>Regex - Character</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">... matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>x</entry> | |
<entry>The character x, except the following...</entry> | |
</row> | |
<row> | |
<entry>\uhhhh</entry> | |
<entry>The character with hexadecimal value 0xhhhh</entry> | |
</row> | |
<row> | |
<entry>\t</entry> | |
<entry>The tab character ('\u0009')</entry> | |
</row> | |
<row> | |
<entry>\n</entry> | |
<entry>The newline (line feed) character ('\u000A')</entry> | |
</row> | |
<row> | |
<entry>\r</entry> | |
<entry>The carriage-return character ('\u000D')</entry> | |
</row> | |
<row> | |
<entry>\f</entry> | |
<entry>The form-feed character ('\u000C')</entry> | |
</row> | |
<row> | |
<entry>\a</entry> | |
<entry>The alert (bell) character ('\u0007')</entry> | |
</row> | |
<row> | |
<entry>\e</entry> | |
<entry>The escape character ('\u001B')</entry> | |
</row> | |
<row> | |
<entry>\cx</entry> | |
<entry>The control character corresponding to x</entry> | |
</row> | |
<row> | |
<entry>\0n</entry> | |
<entry>The character with octal value 0n (0 <= n <= 7)</entry> | |
</row> | |
<row> | |
<entry>\0nn</entry> | |
<entry>The character with octal value 0nn (0 <= n <= | |
7)</entry> | |
</row> | |
<row> | |
<entry>\0mnn</entry> | |
<entry>The character with octal value 0mnn (0 <= m <= 3, 0 | |
<= n <= 7)</entry> | |
</row> | |
<row> | |
<entry>\xhh</entry> | |
<entry>The character with hexadecimal value 0xhh</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<table> | |
<title>Regex - Quotation</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>\</entry> | |
<entry>Nothing, but quotes the following character. This is required | |
if you would like to enter any of the meta characters | |
!$()*+.<>?[\]^{|} to match as themselves.</entry> | |
</row> | |
<row> | |
<entry>\\</entry> | |
<entry>For example, this is the backslash character</entry> | |
</row> | |
<row> | |
<entry>\Q</entry> | |
<entry>Nothing, but quotes all characters until \E</entry> | |
</row> | |
<row> | |
<entry>\E</entry> | |
<entry>Nothing, but ends quoting started by \Q</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<table> | |
<title>Regex - Classes for Unicode blocks and categories</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>\p{InGreek}</entry> | |
<entry>A character in the Greek block (simple <ulink | |
url="http://download.oracle.com/javase/1.6.0/docs/api/java/util/regex/Pattern.html#ubc"> | |
block</ulink>)</entry> | |
</row> | |
<row> | |
<entry>\p{Lu}</entry> | |
<entry>An uppercase letter (simple <ulink | |
url="http://download.oracle.com/javase/1.6.0/docs/api/java/util/regex/Pattern.html#ubc"> | |
category</ulink>)</entry> | |
</row> | |
<row> | |
<entry>\p{Sc}</entry> | |
<entry>A currency symbol</entry> | |
</row> | |
<row> | |
<entry>\P{InGreek}</entry> | |
<entry>Any character except one in the Greek block | |
(negation)</entry> | |
</row> | |
<row> | |
<entry>[\p{L}&&[^\p{Lu}]]</entry> | |
<entry>Any letter except an uppercase letter (subtraction)</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<table> | |
<title>Regex - Character classes</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>[abc]</entry> | |
<entry>a, b, or c (simple class)</entry> | |
</row> | |
<row> | |
<entry>[^abc]</entry> | |
<entry>Any character except a, b, or c (negation)</entry> | |
</row> | |
<row> | |
<entry>[a-zA-Z]</entry> | |
<entry>a through z or A through Z, inclusive (range)</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<table> | |
<title>Regex - Predefined character classes</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>.</entry> | |
<entry>Any character (except for line terminators)</entry> | |
</row> | |
<row> | |
<entry>\d</entry> | |
<entry>A digit: [0-9]</entry> | |
</row> | |
<row> | |
<entry>\D</entry> | |
<entry>A non-digit: [^0-9]</entry> | |
</row> | |
<row> | |
<entry>\s</entry> | |
<entry>A whitespace character: [ \t\n\x0B\f\r]</entry> | |
</row> | |
<row> | |
<entry>\S</entry> | |
<entry>A non-whitespace character: [^\s]</entry> | |
</row> | |
<row> | |
<entry>\w</entry> | |
<entry>A word character: [a-zA-Z_0-9]</entry> | |
</row> | |
<row> | |
<entry>\W</entry> | |
<entry>A non-word character: [^\w]</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<table> | |
<title>Regex - Boundary matchers</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>^</entry> | |
<entry>The beginning of a line</entry> | |
</row> | |
<row> | |
<entry>$</entry> | |
<entry>The end of a line</entry> | |
</row> | |
<row> | |
<entry>\b</entry> | |
<entry>A word boundary</entry> | |
</row> | |
<row> | |
<entry>\B</entry> | |
<entry>A non-word boundary</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<table> | |
<title>Regex - Greedy quantifiers</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>X<emphasis>?</emphasis></entry> | |
<entry>X, once or not at all</entry> | |
</row> | |
<row> | |
<entry>X<emphasis>*</emphasis></entry> | |
<entry>X, zero or more times</entry> | |
</row> | |
<row> | |
<entry>X<emphasis>+</emphasis></entry> | |
<entry>X, one or more times</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<note> | |
<para>greedy quantifiers will match as much as they can. For example, | |
<emphasis>a+?</emphasis> will match the aaa in | |
<emphasis>aaabbb</emphasis></para> | |
</note> | |
<table> | |
<title>Regex - Reluctant (non-greedy) quantifiers</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>X??</entry> | |
<entry>X, once or not at all</entry> | |
</row> | |
<row> | |
<entry>X*?</entry> | |
<entry>X, zero or more times</entry> | |
</row> | |
<row> | |
<entry>X+?</entry> | |
<entry>X, one or more times</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<note> | |
<para>non-greedy quantifiers will match as little as they can. For | |
example, <emphasis>a+?</emphasis> will match the first | |
<emphasis>a</emphasis> in <emphasis>aaabbb</emphasis></para> | |
</note> | |
<table> | |
<title>Regex - Logical operators</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry align="left">The construct</entry> | |
<entry align="left">...matches the following</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>XY</entry> | |
<entry>X followed by Y</entry> | |
</row> | |
<row> | |
<entry>X|Y</entry> | |
<entry>Either X or Y</entry> | |
</row> | |
<row> | |
<entry>(XY)</entry> | |
<entry>XY as a single group</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
<section id="regex.tools.and.examples.of.use"> | |
<title><indexterm class="singular"> | |
<primary>Regular expressions</primary> | |
<secondary>Tools</secondary> | |
</indexterm>Regex tools and examples of use<indexterm class="singular"> | |
<primary>Regular expressions</primary> | |
<secondary>Examples of use</secondary> | |
</indexterm></title> | |
<para>A number of interactive tools are available to develop and test | |
regular expressions. They generally follow much the same pattern (for an | |
example from the Regular Expression Tester see below): the regular | |
expression (top entry) analyzes the search text (Text box in the middle) , | |
yielding the hits, shown in the result Text box.</para> | |
<figure id="regerx.tester"> | |
<title>Regex Tester</title> | |
<mediaobject> | |
<imageobject role="html"> | |
<imagedata fileref="images/RegexTester.png"/> | |
</imageobject> | |
<imageobject role="fo"> | |
<imagedata fileref="images/RegexTester.png" width="80%"/> | |
</imageobject> | |
</mediaobject> | |
</figure> | |
<para>See <ulink url="http://weitz.de/regex-coach/">The Regex | |
Coach</ulink> for Windows,Linux, FreeBSD versions of a stand-alone tool. | |
This is much the same as the above example.</para> | |
<para>A nice collection of useful regex cases can be found in | |
<application>OmegaT</application> itself (see Options > Segmentation). | |
The following list includes expressions you may find useful when searching | |
through the translation memory:</para> | |
<table> | |
<title>Regex - Examples of regular expressions in translations</title> | |
<tgroup align="left" cols="2" rowsep="1"> | |
<colspec align="left" colnum="1"/> | |
<thead> | |
<row> | |
<entry>Regular expression</entry> | |
<entry>Finds the following:</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>(\b\w+\b)\s\1\b</entry> | |
<entry>double words</entry> | |
</row> | |
<row> | |
<entry>[\.,]\s*[\.,]+</entry> | |
<entry>comma or a period, followed by spaces and yet another comma | |
or period</entry> | |
</row> | |
<row> | |
<entry>\. \s+$</entry> | |
<entry>extra spaces after the period at the end of the | |
line</entry> | |
</row> | |
<row> | |
<entry>\s+a\s+[aeiou]</entry> | |
<entry>English: words, beginning with vowels, should generally be | |
preceded by "an", not "a"</entry> | |
</row> | |
<row> | |
<entry>\s+an\s+[^aeiou]</entry> | |
<entry>English: the same check as above, but concerning consonants | |
("a", not "an")</entry> | |
</row> | |
<row> | |
<entry>\s{2,}</entry> | |
<entry>more than one space</entry> | |
</row> | |
<row> | |
<entry>\.[A-Z]</entry> | |
<entry>Period, followed by an upper-case letter - possibly a space | |
is missing between the period and the start of a new | |
sentence?</entry> | |
</row> | |
<row> | |
<entry>\bis\b</entry> | |
<entry>search for “is”, not “this” or “isn't” etc.</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
</section> | |
</chapter> |