Talk:Regular expression/Archive 1

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

First paragraph

I have added an introductory paragraph before the history section, explaining by means of example the basic regex operations. This used to be buried in the formal language theory section, which makes it opaque to a lay audience. This led to some minor edits in the formal language section.

Removed Regular expressions used when editing Wikipedia

Removed Regular expressions used when editing Wikipedia to Wikipedia:Avoid self-references. --Anon.

Missing information about locale

The section about Unicode should be preceded with something about locale and collating. Those issues rise before introducing unicode charachters.

remove regexlib.com

regexlib.com is totally wrong here, they have a crappy design and no good content, why is it still listed?

History of the Henry Spencer regexp engine(s)

The reference to Henry Spencer and Tcl is a bit off. Tcl actually adopted Henry Spencers regexp engine twice, once for the pre-8.1 Tcl versions (which is the same Version Perl is based on), and then later for the Unicode supporting Tcl 8.1+ versions a new, unicode able regexp engine which was written by Henry Spencer under contract for Scriptics... (around 1996 or 1997) This new regexp engine is also used in PostgreSQL.

Missing The Fundamental Algorithm

Latest comment: 17 years ago1 comment1 person in discussion

l The regular expression article needs to have a link to or at least include the simple algorithm of converting a regular expression to a Deterministic Finite Automaton because regular expressions are inherently used for input string matching by a computer program.

Very few commonly used regex engines actually use either simulated DFA or full DFA engines. As long as this page is a merge of both the mathematical and jargon uses of the term "regular expression" i dont think the DFA construction algorithm should be included. If/when it is split into two then i think the strict term should include such discussion. Demerphq 12:45, 23 March 2007 (UTC)

That seems to make little sense. Seeing as that since they are merged, meant to handle both topics as you've pointed out, there's all the more reason to include the algorithm. Then, if they are split, the algorithm will only remain in the mathematical article. Including the algorithm is a necessary step to help this topic grow here, and excluding it is only hindering progress, excluding useful information, and attempting to stave off the inevitable.

Notation and style

Latest comment: 17 years ago1 comment1 person in discussion

The Manual of Style suggests that strings as strings should appear in italics. This convention doesn't make the current page harder to read, so I have edited some passages with that convention in mind. Some questions remain: how should regular expressions be formatted in this article? They currently appear mostly in double quotes (inch-marks), like this:

the regular expression "gr(e|a)y" matches grey and gray.

I think that works well, provided we do not add many expressions that include quotation marks. Alternatives would be to use the same convention as for strings:

the regular expression gr(e|a)y matches grey and gray

Alternatives include bold face, not least for readabilityl

the regular expression gr(e|a)y matches grey and gray

or a fixed width typeface, also for readability and proximity to program code

the regular expression gr(a|e)y matches grey and gray

A alternative solution to the double quotes would be to always present them in slashes, to make it look like perl

/gr(a|e)y/ matches grey and gray

Opinions are welcome.

On a related matter, I have replaced the union operator ∪ with the vertical bar even in the formal language theory section. I can count at least four different symbols for this in the literature (including the vertical bar), and suggest we use the vertical bar to keep the connection with programming tools and reduce the number of symbols in this article. Moreover, the choice of default typeface for the Wikipedia leads to a problem for this page, in that there is no distinction between the lowercase letter “l” and the vertical bar “|”. That makes some of the regexes really hard to read, and could be taken as a point in favour of displaying all regexes in the monospaced typeface. Here is an example of two lower-case Ls separated by a vertical bar: “l|l” in the default typeface, and “l|l” in the default monospace.

The article is quite hard to read because of the problem of “|”, “l” and “I” all looking somewhat similar. I think the change to a monospaced font for regexps would be a very good idea. I know when I read the first example, it took me a few seconds to realise that it wasn’t a lower case “L” that I was looking at… for a moment I thought it was using some new notation I hadn’t seen before. I suppose my only other comment is that the default choice of monospaced font isn’t itself that great, because of the similarity between “1” and “l”, but for an article on regular expressions, the “|” character is a much more relevant issue. — Ajhoughton 10:50, 2 March 2007 (UTC)

Complexity

Latest comment: 17 years ago1 comment1 person in discussion

Under Perl Regexp it states: ****

"This extra power comes at a cost. The worst-case complexity of matching a string against a Perl regular expression is exponential in the size of the input. That means the regular expression could take an extremely long time to process under just the right conditions of expression and input string, although it rarely happens in practice."

Isn't this also true for POSIX and all flavours using an NFA? --Weinzier 18:12, 12 Jul 2004 (UTC) Ludwig Weinzierl

NFAs can recognize strings in linear time (in the size of the input). I don't know POSIX well enough to comment on that. --Jan Hidders 20:00, 12 Jul 2004 (UTC)

LW is right. The exponential running time is not caused by nonregularity (instead it has to do with greediness and/or the ability to make backreferences). Indeed, the nonregularity does not "come at a cost", but is included pretty much because the engine might as well execute some Perl code while it's there (so you can implement a stack, or save substrings). So it would be more correct (but still misleading) to say it "comes for free". I have removed the whole paragraph. --User:Thore

I don't follow. The greediness and backreferences are what makes it non-regular, (by having them things become non-polynomial, and if we would not have them then it would stay polynomial) so if I understand you correctly you are in fact above confirming the very thing that you claim is not true. What am I missing? --Jan Hidders 20:41, 8 Dec 2004 (UTC)

Jan, let me try to explain this better. The thing that causes nonregularity are the backreferences. They are not specific to Perl. The thing that causes the nonlinear running time is the implementation often referred to as NFA, which is just a "backtracking" algorithm. (Rather than simulating a determinised automaton, which is sometimes called DFA, and always runs in linear time.) That's an inaccuracy in your first paragraph on this page, by the way. "NFA implementations" like Perl's are decidedly non-linear time. Friedl's book tries to explain this in Chapter 6. Let me see if I can write a few paragraphs about this on the main page, and then you can all shoot it down. --User:Thore

Its a mistake to equate NFA and backtracking. Most backtracking engines are not strictly speaking NFA since they implement leftmost-longest behaviour. This is an area where the commonly used terms, and the strict mathematical definitions contradict. To be properly NFA the engine must simulate the behaviour of a DFA matching the same pattern, so for instance 'aaaa'=~/a|aa|aaa|aaaa/ should match 'aaaa' and not 'a'. Of course this is extremely inefficient in a backtracking engine (as it means you must try every alternation), so most backtracing-nfa's dont do this, opting for the more efficient behaviour of matching 'a' instead. Demerphq 12:52, 23 March 2007 (UTC)

Error in example?

Latest comment: 18 years ago5 comments3 people in discussion

character classes, with negation: [a-z0-9^4-6] ('all lowercase letters plus the digits 0,1,2,3,7,8, and 9')

request verification: "[^" requests negation; a "^" not following the first "[" is taken literally; so this is all lowercase letters, all digits, and "^"

Correct. My 'vi' and 'grep' agree with you. I will remove it. --JanHidders

In the examples section of the Regular expressions in Formal Language Theory it states that

(a U b)* is equivalent to (ab*)*

However, the first set contains "b", the second does not.

Indeed, all non-empty strings in the second set start with "a" which is of course not the case in the first set. Fixed now. --Jan Hidders 12:55, 19 Sep 2003 (UTC)

Description and examples of alternation for regexes were uniformly incorrect. Alternation operates on single sub-expressions before and after the bar. If not a class, parenthesis group, or replication expression, only a single character is matched. The main page has been updated to conform. Turmoil 16:29, 14 March 2006 (UTC)

Not sure what planet you're from, but on Earth, /abc|def/ matches abc or def, not abcef and abdef. I reverted your changes --Randal L. Schwartz 20:57, 14 March 2006 (UTC)

This is a case in which mathematical usage and practical usage differ. Turmoil accurately describes mathematical usage. Alternation applies to atoms, so in the absence of a grouping construct the alternation operator applies only to single symbols. However, in practice, Randal L. Schwartz is right. Every matching engine that I can think of behaves as if a string of atoms were implicitly grouped. That is why you need parentheses around the entire alternation construct even if it isn't in the scope of a quantifier. /(abc|def)/ matches "abc" and "def", but /abc|def/ matches not only "abc" and "def" but also "abcdef", "abcef", and "abcdf" due to the ambiguity of the scope of the alternation operator. (If you don't believe me, try it. I just tested this in Perl, egrep, and Tcl to be sure my memory was correct.) Bill 21:24, 14 March 2006 (UTC)

Try that again with anchors. /abc|def/ matches "abcef" because it matches the abc part. Print out the value of $& to see the part that is matching. To get a good test, you have to say /^(abc|def)$/ which will demonstrate that the vertical bar is choosing either abc or def, and not just c or d. It's all about relative precedence... if you read the regex description, they describe it very precisely and predictably... the "implicit" sequence operator (two atoms next to each other) binds tighter than the vertical bar. --Randal L. Schwartz 14:36, 15 March 2006 (UTC)

I did check it with anchors but misled myself. /^abc|def$/ still matches "abcef" etc., which suggested that my explanation above was correct. The substitution test shows you are right. The reason that /^abc|def$/ matches "abcef" etc. is because it is equivalent to (/^abc)|(def$)/. (Personally, I've always thought that relying on detailed operator precedence was nuts, pace the Numerical Recipes folks, and prefer explicit grouping.) My point about mathematical usage versus application stands, though. The usual assumption in mathematical discussion of regular expressions is that you use explicit grouping and don't assume tight binding in sequences. (Incidentally, in a smallness of the world moment, I realize that I once knew your buddy Rich Cower.) Bill 20:50, 15 March 2006 (UTC)

Programmers' T-shirt pic

hmm... good one.. but Diet Coke is getting some free publicity! --Jay 13:26, 17 Jun 2004 (UTC)

I prefer the tee-shirt imaged at http://www.thinkgeek.com/tshirts/coder/57f0/ . I wonder if there is any way that we can use that image instead? Probably not, unless one of us orders one and snaps a picture while wearing it. --Bevo 14:46, 17 Jun 2004 (UTC)

Must admit, I hate Diet Coke, and this picture is not going to convert me — simply thought it looked very typical; and mimicking the FedEx logo is so obvious. (The Shakespeare quote is also cute.) --Palapala 16:05, 2004 Jun 17 (UTC)

I have no objection to the picturing of diet coke here, however it's hard to find an adequate picture for this article. --Ævar Arnfjörð Bjarmason 22:35, 2004 Jul 12 (UTC)

I think the pic should go altogether, or at least it shouldn't be at the very start. It doesn't add anything useful or essential to the article, but rather distracts attention. This is supposed to be about regular expressions, not geek culture or t-shirts or whatever. --Jonik 13:11, 22 Jan 2005 (UTC)

I agreee. --Edward 19:33, 2005 Jan 22 (UTC)

T shirt text

I'm sorry to say it so but that "skript kiddie" t-shirt regexp states for "Hack the planet" with some haxx0r variations: h@<</h4c</hax/hak, da/tha/de/ttha/tth@, planet/p1anet/pl4n3t/planett/pl4nettt. also any number of spaces between the words I do not understand, though, the /i part. --Tordek 19:34, 3 Sep 2004 (UTC)

Be more generic

Latest comment: 17 years ago1 comment1 person in discussion

Don't want to start any arguments, but where the page says "Ever since that time, regular expressions have been widely used in Unix and Unix-like utilities such as: expr, awk, Emacs, vim, lex, and Perl.", shouldn't it really mention vi, not vim. Agree? Celada 00:28, 2004 Nov 19 (UTC)

Agree! --Jan Hidders 16:14, 19 Nov 2004 (UTC)

OK, done! --Celada 03:31, 2004 Nov 20 (UTC)

I agree with this too. (81.11.186.58 23:30, 1 March 2007 (UTC))

Regex versus regular expression

Latest comment: 18 years ago3 comments2 people in discussion

Porge replaced a number of occurences of the term regex with regular expression. However, this may not be as straightforward as expanding an abbreviation with a longer form.

One of the big problems with this pages is that regular expression means different things to different people. Especially, not all regexes are formal language theory regular expressions. The previous version of this page tried to use the term regular expression only for those constructs that are regular in both senses of the word, using the term regex for irregular patterns. The article also contains a paragraph about this policy, and tried to follow that policy itself.

However, from Porge's edits that may not be a universally appreciated policy. Maybe we can discuss it here. There are certainly many, many ways to resolve the issue. (For example, by separating the pages.) However, I think this article has a lot of potential and actually tries to explain something, so I would be in favour of keeping both meanings under one hood, and retaining the subtle nomenclature. Other opinions are welcome.

(Larry Wall in one of his Perl 6 apocalypses suggests to just use pattern, and get rid of the connection to (Formal Language) regular expressions altogether.) --Arbor 07:25, 20 May 2005 (UTC)

I don't think this is a widely accepted distinction - I don't remember ever seeing it before. --Khendon 19:00, 20 May 2005 (UTC)

Sorry if I've removed a distinction, I was just trying to standardize the word across the article (in various places it had regexp, regex or regular expression). I'm not experienced enough in this area to comment on the formal versus implemented side of things, although I've never heard of the nomenclature being used to protray the difference. --porges 19:52, May 20, 2005 (UTC)

I never claimed the distinction was widely accepted. I introduced it when I tried to clean up the article; the idea is certainly up for debate. The solution had the advantage of not calling overmuch attention to itself (compared to, say, using phrases like non-regular regular expressions) but at the same time surviving a close examination (because nowhere was a regex called regular when it wasn't). No matter how we solve this, I would maintain the position that the present article should aim to minimise confusion, which includes being very precise about not using the term regular expression for patterns that don't conform to the term in all its meanings. (Alternatively, the language-theoretical use needs to be split off, which would make the article less interesting in my opinion.) The nomenclature is bad enough as it is (pointed out and commented on in both Schwartz's book and Wall's writings), and Wikipedia shouldn't add to the confusion. However, other suggestions are more than welcome. (I would prefer to just use pattern for all those regexes that aren't regular expressions.) --Arbor 20:09, 20 May 2005 (UTC)

I'd prefer "pattern" over "regex" for irregular regular expressions. The latter is too ambiguous. :) --porges 09:13, May 21, 2005 (UTC)

Javascript and escaping minus

Latest comment: 16 years ago2 comments2 people in discussion

Javascripts regex.test () do not match '-' (minus sign) for [-a-z] like, for example, Perl does. Javascript requires that '-' is escaped with [\-a-z]. (Tested in Firefox.)

The article says: "The '-' character should be literal only if it is the last or the first character within the brackets: [abc-] or [-abc].".

Suggestion: Note that this does not work in some implementations like Javascript and that '\-' is better (this works in all implementations?) and clearer.

The article also says: "Since the characters '(', ')', '[', ']', '.', '*', '?', '+', '^' and '$' are used as special symbols they have to be "escaped" somehow if they are meant literally. This is done by preceding them with '\'".

Suggestion: Mention that '-' should (not must) be escaped as well. It's a small detail, but it would have helped me a few moments ago. :) --Kricke 14:38, 17 june 2005 (CET).

I've just tested this in Firefox and IE, and escaping minus is not necessary. Shinobu 03:46, 9 October 2006 (UTC)

Shinobu is correct. The claim that "[-a-z]" requires an escape to match a hyphen is nonsense. It's simply not the case in Firefox, IE, Opera, Safari, etc., or according to ECMA-262v3, for that matter. --Monger 02:46, 7 August 2007 (UTC)

Alternation character and text appearance

On my computer (Mac running Safari) the alternation character as seen here is nearly indistinguishable from a lower-case L. (The font is Arial, I believe - if your theme matches mine, see for yourself: | l ) In some examples this makes comprehension difficult - for example, I read the first expression in the line

For example, "gr(a|e)y" is the same as "gray|grey"

as "gr(ALE)y", and since there's no direct reference to alternation in the explanation of the example, I had to rely on my knowledge of regexps to figure out that the character was an alternation operator.

One solution to this problem would be to use the TeX equivalent of the character. Another would be to put all regular expression examples in code form, as suggested above:

For example, gr(a|e)y is the same as gray|grey

I think the latter (code-style) is preferable, but there may be larger Wikipedia style issues at play. Does anyone have any comments or suggestions? Either way the problem should be resolved. -- — Adam Conover † 17:23, Jun 20, 2005 (UTC)

I tried to discuss the issue further up. You correctly point out a big problem with this page. --Arbor 18:31, 20 Jun 2005 (UTC)

Splitting This Topic

Latest comment: 17 years ago5 comments3 people in discussion

Currently this topic is confusing at best. The problem is with the material relating to various programming tools that make use of regular expressions. Regular expressions are primarily a mathematical concept. As such, the most pertinent sections are those labeled "In formal language theory" and "Implementations and running times."

Other sections contain data that is inappropriate for the topic regular expressions. For instance, the "Basic concepts" section discusses the operation "?" which does not exist in regular expressions. Similarly "Traditional Unix regular expressions" mentions operations dealing with placement in a line of text such as "^" and "$" which also have little to do with regular expressions.

I understand that this is good and useful data in the context of regular expressions in programming and in Unix. I do not mean to say "This is wrong, we must delete it!" What I suggest is that there be a topic "Regular Expressions in Programming" and that information be moved there, leaving this topic to focus on the mathematical concept of regular expressions.

Agreed. Personally, I see the current article as three different things, mixed together.

the theoretical CS perspective... eg. DFA's from the perspective of regular expressions.
the practical use as general text-matching tools in compilers, editors, grep, and programming languages.
an enumeration of all the possible libraries, blah blah, that implement regexps. In my mind, the third should be removed, according to m:When should I link externally.

In terms of enumerating some of the syntax features and differences of specific practical implementations... I don't think that should be done on this page. That's been done elsewhere [1], but if needed inside the wiki, it should be on a separate article (though it's terribly complicated IMHO, especially with things like less which can link with a variety of libraries, and with many libraries being slight derivatives of others). --Interiot 22:09, 21 September 2005 (UTC)

Like that edit for example. PCRE links with many applications (the PCRE homepage lists 13 notable examples). We don't need to list them all, especially not on this page. --Interiot 22:35, 21 September 2005 (UTC)

Object. The topic of this article good enough to make this a featured article at some time. It's a wonderful subject, and it's useful. What I don't see this article becoming is a look-up reference for Perl 5 compatible regular expressions or any other variant. That will never work—look at the pages upon pages in Mastering regular expressions needed to cover these many variants. Wikipedia is not a how-to, and it's certainly not a list of API's. But the central issues of Regular Expressions—how they are defined, how they came about, how they are used and in what contexts, how the algorithms work—that's a great topic for an encyclopedia, not really covered anywhere else in a coherent, readable fashion. We need to de-mystify the TCS section and remove some of the enumeration of syntax differences so that this article can have a cleaner focus on the central concepts. --Arbor 08:24, 22 September 2005 (UTC)

Later: Actually, we all seem to more or less agree, don't we? I also want to remove the proliferation of syntaxes, I just don't want to split the article—the practicalities are very important, as long as we don't get bogged down providing detailed syntax comparisons. (The Perl page has its own description of Perl's syntax, and that's exactly how it should be done.) But this page needs to explain ^ and $, and relate them to the formal definition. We need to explain that matching is defined differently in tools and CS (namely, that uvw tool-matches p iff v CS-matches p, or, in other words, w CS-matches p iff ^w$ tool-matches p). I am pretty fond of the explanation of ? and + in the TCS-section, explaining that the formal definition is purposefully parsimonious, but has the same expressive power. Such details are exactly missing from TCS text books and programmers' manuals. --Arbor 08:34, 22 September 2005 (UTC)

I am also in favour of splitting this article. This title should cover only true regular expressions as defined by computer science and mathematics, and contain a link to a seperate article on the software development use of the term, which does not strictly conform to the mathematical notion. By splitting the page we could show the true primitives of regular expressions without being distracted by non-primitives or non-regular constructs. And i disagree with the Arbor regarding Perl Compatible Regular Expressions, which have become more orless the standard syntax for most modern backtracking-nfa regular expression implementations. Indeed the PCRE standard of leftmost-longest matching sematics are more common than the strictly regular longest-token matching semantics, and afaik the former is not emulatable by the latter. Full disclosure: I am and have been working on the Perl regular expression engine for some time. Demerphq 14:37, 19 February 2007 (UTC)

Color-coded ASCII chart

Latest comment: 18 years ago3 comments2 people in discussion

I've added a pointer to an ASCII chart color-coded for the POSIX classes that I made some time ago as part of the documentation for a regular expression tool I wrote as I think that it makes things a lot clearer and also makes precise some of the not so clear terminology (e.g. the difference between "graph" and "print"). Three questions: (a) is this useful? (b) if so, would it be better to incorporate it directly into the article? (c) if so, is there any utility for converting HTML to wiki markup? And is the color-coding even possible? Bill 19:44, 2 January 2006 (UTC)

It's somewhat useful. Yes, it's possible, some amount of HTML and all of CSS are supported, see Image:Mississippi Delta Lobes.jpg for one example, or as a reference m:Help:Table#Color; scope of parameters or m:Help:HTML in wikitext.

But I'm actually starting to think that a comparison table like this would be more helpful, as ugly as tables are, since people keep changing the \( on traditional regular expressions to (, perhaps having more experience with modern regexp patterns and not realizing how much various regexp syntaxes differ. Because I think tables are generally uncyclopedic, I'd prefer to add as few as necessary, and would prefer to add the syntax one instead, but they're probably both useful. --Interiot 20:17, 2 January 2006 (UTC)

Maybe a way of avoiding cluttering up the regexp article would be to create a separate article on character classes (which would require disambiguation as I see that there is already an article with this title that has to do with role-playing games) or POSIX character classes. That would make some sense since character classes have some uses outside of regular expressions.

As for the syntax comparison, the chart you point at is pretty good, but the variation in regexp syntax is so great and there are so many programs that differ in subtle ways that I'm skeptical that a table can really do a good job of it.Bill 06:19, 3 January 2006 (UTC)

Complement operators and Kleene algebras

Latest comment: 18 years ago3 comments2 people in discussion

In the section titled "In formal language theory", it says the following:

"Sometimes the complement operator ~ is added; ~R denotes the set of all strings over Σ that are not in R. In that case the resulting operators form a Kleene algebra. "

A complement operator is not required in Kleene algebra. Concatenation, choice, and Kleene star operators suffice. I think the above sentence is misleading.

I also would like to question the comment:

"The complement operator is redundant: it can always be expressed by only using the other operators."

Is this so? I have never seen a definition like this. It is generally defined as something like ~A = Σ - A , where '-' represent set minus. However, set minus is not part of the regular expression syntax, so it means the complement operator is "primitive". Or is there something I am missing?

tim 13:54, 25 January 2006 (UTC)

Your first comment is correct. Fix it. As to complementation, the easiest way to see that goes via DFA: switch the accepting and rejecting states: hey presto! Instant DFA for the complement. Arbor 14:02, 25 January 2006 (UTC)

Thanks for you (very fast!) response. The change has been made. I see what you mean about the complement operator. I had misunderstood and thought that it was saying that the complement operator could be defined in terms of other operators, rather than expressions containing complement operators could be rewritten to not use them. tim 14:31, 25 January 2006 (UTC)

How to negate a whole line?

Latest comment: 18 years ago1 comment1 person in discussion

Hello, I want to negate a whole line starting from a specific caracter, eg.:

"The text to be negated starts here: negated text"

I want to negate all the line starting from the two dots ":".

Is there a way to do this?

Thanks, Ian Liu.

You cannot complement anything other than character sets, because this would make the efficient generation of NFAs from regexps infeasible. It's conceivably possible to complement the entire regexp by complementing the final DFA generated from the NFA (for systems that generate such a DFA), but this isn't usually very useful. Deco 00:29, 27 January 2006 (UTC)

"In formal language theory" or "Informal language theory"?

Latest comment: 17 years ago2 comments2 people in discussion

Greetings,

What a difference a space character can make...

The section title "In formal language theory" puzzles me. Does it mean:

 In that discipline we call formal language theory

or does it mean:

 Here's some informal language theory

If the former, please lose the initial "In" and just call it "Formal Language Theory". If the latter, please lose the space between "In" and "formal" to make the title "Informal language theory".

In the case of the former, I don't mind using a preposition as the first word, but using "in" in this case is a little too close to the word "informal".

Thanks, David

Hi David. Why not fix it yourself? WP:BOLD Welcome to Wikipedia! Arbor 08:15, 4 March 2006 (UTC)

Hmm, I don't think being bold is the problem here. Anyway, the edit history [2] shows it comes from a heading "Regular Expressions in Formal Language Theory", so I'll remove the "In". Thanks for picking that out, it's good to know where there's confusion. --H2g2bob 03:07, 24 June 2006 (UTC)

Even/Odd Regular Expression replacement.

Latest comment: 18 years ago5 comments2 people in discussion

The original RE text listed was :
"((a|ba(aa)*b)(b(aa)*b)*(a|ba(aa)*b)|b(aa)*b)*(a|ba(aa)*b)(b(aa)*b)* denotes the set of all strings which contain an even number of bs and an odd number of as. Note that this regular expression is of the form (X Y*X U Y)*X Y* with X = a|ba(aa)*b and Y = b(aa)*b. "
If you trace through this, you'll see that the first part of the RE matches aa(an even number of a's). This is therefore incorrect. This is my replacement:
"(aa|ab(bb)*ba)*(b|ab(bb)*a)[a(bb)*a|(b|a(bb)*ba)(aa|ab(bb)*ba)*(b|ab(bb)*a)]* denotes the set of all strings which contain an even number of as and an odd number of bs."
I had to removed text stating what form it was is, since that form was wrong and does not apply to the correct solution. Also, due to my own original solving of this problem, I have even a's and odd b's(instead of the original odd a's, even b's) but they are interchangable. --MacGyver07 03:15, 9 March 2006 (UTC)

The right square bracket on the end is clearly wrong. I had presumed that a version without that was correct. But as it stands, it's not even a legal regular expression. I've pulled the entire paragraph out until someone can figure out what the real expression is. --Randal L. Schwartz 14:39, 9 March 2006 (UTC)

Ahh, there's a left bracket earlier. Well, that's clearly wrong then, because it's a legal regular expression, but matching any of the characters between that left bracket and right bracket. So it's gonna match a's, b's, left parens, right parens, vertical bars, and stars. I really don't think that's what you want. Again, it should stay removed until it gets fixed. --Randal L. Schwartz 14:46, 9 March 2006 (UTC)

Look, you seriously need to stop reverting edits on things you clearly can't understand.
[a(bb)*a|(b|a(bb)*ba)(aa|ab(bb)*ba)*(b|ab(bb)*a)]*
The asterisk clearly shows that it can match nothing, and if you trace through you'll see it won't match just a, but only even a's(which is the whole point of the RE).

|'s are " or's " and *'s mean zero, one, or more... this is unaffected by the bracket as it does not change meaning, but put there for clarity.... I am really at the end of my patience for this, I fixed a CLEARLY wrong RE with the right one and you have reverted the correct changes twice now with completely incorrect reasons.... please learn how Regular Expressions work before saying they are wrong.--MacGyver07 23:31, 9 March 2006 (UTC)

Ok... upon relooking at the article, I understand where you got off base... in UNIX regular expressions, you are correct in that brackets denote character classes.... however the ones listed were in a section talking about FA's and RE's in a general sense, and in that case the RE is seen as a mathematical expression rather than a functional language.... in other words the brackets are the same as a parentheseis(this is why my original algorithm listed '|'s as +'s(whereas +'s are one or many in functional RE's), In any case, I changed the brackets to parenthesis to better fit the other RE, but I still say that brackets denoting character classes is specific to unix RE's(an implementation) and doesnt affect them in the "examples" section...AFAIK the expressions given in the example area were meant to be general RE's and not language specific which is why I used brackets... however I as I said they've been removed and no further editing or reverting of the RE should be needed... please trace through and post here before reverting if you cant understand this.--MacGyver07 23:42, 9 March 2006 (UTC)

This doesn't seem right

Latest comment: 17 years ago2 comments2 people in discussion

"Sometimes the complement operator ~ is added; ~R denotes the set of all strings over Σ that are not in R. The complement operator is redundant: it can always be expressed by only using the other operators."

How? --Ihope127 02:31, 24 June 2006 (UTC)

It's true, but completely non-obvious. First, you translate the regexp to an NFA. Then, translate the NFA to a DFA using the powerset construction. Then, complement the set of final states. Then, translate back to a regexp. The resulting regexp will recognize the complement language, but may be an exponentially larger string. Deco 17:38, 26 June 2006 (UTC)

search and replace

Latest comment: 17 years ago1 comment1 person in discussion

Many text editors have a "search and replace" that allows regexes in the "search" part.

This article emphasizes "searching", but only hints at the possibility of using regexes in the "replace" part.

For example, "\0" represents the entire thing matched, "\1", "\2", "\3", etc. represent various parts of the thing matched.

If the string I type in the "replace" part -- for example,

 <em>\0</em>

(intended to add literal "<em>" and "</em>" around what the regex found, whatever it may be) is not technically a regex, then what is it called? If this is not the appropriate encyclopedia article to discuss the "replace" part, which encyclopedia article is? --70.189.73.224 15:35, 17 September 2006 (UTC)

Google Analytics and RegEx

Latest comment: 17 years ago1 comment1 person in discussion

Google Analytis uses RegEx now and so the need to understand them is greater than before. But, all the examples here are tech-heavy Perl (or as one guy said on the Web Analytics forum, "To me, Perl is something that comes in an oyster." And that was someone who was writing his own RSS application...) Anyway, we need examples that are more relevant to that market. Rsteif 22:09, 23 September 2006 (UTC)

Mailing Lists

Latest comment: 17 years ago10 comments5 people in discussion

I had added 3 links to RegEx mailing lists and they were removed. Is there a reason for this?

Maybe because Wikipedia is not a list of links? There could be lots of other reasons. Arbor 17:35, 28 September 2006 (UTC)

While I was here, I removed more than half of the links in that section, arguably somewhat randomly. Editors should put back whatever they need is necessary for this article, but let's remember that this is not a repository of external references. Arbor 17:40, 28 September 2006 (UTC)

But looking at the links left shows that they are geared towards resources. Cheat sheets, learning sites, etc. A mailing list definitely falls under this definition and as there are almost none, giving people the ability to know where to go to ask questions is of great importance, at least in my thinking. I can understand not having a link to regexbuddy, but a learning resource is a learning resource.

If I'm wrong in this, please tell me the difference between a mailing list and a site with a tutorial.

A mailing list has no encyclopedic content, a site with a tutorial (hopefully) does. Martin 18:14, 28 September 2006 (UTC)

What do you mean by 'encyclopedic content'? The mailing lists have archives of the questions and answers if that's an issue.

If there is a well-organised mailing list archive somewhere that presents such answers in a coherent, comprehensive, and accessible manner, we could use it just as well as any of the other tutorials. But that's no longer a mailing list then. In any case, one such link, to one external site is quite enough. (Also, sign your contributions using four tildes—that will produce a signature like mine or Martin's. And why not get an account?) Arbor 19:00, 28 September 2006 (UTC)

So if the list owners did a 'best of' or compilation of the questions and answers on the list, it would be something that could be linked in without issue? Also, how do you determine which resource to link to? Jakarta has 2 different lists. House of Fusion has 1. There's even a yahoo groups list at http://tech.groups.yahoo.com/group/regex/ and a google group at http://groups.google.com/group/regex. Which 'one' gets listed? Mysticrpg 19:26, 28 September 2006 (UTC)

I would say none. That's the easiest way to solve this problem. As we said, this is an encyclopaedia, the idea is that important information is here. If a hypothetical book/site/list/whatever contained useful information, the idea is to incorporate it into this very article, not link to it. If the book/site/list/whatever does not contain useful information, it needn't be linked anyway. Arbor 20:12, 28 September 2006 (UTC)

Your argument means that there should be no external links at all as the data in the external link should be in the entry anyway. There's a problem with that logic. And where should people ask RegEx questions? Should there be a Q&A box at the bottom of the entry? Pointing people to where they can get more information is exactly what an encyclopaedia is for. Give the info and if the info is not present, point to where it can be found. Mysticrpg 20:28, 28 September 2006 (UTC)

DMOZ is designed to point people to the most useful websites for a specific topic. Wikipedia is not. We link to DMOZ, that should be good enough. --Interiot 20:52, 28 September 2006 (UTC)

OK, I added a single mailing list to the external links section so people know where to go to ask RegEx questions. Good enough? Mysticrpg 20:44, 28 September 2006 (UTC) OK, so the single link I added was taken down again. I quit. This is exactly why people hate wikipedia. Overediting and control with valid resources being rejected with a blanket statement of "not a link repository". So take down all the links then. Over controled BS. Enjoy your control. Mysticrpg 03:15, 29 September 2006 (UTC)

I think that the above overseers should've pointed to the rules instead of their private opinions: Wikipedia:External_links#Links_normally_to_be_avoided. It clearly states:

10. Links to social networking sites (such as MySpace), discussion forums or USENET.

Such a reference is much more descriptive (since rules are written much clearer than their interpretations) and a user (including me) feels far fairer after realizing he's violated a rule, not fallen a victim to arbitrarity. — Vano 01:08, 1 December 2006 (UTC)

External links that were removed en-masse

Latest comment: 17 years ago10 comments6 people in discussion

Someone in the past removed a bunch of link - here is the whole list from that awful day, so people can re-add the ones they find useful. -- Wriggleybum 08:27, 30 September 2006 (UTC)

That was me; I announced it here as well, see above. By the way, most WP editors should be able to use the "history" tab to find and bring back whatever they find necessary, so in general there is no need to "resurrect" material by copying it to the talk page. Arbor 09:15, 30 September 2006 (UTC)

A good way to point to an edit that you find problematic is to include a link to the diff, like this: [3]. That shows what I removed and when, including my edit summary. Having done that, I will remove the section of links you added to this talk page. Arbor 09:18, 30 September 2006 (UTC)

Much appreciated. I've been trying to keep it sparse. For anyone who thinks this is a bad idea, please review WP:NOT. -Harmil 19:06, 2 October 2006 (UTC)

Very Bad Idea. Don't have to think it. A link to site that was up for barely 24 hours generated a very large volume of traffic from this article. Strongly suggests this page is failing to provide answers for people coming here. WP NOT, is fine and good, but I believe you'll also find that WP is not about useless information for no purpose. Earnestly suggest you reconsider your pruning efforts or do a better job in the article. dmoz is also fine and good, but why are you making it hard for people? --Gavferr 04:14, 14 October 2006 (UTC)

More likely it was the multitude of editors who watchlist this article following the link for validation. — EncMstr 04:24, 14 October 2006 (UTC)

Gavferr, based on your comments, I take it that you are the maintainer of one of the sites that had been linked to from this article? If so, please do review WP:NOT, and if you decide that you have a justification for a particular external link, let us know. "a link ... generated a very large volume of traffic," doesn't actually address the concern. -Harmil 04:26, 14 October 2006 (UTC)

Personally I could care less if you link or don't. I derive zero benefit either way. Do keep in mind that those links keep growing. Ask yourselves why? People aren't finding what they want here, they find it somewhere else, and try and help by putting a link up. Then think of what fraction of people who would even *bother* doing the editing. QED. You can either learn from this or not. It's your call as the keepers of the page, not mine. The goal of WP is not to freeze articles in stasis but improve them. This one is frozen and has plenty of proof that it is failing. I care enough about WP succeeding to try and help by alerting you to the issue, but not enough to waste time arguing with pedants. Apologies if the truth offends. --Gavferr 09:04, 15 October 2006 (UTC)

I noticed someone added the Regex Coach to the external links; although I agree with 99% of the removed links, this one should stay; I've been tempted to add it myself. It is a stand-alone (not web-based) program, and it has literally saved me hours and hours in debugging complex REs. It has some nice "stepping" tools to let you watch what the parsing is doing. MeekMark 14:24, 5 October 2006 (UTC)

The problem with that sort of logic is that it's an excellent reason to keep a link on a Web directory service, which Wikipedia is not. Links in an article should further inform the reader, and allow them to perform further research on their own. If there's a link that you think regular expression users should know about, I suggest strongly that you submit it at DMOZ. -Harmil 06:11, 6 October 2006 (UTC)

Javascript?

Latest comment: 17 years ago2 comments2 people in discussion

Why is Javascript not mentioned in the article? I think to most users it is a much more familiar and accessible example of a language supporting regexen than the relatively speaking more obscure tcl. Shinobu 03:49, 9 October 2006 (UTC)

Oh, I agree. And I think that lots of people are trying to learn RegExen now that Google Analytics uses them. This is the same reason I think that the the RegEx Coach should be an external link (and let's face it, nobody goes to DMOZ except to get a directory listing to help their SEO.) We really need a whole section on RegEx for non-geeks. Rsteif 20:32, 12 October 2006 (UTC)

How to describe alternation?

Latest comment: 17 years ago6 comments4 people in discussion

Sorry I didn't ask before changing this sentence:

For example, "gray|grey" matches gray or grey, which can commonly be shortened to "gr(a|e)y".

to this sentence:

For example, "gray|grey", which is commonly shortened to the equivalent "gr(a|e)y", can match gray or grey.

I changed it because the original doesn't seem to me to say what it's supposed to say. That is, the original seems to say that it is "gray or grey" which can be shortened, not "gray|grey". While maybe my rewrite isn't best possible, surely some rewording is necessary to make it apparent what's being shortened? JadeNB 21:25, 10 October 2006 (UTC)

It looks good to me, although I prefer the more explicit:

For example, the regular expression "gray|grey", which is commonly shortened to the equivalent "gr(a|e)y", can match either "gray" or "grey".

Deco 23:26, 10 October 2006 (UTC)

Oops, of course you're right that I'm missing some quotation marks there. I'm not clear on the etiquette here -- if another user changed my changes, is it now reasonable for me to change them back after asking here, or am I supposed to address him or her specifically?

JadeNB 00:01, 11 October 2006 (UTC)

That person should be watching this page, and evidently prefers the original wording. You could give them a few hours to read this so that you don't end up in a revert war. However, I much prefer your wording. Deco 00:15, 11 October 2006 (UTC)

I too, like the new one better (with the quotes). Shinobu 09:37, 11 October 2006 (UTC)

This particular example could also (and quite possibly more often?) be expressed by the character class as "gr[ae]y". Perhaps anothe canonical example of the altnernation might be in order? Something that can "only" be expressed as an alternation? (Such as "jail|gaol" which would also be expressable as "(gao|jai)l".) QTJ 19:57, 11 October 2006 (UTC)

McCullough and Pitts

Latest comment: 17 years ago1 comment1 person in discussion

I have no idea what McCullough and Pitts have to do with regular expressions. I am removing the statement making that reference. If someone can dig up a reference that McCullough and Pitts had anything to do with the early history of regular expressions or finite-state languages, feel free to revert. Pexatus 01:06, 21 October 2006 (UTC)

"Regular Expressions Example (Visual Basic .NET)"

Latest comment: 17 years ago4 comments3 people in discussion

A new section entitled "Regular Expressions Example (Visual Basic .NET)" does not seem at all appropriate to me, as the example does not seem at all useful, and is more about visual basic than regular expressions. It also contains misleading statements, such as "Regular Expressions have been fully integrated into Visual Studio", I think it means integrated into the .NET framework, and "In particular, Visual Basic .NET", what bout c# etc. where regex implementation is pretty much identical. Martin 09:05, 23 October 2006 (UTC)

I agree with all that. Removed the section, and scattered the text into #Regular expression and the .NET framework and #Uses of regular expressions. Both could do with more work though. --h2g2bob 11:48, 23 October 2006 (UTC)

I've added an {{expert}} tag to the .net section. I have no knowledge of "lookaround", which looks to be a key concept in .net regexps. Plus, I have no particular desire to find out. --h2g2bob 12:15, 23 October 2006 (UTC)

"lookaround" is jargon for "positive/negative lookahead/behind assertion" --Demerphq 15:59, 23 December 2006 (UTC)

searchmonkey

Latest comment: 17 years ago1 comment1 person in discussion

Someone added searchmonkey to the "See also" section. This makes about as much sense as adding Ford F-100 to Fuel injection "See also" section. It certainly makes sense to disucss (in prose, with as much context as possible) any product which has made notable contributions to the real-world use of regular expressions, but the "See also" section isn't the place to do that. That section should be reserved for directly related or very similar topics that are not otherwise covered in the body of the article. -Harmil 21:25, 9 February 2007 (UTC)

Perhaps a section on clarifying the asterisk is in order?

Latest comment: 17 years ago4 comments4 people in discussion

Speaking as a reader of this article I was quickly baffled in the very first section by the following:

"The asterisk indicates there are 0, 1 or any number of the previous expression. For example, "go*gle" matches ggle, gogle, google, gooogle, etc."

I had to do a double take because I was under the impression that the asterisk was used as a wildcard in regular expressions. Some quick googling led me to discover that I was mistaken. Through my googling though I quickly came to the conclusion that this mistaken impression is probably very common. Perhaps a section talking about common misconceptions like this might be in order? —The preceding unsigned comment was added by 64.126.82.205 (talk) 08:11, 14 February 2007 (UTC).

The * as a wildcard is non-Unix, it was DOS (maybe CP/M that was before my time). — RevRagnarok ^{Talk Contrib} 11:48, 14 February 2007 (UTC)

I don't know about CP/M, but the DOS implementation was flawed. The asterisk could only appear at the end of a field. That is, FILE*.* worked as expected, however *FILE.C would be interpreted as *.C. It seems the asterisk was expanded into the single-character wild card ? to fill out the field. As these were 8.3 filenames, FILE*.* was transformed into FILE????.??? while *FILE.C became ????????.C for matching purposes. —EncMstr 16:20, 14 February 2007 (UTC)

* wildcarding is very old, but Unix, DOS, VMS and various other systems supported this kind of wildcarding. Wildcarding is a subset of the capability of regular expressions, but is not directly related. The clarification should probably be made if it's not already between the two. Modern Unix wildcarding is VERY sophisticated, but the fundamental difference is that wildcarding expresses missing segments, where regular expressions express sequences. Thus .* in a regular expression is functionally equivalent to * in a wildcard, but you don't have to describe what it is that comes between in a wildcard (and cannot). Alternations are sometimes possible with {a,b,c} in wildcards, but not everything supports this, and it's limited in its capabilities (e.g. you cannot specifiy the repetition of an alternation, as you can with (a|b|c)*). -Harmil 17:49, 14 February 2007 (UTC)

Tone of the Formal Language Theory section

Latest comment: 17 years ago1 comment1 person in discussion

This section reads like a textbook. Further, it seems like it could be condensed quite a bit to make it more relevant to regular expressions (formal language theory does have its own article, after all). -- Super Aardvark 01:09, 31 March 2007 (UTC)

Suggest removing section "Uses of regular expressions"

Latest comment: 17 years ago1 comment1 person in discussion

The section "Uses of regular expressions" is not very useful and feel out of place. Regexes has many uses other than syntax highlighting. Let's just link to Regular expression examples instead. --Apoc2400 06:10, 2 April 2007 (UTC)

Missing The Formal Definition

Latest comment: 16 years ago2 comments1 person in discussion

Can someone provide a formal definition for regular expression? --CBKAtTopsails 03:12, 8 June 2007 (UTC)

It is extremely confusing to read the Formal Language Theory section of this Article. In this section, RS is defined as { αβ | α in R and β in S }. It implies from this definition that R and S are sets. If so, what are the elements in R and S ? Are they symbols from the finite alphabet ? If so, R and S should be languages and we know that languages are not equal to expressions although there is a correspondence between them.

Also, in the example given, is R equal to {"ab", "c"} ? If so, R should be a language. If not, what is R in terms of a, b, and c ? --CBKAtTopsails 02:38, 15 June 2007 (UTC)

Rules on [...]

Latest comment: 16 years ago2 comments2 people in discussion

I added the following rules on [...]: []...] matches ] and the rule on placement of - to match a -. I also fixed the regexp for [:punct:], which had ] in the wrong place. - PGSONIC 17:09, 31 July 2007 (UTC)

The rules regarding "]" as the first character in a character class is flavor/engine specific. For example, Internet Explorer agrees with you, while Firefox does not. --Monger 01:18, 15 August 2007 (UTC)

Google and regex search

Latest comment: 16 years ago1 comment1 person in discussion

FYI, Google does offer regex search, at least on their codesearch. (google.com/codesearch) That might have to be changed 129.92.250.39 21:00, 6 August 2007 (UTC)

Me being a pedant

Latest comment: 16 years ago1 comment1 person in discussion

In the example "((great )*grand )?(fa|mo)ther", it should be "((great )*grand)?(fa|mo)ther" since in English (US/UK) grandmother and grandfather is one word, not two.

That is true. Be bold and fix it. Dcoetzee 22:47, 14 August 2007 (UTC)

"Multi line mode"

Latest comment: 16 years ago1 comment1 person in discussion

There are two references in the article to "multiline mode", but no definition or any other information. Being a beginner at regexp, I know there are parameters that can change the "mode", but not enough to explain it. Could this be added to the article? Thanks! -- SatyrTN (talk | contribs) 01:40, 16 August 2007 (UTC)

\b example in introduction?!

Latest comment: 16 years ago6 comments3 people in discussion

I'm concerned that the "example of regular expression syntax" in the introduction uses a feature, \b, which is a Perl extension not found in many common regular expression engines, and hardly representative of common regex syntax at all. I'd replace it with a *-based example if I could think of a good one... -- 81.86.133.45

First of all, "common regex syntax" these days is Perl-derivative. I'm not crazy about the \b-based example, but I think it is better than a *-based example because it is a simple to understand case of something useful which cannot be done with more familiar wildcards. In any case, the word boundary concept has been around with regular expressions since before Perl (via the < and > metacharacters found in a number of traditional Unix tools). --Monger 23:57, 12 November 2007 (UTC)

Suggestion: to make the \bex example clear as glass, change 'immediately after' to 'immediately before or after', to clarify what 'at a word boundary' means. And/or change the example text to something like 'Hex Texts for experts', and change the results accordingly. —Preceding unsigned comment added by 192.146.1.16 (talk) 05:11, 12 December 2007 (UTC)

I think zero-width assertions should be avoided in any introductory pattern, because they're less intuitive than positive width patterns. Far better to use something based on * and |. As for being something cannot be done with more familiar wildcards, why bother? More familiar wildcards are ultimately derived from regular expressions and are relevant in the context. Dcoetzee 05:49, 12 December 2007 (UTC)

Sez who? The claim that [all] zero-width assertions are less intuitive is obviously subjective. --Monger (talk) 07:52, 12 December 2007 (UTC)

My own opinion. Just try to answer the question "what is it matching?" It's complicated. It deserves to be postponed. Just look at a typical Perl manual or book - they usually reserve a later section to expound upon these. Dcoetzee 02:32, 13 December 2007 (UTC)

The question is of course quite simple to answer. It matches successfully at the position between a word character and a non-word character, according to the regex flavor's definition of these terms (e.g. a Unicode-based interpretation might be used, and a very small number of flavors do not base \b and \w on the same definitions). The reader does not need to understand all the subtleties of what matching an empty string means or the variations on precisely which characters are considered word characters by every regex flavor in order for the functionality to be reasonably straightforward and intuitive. Also note that the use of * and | (your preferred example tokens) can also involve the concept of elision, which is really the most complicated aspect of any zero-width assertion. --Monger (talk) 17:17, 27 December 2007 (UTC)

RegexBuddy under See Also

Latest comment: 16 years ago7 comments3 people in discussion

There is no question that the RegexBuddy article is related to regular expressions, and while the wikilink to that page might be better worked into some other part of this article, IMO it is not inappropriate to put it in under the See Also heading. I do not think that is much different than e.g. the word processor article linking to list of word processors from its See Also section. The reason I'm bringing this up is because recently another Wikipedia editor removed the RegexBuddy wikilink as "spam." If they think that's the case I would encourage them to nominate the article for deletion. Short of that, I am restoring the wikilink for now. --Monger (talk) 14:53, 12 December 2007 (UTC)

I believe it's inappropriate under the See Also heading because it gives it undue prominence - there are many such tools and at best it should receive a single mention among a list of them. Currently the article does not discuss tools and this is an area for expansion. Dcoetzee 02:30, 13 December 2007 (UTC)

I agree that it currently appears a bit too prominently, and that it would be better in a list of tools which use regexes in a significant way (there are already a few others mentioned in the article, e.g. ed, grep, and Vim). --Monger (talk) 07:52, 13 December 2007 (UTC)

I took RegexBuddy out of the See also section again, because it does indeed give prominence to one software package above all others. We should either list every single regex program here on this page, or do like other computer software articles and link to a separate comparison or List page. And if you'll follow that link to the left, you'll see that WP already has a comparison page for regex software. For readers interested in a good regex software package - like me - linking RegexBuddy on this page is almost like promoting it above other regex software programs. Instead, let's link to a comparison/list page, and let the reader decide which software program is the one they should use.

Also, contrary to it's title, the comparison page lists some regex applications, not just the underlying engines. Please read the article before reverting next time (and don't read the page's title). Kinda like judging a book by it's cover. — EagleOne\^Talk 02:11, 25 December 2007 (UTC)

Don't read the title? If the title doesn't describe what the article is about, perhaps either the title or the article's content need to change. I've contributed to that page, and read it and its talk page from top to bottom. I also happen to think it needs a ton of work, and that the list of applications which use regular expressions doesn't belong there. That sublist -- which is currently dominated by text editors -- should be split into another page, since if one were to try to fill it out with more categories of software it would quickly dominate the page. I don't care if you think linking to RegexBuddy, grep, vi, or any other regex-related software which has a dedicated Wikipedia article gives it undue prominence. I care about linking to it where it is most appropriate. As stated previously, I think an article titled "List of regular expression software" or similar would be helpful, and if one existed, yes I think the link in question would be more appropriate there. --Monger (talk) 15:51, 25 December 2007 (UTC)

Sorry. That should have read "don't just read the page title". Granted, the page title should clearly describe the contents, and the specific comparison page may need some work. However, in it's current form, it does clearly list several software tools that use regexes, and there there's no need to link each package directly on this page; just link to the comparison page and be done with it. I also believe in linking pages where it's most appropriate (I've been fighting a similar battle about portal links on a couple other articles for a while now); however, the "see also" section in this article isn't the most appropriate place for the RegexBuddy link. I can create a List of regular expression software page - I can do that right now - but I'll need your help, and others, to fill out the page, as I am a novice when it comes to regexes. — EagleOne\^Talk 22:40, 25 December 2007 (UTC)

Thanks for starting the article. I will try to contribute to it as time permits. For now, I've added a link to it under the see also section of this article. --Monger (talk) 05:30, 26 December 2007 (UTC)

The poor reader

Latest comment: 16 years ago13 comments2 people in discussion

"a regular expression is a string..."

If you don't already know what a string is, you follow that link, where you learn that "a string is an ordered sequence of symbols..".

Poor reader. I'll see if I can rephrase the intro to something more approachable - unless someone else does it first.Leotohill (talk) 04:46, 27 December 2007 (UTC)

Good luck. I'll be sure to revert if you dumb it down to something less descriptive/accurate/concise. ;-) --Monger (talk) 05:06, 27 December 2007 (UTC)

check it out. hope you like it. Leotohill (talk) 06:03, 27 December 2007 (UTC)

Nicely done. I made one change for technical accuracy and several other minor changes, but overall I agree that it is now more approachable. --Monger (talk) 13:31, 27 December 2007 (UTC)

I like your edits, except for "...and converts it into a program which...". I think that's an implementation detail that is not needed in the intro. It's also arguable whether it is required, depending on one's definition of "program". Would you consider moving this to a later section? Leotohill (talk) 15:40, 27 December 2007 (UTC)

I'd have no problems with that detail being mentioned later in the article. However, I think it is an often misunderstood technical detail that is of interest to readers. The main reason I added it though was because it corrected the previous, subtly-incorrect description of a "regular expression processor" (the patterns themselves merely provide a blueprint for the actual parser programs that are created). Especially for people who only use regexes within text editors or other such utilities, I can understand this being seen as an unnecessary implementation detail, but with many programming languages it is common to work directly with the generated regex programs, e.g. by assigning them to variables and reusing them to avoid recompilation, etc. --Monger (talk) 17:02, 27 December 2007 (UTC)

\\

I'm having trouble rewriting it in a way that addresses your concern about my original phrasing but defers this detail until later. I'd like to convince that your concern is not necessary. I wrote "...regular expression processor, a program that examines text and identifies parts of the text that match..." . I think that grep, or a regex search feature of an editor, or library routines such as .Net's Regex methods all qualify as "regular expression processors", and are commonly considered to be "the program" (or code) that does the matching, and it's really not incorrect to describe them as such. I'd really like to restore that phrasing. Will you reconsider, or propose something else that allows moving the other part to later? Leotohill (talk) 15:22, 28 December 2007 (UTC)

\\

You're proving my point about how common the misconception is (although as mentioned earlier it's not really a misconception in the application software world). So-called regex or matcher objects are very commonly compiled parser programs/functions which regex and string methods work with. E.g. if you use something like Dim RegexObj as Regex = New Regex("pattern") (VB.NET) or var regexObj = new RegExp("pattern") (ECMAScript), you're creating a parser wrapped with several methods/properties. In some other cases the same process occurs implicitly.

In any case, I do understand the point you're making, and I've merged your previous rewrite back into the intro, so that it now reads, "... Regular expressions are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification." Do you think this is enough of an improvement, or does it need further refinement/change? --Monger (talk) 19:24, 28 December 2007 (UTC)

\\

It's better. I still think the intro would suit a wider audience if it focused on the essential, and I don't see this distinction as essential, but it's ok. BTW, I'm glad you removed the "philosophy" categorization. Leotohill (talk) 21:09, 28 December 2007 (UTC)

Recent changes restored the opening sentence "regular expression (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules". This is not a good opening. "string" is an idiom of computing, completely unfamiliar to readers outside the field. As much as possible, an intro should not use special language. Also, in this case, the reader isn't helped much by following the link to "string", where it is described as "an ordered sequence of symbols". Finally, the parenthetical "(abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen)" is not necessary in the first sentence, and makes it less concise. The opening "In computing, regular expressions provide a concise and flexible means for identifying text of interest, such as particular characters, words, or patterns of characters." is concise and understandable by a reader who has no background in computing. Leotohill (talk) 13:34, 11 January 2008 (UTC)

I'll play devil's advocate here, since as you know I am not the one who made the change... I mostly just tried to make the update less redundant and bring back parts of what you added. However, one problem with the intro you have reverted to is that it does not describe what regular expressions are; merely how they're used. Personally, I cannot understand your aversion to the extremely-commonly-used computer programming term string. "Regular expression" is itself a completely foreign idiom to those outside the field, not to mention that this a fairly technical topic in the first place. If you have a problem with how the "string" article defines the term, you should take it up over there. IMO, it is far more important for this article to describe its subject thoroughly and accurately than to cater to the uninitiated. Finally, your removal of the common abbreviated forms of "regular expression" is problematic -- please put them back, where you feel is appropriate. --Monger (talk) 23:32, 11 January 2008 (UTC)

I agree that it would be better to say what a regexp is vs. what it is used for. I struggled with that earlier, without a result that I liked, but I have some new ideas now that I'll try to work in. Regarding the use of "string": what is string, but a data type for representing text? In the context of this intro, they have equivalent meaning, don't they? So the more familiar term (globally) should be used. Beyond that argument, I can only be repetitive: the way I've phrased the opening, more people can immediately understand it. I think there's great value in that, and no particular loss.

I added back the abbreviations, at another place.

Leotohill (talk) 00:10, 12 January 2008 (UTC)

BTW, are you interested in offering an opinion on the currently open regular expression examples AFD? I mention it here since it has seen few participants, and will probably be closed very soon. --Monger (talk) 23:55, 11 January 2008 (UTC)

Typo in Google find

Latest comment: 16 years ago2 comments2 people in discussion

This article shows up in a Google search result for regular-expressions as:

"Regular expression - Wikipedia, the free encyclopedia

"A Wikipedia articele on regular expressions with an informal discussion, a formal definition and examples."

Note the word "articele". Where does it come from? I can't find it on this page and don't know how to correct it.

Not overly important, but we want to be perfect, don't we? (:-)

Hgmichna (talk) 12:18, 4 January 2008 (UTC)

That's a interesting question. Since WP articles do not themselves provide a meta-summary like that, I don't know where that text would come from. I wonder if google is doing some add-on work. Whereever it came from, it may no longer exist. Google caches hits, and caches get out of date. Leotohill (talk) 17:11, 11 January 2008 (UTC)

Poor user II: Formal Language Theory

Latest comment: 16 years ago2 comments2 people in discussion

I've used regexes for years, I remember reading the O'Reilly book cover-to-cover, but I've never thought much about the formal side of them, and I have no formal language theory background. Here's what I'm noticing as a newbie to the topic:

It starts out defining the alphabet (Σ) and three constants (∅, ε and a).
It then (in "concatenation") immediately starts talking about α and β, and R and S, and braces and a vertical bar, without introducing their meaning. I gather that α and β are representative of any two strings, and that braces contain a set, but the "explanation" of "{ αβ | α in R and β in S }" makes no sense even though I already know what concatenation is.
The very next line ("alternation") uses the vertical bar in a different sense.

I'm sure I can figure out what it's trying to say once I follow all the links, but it sure is confusing at first blush. --JayLevitt (talk) 00:54, 12 January 2008 (UTC)

It is understandably confusing because the examples use the grammar of set_notation to describe the grammar of regular expressions. The vertical bar has different meaning in these two grammars, so you have to read this verrrry carefully. Let's examine this one:

(concatenation) RS denoting the set { αβ | α in R and β in S }.

Here, the vertical bar is part of the grammar of set notation, and it means "such that". So this is saying that RS is the set {αβ} (a string α followed by a string β) such that α is a member of the set R and β is a member of the set S.

In set notation grammar I think that it is acceptable to use ":" instead of "|" - that might be a good change here, but I'd like to hear comment from others before making that change.

Now the next line

(alternation) R|S denoting the set union of R and S.

Here, the "|" in R|S is referencing the vertical bar of the regex grammar. In fact, this line is defining the meaning of the regex vertical bar.

Hope this helps.

Leotohill (talk) 16:23, 12 January 2008 (UTC)

That opening sentence, again.

Latest comment: 16 years ago1 comment1 person in discussion

Once more, I'm going to restore the wording that uses "text" instead of "string". I haven't heard anyone making a strong defense of "string", and perhaps the recent change was inadvertently included with other changes.

At this time I will not remove the "abbreviations" part from the first sentence, but I still object to it. One editor invoked wp:mos as supporting placement at the first. In this case, I don't think it does. One part of mos says "The use of abbreviations should be avoided when they would be confusing to the reader, interrupt the flow, or appear informal or lazy." I think that the usage here qualifies as "interrupts the flow" and "appears informal". If it was just one abbreviation, it would be OK, but now five are listed in a parenthetical phrase that is 11 words long, which certainly interrupts the flow. They are also not true abbreviations - they are slang. Leotohill (talk) 18:56, 18 January 2008 (UTC)

redirects

Latest comment: 16 years ago2 comments2 people in discussion

BRE and ERE should probably have redirects to this page. Speaking for myself, I typed BRE in the wikipedia search box and was surprised to get a disambiguation page which didn't mention Basic Regular Expressions and typed in ERE and was even more surprised to just get most likely search results (which didn't include this page). No big deal, either way, though. —Preceding unsigned comment added by 68.214.235.119 (talk) 03:08, 12 February 2008 (UTC)

Added Basic Regular Expression to BRE disambiguation page, and created ERE redirect to this page. MeekMark (talk) 12:32, 29 February 2008 (UTC)

sexyregex.com

Latest comment: 16 years ago1 comment1 person in discussion

This link damages the topic by pointing to a poorly organized set of blog-style entries of trivial scripts lacking tutorial content. The site as a whole is as noted by another editor poorly responsive (perhaps from the way the scripts are added in a haphazard manner). The IP-editor(s) intend to use Wikipedia for promoting their site rather than improving the Wikipedia topic. Tedickey (talk) 12:02, 29 February 2008 (UTC)

POSIX Character Classes [\x00-\x1F\x7F] is not working.

Latest comment: 16 years ago1 comment1 person in discussion

POSIX Character Clasess [\x00-\x1F\x7F] is not working. I had to use [:cntrl:] in order to make it work properly. All others are working just find. Unfortunately I don't have the solution or the proper expression but I am just leaving a feedback for anyone else who can find out what is wrong with it.

This was the expression not working. Note that I am not using ending line "$". ^[\x00-\x1F\x7F]+

This is the corrected expression which is a workaround and not a full solution. ^[:cntrl:]+

Erick V... 03/31/08 13:36 EST. —Preceding unsigned comment added by Pele matosky (talk • contribs) 17:43, 31 March 2008 (UTC)

. and newline characters

Latest comment: 16 years ago1 comment1 person in discussion

Says Regular expression#POSIX Basic Regular Expressions:

". Matches any single character except newlines"

Says the spec:

"In the regular expression processing described in this specification, the newline character is regarded as an ordinary character and both a period and a non-matching list can match one."

"A period (.), when used outside a bracket expression, is a BRE that matches any character in the supported character set except NUL."

I changed the article accordingly and added a note: "many varieties exclude newlines". --TuukkaH (talk) 14:54, 12 April 2008 (UTC)

grep citation

Latest comment: 15 years ago2 comments2 people in discussion

Sorry, I should probably figure out how to edit the article myself but a peek at the "how to add a citation" seemed a little overwhelming (I have never edited a wikipedia article)... So below is a citation for the "citation needed" in the following article text:

("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p where re stands for regular expression[citation needed])

Perhaps some kind sole could add the following reference?

Article "Statistical Text Processing", by Kernighan, Lesk, Ossanna, in "The Bell System Technical Journal", Vol 57, No. 6, Part 2, July-August 1978. pg 2140:

"...the etymology of the name is from the editor command g/r.e./p where r.e. stands for regular expression." —Preceding unsigned comment added by 64.24.133.241 (talk) 02:02, 6 July 2008 (UTC)

I found this in an article, don't know if it will be useful in citation: "It stood for the editor command that it simulated, g/re/p (global regular expression print)." - http://www.columbia.edu/~rh120/ch106.x09 ... -Coronus —Preceding unsigned comment added by 71.193.31.251 (talk) 07:44, 15 September 2008 (UTC)

Why The Slashes?

Latest comment: 15 years ago5 comments4 people in discussion

\{m,n\} —Preceding unsigned comment added by Jamie Kitson (talk • contribs) 22:37, 6 August 2008 (UTC)

Other than the obvious (to distinguish those characters from the unescaped literal ones)? Tedickey (talk) 22:52, 6 August 2008 (UTC)

Some flavors of regular expressions prefer to use brackets as metacharachters and the escaped equivalents as literals. Others prefer the opposite. For example, the regular expressions used in the search command of Vim require a backslash escape in front of nearly everything. For example, the Vim regular expression /abc\|$123$\{2\}/ is equivalent to the Perl regular expression /abc|(123){2}/. In fact, Vim regexes are basically Perl regexes with different rules about what need backslashes. Karl Dickman ^talk 20:48, 23 September 2008 (UTC)

21-Oct-2008: I have removed those backslashes from POSIX syntax (as I had done 9 months ago in Jan. 2008), so the subexpression syntax is simply parentheses "( )" as in Perl and not "" and such. I had suspected that some tool must have been warping people to think backslashes were not "escape" characters to escape from metacharacter functionality. Thank you for revealing that Vim is the peculiar-child, spastic tool that warped people's minds. I will remember "VIM" as "Vulgar-Insane-Mindmush" as a mnemonic trick to recall it as mondo-spastic (not meant as an insult to Vim, just as a memory jog). Sometimes spastic tools reveal a broader capability that other tools lack. -Wikid77 (talk) 08:35, 22 October 2008 (UTC)

The backslashes are used (see the source for BRE's, which is the context in which you made your changes. Tedickey (talk) 11:35, 22 October 2008 (UTC)

Bias?

Latest comment: 15 years ago1 comment1 person in discussion

should "implements the various improvements introduced by Perl" be "implements the various extensions introduced by Perl" 81.138.55.194 (talk) 10:47, 11 August 2008 (UTC)

History

Latest comment: 15 years ago2 comments2 people in discussion

Should the history contain dates? 81.138.55.194 (talk) 10:47, 11 August 2008 (UTC)

Probably (along with reliable sources - which the history already lacks) Tedickey (talk) 13:33, 11 August 2008 (UTC)

'equal' regexs

Latest comment: 14 years ago3 comments3 people in discussion

can someone post a link to where i can read more about the algorithm that proves equality of regular expressions? —Preceding unsigned comment added by 84.75.161.109 (talk) 15:29, 13 August 2008 (UTC)

The procedure is fairly standard: Starting with two regular expressions r_1 and r_2, you first convert each to an equivalent nondeterministic finite automaton A_1 and A_2, respectively. Then apply the powerset construction to A_1 and A_2 each. The resulting deterministic finite automata can be tested for equivalence by minimizing them first and then testing whether the results are isomporphic, that means, are equal up to renaming of states. Details can be found in virtually any textbook on automata theory, see for example Chapter 3 in Introduction to Automata Theory, Languages, and Computation. Two notes: First, the above applies only to regular expressions in the "formal language theory sense", and not to (say) perl regexes. Second, in the worst case the subset construction takes exponential time and space (although you may well be lucky and the worst case does not occur in your examples). HTH. Hermel (talk) 10:03, 9 April 2009 (UTC)

The chapter you mention shows the algorithm for converting regexs into NFA, and the algorithms for converting NFA to DFA and minimization are of course well known. However, I fail to find in that book the description of how to test whether the resulting graphs are isomorphic. —Preceding unsigned comment added by 84.226.131.226 (talk) 18:59, 7 October 2009 (UTC)

Lookahead & lookbehind

Latest comment: 15 years ago1 comment1 person in discussion

How about lookahead & lookbehind feature? [4] --Octra Bond (talk) 10:58, 26 February 2009 (UTC)

Merge content from Regular expression examples

Latest comment: 15 years ago2 comments2 people in discussion

Discussion on the Regular expression examples article suggests merging that content into this article. Seems a good idea to me Chas49 (talk) 12:35, 31 March 2009 (UTC)

Given the quality concerns already noted on that page, it's not really a "good idea" Tedickey (talk) 18:00, 31 March 2009 (UTC)

NFA-style algorithms

Latest comment: 15 years ago1 comment1 person in discussion

NFA-style algorithms can extract subexpressions and do lazy quantification. They can do approximate matching, too, and theoretically lookahead/behind (though I haven't seen it implemented). LibTRE does all these things, has been out for a while, and probably should be cited. They can't do backtracking, though. —Preceding unsigned comment added by 171.66.53.79 (talk) 22:58, 1 April 2009 (UTC)

Theory of computation

Latest comment: 14 years ago2 comments2 people in discussion

Shouldn't this article be in the Theory of Computation category? angelofwisdom (talk) 23:44, 12 July 2009 (UTC)

Yes, indeed. Done. Hermel (talk) 21:23, 26 September 2009 (UTC)

Backslash-escaping hyphen in [..] in BRE and ERE

Latest comment: 14 years ago2 comments1 person in discussion

I don't think that this is supported. See re_comp and regcmp. It also demonstrably doesn't work with the regexec on my Linux system. --Dustin (talk) 23:32, 15 April 2010 (UTC)

Ah, better evidence is here. I'll just change this --Dustin (talk) 23:38, 15 April 2010 (UTC)

Human factors

Latest comment: 13 years ago2 comments2 people in discussion

There ought to be a short discussion of human factors involving the use of regular expressions. Although they are compact and highly expressive, when they are more than trivially complex they become extremely difficult to decipher even for “wizards”. They’re often derided as a type of write-only language, partly because they seem to be easier to write than to read. I’m unaware if there have been any studies done on this issue, but there at least are plenty of anecdotes out there that could be cited. The famous example is Zawinski’s saying: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”. I’d write the section but I’m no regexp expert. — 128.171.68.145 (talk) 20:22, 3 June 2010 (UTC)

Jamie's not a bona fide expert in this area either. Skipping the random quotes and focusing on research results would be a good way to go. Tedickey (talk) 20:41, 3 June 2010 (UTC)

Engines built-in

Latest comment: 13 years ago7 comments2 people in discussion

What does "regular expression engines built directly into their syntax" mean anyway? Zarboublian (talk) 07:52, 15 September 2010 (UTC)

Presumably it means that patterns don't have to be parameters to a function call, e.g., as awk does, and that the regular expressions are handled within the interpreter rather than in a separate program. TEDickey (talk) 09:14, 15 September 2010 (UTC)

But an "engine" is an implementation concept and can't be part of the syntax. Zarboublian (talk) 08:08, 16 September 2010 (UTC)

Your "implementation concept" is however common usage for whatever implements a regular expression TEDickey (talk) 20:30, 16 September 2010 (UTC)

That's what I said -- and syntax is a thing that is implemented, not a part of the implementation. So is there a reliable source that makes this comment in these terms? Zarboublian (talk) 11:30, 18 September 2010 (UTC)

lots of places to look at TEDickey (talk) 22:00, 18 September 2010 (UTC)

Then you won't find it hard to exhibit one. Zarboublian (talk) 07:55, 22 September 2010 (UTC)

More examples, less theory

Latest comment: 13 years ago3 comments3 people in discussion

I think this page should look more like a tutorial and less like an abstract treatise. Most of the viewers on this page will be newbies looking for simple examples. —Preceding unsigned comment added by 89.150.116.86 (talk) 13:08, 14 May 2010 (UTC)

Definitely not. Wikibooks is a better place for your idea. If you think something needs additional explanation by example, please ask for it. Reducing the the current quality, however, is not a good suggestion. --Zahnradzacken (talk) 16:11, 14 May 2010 (UTC)

OK with Zahnradzacken. And it is already there at Wikibooks: Regular Expressions/Examples. So anyone has an obligation to let Regular expression examples in the See also section of Regular expression point to that Wikibooks article and delete the Regular expression examples Wikipedia article here? --Pypes (talk) 14:47, 15 October 2010 (UTC)

Solutions to the 'car' examples in the introduction?

Latest comment: 13 years ago4 comments3 people in discussion

Wonderful, but could someone give the solutions to the 'car' examples in the introduction. I was extremely delighted to see these wonderful 'examples', and then similarly disappointed that the answers weren't there to be found!

The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
The word "car" when it appears as an isolated word
The word "car" when preceded by the word "blue" or "red"
The word "car" when not preceded by the word "motor"
A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$10" or "$245.99").

-MinorFixes (talk) 03:56, 23 October 2010 (UTC)

Depending on exactly what you want to do with your regex the form may vary a bit, but here are solutions in perlish notation that should work with a simple match. Untested, but I think they should all work.

The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"

  /car/    #Pretty basic

The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"

  /c[^a]*a[^r]*r/     #Use negated character classes
  /c.*?a.*?r/         #Or lazy quantifiers
  /c[^a\s]*a[^r\s]*r/ #Confine matches to a single word

The word "car" when it appears as an isolated word

  /\bcar\b/  #\b is a word boundary

The word "car" when preceded by the word "blue" or "red"

  /(?:blue|red)\scar/     #Using alternation here, \s for whitespace, :? for non-capturing parens
                      #Note this matches all of "blue car", not just the car part
  /(?<=blue|red)\scar/  #Lookbehind, you could try this, but it raises an error in perl
                      #Works in some other regex engines though, see below

The word "car" when not preceded by the word "motor"

  /(?<!motor)car/       #Negative lookbehind
                      #Note in Perl (at least up to 5.10) only fixed length lookbehind is supported
                      #which is why at wasn't used in the last example
                      #"red" and "blue" have different lengths

A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$10" or "$245.99").

  /$\d+(?:\.\d\d)?/   #\d for digits, ? for a single optional character, + for one or more

There are other slightly different ways to write these, and the requirements given in the article are a bit vague, but hopefully these will serve as some examples. There are a lot of tweaks that could be used depending on exactly what the requirements are. Winston365 (talk) 04:42, 23 October 2010 (UTC)

Much much appreciated. --MinorFixes (talk) 19:44, 2 November 2010 (UTC)

FWIW, your last example is (potentially) incorrect on two points. First, you didn't backwhack your $. Second, the description says "exactly two more digits," but isn't clear about how or if that should be validated. For example, in the string "$3.333$4.44" the correct match would depend on the interpretation of the problem statement. Your example matches "$3.33" while adding a "\b" at the end of your match would force matching "$4.44". The statement of the problem needs to be clarified for these two to be disambiguated. Also, for your red/blue car example, lookbehind is a terrible thing to use there. It's far, far more efficient to say (red|blue)\s+car". You appear to be overly concerned about capturing, but capturing is very, very cheap when compared to the expense of look-behind assertions. -Miskaton (talk) 16:00, 20 December 2010 (UTC)

fix please

Latest comment: 13 years ago4 comments2 people in discussion

^(f(o([^fo]|$)|[^o]|$)|[^f])*$ - this is not correct match for all words not containing foo. fffffoo still matches; please, fix when you have time 198.240.130.75 (talk) 17:04, 2 December 2010 (UTC)

Yah, this regex is broken. In fact it matches as long as "foo" has an odd number of "f"s before it, and also fails to match anything with 'fof' in it. It is a pretty ugly regex, but can be simplified. The $'s in the middle bit just make sure that "fo" and "f" both match, but any string less than three letters matches, so we get the same effect by adding .?.? to the end and we can get rid of the ugly alternation. Rearrange a little bit and we get ^([^f]|f([^o]|o[^fo]))*.?.?$, which can be expanded as ^([^f]|f[^o]|fo[^fo])*.?.?$, which is at least approaching legibility. The problem with it matching "fof" is because of the extra "f" in the last negated character class, so ^([^f]|f[^o]|fo[^o])*.?.?$ fixes that problem.

The problem with matching 'ffoo', 'ffffoo', etc. still remains though, and as far as I know can't be fixed. POSIX regular expressions simply don't have the power to do this properly (I would love to be proven wrong however). In reality this isn't something you would use a regex for anyway. There are much simpler and faster ways to determine if a given string doesn't contain a substring, making this a pretty poor example to have in the article, even if it were possible to do correctly. I have removed it. Winston365 (talk) 23:08, 2 December 2010 (UTC)

Oh, I only just realized the "fof" bug only showed up today. I could have saved myself quite a bit of time by just looking at the page history. ;) Winston365 (talk) 23:37, 2 December 2010 (UTC)

As it turns out it is possible, but it isn't pretty. User:Tardis kindly showed me the way at the computing reference desk. So you can blame him/her if your eyes start burning. To quote him/her

It's possible but evil. It's easier to see without repeated characters ("fox"): ^([^f]|f(o?f)*([^fo]|o([^fx]|$)|$))*$, or ^([^f]|f(o?f)*([^fo]|o[^fx]))*(f(o?f)*o?)?$ if your matcher doesn't like the nested/repeated $s. Applications that are strictly matching rather than searching don't need the outer ^$, of course. I think it's safe to say that it's impractical.

Pretty nasty looking, but quite cool. And yes, totally impractical. Winston365 (talk) 01:28, 3 December 2010 (UTC)

Computer science vs. modern programming usages

Latest comment: 13 years ago3 comments3 people in discussion

Regular expressions in computer science and the modern programming term, "regular expression" or "regex" are very different things, and IMHO, should have two different articles which refer to each other. The regular expression (computer science) article should contain just the "formal theory" section's details. The regular expression (programming) page should be about the non-regular expressions used in POSIX and Perl-like systems over the course of the last 30+ years, including subgroup saving, back-referencing, forward and backward, zero-width assertions, and other behaviors which are not part of the classical computer science definition of a regular expression along with details on which languages and systems implement what. Finally, regular expression should probably be a disambiguation article.

This isn't a matter of preference. The classical computer science definition of a regular expression explicitly precludes all of the examples that I gave above, and is a fundamentally different construct than the modern suite of tools that are called regular expressions. There's nothing right or wrong with either of these uses, but it should be clear to the reader that the most common usage they're likely to encounter is not the text-book definition of the phrase. -Miskaton (talk) 15:46, 20 December 2010 (UTC)

I don't believe there is a fundamental difference. The examples you mentioned add to the expressiveness of classical regular expressions, they are not opposed to them. The formal language expressions are not different from practical expressions, they are merely more abstract/ more "meta"/ written differently. Rather than disambiguating, this article requires conciliation of both sides of that term. Dividing it up into two articles would make the similarities even less visible. --Zahnradzacken (talk) 17:26, 20 December 2010 (UTC)

Miskaton is right. The concept of regular expressions is very important in formal language theory (a syntax describing one of the basis classes of languages), but have little to do with regular expressions used in programming languages and text editors (other than some historical inspiration.) The claim

"[Regular languages] can be described by a formal regular expression. Note that the "regular expression" features provided with many programming languages are augmented with features that make them capable of recognizing languages which are not regular, and are therefore not strictly equivalent to formal regular expressions."

from Regular language is a gross understatement. At minimum a disambiguation header should be placed at the top of this article. The "Formal language theory" section should be unmerged from this article or moved to regular language, additional discussion regarding their differences should be placed there instead. The theorems and results curreclty stated in that section apply to regular expressions from formal language theory, not to regexes and this is very misleading. —Ruud 01:42, 5 January 2011 (UTC)