Extended Regular Expression Syntax in the regex.h C Library

Nov 27, 2024

Regular expressions (regex) are powerful tools for searching, manipulating, and validating strings in a text-processing system. In C programming, the regex.h library provides the necessary functions to work with regular expressions. It implements the POSIX extended regular expression (ERE) syntax, which is more feature-rich than the basic regular expression syntax (BRE).

In this blog post, we'll explore the syntax, key features, and practical examples of how to use this functionality.

Overview of Regular Expression Syntax in `regex.h`

The regex.h C library, defined by POSIX, provides several functions for compiling and using regular expressions:

regcomp(): Compile a regular expression into a regex_t structure.
regexec(): Match a compiled regular expression against a string.
regfree(): Free memory allocated for the compiled regular expression.
regerror(): Return an error message if a regular expression fails to compile.

When working with regex.h, you can use extended regular expressions (ERE) syntax, which allows for a broader set of operators compared to the basic regular expression (BRE) syntax. ERE is typically more expressive and is the preferred syntax in modern applications.

Key Extended Regular Expression (ERE) Syntax

1. Metacharacters

The following metacharacters are available in the ERE syntax, many of which are similar to other regular expression engines.

. (Dot): Matches any single character except newline.

regcomp(&regex, "a.b", 0); // Matches "a", any character, then "b"

^ (Caret): Anchors the match to the start of the string.

regcomp(&regex, "^abc", 0); // Matches "abc" at the start of the string

$ (Dollar Sign): Anchors the match to the end of the string.

regcomp(&regex, "abc$", 0); // Matches "abc" at the end of the string

[] (Square Brackets): Matches any one of the enclosed characters.

regcomp(&regex, "[aeiou]", 0); // Matches any vowel

[^] (Negated Character Class): Matches any character except the ones inside the brackets.

regcomp(&regex, "[^aeiou]", 0); // Matches any character except vowels

| (Pipe): Acts as an OR operator, matching either the left or right expression.

regcomp(&regex, "cat|dog", 0); // Matches either "cat" or "dog"

2. Quantifiers

Quantifiers allow you to specify how many times an element in the regular expression should match.

* (Asterisk): Matches 0 or more repetitions of the preceding element.

regcomp(&regex, "ab*c", 0); // Matches "ac", "abc", "abbc", etc.

+ (Plus Sign): Matches 1 or more repetitions of the preceding element.

regcomp(&regex, "ab+c", 0); // Matches "abc", "abbc", etc., but not "ac"

? (Question Mark): Matches 0 or 1 occurrence of the preceding element (makes it optional).

regcomp(&regex, "ab?c", 0); // Matches "ac" or "abc"

{n,m} (Braces with Range): Matches between n and m repetitions of the preceding element.

regcomp(&regex, "a{2,4}", 0); // Matches "aa", "aaa", or "aaaa"

3. Grouping and Capturing

() (Parentheses): Groups expressions together for applying quantifiers or alternation.

regcomp(&regex, "(abc)+", 0); // Matches one or more occurrences of "abc"

\n (Backreference): Refers to the nth capturing group (for example, \1 refers to the first group).

regcomp(&regex, "(ab)(cd)\\1", 0); // Matches "abcdab"

4. Escape Sequences

Escape sequences allow you to match special characters or specify certain character classes.

\d: Matches a digit (equivalent to [0-9]).

regcomp(&regex, "\\d+", 0); // Matches one or more digits

\D: Matches any non-digit character.

regcomp(&regex, "\\D+", 0); // Matches one or more non-digits

\s: Matches any whitespace character (spaces, tabs, line breaks).

regcomp(&regex, "\\s", 0); // Matches any whitespace

\S: Matches any non-whitespace character.

regcomp(&regex, "\\S", 0); // Matches any non-whitespace character

\w: Matches any word character (alphanumeric or underscore).

regcomp(&regex, "\\w+", 0); // Matches one or more word characters

\W: Matches any non-word character.

regcomp(&regex, "\\W", 0); // Matches any non-word character

5. Assertions

Assertions are advanced constructs for matching patterns in specific contexts.

(?=...) (Positive Lookahead): Matches only if the given expression can match ahead in the string, but doesn’t consume characters.

regcomp(&regex, "abc(?=123)", 0); // Matches "abc" only if it is followed by "123"

(?!...) (Negative Lookahead): Matches only if the given expression does not match ahead in the string.

regcomp(&regex, "abc(?!123)", 0); // Matches "abc" only if it is not followed by "123"

Example: Using `regex.h` with Extended Regular Expressions

Here's an example that demonstrates how to use the extended regular expression syntax in regex.h:

#include <stdio.h>
#include <regex.h>

int main() {
    regex_t regex;
    const char *pattern = "^\\d{3}-\\d{2}-\\d{4}$";  // Matches a US Social Security Number (SSN)
    const char *ssn = "123-45-6789";

    // Compile the regular expression
    if (regcomp(&regex, pattern, REG_EXTENDED)) {
        printf("Could not compile regex\n");
        return 1;
    }

    // Execute the regular expression
    if (regexec(&regex, ssn, 0, NULL, 0) == 0) {
        printf("Match found: %s\n", ssn);
    } else {
        printf("No match found\n");
    }

    // Free the memory used by the regex
    regfree(&regex);
    return 0;
}

Understanding the Code:

^\\d{3}-\\d{2}-\\d{4}$ is the regular expression that matches a typical US Social Security Number (SSN), formatted as XXX-XX-XXXX.
regcomp() compiles the regex pattern, using the REG_EXTENDED flag to indicate we want to use extended regular expressions.
regexec() tests if the string ssn matches the compiled regular expression.
regfree() frees the memory allocated for the compiled regular expression.

Conclusion

The extended regular expression syntax available in the regex.h C library provides a robust set of features for string matching and manipulation. With its support for advanced constructs like lookaheads, grouping, and special character classes, it enables efficient and expressive text processing. Understanding and utilizing these features can significantly enhance your ability to work with text in C applications. Whether you're validating input, searching through logs, or transforming data, mastering extended regular expressions in regex.h will be an invaluable skill in your toolkit.

Andrew’s Substack

Discussion about this post