Get Started Free
David Anderson

David Anderson

Software Practice Lead

MATCH_RECOGNIZE (Exercise)

Best practices for avoiding resource exhaustion

Any time a pattern includes optional or iterable quantifiers, such as *, +, ?, or {n,m}, the runtime may have to consider many different ways that the pattern might match the input stream. The worst case is when the pattern doesn't match, in which case the runtime must account for all of the possible ways that the pattern might be mapped onto the stream before concluding it can't find way for it to match.

What the runtime for MATCH_RECOGNIZE is doing is similar to what regular expression engines do. However, pattern matching over streams presents additional challenges. In particular, it's important to optimize the implementation, and avoid repeating an exhaustive search in response to each new row.

The Flink runtime solves these problems by storing information about the active potential matches in Flink state. These data structures can grow in proportion to the size of the input event stream. Since the input stream is unbounded, there's no a priori limit to how much state could be involved. (If you are interested in the details, you can read about them in the research paper that inspired the implementation of Flink's CEP library, which serves as the foundation for Flink SQL's support for MATCH_RECOGNIZE.)

The bottom line is that ill-considered patterns can result in combinatorial explosions of both state and CPU. To avoid problems, use patterns that are as specific as possible:

  • Make sure that each row matches only one pattern variable or a small number of pattern variables, and avoid using pattern variables that match every row (e.g. variables that are defined as true). Whenever an incoming event can match multiple variables, the engine can eventually backtrack and try all possible combinations, leading to a combinatorial explosion.

  • Provide a definitive way for patterns to end

    • by using an upper limit for quantifiers (e.g. {,4} instead of *)
    • or by imposing a time limit (e.g., WITHIN INTERVAL '2' MINUTES)

Greedy vs. reluctant

Each quantifier can be either greedy (default behavior) or reluctant (sometimes referred to as lazy). Greedy quantifiers will match as many rows as possible, while reluctant quantifiers will match as few as possible.

Flink does not allow patterns that end with a greedy quantifier. They are forbidden because generally they are entirely unreasonable, such as the one mentioned in the video, which can never be satisfied:

    PATTERN (UP{3,})
    DEFINE
        UP AS (COUNT(UP.price) = 1) OR 
              (UP.price > LAST(UP.price, 1))

However, this restriction against greedy quantifiers at the end of a pattern can seem unnecessarily restrictive. For example, it's often useful to define a pattern that matches rows up until something specific happens.

Suppose you wanted to have a machine learning model learn from the sequence of purchases leading up to the purchase of a specific product:

SELECT *
FROM examples.marketplace.orders  
MATCH_RECOGNIZE (
    PARTITION BY customer_id
    ORDER BY `$rowtime`
    MEASURES
        ARRAY_AGG(O.product_id) AS customer_journey
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (O+)
    DEFINE
        O AS O.product_id <> '1000'
);

This query fails, and the resulting error message mentions possible approaches to resolving this:

ERROR

Greedy quantifiers are not allowed as the last element of a Pattern yet. Finish your pattern with either a simple variable or reluctant quantifier.

The way to fix this is to follow O+ with another variable that has the opposite condition:

SELECT *
FROM examples.marketplace.orders  
MATCH_RECOGNIZE (
    PARTITION BY customer_id
    ORDER BY `$rowtime`
    MEASURES
        ARRAY_AGG(O.product_id) AS customer_journey
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (O+ P1000)
    DEFINE
        O AS O.product_id <> '1000',
        P1000 AS P1000.product_id = '1000'
);

Alternatively, this same pattern can be expressed by making the first part of the pattern reluctant -- here O+ has been changed to O+?. One could argue that this formulation more clearly expresses the intent of matching everything up until an order for product 1000 arrives, and in cases where the condition is very complex, this sort of pattern rewrite can make the pattern more readable:

SELECT *
FROM examples.marketplace.orders  
MATCH_RECOGNIZE (
    PARTITION BY customer_id
    ORDER BY `$rowtime`
    MEASURES
        ARRAY_AGG(O.product_id) AS customer_journey
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (O+? P1000)
    DEFINE
        O AS TRUE,
        P1000 AS P1000.product_id = '1000'
);

Hands-on exercise

Both variants of the example presented above fail to follow the best practices outlined at the beginning -- these are queries with the potential to accumulate a lot of state.

Find a way to fix this.

Hint

This is going to require somewhat altering the business logic, as making this query less expensive will involve modifying it so that it no longer retains the complete order history for every customer.

For most cases, this is a reasonable approach. Potentially holding onto dozens of past purchases spread out over several years is unlikely to improve the recommendations that will be produced, compared to a more modest retention policy.

Solution

It's straightforward to impose a limit on the number of orders

SELECT *
FROM examples.marketplace.orders  
MATCH_RECOGNIZE (
    PARTITION BY customer_id
    ORDER BY `$rowtime`
    MEASURES
        COUNT(O.product_id) AS order_count,
        ARRAY_AGG(O.product_id) AS customer_journey
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (O{1,4}? P1000)
    DEFINE
        O AS TRUE,
        P1000 AS P1000.product_id = '1000'
);

or the timeframe involved (by using WITHIN), or both.

We've got promo codes for you! Use CONFLUENTDEV1 to skip credit card entry during signup, and FLINKSQL25 for $25 of additional usage credit.

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.