Here are some of the questions that you may have about Apache Flink and its surrounding ecosystem.
If you’ve got a question that isn’t answered here then please do ask the community.
Flink applications run in distributed compute clusters that can be scaled up to hundreds or thousands of compute nodes. Flink’s performance at scale can be attributed to the following design characteristics:
Flink is able to guarantee that the state it manages is affected once, and only once, by each event. You can expect correct results, without concern for data loss or duplication.
To understand how this works, see Exactly-Once Processing in Apache Flink.
Streaming data is processed as it becomes available, and often this means that streams are being processed out-of-order with respect to the timestamps in the events (which indicate when the events actually occurred). Time-related operations, such as windows, need to know how long to wait for out-of-order events before producing results, and how long to retain whatever state is required to handle these out-of-order events correctly.
Watermark are timestamped stream records that Flink inserts into your data streams. Each watermark marks a position in the stream with the timestamp that it carries. Time-based operations, like windows, rely on watermarks to know when to produce results, and when state they’re keeping can be safely garbage collected.
Watermark generation is typically based on an estimate of the maximum out-of-orderness that is expected for each data source. As a developer, you can control the tradeoff between latency and completeness by choosing this parameter that controls how long Flink will wait for out-of-order events.
For more on watermarks, see Event Time and Watermarks.
Learn how Flink works, how to use it, and how to get started.