Flink从入门到实战四[DataStream API]-4-Source数据源

Flink应用要计算,必须要有执行环境,有了执行环境,那么就需要将要数据输入到计算应用中了。
Flink提供了丰富的数据源API,包括从集合获取、单个数据输入、读取文件、读取输入流、读取Socket,还包括自定义数据源addSource。

基于文件:
readTextFile(path) – 读取文本文件,例如遵守 TextInputFormat 规范的文件,逐行读取并将它们作为字符串返回。
readFile(fileInputFormat, path) – 按照指定的文件输入格式读取(一次)文件。

基于套接字:
socketTextStream – 从套接字读取。元素可以由分隔符分隔。

基于集合:
fromCollection(Collection) – 从 Java Java.util.Collection 创建数据流。集合中的所有元素必须属于同一类型。
fromCollection(Iterator, Class) – 从迭代器创建数据流。class 参数指定迭代器返回元素的数据类型。
fromElements(T …) – 从给定的对象序列中创建数据流。所有的对象必须属于同一类型。
fromParallelCollection(SplittableIterator, Class) – 从迭代器并行创建数据流。class 参数指定迭代器返回元素的数据类型。
generateSequence(from, to) – 基于给定间隔内的数字序列并行生成数据流。

自定义:
addSource – 关联一个新的 source function。例如,你可以使用 addSource(new FlinkKafkaConsumer<>(…)) 来从 Apache Kafka 获取数据。
StreamExecutionEnvironment.addSource(sourceFunction) 将一个 source 关联到你的程序。Flink 自带了许多预先实现的 source functions,不过你仍然可以通过实现 SourceFunction 接口编写自定义的非并行 source,也可以通过实现 ParallelSourceFunction 接口或者继承 RichParallelSourceFunction 类编写自定义的并行 sources。

数据输入API包括:

<OUT> DataStreamSource<OUT>	addSource(SourceFunction<OUT> function)
<OUT> DataStreamSource<OUT>	addSource(SourceFunction<OUT> function, String sourceName)
<OUT> DataStreamSource<OUT>	addSource(SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo)
<OUT> DataStreamSource<OUT>	addSource(SourceFunction<OUT> function, TypeInformation<OUT> typeInfo)

<OUT> DataStreamSource<OUT>	createInput(InputFormat<OUT,?> inputFormat)
<OUT> DataStreamSource<OUT>	createInput(InputFormat<OUT,?> inputFormat, TypeInformation<OUT> typeInfo)

<OUT> DataStreamSource<OUT>	fromCollection(Collection<OUT> data)
<OUT> DataStreamSource<OUT>	fromCollection(Collection<OUT> data, TypeInformation<OUT> typeInfo)
<OUT> DataStreamSource<OUT>	fromCollection(Iterator<OUT> data, Class<OUT> type)
<OUT> DataStreamSource<OUT>	fromCollection(Iterator<OUT> data, TypeInformation<OUT> typeInfo)
<OUT> DataStreamSource<OUT>	fromElements(Class<OUT> type, OUT... data)
<OUT> DataStreamSource<OUT>	fromElements(OUT... data)
<OUT> DataStreamSource<OUT>	fromParallelCollection(SplittableIterator<OUT> iterator, Class<OUT> type)
<OUT> DataStreamSource<OUT>	fromParallelCollection(SplittableIterator<OUT> iterator, TypeInformation<OUT> typeInfo)
DataStreamSource<Long>	fromSequence(long from, long to)
<OUT> DataStreamSource<OUT>	fromSource(Source<OUT,?,?> source, WatermarkStrategy<OUT> timestampsAndWatermarks, String sourceName)
<OUT> DataStreamSource<OUT>	fromSource(Source<OUT,?,?> source, WatermarkStrategy<OUT> timestampsAndWatermarks, String sourceName, TypeInformation<OUT> typeInfo)
DataStreamSource<Long>	generateSequence(long from, long to)

<OUT> DataStreamSource<OUT>	readFile(FileInputFormat<OUT> inputFormat, String filePath)
<OUT> DataStreamSource<OUT>	readFile(FileInputFormat<OUT> inputFormat, String filePath, FileProcessingMode watchType, long interval)
<OUT> DataStreamSource<OUT>	readFile(FileInputFormat<OUT> inputFormat, String filePath, FileProcessingMode watchType, long interval, FilePathFilter filter)
<OUT> DataStreamSource<OUT>	readFile(FileInputFormat<OUT> inputFormat, String filePath, FileProcessingMode watchType, long interval, TypeInformation<OUT> typeInformation)
DataStream<String>	readFileStream(String filePath, long intervalMillis, FileMonitoringFunction.WatchType watchType)
DataStreamSource<String>	readTextFile(String filePath)
DataStreamSource<String>	readTextFile(String filePath, String charsetName)

DataStreamSource<String>	socketTextStream(String hostname, int port)
DataStreamSource<String>	socketTextStream(String hostname, int port, char delimiter)
DataStreamSource<String>	socketTextStream(String hostname, int port, char delimiter, long maxRetry)	Deprecated. 
DataStreamSource<String>	socketTextStream(String hostname, int port, String delimiter)
DataStreamSource<String>	socketTextStream(String hostname, int port, String delimiter, long maxRetry)

Flink DataStream 数据源连接器

Flink 内置 Connector:
Apache Kafka (source/sink)
Apache Cassandra (sink)
Amazon Kinesis Streams (source/sink)
Elasticsearch (sink)
Hadoop FileSystem (sink)
RabbitMQ (source/sink)
Apache NiFi (source/sink)
Twitter Streaming API (source)
Google PubSub (source/sink)
JDBC (sink)

Apache Bahir 项目:
Apache ActiveMQ (source/sink)
Apache Flume (sink)
Redis (sink)
Akka (sink)
Netty (source)