Spark Streaming tutorial

JinungKim 2022. 1. 7. 15:35

Spark Streaming - Spark 3.2.0 Documentation

Spark Streaming Programming Guide Overview Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Kinesis, or T

spark.apache.org

A Quick Example

특정 포트로 들어오는 텍스트를 분리해서 카운팅 하는 예제

# network_wordcount.py

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)


# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)


# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))


# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)


# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()


ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

Netcat

TCP, UDP 프로토콜을 사용해서 네트워크를 연결하고 데이터를 읽고 쓰는 유틸리티 프로그램

nc -lk 9999

작성한 스크립트를 실행한다.

./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py

nc 커맨드라인에서 입력한 텍스트를 분리해서 카운팅 한다.

저작자표시 (새창열림)