[Spark MOOC note] Lec4. Spark Essentials

Tue, 16 Jun 2015 spark Series Part 1 of «spark MOOC»

PYTHON SPARK (PYSPARK)
RDDs
SPARK TRANSFORMATIONS
PYTHON LAMBDA FUNCTIONS
TRANSFORMATIONS
SPARK ACTIONS
SPARK PROGRAMMING MODEL
CACHING RDDS
SPARK PROGRAM LIFECYCLE
SPARK KEY-VALUE RDDS
PYSPARK CLOSURES
SPARK BROADCAST VARIABLES
SPARK ACCUMULATORS
Lab1

PYTHON SPARK (PYSPARK)

a spark prog has 2 programs:

dirver program: runs on driver machine
worker program: runs on local threads or cluster nodes

a spark prog first creates a SparkContext object:

tells how and where to access a cluster
shell will automatically create the sc varible
in iPython: use constructor to create a SparkContext obj
⇒ use this SparkContext obj to create RDDs

Master:
The master parameter (for a SparkContext) determines which type and size of cluster to use

RDDs

Resilient Distributed Dataset:

immutable once created
spark tracks linege information to compute lost data efficiently
operations on collections of elements in parallel

to create RDDs:

paralizing existing python collections
transforming existing RDDs
from files
can specify the number of partitions for an RDD

2 types of operations on RDD:

tranformation: lazy, executed only one action runs on it
action

Working with RDD:

create an RDD
apply transformations to that RDD (ex. map, filter)
apply actions on RDD (collect, count)

ex code:

data = [1,2,3,4]
rDD = sc.paralize(data, 4)
distFile = sc.textFile("readme.txt", 4) // elements are lines in the file

SPARK TRANSFORMATIONS

to create new dataset from existing one (lazy)

examples of transformations:

PYTHON LAMBDA FUNCTIONS

single expression

TRANSFORMATIONS

⇒ spark truns the function litral into a cloture, balck code runs in driver, green code in workers

SPARK ACTIONS

cause spark to execute recipe to transform source.

SPARK PROGRAMMING MODEL

CACHING RDDS

to avoid having to reload data: rdd.cache()⇒ read from memory instead of disk

SPARK PROGRAM LIFECYCLE

create/paralise ⇒ transform ⇒ [cache] ⇒ action

SPARK KEY-VALUE RDDS

each element of a pair RDD is a pair tuple

key-value transformations:

ex:

careful using groupByKey: create lots of data traffic and iterables at works

PYSPARK CLOSURES

one closure per worker is sent with every task
no communication between workers
changes to global vars will not effect driver / other workers

⇒ pbs:

inefficient to send large data to each job
one-way: driver → worker

pyspark shared vaiables: 2 types:

Broadcase variables:
- send large, read-only variables to all workers
Accumulators
- aggregate values from worker to drivers
- only driver can access its value
- for workers the accumulators are write-only

SPARK BROADCAST VARIABLES

ex. give every worker a large dataset

SPARK ACCUMULATORS

can only be "add" to by associative operation

careful to use accumulators in transformations:

Lab1

VB更新以后虚拟机打不开了, 解决办法在:
http://bbs.deepin.org/forum.php?mod=viewthread&tid=26001