[Spark MOOC note] Lec6. Structured Data

Thu, 18 Jun 2015 spark Series Part 3 of «spark MOOC»

RELATIONAL DATABASE
LARGE DATABASES
RELATIONAL DATABASE EXAMPLE AND DISCUSSION
STRUCTURED QUERY LANGUAGE (SQL)
JOINS IN SQL
EXPLICIT SQL JOINS
TYPES OF SQL JOINS
JOINS IN SPARK
Lab 2 - Web Server Log Analysis with Apache Spark

RELATIONAL DATABASE

review: key data management concepts:

data model
schema
relational data model

structured data: have a specific schema to start with

relationl database: a set of relations. 2 parts to a Relation:

schema: name of relation, name and type of columns

instance:

any data at given time (cardinality:=nb of rows, degree:=nb of fields)

LARGE DATABASES

RELATIONAL DATABASE EXAMPLE AND DISCUSSION

cardinality=3 degree=5

advantages of Relational Databases:

well-def structure
maintain indices for high performance
consistancy maintained by transactions

disadvantages:

limited, rigid structure
most disk space taken by large indices
transactions are slow
poor support for sparse data(which is common)

STRUCTURED QUERY LANGUAGE (SQL)

supported by DataFrame of pyspark

JOINS IN SQL

cross join: carteian product

EXPLICIT SQL JOINS

explicit version is preferred

TYPES OF SQL JOINS

⇒ controls how unmatched keys are handled

LEFT OUTER JOIN: keys appearring in left table but not in right table will be included with NULL as value

JOINS IN SPARK

for spark DataFrame: support inner/left outer/semi-join
for pair RDDs: support inner join(), leftOuterJoin(), fullOuterJoin()

join ex:

outerjoin ex:

fullouterjoin ex:

Lab 2 - Web Server Log Analysis with Apache Spark

Apache Common Log Format (CLF):
127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839

Row( host = match.group(1), client_identd = match.group(2), user_id = match.group(3), date_time = parse_apache_time(match.group(4)), method = match.group(5), endpoint = match.group(6), protocol = match.group(7), response_code = int(match.group(8)), content_size = size )

distinctByKey

一个pair RDD按照key来distinct不知道有没有distinctByKey之类的东西, 只好写成这样, 不知是不是对的: dayHostCount = dayGroupedHosts.map(lambda group : (group[0], len(set(group[1])) ) )