[TOC]

Part 0: Preliminaries

Each line in the ratings dataset (ratings.dat.gz) is formatted as:
UserID::MovieID::Rating::Timestamp ⇒ tuples of (UserID, MovieID, Rating)in ratingsRDD
Each line in the movies (movies.dat) dataset is formatted as:
MovieID::Title::Genres ⇒ tuples of (MovieID, Title) in ratingsRDD


487650 ratings and …

[TOC]

STATISTICS, BUSINESS QUESTIONS, AND LEARNING TECHNIQUES

2 different kinds of statistics:

  • descriptive statistics

ex. median — describes data, but cannot generalize beyong that

  • inferential statistics

ex. t-testing — inferences beyond the data techniques leveraged for machine learning and prediction

supervised learning (clf, reg), unsupervised learning (clustering, dim-reduction) → UL often used in …

[TOC]

DATA CLEANING

ex. deal with missing data, entity resolution, unit mismatch, ...

deal with non-ideal samples ⇒ tradeoff between simplicity and accuracy.

DATA QUALITY PROBLEMS

data quality problems:

  • Conversions in complex pipelines can mess up data
  • Combining multiple datasets can result in errrors
  • Data degrades in accuracy or loses value over …

[TOC]

RELATIONAL DATABASE

review: key data management concepts:

  • data model
  • schema
  • relational data model

structured data: have a specific schema to start with

relationl database: a set of relations. 2 parts to a Relation:

  • schema: name of relation, name and type of columns

  • instance:

any data at given time (cardinality …

[TOC]

KEY DATA MANAGEMENT CONCEPTS

data model: collection of concepts for describing data schema: a description of a particular collection of data using a given data model

structure spectrum:
semi-structured data: apply schema after creating data.

FILES

files: named collection of bytes, in hierarchical namespace (but: In a Content-Addressable Storage …

[TOC]

PYTHON SPARK (PYSPARK)

a spark prog has 2 programs:

  • dirver program: runs on driver machine
  • worker program: runs on local threads or cluster nodes

a spark prog first creates a SparkContext object:

  • tells how and where to access a cluster
  • shell will automatically create the sc varible
  • in iPython …