Skip to main content

Big Data

Big Data Analytics



Introduction

5 Vs of Big Data
  1. Volume
  2. Velocity
  3. Variety
  4. Veracity
  5. value
Big Data Infrastructure (Slide P38)
  • Reliable Distributed File system
  • Data kept in “chunks” spread across machines
  • Each chunk replicated on different machines.(if machine/disk) failure, recovery seamlessly)
  • Bring computation directly to data
  • Chunk servers also servers as compute servers
Societal concerns:
privacy, algorithmic black boxes, filter bubble.

MapReduce

Programming model:
A programming model, a parallel, data aware, fault-tolerant implementation
  • mappers:(Slide P11)
    The Map function takes an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary. Further, keys are not “keys” in the usual sense; they do not have to be unique. Rather a Map task can produce several key-value pairs with the same key, even from the same element.
  • reducers:(Slide P11)
    The Reduce function’s argument is a pair consisting of a key and its list of associated values. The output of the Reduce function is a sequence of zero or more key-value pairs. These key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks, but often they are the same type. We shall refer to the application of the Reduce function to a single key and its associated list of values as a reducer. A Reduce task receives one or more keys and their associated value lists. That is, a Reduce task executes one or more reducers. The outputs from all the Reduce tasks are merged into a single file.
  • combiners. When the Reduce function is associative and commutative, we can push some of what the reducers do to the Map tasks
Execution framework
  • Strengths:
    1. MapReduce can process data that doesn’t fit in memory, through parallelization. Every map task will process an input split only, which fits in memory.
    2. MapReduce is a parallel programming environment, for clusters of commodity computers.
    3. MapReduce is a fault-tolerant framework
    4. MapReduce leverages a distributed file system (GPFS – HDFS) which allows tasks to be scheduled where the data is (data locality).
    5. MapReduce replicate tasks (so-called “backup tasks”) to deal with stragglers
    6. There are multiple open source implementations
  • Weaknesses:
    1. The MapReduce master is a single point of failure and possibly a performance bottleneck
    2. MapReduce can only process data represented as key-value pairs
    3. The fact that MapReduce is meant for large clusters makes it prone to failure and communication times
    4. MapReduce’s intermediate results are stored on disk
Update (Mar 21): no programming skills in MapReduce are expected.

Spark

Differences with MapReduce.(Compare)
Features Spark MapReduce
Speed Faster Computation /
Data Processing Spark speeds up batch processing via in-memory computation and processing optimization it stores data on-disk, which means it can handle large datasets
Real Time Analysis Yes No
Easy use Easier /
Fault Tolerance Spark uses RDD and various data storage models for fault tolerance by minimizing network I/O. Hadoop achieves fault tolerance through replication.
Security infancy good
Compatibility Apache Spark is much more advanced cluster computing engine than MapReduce.
Programming model: RDDs, DataFrames, transformations, actions. Update (Mar 21): that means that you are supposed to know how to write programs in Spark.

Recommendation systems

Formal Model:
: Set of customers
: Set of items
: Utility function,
  • Content-based:
    • Main idea: Recommend the item to customer similar to preview items rated highly by .
    • item profiles: Set of item features
    • user profiles: E.g., average rating and variation
    • cosine distance:
  • Collaborative filtering
    • Pearson’s correlation:
    • Item based:(Slide P27)
      1. for item find other similar items
      2. Estimate rating for item based on ratings for similar items
      3. Can use the same similarity metrics and prediction functions as user based model, i.e.,
    • User based:(Slide P24)
      where
  • Latent factors

  • Pros and cons of each method
    • Content-based:
      • Pros
        • +No need for data for other users
        • +Able to recommend to users with unique tastes
        • +Able to recommend new&un-pop items
        • +Easy to explain
      • Cons
        • -Find the appropriate features is hard
        • -New Users? what should do?
        • -Overspecialization:
          1. Never recommends items outside user’s content profiles
          2. Users have multi interests
          3. Unable to exploit quality judgments of other users
    • Collaborative filtering (Slide P34)
      • Pros
        • +Work for all kind of items. No features selection needed
      • Cons
        • -Cold start, need enough users in the system to find match
        • -Sparsity, the user/ratings matrix is sparse, hard to find users that rate the same items
        • -New coming one, cannot recommend item that has not been previously rated.
        • -Popularity bias, cannot recommend items to someone with unique taste, trend to recommend pop items
  • Biases (user and item)
    • Latent factors with bias (Slide P23)
  • Evaluation with RMSE


Clustering

Hierarchical clustering (agglomerative)(Slide P13)
1. Each point is a cluster
2. Repeatedly combine 2 "nearest" clusters into one
Kmeans(Slide P22)
Kmeans++ (Kmean++)
The only different with Kmeans is the initialization part:
1. Select a centroid $C$ randomly
2. while |C|<K:
    give x in X with probability: (min|x-C_i|^2)/(sum[(x-C_i)^2])
    C = C union x
CURE (Slide P32)

Frequent itemsets (Slide)

Definitions:
  • frequent itemsets [P8]->Example[P9]: Given a support threshold , the sets of items that at least basket
  • association rules [P10]->Example[P12]: means if we have in the basket, then it is likely to have .
  • support [P8]->Example[P9]: For a given itemset , the number of basket that contain all items in
  • confidence [P10]->Example[P12]: Probability of given .
  • interest P[11]->Example[P12]:
Apriori algorithm (An Example)
PCY algorithm
  • The algorithm:
    1. Pass1:
      Image of Yaktocat
    2. Mid-Pass: Replaces buckets by bit vectors, means , otherwise.
    3. Pass2: Only count pairs which hash to frequent bucket.
      1. Both and are frequent items.
      2. hash to bit vector which value is .
  • An Example:
Hash functions
Multi-stage
Multi-hash
Random sampling
SON algorithm

Stream processing

  • Sampling

    • Sampling a fixed proportion (As the stream grows the sample also grows) (Slide P14)
      To get a sample of fraction of the stream:
      1. Hash each tuple’s key uniformly into buckets
      2. Pick the tuple if its hash value is at most
      E.g:
      If we want to generate a sample?
      Ans:
      Hash into buckets, take the tuples if the hash value is
    • Sampling a fixed-size sample (Reservoir sampling) (Slide P17)
      # Reservoir sampling
        1. Store all the first s elements of the stream to S
        2. Suppose that we have seen n-1 elements, and now the nth element arrive
          2.1 With probability s/n, keep nth element, else discard it
          2.2 If we decide to keep it, replaces one of the s element in the sample S, pick uniformly at random
      After elements, the sample contains each element seen so far with probability
  • Counting 1s in sliding windows: DGIM. (Slide P24)

    Problem: Given a stream of s and s, how many s are in the last bits?
    DGIM:
    • Idea: Summarize blocks with specific number of 1s
    • A Buckets:
      • a: The timestamp of its end [ bits]
      • b: The number of 1s in its begin and end [ bits]
    • Algorithm:
      a. When the new bit comes in, drop the oldest bucket if the end timestamp is prior to N time units before the current time
      b. if current bit = 0:
            continue
        else:
            b.1 create a new bucket of size 1, End time = current time;
            b.2 if there are 3 buckets size 1, combine oldest 2 into 1 size 2;
            b.3 if there are 3 buckets size 2, combine oldest 2 into 1 size 4;
            b.4 ...
      num 1s = sum((all buckets)\(last bucket)) + size(last)/2
    • An Example: (Slide P35)
    • Error Bounded:
      50%
  • Bloom filter: Filtering a data Stream

    • Notes:
      1. is the set of keys we want to filtered.
      2. is bit array,
      3. is the hash function, are independent hash functions.
    • Algorithm:
      1. Set B to all 0s
      2. Hash each elements s in S using hash function h_i, set B[h_i(s)] = 1
    • Judgment: The new coming stream element only if where
    • An Example: (Time: 5:31)
    • The fraction of 1s in ? (Slide: 13)
      • The false positive probability:
      • The optimal is
    • Summary:
      1. No false negatives
      2. Suitable for hardware implementation
      3. [1 big ] v.s [ small s] are same, but 1 big is simpler
  • Counting distinct element: Flajolet-Martin counting distinct elements

    • Notes:
      • is hash function
      • is all the elements
    • Algorithm:(Slide: 20)
      1. Represent(hashing) all the elements x as bit arrays, len(h(x))>logN
      2. For each stream element a, r(a) is the number of trail 0s of h(a)
      3. R = max r(a)
      4. ans = 2^R
    • A problem:
      If we , our estimation double
    • The solution: Using many hash functions and get many samples
      1. Partition samples into small groups
      2. The the median of the groups
      3. Take the average of the medians

Finding Similar Items: Locality-sensitive hashing

The Goal
Find near-neighbors in high-dim and space Jaccard similarity and distance
  • Jaccard similarity:
  • Jaccard distance:
  • An Example:
    Image of Yaktocat
The Algorithm
Step1: Shingling: Convert documents to sets.
Step2: Min-hashing (building the signature matrix)
  1. Convert large sets to short signatures, while preserving similarity
  2. How to do it?
    Image of Yaktocat
Step3: Locality sensitive hashing
  • Goal is to find documents with Jaccard similarity at least .
  • Idea: Hash column of signature several times. The similar column will be hashed into same bucket with high probability.
  • How to do?:
    Image of Yaktocat
  • Select number of and for better performance:
    Image of Yaktocat

Graph Analysis: PageRank

Recursive flow formulation
  • The Explain
    Image of Yaktocat
  • An Example
    Image of Yaktocat
Matrix formulation

  • Image of Yaktocat
  • Algorithm:
    Image of Yaktocat
  • Iteration:
    Image of Yaktocat
Resolution: power method
Iteration is power.
Random walk interpretation
  • Where to go at time slot Image of Yaktocat
Teleports to solve dead-ends and spider traps
  • What is Dead Ends and Spider Traps
    Image of Yaktocat
  • How to solve? Teleports
    Image of Yaktocat
  • An Example
    Image of Yaktocat

Some Exercises

1. Introduction:


2. MapReduce:


3. Spark:

Compute the maximum element in an RDD?
Ans:
rdd.reduce(lambda a,b: max(a,b))

4. Recommendation System:

  Q12 - Three computers A, B and C have the numerical features listed below
(these features have been normalized between 0 and 1). A certain user U has
rated the three computers as follows. A: 4 stars, B: 2 stars, C: 5 stars. Compute
a user profile for user U, with components for processor speed, disk size and
main memory.
 
    Feature         :  A    B    C
    Processor Speed :  1    0   0.6
    Disk Size       : 0.25  0    1
    Main-Memory Size:  1   0.5   0
Ans:
Processor Speed:
Disk Size:
Main-Memory Size:
Therefore,
 Q13 - For user U in the previous question, how would the following computers
be ranked by a content-based recommendation system using the cosine similarity
measure?
 
  Feature            D   E   F
  Processor Speed : 0.1  0   0
  Disk Size       : 0   0.1  0
  Main-Memory Size: 0    0  0.1
Ans:



Therefore:
  Q14 - In the utility matrix below (called M in the remainder of this exam),
columns represent items and rows represent users. What is the value of MA,c
predicted by user mean?
 
      a   b   c   d
  A   4   5       5
  B       3   4   3
  C   2       1   3
Ans: usermean =
Q15 - What is the RMSE (root-mean-square error) associated with the following
factorization of matrix M in the previous question?
 
[
 1 1 1 1
 1 1 1 1
 1 1 1 1
]
Ans:
Q16 - What is the value of mA,c predicted by the factorization in the previous
question?
 
[
 1 1 1 1
 1 1 1 1
 1 1 1 1
]
Ans:

5. Clustering:

  • 5.1 Midterm questions
    (Question 9 ,11)
    1. What is the centroid of and ?
      Ans:
    2. Which limitation of kmeans is addressed by CURE?
      Ans: CURE can produce clusters of arbitrary shape
  • 5.2 Suggested Exercise
    Perform a hierarchical clustering of the one-dimensional set of points , assuming clusters are represented by their centroid (average), and at each step the clusters with the closest centroids are merged.
  • 5.3 Extra Exercise
    Ans:


6. Frequent itemsets:

  • 6.1 Midterm questions
    (Question 6-8)
    Q6 - Suppose there are 10 items, numbered 1 to 10. If the support threshold
    is 3, what is the list of frequent item pairs in the following baskets?
    basket 1: {4, 5, 7, 8, 10}
    ------
    basket 2: {1, 4, 5, 6, 8, 10}
    ------
    basket 3: {5, 6, 9}
    ------
    basket 4: {2, 4, 6, 7, 8}
    ------
    basket 5: {7}
    ------
    basket 6: {8}
    ------
    basket 7: {1, 2, 4, 7, 9, 10}
    Ans: , ,
    Q7 - Using the set of baskets of the previous question, what is the confidence
    of the 6 → 10 association rule?
    Ans:
    Q8 - What is the interest of the association rule in the previous question?
    Ans:

7. Stream processing:

  • 7.1 Midterm questions None!!!
  • 7.2 Suggested Exercise [Exercise 4.4.1 page 145]
Exercise 4.4.1 : Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2,
6, 5. Our hash functions will all be of the form h(x) = ax+b mod 32 for some
a and b. You should treat the result as a 5-bit binary integer. Determine the
tail length for each stream element and the resulting estimate of the number of
distinct elements if the hash function is:
(a) h(x) = 2x + 1 mod 32.
(b) h(x) = 3x + 7 mod 32.
(c) h(x) = 4x mod 32.
Ans:


8. LSH:

  • 7.1 Midterm questions(Question 2-5)
      Q3 - What is the Jaccard similarity of the two following sets: {a,b,c,d} and
    {b,c,d,e,f}?
    Ans:
      Q4 - What is the cosine similarity of the two following vectors: (1,2,3) and
    (2,4,6)?
    Ans:
    Q5 - Using hash functions h1 and h2 (that is, h1 and h2 are used to implement
    permutation functions), what is the min-hash signature matrix of the following
    characteristic matrix representing sets S1 to S4?
     
    Row S1 S2 S3 S4
      0  1  0  0  1
      1  0  0  1  0
      2  0  1  0  1
      3  1  0  1  1
      4  0  0  1  0
     
    h1(x) = x + 1 mod 5
    h2(x) = 3x + 1 mod 5
    Ans:
    Row S1 S2 S3 S4 h1 h2
      0  1  0  0  1  1  1
      1  0  0  1  0  2  4       
      2  0  1  0  1  3  2    
      3  1  0  1  1  4  0
      4  0  0  1  0  0  3
     
          S1 S2 S3 S4
      h1  1  3  0  1
      h2  0  2  0  0

9. PageRank:

  • 9.1 Midterm questions None!!!
  • 9.2 Suggested Exercise [Exercise 5.1.1 page 193]

Comments

Popular posts from this blog

[LeetCode Solution 230]: Kth Smallest Element in a BST

Question: Given a binary search tree, write a function  kthSmallest  to find the  k th smallest element in it. ************************************************************************************************************************************ Write Infront To read to a tutorial, please to read the tutorial of in-order traversal of BST, please check: LeetCode Solution 94: Binary Tree Inorder Traversal We are going to solve this question using the following 4 methods: ->Binary Search ->Recursive ->Iterative ->Morris  Approach #1 Binary Search [Accepted] Detail Explanation The first method to solve this problem is using Binary Search. The idea is very easy and extremely to think. We use BST's property that the left child of the root is smaller than the root while the right child of the root is always bigger. We consider that the root is the pivot, and find the number of the nodes in the left subtree and the number of the nodes in th

[LeetCode Solution 145] Binary Tree Postorder Traversal

[LeetCode Solution 145]: Binary Tree Postorder Traversal Question: Given a binary tree, return the  postorder  traversal of its nodes' values. For example: Given binary tree  {1,#,2,3} , 1 \ 2 / 3 return  [3,2,1] . Approach #1 Recursive [Accepted] Detail Explanation The first method to solve this problem is using recursive. This is the classical method and straightforward. we can define a helper function to implement recursion. The java code is as follows: Java public class Solution { public List<Integer> postorderTraversal (TreeNode root) { List<Integer> res = new ArrayList<>(); helper(root, res); return res; } public void helper (TreeNode root, List<Integer> res) { if (root != null ) { if (root.left != null ) { helper(root.left, res); } if (root.right != null ) { helper(root.right, res); } res.add(root.val); } } } Complexity Analysis Time complexity :

[Interview]: URLify

[Interview]  URLify: -------------------------------------------------------------------------------------------------------------------------- Question: URLify: Write a method to replace all spaces in a string with ‘%20’, you may assume that the string has sufficient space at the end to hold the additional characters. Example  input: ' mr john smith '  output: ' mr %20john%20smith' --------------------------------------------------------------------------------------------------------------------------   Idea 1:  Start from the back and start replacing until the character is not ' ', and replace the characters in reverse order. Solution 1: public class Solution{ public String replace(char[] str) { boolean flag = false; StringBuffer sb = new StringBuffer(); for (int i = str.length - 1; i >= 0; i--) { if (str[i] != ' ') flag = true; if (flag == true) { if (str[i] == ' ') { s