Cloud computing text mining

10-15 UGC, text mining overview

  • 14-15 Model being used is bag of words

16- Text mining concepts

  • Steps & concepts:
    • 16 Tokenization
    • 17, 18 TF
    • 19 stopwords
    • 20, 21 IDF, TF-IDF
  • Use case:
    • 23 Word cloud
    • 24 Spam filtering
    • 26 Co-occurrence analysis: find out that A and B are related
    • 27 Topic modelling
      • 28 LDA model
    • 30 Business case

Spark

  • 31-37 basics
  • 38- Resilient Distributed Dataset RDD
    • 38 def
    • 39 RDD Operations in Spark:
      • Actions
        • count()
        • take(n)
        • collect()
      • Transformations
        • map(function): creates a new RDD by using the function
        • filter: creates a new RDD by including/excluding each record
          • 41-42 example
        • Single RDD transformations:
          • flatMap(function): map one element to multiple elements
          • distinct(): filter out duplicates
          • sortBy()
          • 44 example
        • Multi RDD transformations:
          • intersection
          • union
          • zip
          • 45-48 example
  • Lazy execution: data in RDD is not processed until an action is performed
    • RDD creation
      • sc.textFile(… <- can be more than 1 files)
      • myRdd = sc.parallelize(mydata)

Tags:

Comments are closed

Latest Comments