Cloud computing text mining

May 28, 2024 | 14:13

10-15 UGC, text mining overview

Steps & concepts:
- 16 Tokenization
- 17, 18 TF
- 19 stopwords
- 20, 21 IDF, TF-IDF
Use case:
- 23 Word cloud
- 24 Spam filtering
- 26 Co-occurrence analysis: find out that A and B are related
- 27 Topic modelling
  - 28 LDA model
- 30 Business case

31-37 basics
38- Resilient Distributed Dataset RDD
- 38 def
- 39 RDD Operations in Spark:
  - Actions
    - count()
    - take(n)
    - collect()
    - …
  - Transformations
    - map(function): creates a new RDD by using the function
    - filter: creates a new RDD by including/excluding each record
      - 41-42 example
    - Single RDD transformations:
      - flatMap(function): map one element to multiple elements
      - distinct(): filter out duplicates
      - sortBy()
      - 44 example
    - Multi RDD transformations:
      - intersection
      - union
      - zip
      - 45-48 example
Lazy execution: data in RDD is not processed until an action is performed
- RDD creation
  - sc.textFile(… <- can be more than 1 files)
  - myRdd = sc.parallelize(mydata)

Post Views: 76

Tags:

No tags

Comments are closed