Skip to content

Latest commit

ย 

History

History
90 lines (76 loc) ยท 3.49 KB

Spark.md

File metadata and controls

90 lines (76 loc) ยท 3.49 KB

Spark ๊ฐœ๋ฐœ ๊ฐ€์ด๋“œ

Spark Application ๊ฐœ๋ฐœ์€ scala ์‚ฌ์šฉ์„ ์›์น™์œผ๋กœ ํ•œ๋‹ค.

Scala Style Guides

์Šค์นผ๋ผ ์ฝ”๋”ฉ ์Šคํƒ€์ผ์€ Scala ์ฝ”๋”ฉ ์Šคํƒ€์ผ ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ์กฐํ•œ๋‹ค.

Naming

Variables

  • ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ณ€์ˆ˜๋ช…์€ camelCase๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    • ๋‹ค์Œ์˜ ์ผ€์ด์Šค๋Š” ์˜ˆ์™ธ์ ์œผ๋กœ snake_case๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
      • DataFrame/Dataset์„ case class์— ๋‹ด์•„ ํ…Œ์ด๋ธ”๋กœ ์ €์žฅํ•˜๋Š” ๊ฒฝ์šฐ
        • ํ•„๋“œ๋ช…์ด ํ…Œ์ด๋ธ” ์ปฌ๋Ÿผ์ด๋ฆ„์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— case class์˜ ํ•„๋“œ๋ช…์€ snake_case๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
      • ๋น„์Šทํ•œ ๋งฅ๋ฝ์œผ๋กœ, Column.name๋„ SQL์ฝ”๋“œ๋ฅผ ์‰ฝ๊ฒŒ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก snake_case๋กœ ์ž‘์„ฑํ•œ๋‹ค.
  • DataFrame/Dataset/RDD๋ฅผ ๋‹ด๋Š” ๋ณ€์ˆ˜๋ช…์—๋Š” DF/DS/RDD postfix๋ฅผ ๋ถ™์ธ๋‹ค.
    val userDF: DataFrame = ...
    val bookDS: Dataset[Book] = ...
    val bestsellerRDD: RDD[Bestseller] = ...
  • ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ณ€์ˆ˜๋ช…์—๋Š” ๋‹จ์ˆ˜ํ˜•์„, ๋ณตํ•ฉ ๊ฐ์ฒด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ณ€์ˆ˜๋ช…์—๋Š” ๋ณต์ˆ˜ํ˜•์„ ์‚ฌ์šฉํ•˜๋˜ ์˜๋ฏธ์ƒ ์ค‘๋ณต์„ ์ตœ์†Œํ™”ํ•œ๋‹ค.
    val defaultCategory = "comic"
    val categories = List("general", "romance", "fantasy", "comic", "bl")
    
    // Don't do this
    val categoryList = List("general", "romance", "fantasy", "comic", "bl")

Application

  • Spark Application์˜ ์ด๋ฆ„์€ ๋™์‚ฌ๊ตฌ(verb phrase)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ex) BuildBestseller, BuildRecommend

Chained Method Invocations

  • chained method ํ˜ธ์ถœ์€ ๋‹ค์Œ์˜ ๊ฒฝ์šฐ๋ฅผ ๋ชจ๋‘ ํ—ˆ์šฉํ•œ๋‹ค.
    • on a single line
      • line length๊ฐ€ 100 ์ด๋‚ด์ด๊ณ  ๊ฐ€๋…์„ฑ์„ ํ—ค์น˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ํ—ˆ์šฉ
    • on multiple lines
      • ์ผ๋ฐ˜์ ์ธ ๋ชจ๋“  ๊ฒฝ์šฐ
      • argument๋ฅผ ๋ฐ›๋Š” method๊ฐ€ ํ•œ line์— ๋‘˜ ์ด์ƒ ์˜ฌ ์ˆ˜ ์—†์Œ
outputDF.write.format("parquet").mode(SaveMode.Overwrite).saveAsTable(output)

outputDF.write
  .format("parquet")
  .mode(SaveMode.Overwrite)
  .saveAsTable(output)

outputDF.write.format("parquet")
  .mode(SaveMode.Overwrite)
  .saveAsTable(output)

// don't do these
outputDF.write.format("parquet").mode(SaveMode.Overwrite)
  .saveAsTable(output)

outputDF.write
  .format("parquet").mode(SaveMode.Overwrite)
  .saveAsTable(output)

Spark SQL

  • ์ฟผ๋ฆฌ๋ฌธ์ด ๊ฐ„๊ฒฐํ•˜๊ณ  ์งง์€ ๊ฒฝ์šฐ singleline string ์‚ฌ์šฉ
val teenagerNameDS = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19").as[Name]
  • ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ indent์™€ ํ•จ๊ป˜ multiline string ์‚ฌ์šฉ, ์ฟผ๋ฆฌ๋ฌธ ๋‚ด indent๋Š” 1 space ์‚ฌ์šฉ
val teenagerNameDS = spark
  .sql(
    s"""
    |SELECT
    | name // indent 1 space
    |FROM people
    |WHERE age BETWEEN 13 AND 19
    |""".stringMargin
  ).as[Name]

RDD vs Spark SQL, Dataset, DataFrame API

  • ๊ฐ€๊ธ‰์ ์ด๋ฉด Spark SQL, Dataset & DataFrame API๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    • Type Safety, High Level Abstraction
    • Spark SQL์˜ Catalyst Optimizer๋กœ ์ธํ•œ ์„ฑ๋Šฅ ์ตœ์ ํ™”
  • Low Level์˜ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ RDD API ์‚ฌ์šฉ์„ ํ—ˆ์šฉํ•œ๋‹ค.

์ฐธ๊ณ ๋ฌธ์„œ