Home

Clarinet

tags: #geo-distributed #article

source: Clarinet: WAN-Aware Optimization for Analytics [link]

Problem:

  • Existing query optimizer do not consider network (heterogenous and variable bandwidth) or wan-constraints-geo-distributed-systems when preparing query execution plan Intuition:
  • Getting WAN aware execution plan requires working with execution layer (of analytical framework) responsible for [[../concepts/task-placement]] across sites and [[../concepts/tasks-scheduling]]
  • Factors affecting latency
    • Choice of join order
    • Optimization of task-placement involves moving intermediate data less. So we’d have to consider running time from base task placement.
    • If the data transfer link is used by different application, so we’d also have to consider schedule of network transfer too.
  • Approach:
  • Clarinet solves these problem by taking following approach
    • Generate multiple query plans (with different join order per query)
    • Assign parallelism for each stage for converting these logical plans to physical plans
    • These plans are fed into Clarinet which does task placement and scheduling in network aware fashion
    • Choose plan with smallest run time for execution.
    • Implementation:
  • For single query WAN Awareness
  • For Multiple Contending Queries

Conclusion:

  • reduces query completion times by 2× compared to using state-of-the-art WAN-aware placement and scheduling
  • see also potential-ideas