AWS EMR - Data flow from S3 to Core and task nodes

457 Views Asked by user3646519 At 27 June 2018 at 01:44

Can someone point me to a URL which explains how data flows from S3 to memory to HDFS to disk space in a job executed on AWS EMR? I understand the role played by Core and Task nodes, but am not clear how data would flow. e.g; If Im joining two tables in Hive whose data sits in S3. Would data first go to HDFS and then memory or vice versa, and when would disk space on task nodes be used? How would data flow to task nodes, from master nodes or core nodes?

The reason Im asking this questions, is that sometime my jobs fail with message "datanodes are bad" mostly due to full HDFS, or nodes become unhealthy because disk space is full.

So I'm trying to figure out role played by each component. When cluster was on-prem I never had to encounter such issues, so now I need to configure my AWS cluster better.

Thanks

Original Q&A

AWS EMR - Data flow from S3 to Core and task nodes

There are 0 best solutions below

Related Questions in HADOOP

Related Questions in AMAZON-S3

Related Questions in HDFS

Related Questions in EMR

Related Questions in AMAZON-EMR

Trending Questions

Popular # Hahtags

Popular Questions