Can someone point me to a URL which explains how data flows from S3 to memory to HDFS to disk space in a job executed on AWS EMR? I understand the role played by Core and Task nodes, but am not clear how data would flow. e.g; If Im joining two tables in Hive whose data sits in S3. Would data first go to HDFS and then memory or vice versa, and when would disk space on task nodes be used? How would data flow to task nodes, from master nodes or core nodes?
The reason Im asking this questions, is that sometime my jobs fail with message "datanodes are bad" mostly due to full HDFS, or nodes become unhealthy because disk space is full.
So I'm trying to figure out role played by each component. When cluster was on-prem I never had to encounter such issues, so now I need to configure my AWS cluster better.
Thanks