I am on intel dev cloud and using Intel OneAPI. This is my code till now:
# first block of jupyter notebook
import modin.pandas as pd
# second block of jupyter notebook
df = pd.read_csv('dataset/dataset.csv')
df.head()
# output of second block
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
import ray
ray.init()
2023-09-01 12:00:16,471 INFO worker.py:1636 -- Started a local Ray instance.
The first block is running properly but, when I am reading my dataset, it is giving me this warning and server unavailable error.
If I use import pandas as pd, the code is running fine, but modin.pandas is not working. My dataset is ~ 1 GB csv file. Why is this happening???
How to Reproduce this?
- Step 1 - https://devcloud.intel.com/oneapi/
- Step 2 - click on Getting Started tab.
- Step 3 - Go down and Click on launch JupyterLab. (It is like Google Colab or Kaggle Notebooks)
- Step 4 - Create ipynb and use wget to download this data.
- Data - !wget https://s3-ap-southeast-1.amazonaws.com/he-public-data/datasetab75fb3.zip
System Information
- OS - Linux 90-Ubuntu 5.4.0-80-generic
- CPU - Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
- RAM - 188 GB

Step 1: Check if you have installed modin properly. If you are unsure, try to reinstall the relevant modin dependencies etc.
Step 2: import modin.pandas as pd
Let's see what will happen.
Reference: The pandas library offers user-friendly data structures, including DataFrames, for data analysis. However, it may perform slowly with extensive datasets (e.g., 100 GB or 1 TB) since it wasn't optimized for such large volumes. Fortunately, the Modin library addresses this by allowing you to scale pandas workflows with just one code change.