R vs Python for Big Data
I was using R and Python for huge datasets (>50M rows) recently in parallel and realized that R is 3 times faster than Pandas DataFrames! My aim was to convert R script into Python to integrate it with the other systems that I have. Observing the long run times of Pandas DataFrames, I decided to ask my LinkedIn network for tricks to improve the speed of Python, because eventually I needed to convert everything into Python. The status went viral with 100K views and 200 comments. This is the power of LinkedIn – so many professionals willing to help! I decided to collect all the comments and group them to help data scientists benefit from them.
First thing to realize is that, Pandas DataFrame is NOT designed for Big Data.
As the next step, find the bottlenecks in your code and try to optimize them: Run the script through line_profiler to see which statements take up the majority of the time. Refrain from using any actual Python loops over a large number of data items. One can easily compile loops using Numba by first converting columns of a DataFrame to Numpy arrays and then passing them to a Numba compiled function. Pandas “apply” function accepts Numba compiled functions. Also maximizing the usage of Numpy for every operation, even for min, max, sum, round, etc. is a good idea. In any case, the profiling step will help understand what is slowing down the script, and one can then think about ways to speed up the code.
Here are the other options:
Use multiprocessing, distribution, parallelized environment. Use one of Koalas, Dask, Pyspark, Pathos, Modin, Numba or Pypy; parallelize the task using multiprocessing in Python, slice the DataFrame and distribute among all the cores in the system. However, these might be tricky to debug.
Prepare the data in SQL before using as an input to Python. If the data comes from a database, one can do aggregation and manipulation in the database with SQL before reading it into any programming language.
Use vectorization, Numpy library. Experts advise using as much as possible vectorized operations from Pandas/Numpy, avoiding manual iterations over the rows and using categorical types for those columns with limited categories. For example, vectorize the functions with Numpy (np.vectorize(function) and then use vect_function(df[‘whatever_field’]) instead of .apply in Pandas. This should improve the speed considerably, (10x-100x faster, especially if using “for” loops with pandas). If using a loop is inevitable – use the apply function. It is even better to use pure Python data structure like dictionary, defaultdict, list, set and manage data directly without Pandas. Whatever is done Pandas, can also easily be done by using dictionaries.
Use Pandas with speed optimization tricks. If the NVIDIA GPU is good enough, try using RAPIDS AI or cuDF. If one can do away with explicit loops and can vectorize, the user might even want to try running things on GPU via Cocos (pip install cocos). In pandas there is “inplace” argument in many methods and it is false by default. So, every method has copy-on-modify semantics, that is, it makes a copy of the whole DataFrame each time it is used. Setting this argument with inplace = True, will not do such copies and code will be executed faster. If the operations in Python code does not involve aggregation, or output will not change if the DataFrame is divided into chunks, one can try dividing DataFrame into partitions and map function to each partition, and then concatenate later. One can also push down those transformations to a modern cloud DWH like BigQuery or Snowflake.
Use Microsoft library for Python. The name is Revoscalepy. It’s inside the SQL Server Machine Learning and with the help of it, one can convert the data into a .xdf file and reduce the size.
Many thanks to my LinkedIn network for their comments. If you wish to receive all the posts via email, please do not forget to subscribe.
Use R code as it is from Python. rpy2 package can be used to run R from Python.
I hope these conclusions drawn from Data Science professionals’ comments will help you. Thanks for reading this post and please feel free to subscribe to my blog to receive the posts in your mailbox.
Thanks a lot for such a useful post!