Python versus R Language

From Open Risk Manual
Revision as of 22:11, 11 August 2019 by Wiki admin (talk | contribs) (Files, Databases and Data Manipulation)

Python versus R Language: A Rosetta Stone

Structure

The comparison data are provided in tabular format in several distinct tables. Each table documents a relevant language attribute, information for both languages and (where applicable) commentary.

For easy feedback (additions, corrections) scroll down to the bottom of the page!

History and Community

The objective of this section is to provide an overall comparison of the history of the two ecosystems, towards answering the question: who is really behind Python and R?

Aspect Python R Comment
First Release 1991 1995 Both ecosystems have a long history of development with both receiving a lot of attention in the last few years
Initial Authors Guido van Rossum Ross Ihaka and Robert Gentleman
Current Stable Version 3.7 3.5
Current Governance Python Software Foundation (Non Profit) R Foundation (Non Profit)
Open Source License PSF License GNU General Public License
Size of Core Contributors TBD TBD Core Team Sizes are not public
Developer Communities pyLadies R-Ladies
Size of Developer Communities Third most popular in number of repositories and number of contributors Not in Top 10 (of developers) NB: R programmers might not necessarily self-identify as developers (but as data scientists, statisticians etc.)
Important Organizations Numfocus Bioconductor A large number of both commercial and non-profit organizations support both ecosystems explicitly and implicitly. This is a partial selection with focus on the applications relevant for risk management
Important Conferences pycon useR!
Important Journals The R Journal Journal of Open Source Software for both Python / R
IRC Channels #python
Reddit Python subreddit R Stats subreddit Data Science subreddit (discussing both Python and R issues)
Online Forums and Blogs Too many Too many Both ecosystems have very extensive numbers of blogs, forums etc. (with very varying level of quality)

Devices and Operating Systems

This section aims to answer the question: Where (as in what kind of device and operating system) can I use Python or R. It is not a how-to install Python or R in your system!

Aspect Python R Comment
Linux Desktop Comes Pre-installed apt-get install r-base Python is generally pre-installed as it is used by the Linux system itself
MacOS 2.7 version is pre-installed MacOS installer
Raspbian Pre-installed apt-get install r-base Linux is the operating system of choice for IoT devices, which means a basic Python installation is generally available
Windows Windows installer Windows installer
Android / iOS Via python-for-android No Neither Python or R are readily available on mobile devices
Cloud Servers As per Linux Desktop As per Linux Desktop Cloud servers typically run the Linux operating system and have Python installations generally available

Package Management

This section aims to answer the question: How can I extend the Python or R functionality with existing libraries. The ease of finding and installing packages is a very important aspect of the popularity of both and in marked contrast e.g. to languages like C++

Aspect Python R Comment
Discovery of Packages Online Search, Built-in PyCharm access to PyPI R-Studio Built-in access to CRAN Most mature Python packages are released on PyPI, R packages are released in CRAN
Number of Packages (July 2019) 189,855 14,108
Online Repositories pipy, via linux distributions CRAN github, gitlab are used for releasing both Python and R packages online, coordination of development etc
Package Installation Done at OS level (PyPI, setup, conda, pip, easy_install, apt) Bulti-in install.packages Python installation methods are quite varied (and have evolved over time) and can be either system wide (e.g. a linux distro package) or user specific
Dependency Management pip, virtualenv packrat virtualenv enables using separate Python distributions and package collections
Loading Packages import statement library statement

Package Documentation

This section aims to answer the question: How can I document a Python or R module? The ease and quality of documentation is an important factor as it both helps beginners learn new functionality and experienced users ensure better quality work

Aspect Python R Comment
Source level documentation Docstrings
Formats markdown, restructuredtext markdown, latex
Documentation generator sphinx roxygen2
Online documentation readthedocs CRAN, bookdown

Language Characteristics

This section aims to answer the question: What does code in Python or R look like from a programming perspective? Many standard aspects of programming languages are available in both so are not included.

Aspect Python R Comment
Compiled / Interpreted Interpreted Interpreted Code can be executed interactively
Main Implementation Language C (CPython) C and Fortran This is the language used for the interpretation of a Python or R script
Other Implementation Languages Java (Jython), RustPython etc pqR, Renjin, FastR etc Many alternative implementations

of the underlying interpreter exist for both languages

Type System Dynamic (Duck) Typing Dynamic Both Python and R have dynamic type systems contrast with languages such as C++, Java or Rust
Native Data Types Numbers, Strings, Lists, Tuples, Dictionaries Numeric, Int, Character, List, Vector, Logical (and the pairlist)
Object Oriented Yes Yes R has a variety of Object Oriented implementations with different design and functionalities, they are denoted S3, S4, R5 and R6 respectively
Code Structure Based on Indentation Free Style
Standard Libraries Extensive Built-in Functions Python has an extensive standard library as it covers a larger domain
Building Extensions Via bindings to other languages Via bindings to other languages See below under HPC for more specifics

Development Environment

This section aims to answer the question: How can I develop and test code / applications written in Python or R

Aspect Python R Comment
Free / Open Source IDE's spyder, netbeans, pycharm community, eclipse, visual studio code R Studio, RTVS There are many other IDE's or advanced editors that support programming languages via plugins. The degree of support varies though (from syntax highlighting to supporting complete workflows within the IDE/editor)
Commercial IDE's pycharm pro, komodo R Studio Commercial Support (Community Versions in Previous Entry)
Notebook Environment Jupyter Jupyter, R Markdown
Debugger pdb various builtin functions (browser, traceback, debug)
Testing tox, pytest, unittest runit, testthat, assertthat (R testthat is for typical unit tests, R assertthat is to declare the pre and post conditions that code should satisfy)

Files, Databases and Data Manipulation

This section aims to answer the following questions: What direct connectors to disk files and databases are available for Python and R respectively. Once I have connected to a data source, how can I store and do preliminary work with imported data?

Aspect Python R Comment
General Data Wrangling pandas data.table, (dplyr, tidyr, stringr, part of the tidyverse) The concept of a data frame has been a core aspect of R and pandas has emulated this in the Python universe
Advanced datetime handling dateutil lubridate These provide extensions to built-in functionality
Local File Loading Builti-in, Pandas Built-in General file input from local directories
CSV Loading Pandas Built-in (read.csv), data.table, readr
XLS Loading xlrd, openpyxl XLConnect, xlsx
URL Loading requests, PycURL data.table, rCurl
Relational Database Connectors MySQLdb, psycopg2, sqlite3 RODBCext, RMySQL, RPostgresSQL, RSQLite
Graph Databases Connectors neo4j, pyarango neo4R
Object Relational Mapping SQLAlchemy, Django ORM

General Purpose Mathematical Libraries

This section aims to answer the question: What basic building blocks are available for undertaking quantitative work in Python and R respectively?

Aspect Python R Comment
General Purpose vectors and n-dimensional arrays (as storage) numpy Built-in array The R system comes with many basic functionalities available built-in
Numerical Linear Algebra (matrix operations) numpy.linalg Matrix, RcppArmadillo, RcppEigen For specialized operations (large / sparse matrices see below in HPC)
Mathematical (Special) Functions such as Gamma, Beta, Bessel scipy Built-in functions The R system comes with many basic functionalities available built-in
Random Number Generation Built-in, numpy.random Built-in functions This is about generic random numbers. More specialized applications mentioned below
Symbolic Algebra sympy

Core Statistics Libraries

This section aims to answer the question: What libraries are available for undertaking standard statistical studies in Python or R? There is a huge number of packages / modules with significant duplication / overlap, especially for the R system, hence only the major / indicative ones are considered.

Aspect Python R Comment
Basic Statistical Analysis (descriptive statistics, moments) scipy.stats, statsmodels Base R (stats), car, caret
Correlation
ANOVA scipy.stats, statsmodels car, caret
Regression Analysis scikit-learn, statsmodels glmnet
Survival Analysis lifelines survival, OIsurv
Cluster Analysis
Curve Fitting

Econometrics Libraries

This section aims to answer the question: What libraries are available for undertaking econometric (timeseries) studies in Python or R?

Aspect Python R Comment
Basic Econometric Analysis (stationarity, trends, seasonality) statsmodels.tsa Built-in, ts
ARMA Processes statsmodels.tsa auto, forecast
Vector Auto Regressions (VAR) statsmodels.tsa vars
Heteroskedastic (GARCH) processes statsmodels, arch timeseries, zoo, vars

Machine Learning Libraries

This section aims to answer the question: What libraries are available for machine learning projects in Python or R? The term machine learning is not too specific so we use this category to group various advanced / specialized libraries (of use in quantitative risk management). NB: Machine learning algorithms are typically compute intensive and are thus implemented in system languages with eventual binding and API provided to Python or R environments

Aspect Python R Comment
Network Analysis networkx igraph, sna
Random Forests scikit-learn randomForest, ranger
Boosting scikit-learn XGBoost
Probabilistic Graphical Models pgmpy bnlearn, gRain
Neural Networks tensorflow, pytorch, keras h2o, MXNet, keras R studio offers an interface to tensorflow

GeoSpatial Libraries

This section aims to answer the question: What libraries are available for working with geospatial data in Python or R?

Aspect Python R Comment
Data Structures GeoPandas.GeoSeries, GeoPandas.GeoDataFrame raster

Visualization

This section aims to answer the question: What functionality is available to produce data driven visualization in Python or R?

Aspect Python R Comment
Low level API matplotlib grid, gridExtra
Graph packages seaborn, plotly, bokeh ggplot2
Declarative Visualization Altair
XKCD style plots :-) Available! Available!

Web, Desktop and Mobile Deployment

This section aims to answer the question: What tools does each language ecosystem provide for the deployment of applications, whether this is via the web, desktop or mobile apps

Aspect Python R Comment
Native Webservers Tornado, Gunicorn, CherryPy, Twisted OpenCPU, plumber
Classic Web Frameworks Flask, Pyramid, Django R Shiny, rApache Web frameworks typically used behind a production web server (Apache, Nginx etc.)
Web Formats xml (builtin), json XML, jsonlite
Web Sockets websockets
Client Side (Browser) Brython, RustPython
Mobile Apps Kivy, Beeware

High Performance Computing

For our purposes high perfomance computing (HPC) is any use case that requires more than a single CPU and its own memory. This section aims to answer the question: what are my options if I have performance bottlenecks in terms of CPU, memory or disk

Aspect Python R Comment
Bindings to C/C++ Cython, pybind11 Rcpp Both languages are slow compared to lower level / compiled languages. A common approach to make full use of existing CPU is to extend the language via bindings to a faster language
Bindings to other languages (Java, Rust) py4j, pyO3 renjin
Multithreading thread foreach
Multi-core multiprocessing parallel
Spark interface pySpark SparkR, sparklyr
GPU Computing pyCUDA R GPU Offered also built-in in some packages (e.g pytorch, tensorflow)
Distributed Data dask multidplyr

Using R and Python together

The section aims to answer the question: How can I use R from Python and, vice versa, how can I use Python from R

Aspect Python R Comment
Native Integration rpy2 rPython Native means that the integration is done using language bindings within the respective interpreters (not explicitly using the operating system or a server
Python/R Cross-Development and Integration r4intellij, rpy2 reticulate
Via Server API rserve/pyserve
Via Shell Script subprocess system2

Motivation, Objectives, Disclaimers

A large component of Quantitative Risk Management relies on data processing and quantitative tools. In turn, information processing pipelines and numerical algorithms must be implemented in computer systems. Computing systems come in a large and ever growing variety. In recent years open source software finds increased adoption for diverse applications (machine learning, data science, artificial intelligence). In particular cloud computing environments are primarily based on open source projects at the systems level. This facilitates (but does not require) the use of open source computational tools such as Python, R or Julia.

The Python versus R Language article is a side by side comparison of a wide range of aspects of the Python and R language ecosystems. Several comparisons of the two languages aim to "pick a winner" or recommend the best framework. This is not the objective of this entry

The comparison of the two languages aims:

  • Be useful for people that are at least somewhat familiar with programming and want to use both
  • To cover most common use cases that are relevant for the implementation of quantitative risk models
  • Be fact oriented and accurate as much as possible without drifting to opinions


The comparison is not aimed to:

  • Be a detailed / comprehensive catalog of all libraries (which count to thousands!)
  • Cover use cases that are very far removed from quantitative risk models
  • Be totally exhaustive (e.g identify all the possible computer systems one can run a Python interpreter on, or all the possible ways one can perform linear regression)


The comparison attempted here is not entirely appropriate. Strictly speaking R is not a general programming language. R is a system for statistical computation and graphics. It consists of a sufficiently general language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.

Yet despite the disclaimer a comparison is justified because in very large domain of applications and use cases the two frameworks can be used interchangeably (or nearly so)

The comparison does absolutely not provide an assessment of which language is "better" as this is a meaningless question. The proper way to use the comparison is to start with objectives, knowledge level, use case and combining data points should provide you with sufficient information to decide what would be the best fit.

The comparison between Python and R also is not meant to suggest that an optimal choice of tool is always between these two. It is entirely possible (and not unusual) that for a particular use case the optimal tool is based on an another language (Julia, or systems languages like C++, Java and Rust).



Contributors to this article

» Wiki admin