QuantデスクでRはPythonに置き換えられていますか?


69

タイトルが少々極端に聞こえることは知っていますが、RがPythonを支持してセルサイドバンクのヘッジデスクやヘッジファンドの多くによって段階的に廃止されるのではないかと思います。Pandas、Numpy、およびその他のPythonパッケージの機能強化により、データを有意義にマイニングして時系列をモデル化するために、Pythonのその他のPythonパッケージ機能が大幅に向上しているという印象を受けます。コードを並列化し、計算をいくつかのサーバー/マシンに展開するためのPythonを介した非常に印象的な実装も見ました。Rの一部のパッケージでもそれが可能であることは知っていますが、現在の勢いがPythonを支持していると感じています。

自分でモデリングフレームワークのサブセットのアーキテクチャに関する決定を自分で行う必要があり、他のクオンツによる現在のセンチメントが何であるかについて何らかの入力が必要です。

また、一部のパッケージは内部でC実装を多用しているため、Pythonを介したパフォーマンスに関する最初の予約はほとんど古くなっていることを認めなければなりません。

使用しているものについてコメントしていただけますか?以下のタスクの方が良いと思うか悪いと思うかについて意見を求めているのではなく、具体的にはRやPythonを使用する理由や、特に以下のタスクを達成するためにそれらを同じカテゴリに配置するかどうかについて尋ねています。

  • 時系列を取得、保存、維持、読み取り、整理する
  • 時系列の基本的な統計、多変量回帰分析などの高度な統計モデルを実行する...
  • 数学的計算の実行(フーリエ変換、PDEソルバー、PCAなど)
  • データの視覚化(静的および動的)
  • デリバティブの価格設定(金利モデルなどの価格設定モデルの適用)
  • 相互接続性(Excel、サーバー、UIなどを使用)
  • (2016年1月追加):ディープラーニングネットワークを設計、実装、トレーニングする機能。

編集次のリンクは[2013]の日付が少し古いものの、さらに価値があるかもしれないと考えました(不明瞭な理由により、議論も終了しました...):https://softwareengineering.stackexchange.com/questions/181342/r-vs-python-for-data-analysis

RとPythonパッケージ間の計算効率に対処するために、r-bloggers Webサイトでいくつかの投稿を検索することもできます。一部の回答で取り上げたように、1つの側面はデータの剪定、入力データの準備とセットアップです。方程式の別の部分は、実際に統計的および数学的計算を実行するときの計算効率です。

更新(2016年1月)

AI /ディープラーニングネットワークが銀行やヘッジファンドで非常に積極的に追求されている今、この質問の最新情報を提供したいと思いました。ディープラーニングの調査にかなりの時間を費やし、実験を行って、Theano、Torch、Caffeなどのライブラリを操作しました。私の仕事と他の人との会話から際立ったのは、それらのライブラリの多くがPythonを介して使用され、この分野の研究者のほとんどがこの特定の分野でRを使用していないことです。さて、これはまだ金融サービスで実行されている定量的な作業のごく一部ですが、私が尋ねた質問に直接触れているので、それを指摘したかったのです。現在の傾向を反映するために、定量的研究のこの側面を追加しました。

25

This is interesting because I see another trend: Matlab is being replaced by R, but I guess this is another story.

I use R for my academic (I am also teaching this stuff) as well as my consulting work (I am mainly working in the $\mathbb{P}$ area, with some excursions into $\mathbb{Q}$). I tried Python but it didn't work for me. I think the main reasons I will stick with R are:

  • especially in the area of statistics and analytics there is such a huge amount of high quality packages with sometimes even very recent methods which is unrivalled by any other language out there
  • for me R has the right mixture of low level capabilities of e.g. (re-)organizing data and high level commands (e.g. even k-means in the core package)
  • the speed is ok for me because I am not working in the area of HFT and there are many possibilities of speeding up code (vectorization, parallelization, good connectivity with C asf)
  • the community is really very much into the kind of stuff I am interesting in whereas with Python it is really everybody and his dog doing all kinds of stuff I am not interested in... I guess this is also about the mindset how to approach some problems, I don't know.

I think in general one should focus: I wouldn't try to build a webpage or a game with R but when it comes to statistics and analytics I think Python is no real competitor and I would strongly recommend R as your future setup.

Edit
I also wrote a blog post with additional points about why R is better suited for data science than Python: http://blog.ephorie.de/why-r-for-data-science-and-not-python


23

I've used both R and Python with Pandas in a professional quantitative financial work to do both large and small scale projects. I would strongly recommend Python with Pandas over R for most new projects in the field especially in time series analysis.

While I don't dispute vonjd in that you will find more libraries in R with algorithms on the bleeding edge of statistical research, the libraries in Python are very robust and fleshed out in that area. Also, I find in my work and the work of my colleagues that we are grabbing libraries from electrical engineering, computer vision, big data and more. People in these fields mostly have libraries in Python, not R.

However, the main advantage of Python over R in this field is workflow. The workflow with R tended to be that you used Perl/Python for data cleaning, preparation database work because R was too slow awkward for large complicated datasets though this is getting better. You then build the statistical model in R taking advantage of its libraries. Afterwards, the R model was rewritten in C for speed, control, interface, parallelization and error handling for production.

Python can handle this full workflow start to finish. All the inter-connectivity steps surrounding the main research projects is much more robust and a lot of time is saved in development when using the same language throughout. Also, with Pandas the even the core research portion and data handling is now easier and cleaner in my opinion.

In general, if you are just focusing only on advanced statistics/data-mining time series research then R and Python with Pandas are interchangeable at least for now. However it sounds like from your question that you are also are worried also about inter-connectivity and architecture for that Python is far superior.

Edit for 2018: It's amazing how much easier it is to get into data munging in Python these days compared to when I first wrote this. Try Anaconda for those that would like to check out Python/Pandas without any fuss.


13

For data analysis, particularly for large data analysis project, pretty much most of the top quant hedge funds and a lot of the banks are using Python (over R) for a couple of reasons but many still have bits and pieces of R for specific packages or functions (I work at a bank and interface with quite a few quant hedge funds on data analysis):

  1. Earlier Python 2 used to have a lot of backward compatibility issues, but Python 3 is more stable between versions. Even Pandas versions since 0.13 are very stable between versions. No one wants to use a language for which they have to revisit and rewrite significant codes sometime in the future.

  2. People needed same codes to run on both Linux and Windows. Installing, compiling packages in Python can be a super pain, whether Linux or Windows. A lot of people did not wanted to do any new project in Python 2 as sometime in the future one would need to move to Python 3 and they stuck to R for quite a while. Also for a while, Python 3 was available only with WinPython distro and WinPython used to work only on Windows. Anaconda, which is leading Python disto for Linux (& Mac), came out with Python 3 support sometime in 2014, which then caused a huge migration.

Advantages of Python (vs R):

(i) Raw speed is the biggest motive (allowing you to do way more statistical data analysis in the same time)

(ii) Pandas can read csv files very fast (one of the reasons why many folks moved from Matlab to R at some point)

(iii) Cython is more flexible than RCpp (at least my experience)

(iv) organize code files neatly into logical directories and classes within files (classes in R are an oversight) and the project looks much better

(v) As of 2015, PyCharm is a significantly better IDE than RStudio (although RStudio is better than Spyder). Tools matter

Disadvantages of Python (vs R):

(i) The big issue with Pandas used to be that it didn't have its own binary data format. R's RData format is a huge edge. PyData's HDF5 based storage is not compressible easily, gives a lot of errors every now and then, and for big data it was a hindrance. Pickle, and other formats didn't just cut it. After years of Python-vs-R exploration, most ended up writing their own custom binary data format (to store Pandas data frame) or using significant modifications of PostgreSQL for big data storage.

Statistical packages are generally great with both languages.

I have projects in R that took 4 hours to run every day (over night). Now, in Python, they take a total of 20 minutes (with much less use of Cython codes than RCpp codes in R). That's the speed difference for you.

To answer your question:

  • acquire, store, maintain, read, clean time series: Python is better

  • perform basic statistics on time series, advanced statistical models such as multivariate regression analyses, etc.: both Python and R

  • performing mathematical computations (fourier transforms, PDE solver, PCA) visualization of data (static and dynamic): both Python and R

    • pricing derivatives (application of pricing models such as interest rate models) : both Python and R

    • interconnectivity (with Excel, servers, UI): Python is better


6

For the tasks listed, both Python and R perform very well. There are some packages in Python not in R and vice versa. My solution for this is to simply call R from Python. This allows for the best of both worlds.

It is also important to note I do not write any R code other than calling an R library from Python.

Calling Python from R does not work equally across all major OSes as well.


47

My deal is HFT so what I care about is

  1. read/load data from file or DB quickly in memory
  2. perform very efficient data-munging operations (group,transform)
  3. visualize easily the data

I think is is pretty clear that 3. goes to R, graphics and ggplot2 and others allow you to plot anything from scratch with little effort.

About 1. and 2. I am amazed reading previous post to see that people are advocating for python based on pandas and that no one cites data.table The data.table is a fantastic package that allows blazing fast grouping/transforming of tables with 10s million rows. From this bench you can see that data.table is multiple time faster than pandas and much more stable (pandas tend to crash on massive tables)

Example

R) library(data.table)
R) DT = data.table(x=rnorm(2e7),y=rnorm(2e7),z=sample(letters,2e7,replace=T))
R) tables()
     NAME       NROW NCOL  MB COLS  KEY
[1,] DT   20,000,000    3 458 x,y,z    
Total: 458MB
R) system.time(DT[,.(sum(x),mean(y)),.(z)])
   user  system elapsed 
  0.226   0.037   0.264 

R)setkey(DT,z)
R)system.time(DT[,.(sum(x),mean(y)),.(z)])
  user  system elapsed 
  0.118   0.022   0.140 

Then there is speed, as I work in HFT neither R nor python can be used in production. But the Rcpp package allows you to write efficient C++ code and integrate it to R trivially (literally adding 2 lines). I doubt R is fading, given the number of new packages created every day and the momentum the language has...

EDIT 2018-07

A few years latter I am amazed by how the R ecosystem has evolved. For in-memory computation you get unmatched tools, from fst for blazing fast binary read/write, fork or cluster parallelism in one liners. C++ integration is incredibly easy with Rcpp. You get interactive graphics with the classics like plotly, crazy features like ggplotly (just makes your ggplot2 interactive). For trying python with pandas I honestly do not understand how there could even be a match. Syntax is clunky and performance is poor, I must be too used to R I guess. Another thing that is really missing in python is litterate programming, nothing comes close to rmarkdown (the best I could find in python was jupyter but that does even come close). With all the fuss surrounding the R vs Python langage war I realize that vast majority of people are simply uninformed, they do not know what data.table is, that it has nothing to do with a data.frame, they do not know that R fully supports tensorflow and keras.... To conclude I think both tools can do everything and it seems that python langage has very good PR...


3

The major advantage of Python (w/ pandas) over R is that Python supports OOP (object-oriented programming). It makes sense to organize a large code base using a hierarchy of classes. Python also supports the notion of polymorphism so that we can use well-known design patterns (e.g., Strategy, Observer, etc.) in our code.


28

Instead of wild guesses about R's/python's future in the community, here some facts:

The following query on StackExchange Data Explorer counts the number of questions that have <r> or <python> tags. If you scroll down on one of the three webpages provided below, you can see a graph with data on a monthly basis. You can easily run this query on databases for other sites as well (just go to "Switch sites" right below the query).

stats http://data.stackexchange.com/stats/query/350129/r-versus-python-tags#graph

stack http://data.stackexchange.com/stackoverflow/query/350129/r-versus-python-tags#graph

quant http://data.stackexchange.com/quant/query/350129/r-versus-python-tags#graph

The results:

  • In absolute terms, R has more hits for both stats.stackexchange.com and quant.stackexchange.com (the latter having very few data points). Python has more hits for stackoverflow.com.

  • In relative terms, the gap between R and python is closing for stackoverflow.com (ratio approx 1 to 3 at the moment). The ratio between R and python tags on stats.stackexchange.com is more or less stable since mid/end 2013 (roughly a factor 10 or a little above).

I really do think that the tag statistics in the stackexchange universe are a good indicator of the current interest in a particular programming language - probably even more so for its future popularity.

All-in-all, I am confident that the present data makes a strong case against Matt Wolf's hypothesis that "R might be obsolete in 3-4 years". ;)


Update: So now it's been 6 months since my initial answer. We still have to wait another 2.5-3.5 years to definitely see whether R has become obsolete. :) In the meantime, a quick addition due to Matt Wolf's comment. Here are variations of the above queries that give you the tag ratios (that's what I have been referring to in the second point of my answer). All ratios are python tags divided by R tags.

stats

http://data.stackexchange.com/stats/query/421036/r-versus-python-tags-quotient-py-r#graph

I do not see a clear trend here. The Py/R ratio is around 0.07 (there was a spike to 0.095 in November though). Since mid 2013, the ratio varies between 0.04 and 0.11. So I would call it relatively stable.

SO

http://data.stackexchange.com/stackoverflow/query/421032/r-versus-python-tags-quotient-py-r#graph

There was indeed a short term trend in favor of Python since Jul 15 (Py/R ratio went from 3.1 to 3.5). So the statement that "R is closing the gap wrt the Py/R ratio" could be called obsolete at the moment.

quant

http://data.stackexchange.com/quant/query/421042/r-versus-python-tags-quotient-py-r#graph

Still very noisy. Python did seem to catch up a little bit the last few months. But hard to tell with that little data.


6

Also in the high frequency / medium frequency field here.

I received a "mixed" consensus regarding the use of R and its prevalence in the field (specifically HFT). Speaking with someone who works in the equity option industry at a relatively small proprietary firm in San Francisco, I was told, "R is a legacy language".

However, speaking with someone who formerly was leading a HFT team at Goldman Sachs, I was told it is still the best language for time series analysis, statistics and especially latency sensitive projects. For libraries, the following were mentioned:

  1. Quantmod (See Quantmod)
  2. Caret (See Caret)
  3. Zoo (See Zoo)
  4. XTS (See XTS)
  5. highfrequency (See highfrequency: tools for high frequency data analysis)
  6. The popular open source QuantLib library also has an R version, which can be found here.

And to reiterate on other answers to this question, given how heavily dependent the HFT field is on speed, R cannot be integrated into production HFT systems. However, the R C++ Package is a popular tool which makes the integration to the HFT system both practical and easy.

I would not say R is dying, but it also does not have a monopoly for data analysis in the field of quantitative finance in general. Python and matlab are of great use in this field as well (I seem to be a minority in my use of matlab but it is great).