SAS/R/Python:大数据分析之三大利器
“ The more I use Python, the more I love SAS. 从 SAS 到 Python, 感觉像再婚一样:过得越久,越思念前妻! ”
大数据行业蓬勃发展,现在有很多专业的数据分析工具。 使用最多的三大软件是: SAS, R,Python.
根据2018年的一个用户调查, 1000 多家用户调查结果是:SAS 34%, R 33%, Python 32%. 基本上是三分天下的局面。但是在银行、电信、政府部门,SAS 的使用程度更高, 因为 SAS是历史最久的软件,这些行业已经长期使用 SAS,具有很大的依赖性。
从价格角度来说,SAS 是商业软件,价格很高, 一个 License 每年要超过$5000 美金。R/Python 则是开源软件。不过,可靠性和处理大数据的能力,SAS是一骑绝尘,遥遥领先。SAS 像是正规军,而 R/Pyton则是许多开源包拼拼凑凑的乌合之众。Stephanie Glen 对这三大工具做了非常全面的比较和总结,请看下表。
“Best Languages for Data Science and Statistics in One Picture” (Courtesy of Stephanie Glen).
接下来简单说说三个软件的特点。
数据处理
R is a software for statistics and mathematics. It contains a lot of scientific contents such as vectors matrix arrays etc. Python is a general programming language it’s not only for data processing. But it does have some libraries such as Pandas NumPY etc for leveraging and analyzing big data.
SAS is the one with a longest history ever since 1965. It’s specifically designed for data manipulation , processing and statistical analyses. SAS stands for Statistical Analysis System.
When processing data,R and Python use in memory processing. It means it will import and process everything in memory. The pros is that it will be fast in this way. However, if your computer memory is not large enough, it will kill your computer.
On the contrary, SAS uses hard drive as it’s temporary memory when processing big data, it is more reliable and steady though it may be slower.
Therefore each coin has 2 sides. None is perfect in the world!
学习难度
SAS is the easiest one. It’s an aggregated software, very straightforward easy to understand and convenient to use. 从学习难度来说,根据我个人经验,难度打分如下:
Responses