Interlude: My Personal Toolkit
Every data scientist has their own set of preferred programming languages, libraries, and other tools. You will have to decide what works best for you. To give you a data point though, here is how I work when do data analysis:
- My main programming language for data science is Python. I know it, I love it, and I can do just about anything with it. I also use it for production coding whenever I am choosing the tools and there's no good reason not to.
- I use Pandas as my main data analysis library, and I supplement it with scikit-learn for machine learning.
- I usually use matplotlib for visualizations, but I'm looking to branch out. In particular, bokeh is an extremely promising recent arrival to the visualization scene. It is designed particularly for making interactive graphs that you access with a web browser.
- A lot of people use an Integrated Development Environment (IDE) for Python, such as Spyder or PyCharm. Personally though, I'm a little old school: I open up Python from the command line, and I edit my scripts in a plain text editor such as Sublime or TextWrangler. I'm considering switching to a browser-based notebook though, such as Jupyter.
- I do most of my work on a Mac, but that's just because it's what my employers tend to use. I usually do hobby projects using Linux, and I'm hoping to do more work on Windows in the future because they have a famously great set of tools for developers.
- When I'm doing Big Data I use PySpark, which I'll talk ...