Hands-On Machine Learning for Algorithmic Trading

Book Description

Explore effective trading strategies in real-world markets using NumPy, spaCy, pandas, scikit-learn, and Keras

Key Features

  • Implement machine learning algorithms to build, train, and validate algorithmic models
  • Create your own algorithmic design process to apply probabilistic machine learning approaches to trading decisions
  • Develop neural networks for algorithmic trading to perform time series forecasting and smart analytics

Book Description

The explosive growth of digital data has boosted the demand for expertise in trading strategies that use machine learning (ML). This book enables you to use a broad range of supervised and unsupervised algorithms to extract signals from a wide variety of data sources and create powerful investment strategies.

This book shows how to access market, fundamental, and alternative data via API or web scraping and offers a framework to evaluate alternative data. You'll practice the ML work?ow from model design, loss metric definition, and parameter tuning to performance evaluation in a time series context. You will understand ML algorithms such as Bayesian and ensemble methods and manifold learning, and will know how to train and tune these models using pandas, statsmodels, sklearn, PyMC3, xgboost, lightgbm, and catboost. This book also teaches you how to extract features from text data using spaCy, classify news and assign sentiment scores, and to use gensim to model topics and learn word embeddings from financial reports. You will also build and evaluate neural networks, including RNNs and CNNs, using Keras and PyTorch to exploit unstructured data for sophisticated strategies.

Finally, you will apply transfer learning to satellite images to predict economic activity and use reinforcement learning to build agents that learn to trade in the OpenAI Gym.

What you will learn

  • Implement machine learning techniques to solve investment and trading problems
  • Leverage market, fundamental, and alternative data to research alpha factors
  • Design and fine-tune supervised, unsupervised, and reinforcement learning models
  • Optimize portfolio risk and performance using pandas, NumPy, and scikit-learn
  • Integrate machine learning models into a live trading strategy on Quantopian
  • Evaluate strategies using reliable backtesting methodologies for time series
  • Design and evaluate deep neural networks using Keras, PyTorch, and TensorFlow
  • Work with reinforcement learning for trading strategies in the OpenAI Gym

Who this book is for

Hands-On Machine Learning for Algorithmic Trading is for data analysts, data scientists, and Python developers, as well as investment analysts and portfolio managers working within the finance and investment industry. If you want to perform efficient algorithmic trading by developing smart investigating strategies using machine learning algorithms, this is the book for you. Some understanding of Python and machine learning techniques is mandatory.

Publisher Resources

Download Example Code

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Hands-On Machine Learning for Algorithmic Trading
  3. About Packt
    1. Why subscribe?
    2. Packt.com
  4. Contributors
    1. About the author
    2. About the reviewers
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Machine Learning for Trading
    1. How to read this book
      1. What to expect
      2. Who should read this book
      3. How the book is organized
        1. Part 1 – the framework – from data to strategy design
        2. Part 2 – ML fundamentals
        3. Part 3 – natural language processing
        4. Part 4 – deep and reinforcement learning
      4. What you need to succeed
        1. Data sources
        2. GitHub repository
        3. Python libraries
    2. The rise of ML in the investment industry
      1. From electronic to high-frequency trading
      2. Factor investing and smart beta funds
      3. Algorithmic pioneers outperform humans at scale
        1. ML driven funds attract $1 trillion AUM
        2. The emergence of quantamental funds
        3. Investments in strategic capabilities
      4. ML and alternative data
        1. Crowdsourcing of trading algorithms
    3. Design and execution of a trading strategy
      1. Sourcing and managing data
      2. Alpha factor research and evaluation
      3. Portfolio optimization and risk management
      4. Strategy backtesting
    4. ML and algorithmic trading strategies
      1. Use Cases of ML for Trading
        1. Data mining for feature extraction
        2. Supervised learning for alpha factor creation and aggregation
        3. Asset allocation
        4. Testing trade ideas
        5. Reinforcement learning
    5. Summary
  7. Market and Fundamental Data
    1. How to work with market data
      1. Market microstructure
        1. Marketplaces
        2. Types of orders
      2. Working with order book data
        1. The FIX protocol
        2. Nasdaq TotalView-ITCH Order Book data
          1. Parsing binary ITCH messages
          2. Reconstructing trades and the order book
        3. Regularizing tick data
          1. Tick bars
          2. Time bars
          3. Volume bars
          4. Dollar bars
      3. API access to market data
        1. Remote data access using pandas
          1. Reading html tables
          2. pandas-datareader for market data
          3. The Investor Exchange
        2. Quantopian
        3. Zipline
        4. Quandl
        5. Other market-data providers
    2. How to work with fundamental data
      1. Financial statement data
        1. Automated processing – XBRL
        2. Building a fundamental data time series
          1. Extracting the financial statements and notes dataset
          2. Retrieving all quarterly Apple filings
          3. Building a price/earnings time series
      2. Other fundamental data sources
        1. pandas_datareader – macro and industry data
    3. Efficient data storage with pandas
    4. Summary
  8. Alternative Data for Finance
    1. The alternative data revolution
      1. Sources of alternative data
        1. Individuals
        2. Business processes
        3. Sensors
          1. Satellites
          2. Geolocation data
    2. Evaluating alternative datasets
      1. Evaluation criteria
        1. Quality of the signal content
          1. Asset classes
          2. Investment style
          3. Risk premiums
          4. Alpha content and quality
        2. Quality of the data
          1. Legal and reputational risks
          2. Exclusivity
          3. Time horizon
          4. Frequency
          5. Reliability
        3. Technical aspects
          1. Latency
          2. Format
    3. The market for alternative data
      1. Data providers and use cases
        1. Social sentiment data
          1. Dataminr
          2. StockTwits
          3. RavenPack
        2. Satellite data
        3. Geolocation data
        4. Email receipt data
    4. Working with alternative data
      1. Scraping OpenTable data
        1. Extracting data from HTML using requests and BeautifulSoup
        2. Introducing Selenium – using browser automation
        3. Building a dataset of restaurant bookings
        4. One step further – Scrapy and splash
      2. Earnings call transcripts
        1. Parsing HTML using regular expressions
    5. Summary
  9. Alpha Factor Research
    1. Engineering alpha factors
      1. Important factor categories
        1. Momentum and sentiment factors
          1. Rationale
          2. Key metrics
        2. Value factors
          1. Rationale
          2. Key metrics
        3. Volatility and size factors
          1. Rationale
          2. Key metrics
        4. Quality factors
          1. Rationale
          2. Key metrics
      2. How to transform data into factors
        1. Useful pandas and NumPy methods
          1. Loading the data
          2. Resampling from daily to monthly frequency
          3. Computing momentum factors
          4. Using lagged returns and different holding periods
          5. Compute factor betas
        2. Built-in Quantopian factors
        3. TA-Lib
    2. Seeking signals – how to use zipline
      1. The architecture – event-driven trading simulation
      2. A single alpha factor from market data
      3. Combining factors from diverse data sources
    3. Separating signal and noise – how to use alphalens
      1. Creating forward returns and factor quantiles
      2. Predictive performance by factor quantiles
      3. The information coefficient
      4. Factor turnover
    4. Alpha factor resources
      1. Alternative algorithmic trading libraries
    5. Summary
  10. Strategy Evaluation
    1. How to build and test a portfolio with zipline
      1. Scheduled trading and portfolio rebalancing
    2. How to measure performance with pyfolio
      1. The Sharpe ratio
      2. The fundamental law of active management
      3. In and out-of-sample performance with pyfolio
        1. Getting pyfolio input from alphalens
        2. Getting pyfolio input from a zipline backtest
        3. Walk-forward testing out-of-sample returns
        4. Summary performance statistics
        5. Drawdown periods and factor exposure
        6. Modeling event risk
    3. How to avoid the pitfalls of backtesting
      1. Data challenges
        1. Look-ahead bias
        2. Survivorship bias
        3. Outlier control
        4. Unrepresentative period
      2. Implementation issues
        1. Mark-to-market performance
        2. Trading costs
        3. Timing of trades
      3. Data-snooping and backtest-overfitting
        1. The minimum backtest length and the deflated SR
        2. Optimal stopping for backtests
    4. How to manage portfolio risk and return
      1. Mean-variance optimization
        1. How it works
        2. The efficient frontier in Python
        3. Challenges and shortcomings
      2. Alternatives to mean-variance optimization
        1. The 1/n portfolio
        2. The minimum-variance portfolio
        3. Global Portfolio Optimization - The Black-Litterman approach
        4. How to size your bets – the Kelly rule
          1. The optimal size of a bet
          2. Optimal investment – single asset
          3. Optimal investment – multiple assets
      3. Risk parity
      4. Risk factor investment
      5. Hierarchical risk parity
    5. Summary
  11. The Machine Learning Process
    1. Learning from data
      1. Supervised learning
      2. Unsupervised learning
        1. Applications
        2. Cluster algorithms
        3. Dimensionality reduction
      3. Reinforcement learning
    2. The machine learning workflow
      1. Basic walkthrough – k-nearest neighbors
      2. Frame the problem – goals and metrics
        1. Prediction versus inference
          1. Causal inference
        2. Regression problems
        3. Classification problems
          1. Receiver operating characteristics and the area under the curve
          2. Precision-recall curves
      3. Collecting and preparing the data
      4. Explore, extract, and engineer features
        1. Using information theory to evaluate features
      5. Selecting an ML algorithm
      6. Design and tune the model
        1. The bias-variance trade-off
        2. Underfitting versus overfitting
        3. Managing the trade-off
        4. Learning curves
      7. How to use cross-validation for model selection
        1. How to implement cross-validation in Python
          1. Basic train-test split
        2. Cross-validation
          1. Using a hold-out test set
          2. KFold iterator
          3. Leave-one-out CV
          4. Leave-P-Out CV
          5. ShuffleSplit
      8. Parameter tuning with scikit-learn
        1. Validation curves with yellowbricks
        2. Learning curves
        3. Parameter tuning using GridSearchCV and pipeline
      9. Challenges with cross-validation in finance
        1. Time series cross-validation with sklearn
        2. Purging, embargoing, and combinatorial CV
    3. Summary
  12. Linear Models
    1. Linear regression for inference and prediction
    2. The multiple linear regression model
      1. How to formulate the model
      2. How to train the model
        1. Least squares
        2. Maximum likelihood estimation
        3. Gradient descent
      3. The Gauss—Markov theorem
      4. How to conduct statistical inference
      5. How to diagnose and remedy problems
        1. Goodness of fit
        2. Heteroskedasticity
        3. Serial correlation
        4. Multicollinearity
      6. How to run linear regression in practice
        1. OLS with statsmodels
        2. Stochastic gradient descent with sklearn
    3. How to build a linear factor model
      1. From the CAPM to the Fama—French five-factor model
      2. Obtaining the risk factors
      3. Fama—Macbeth regression
    4. Shrinkage methods: regularization for linear regression
      1. How to hedge against overfitting
      2. How ridge regression works
      3. How lasso regression works
    5. How to use linear regression to predict returns
      1. Prepare the data
        1. Universe creation and time horizon
        2. Target return computation
        3. Alpha factor selection and transformation
        4. Data cleaning – missing data
        5. Data exploration
        6. Dummy encoding of categorical variables
        7. Creating forward returns
      2. Linear OLS regression using statsmodels
        1. Diagnostic statistics
      3. Linear OLS regression using sklearn
        1. Custom time series cross-validation
        2. Select features and target
        3. Cross-validating the model
        4. Test results – information coefficient and RMSE
      4. Ridge regression using sklearn
        1. Tuning the regularization parameters using cross-validation
        2. Cross-validation results and ridge coefficient paths
        3. Top 10 coefficients
      5. Lasso regression using sklearn
        1. Cross-validated information coefficient and Lasso Path
    6. Linear classification
      1. The logistic regression model
        1. Objective function
        2. The logistic function
        3. Maximum likelihood estimation
      2. How to conduct inference with statsmodels
      3. How to use logistic regression for prediction
        1. How to predict price movements using sklearn
    7. Summary
  13. Time Series Models
    1. Analytical tools for diagnostics and feature extraction
      1. How to decompose time series patterns
      2. How to compute rolling window statistics
        1. Moving averages and exponential smoothing
      3. How to measure autocorrelation
      4. How to diagnose and achieve stationarity
        1. Time series transformations
        2. How to diagnose and address unit roots
        3. Unit root tests
      5. How to apply time series transformations
    2. Univariate time series models
      1. How to build autoregressive models
        1. How to identify the number of lags
        2. How to diagnose model fit
      2. How to build moving average models
        1. How to identify the number of lags
        2. The relationship between AR and MA models
      3. How to build ARIMA models and extensions
        1. How to identify the number of AR and MA terms
        2. Adding features – ARMAX
        3. Adding seasonal differencing – SARIMAX
      4. How to forecast macro fundamentals
      5. How to use time series models to forecast volatility
        1. The autoregressive conditional heteroskedasticity (ARCH) model
        2. Generalizing ARCH – the GARCH model
          1. Selecting the lag order
        3. How to build a volatility-forecasting model
    3. Multivariate time series models
      1. Systems of equations
      2. The vector autoregressive (VAR) model
      3. How to use the VAR model for macro fundamentals forecasts
      4. Cointegration – time series with a common trend
        1. Testing for cointegration
      5. How to use cointegration for a pairs-trading strategy
    4. Summary
  14. Bayesian Machine Learning
    1. How Bayesian machine learning works
      1. How to update assumptions from empirical evidence
      2. Exact inference: Maximum a Posteriori estimation
        1. How to select priors
        2. How to keep inference simple – conjugate priors
        3. How to dynamically estimate the probabilities of asset price moves
      3. Approximate inference: stochastic versus deterministic approaches
        1. Sampling-based stochastic inference
        2. Markov chain Monte Carlo sampling
          1. Gibbs sampling
          2. Metropolis-Hastings sampling
          3. Hamiltonian Monte Carlo – going NUTS
        3. Variational Inference
          1. Automatic Differentiation Variational Inference (ADVI)
    2. Probabilistic programming with PyMC3
      1. Bayesian machine learning with Theano
      2. The PyMC3 workflow
        1. Model definition – Bayesian logistic regression
          1. Visualization and plate notation
          2. The Generalized Linear Models module
          3. MAP inference
        2. Approximate inference – MCMC
          1. Credible intervals
        3. Approximate inference – variational Bayes
        4. Model diagnostics
          1. Convergence
          2. Posterior Predictive Checks
        5. Prediction
      3. Practical applications
        1. Bayesian Sharpe ratio and performance comparison
          1. Model definition
          2. Performance comparison
        2. Bayesian time series models
        3. Stochastic volatility models
    3. Summary
  15. Decision Trees and Random Forests
    1. Decision trees
      1. How trees learn and apply decision rules
      2. How to use decision trees in practice
        1. How to prepare the data
        2. How to code a custom cross-validation class
        3. How to build a regression tree
        4. How to build a classification tree
          1. How to optimize for node purity
          2. How to train a classification tree
        5. How to visualize a decision tree
        6. How to evaluate decision tree predictions
        7. Feature importance
      3. Overfitting and regularization
        1. How to regularize a decision tree
        2. Decision tree pruning
      4. How to tune the hyperparameters
        1. GridsearchCV for decision trees
        2. How to inspect the tree structure
        3. Learning curves
      5. Strengths and weaknesses of decision trees
    2. Random forests
      1. Ensemble models
      2. How bagging lowers model variance
        1. Bagged decision trees
      3. How to build a random forest
      4. How to train and tune a random forest
        1. Feature importance for random forests
        2. Out-of-bag testing
      5. Pros and cons of random forests
    3. Summary
  16. Gradient Boosting Machines
    1. Adaptive boosting
      1. The AdaBoost algorithm
      2. AdaBoost with sklearn
    2. Gradient boosting machines
      1. How to train and tune GBM models
        1. Ensemble size and early stopping
        2. Shrinkage and learning rate
        3. Subsampling and stochastic gradient boosting
      2. How to use gradient boosting with sklearn
        1. How to tune parameters with GridSearchCV
        2. Parameter impact on test scores
        3. How to test on the holdout set
    3. Fast scalable GBM implementations
      1. How algorithmic innovations drive performance
        1. Second-order loss function approximation
        2. Simplified split-finding algorithms
        3. Depth-wise versus leaf-wise growth
        4. GPU-based training
        5. DART – dropout for trees
        6. Treatment of categorical features
        7. Additional features and optimizations
      2. How to use XGBoost, LightGBM, and CatBoost
        1. How to create binary data formats
        2. How to tune hyperparameters
          1. Objectives and loss functions
          2. Learning parameters
          3. Regularization
          4. Randomized grid search
        3. How to evaluate the results
          1. Cross-validation results across models
    4. How to interpret GBM results
      1. Feature importance
      2. Partial dependence plots
      3. SHapley Additive exPlanations
        1. How to summarize SHAP values by feature
        2. How to use force plots to explain a prediction
        3. How to analyze feature interaction
    5. Summary
  17. Unsupervised Learning
    1. Dimensionality reduction
      1. Linear and non-linear algorithms
      2. The curse of dimensionality
      3. Linear dimensionality reduction
        1. Principal Component Analysis
          1. Visualizing PCA in 2D
          2. The assumptions made by PCA
          3. How the PCA algorithm works
          4. PCA based on the covariance matrix
          5. PCA using Singular Value Decomposition
          6. PCA with sklearn
        2. Independent Component Analysis
          1. ICA assumptions
          2. The ICA algorithm
          3. ICA with sklearn
        3. PCA for algorithmic trading
          1. Data-driven risk factors
          2. Eigen portfolios
      4. Manifold learning
        1. t-SNE
        2. UMAP
    2. Clustering
      1. k-Means clustering
        1. Evaluating cluster quality
        2. Hierarchical clustering
          1. Visualization – dendrograms
        3. Density-based clustering
          1. DBSCAN
          2. Hierarchical DBSCAN
        4. Gaussian mixture models
          1. The expectation-maximization algorithm
        5. Hierarchical risk parity
    3. Summary
  18. Working with Text Data
    1. How to extract features from text data
      1. Challenges of NLP
      2. The NLP workflow
        1. Parsing and tokenizing text data
        2. Linguistic annotation
        3. Semantic annotation
        4. Labeling
      3. Use cases
    2. From text to tokens – the NLP pipeline
      1. NLP pipeline with spaCy and textacy
        1. Parsing, tokenizing, and annotating a sentence
        2. Batch-processing documents
        3. Sentence boundary detection
        4. Named entity recognition
        5. N-grams
        6. spaCy's streaming API
        7. Multi-language NLP
      2. NLP with TextBlob
        1. Stemming
        2. Sentiment polarity and subjectivity
    3. From tokens to numbers – the document-term matrix
      1. The BoW model
        1. Measuring the similarity of documents
      2. Document-term matrix with sklearn
        1. Using CountVectorizer
          1. Visualizing vocabulary distribution
          2. Finding the most similar documents
        2. TfidFTransformer and TfidFVectorizer
          1. The effect of smoothing
          2. How to summarize news articles using TfidFVectorizer
        3. Text Preprocessing - review
    4. Text classification and sentiment analysis
      1. The Naive Bayes classifier
        1. Bayes' theorem refresher
        2. The conditional independence assumption
      2. News article classification
        1. Training and evaluating multinomial Naive Bayes classifier
      3. Sentiment analysis
        1. Twitter data
          1. Multinomial Naive Bayes
          2. Comparison with TextBlob sentiment scores
        2. Business reviews – the Yelp dataset challenge
          1. Benchmark accuracy
          2. Multinomial Naive Bayes model
          3. One-versus-all logistic regression
          4. Combining text and numerical features
          5. Multinomial logistic regression
          6. Gradient-boosting machine
    5. Summary
  19. Topic Modeling
    1. Learning latent topics: goals and approaches
      1. From linear algebra to hierarchical probabilistic models
    2. Latent semantic indexing
      1. How to implement LSI using sklearn
      2. Pros and cons
    3. Probabilistic latent semantic analysis
      1. How to implement pLSA using sklearn
    4. Latent Dirichlet allocation
      1. How LDA works
        1. The Dirichlet distribution
        2. The generative model
        3. Reverse-engineering the process
      2. How to evaluate LDA topics
        1. Perplexity
        2. Topic coherence
      3. How to implement LDA using sklearn
      4. How to visualize LDA results using pyLDAvis
      5. How to implement LDA using gensim
      6. Topic modeling for earnings calls
        1. Data preprocessing
        2. Model training and evaluation
        3. Running experiments
      7. Topic modeling for Yelp business reviews
    5. Summary
  20. Word Embeddings
    1. How word embeddings encode semantics
      1. How neural language models learn usage in context
      2. The Word2vec model – learn embeddings at scale
        1. Model objective – simplifying the softmax
        2. Automatic phrase detection
      3. How to evaluate embeddings – vector arithmetic and analogies
      4. How to use pre-trained word vectors
        1. GloVe – global vectors for word representation
      5. How to train your own word vector embeddings
      6. The Skip-Gram architecture in Keras
        1. Noise-contrastive estimation
        2. The model components
        3. Visualizing embeddings using TensorBoard
    2. Word vectors from SEC filings using gensim
      1. Preprocessing
        1. Automatic phrase detection
      2. Model training
        1. Model evaluation
        2. Performance impact of parameter settings
    3. Sentiment analysis with Doc2vec
      1. Training Doc2vec on yelp sentiment data
        1. Create input data
    4. Bonus – Word2vec for translation
    5. Summary
  21. Next Steps
    1. Key takeaways and lessons learned
      1. Data is the single most important ingredient
        1. Quality control
        2. Data integration
      2. Domain expertise helps unlock value in data
        1. Feature engineering and alpha factor research
      3. ML is a toolkit for solving problems with data
      4. Model diagnostics help speed up optimization
        1. Making do without a free lunch
        2. Managing the bias-variance trade-off
        3. Define targeted model objectives
        4. The optimization verification test
      5. Beware of backtest overfitting
      6. How to gain insights from black-box models
    2. ML for trading in practice
      1. Data management technologies
        1. Database systems
        2. Big Data technologies – Hadoop and Spark
      2. ML tools
      3. Online trading platforms
        1. Quantopian
        2. QuantConnect
        3. QuantRocket
    3. Conclusion
  22. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product Information

  • Title: Hands-On Machine Learning for Algorithmic Trading
  • Author(s): Stefan Jansen
  • Release date: December 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781789346411