book

Machine Learning: End-to-End guide for Java developers

Name: Machine Learning: End-to-End guide for Java developers
ISBN: 9781788622219

by Richard M. Reese, Jennifer L. Reese, Boštjan Kaluža, Dr. Uday Kamath, Krishna Choppella

October 2017

Intermediate to advanced

1159 pages

26h 10m

English

Packt Publishing

Read now

Unlock full access

Machine Learning: End-to-End guide for Java developers
Table of Contents
Machine Learning: End-to-End guide for Java developers
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Module 1

1. Getting Started with Data Science
Problems solved using data science
Understanding the data science problem - solving approach
Using Java to support data science
Acquiring data for an application
The importance and process of cleaning data
Visualizing data to enhance understanding
The use of statistical methods in data science
Machine learning applied to data science
Using neural networks in data science
Deep learning approaches
Performing text analysis
Visual and audio analysis
Improving application performance using parallel techniques
Assembling the pieces
Summary
2. Data Acquisition
Understanding the data formats used in data science applicationsOverview of CSV dataOverview of spreadsheetsOverview of databasesOverview of PDF filesOverview of JSONOverview of XMLOverview of streaming dataOverview of audio/video/images in Java
Data acquisition techniques
Using the HttpUrlConnection classWeb crawlers in JavaCreating your own web crawlerUsing the crawler4j web crawlerWeb scraping in JavaUsing API calls to access common social media sitesUsing OAuth to authenticate usersHanding TwitterHandling WikipediaHandling FlickrHandling YouTubeSearching by keyword
Summary
3. Data Cleaning
Handling data formatsHandling CSV dataHandling spreadsheetsHandling Excel spreadsheetsHandling PDF filesHandling JSONUsing JSON streaming APIUsing the JSON tree API
The nitty gritty of cleaning text
Using Java tokenizers to extract wordsJava core tokenizersThird-party tokenizers and librariesTransforming data into a usable formSimple text cleaningRemoving stop wordsFinding words in textFinding and replacing textData imputationSubsetting dataSorting textData validationValidating data typesValidating datesValidating e-mail addressesValidating ZIP codesValidating names
Cleaning images
Changing the contrast of an imageSmoothing an imageBrightening an imageResizing an imageConverting images to different formats
Summary
4. Data Visualization
Understanding plots and graphsVisual analysis goals
Creating index charts
Creating bar charts
Using country as the categoryUsing decade as the category
Creating stacked graphs
Creating pie charts
Creating scatter charts
Creating histograms
Creating donut charts
Creating bubble charts
Summary
5. Statistical Data Analysis Techniques
Working with mean, mode, and medianCalculating the meanUsing simple Java techniques to find meanUsing Java 8 techniques to find meanUsing Google Guava to find meanUsing Apache Commons to find meanCalculating the medianUsing simple Java techniques to find medianUsing Apache Commons to find the medianCalculating the modeUsing ArrayLists to find multiple modesUsing a HashMap to find multiple modesUsing a Apache Commons to find multiple modes
Standard deviation
Sample size determination
Hypothesis testing
Regression analysis
Using simple linear regressionUsing multiple regression
Summary
6. Machine Learning
Supervised learning techniquesDecision treesDecision tree typesDecision tree librariesUsing a decision tree with a book datasetTesting the book decision treeSupport vector machinesUsing an SVM for camping dataTesting individual instancesBayesian networksUsing a Bayesian network
Unsupervised machine learning
Association rule learningUsing association rule learning to find buying relationships
Reinforcement learning
Summary
7. Neural Networks
Training a neural networkGetting started with neural network architectures
Understanding static neural networks
A basic Java example
Understanding dynamic neural networks
Multilayer perceptron networksBuilding the modelEvaluating the modelPredicting other valuesSaving and retrieving the modelLearning vector quantizationSelf-Organizing MapsUsing a SOMDisplaying the SOM results
Additional network architectures and algorithms
The k-Nearest Neighbors algorithmInstantaneously trained networksSpiking neural networksCascading neural networksHolographic associative memoryBackpropagation and neural networks
Summary
8. Deep Learning
Deeplearning4j architectureAcquiring and manipulating dataReading in a CSV fileConfiguring and building a modelUsing hyperparameters in ND4JInstantiating the network modelTraining a modelTesting a model
Deep learning and regression analysis
Preparing the dataSetting up the classReading and preparing the dataBuilding the modelEvaluating the model
Restricted Boltzmann Machines
Reconstruction in an RBMConfiguring an RBM
Deep autoencoders
Building an autoencoder in DL4JConfiguring the networkBuilding and training the networkSaving and retrieving a networkSpecialized autoencoders
Convolutional networks
Building the modelEvaluating the model
Recurrent Neural Networks
Summary
9. Text Analysis
Implementing named entity recognitionUsing OpenNLP to perform NERIdentifying location entities
Classifying text
Word2Vec and Doc2VecClassifying text by labelsClassifying text by similarity
Understanding tagging and POS
Using OpenNLP to identify POSUnderstanding POS tags
Extracting relationships from sentences
Using OpenNLP to extract relationships
Sentiment analysis
Downloading and extracting the Word2Vec modelBuilding our model and classifying text
Summary
10. Visual and Audio Analysis
Text-to-speechUsing FreeTTSGetting information about voicesGathering voice information
Understanding speech recognition
Using CMUPhinx to convert speech to textObtaining more detail about the words
Extracting text from an image
Using Tess4j to extract text
Identifying faces
Using OpenCV to detect faces
Classifying visual data
Creating a Neuroph Studio project for classifying visual imagesTraining the model
Summary
11. Mathematical and Parallel Techniques for Data Analysis
Implementing basic matrix operationsUsing GPUs with DeepLearning4j
Using map-reduce
Using Apache's Hadoop to perform map-reduceWriting the map methodWriting the reduce methodCreating and executing a new Hadoop job
Various mathematical libraries
Using the jblas APIUsing the Apache Commons math APIUsing the ND4J API
Using OpenCL
Using Aparapi
Creating an Aparapi applicationUsing Aparapi for matrix multiplication
Using Java 8 streams
Understanding Java 8 lambda expressions and streamsUsing Java 8 to perform matrix multiplicationUsing Java 8 to perform map-reduce
Summary
12. Bringing It All Together
Defining the purpose and scope of our application
Understanding the application's architecture
Data acquisition using Twitter
Understanding the TweetHandler class
Extracting data for a sentiment analysis modelBuilding the sentiment modelProcessing the JSON inputCleaning data to improve our resultsRemoving stop wordsPerforming sentiment analysisAnalysing the results
Other optional enhancements
Summary
2. Module 2
1. Applied Machine Learning Quick Start
Machine learning and data scienceWhat kind of problems can machine learning solve?Applied machine learning workflow
Data and problem definition
Measurement scales
Data collection
Find or observe dataGenerate dataSampling traps
Data pre-processing
Data cleaningFill missing valuesRemove outliersData transformationData reduction
Unsupervised learning
Find similar itemsEuclidean distancesNon-Euclidean distancesThe curse of dimensionalityClustering
Supervised learning
ClassificationDecision tree learningProbabilistic classifiersKernel methodsArtificial neural networksEnsemble learningEvaluating classificationPrecision and recallRoc curvesRegressionLinear regressionEvaluating regressionMean squared errorMean absolute errorCorrelation coefficient
Generalization and evaluation
Underfitting and overfittingTrain and test setsCross-validationLeave-one-out validationStratification
Summary
2. Java Libraries and Platforms for Machine Learning
The need for Java
Machine learning libraries
WekaJava machine learningApache MahoutApache SparkDeeplearning4jMALLETComparing libraries
Building a machine learning application
Traditional machine learning architectureDealing with big dataBig data application architecture
Summary
3. Basic Algorithms – Classification, Regression, and Clustering
Before you start
Classification
DataLoading dataFeature selectionLearning algorithmsClassify new dataEvaluation and prediction error metricsConfusion matrixChoosing a classification algorithm
Regression
Loading the dataAnalyzing attributesBuilding and evaluating regression modelLinear regressionRegression treesTips to avoid common regression problems
Clustering
Clustering algorithmsEvaluation
Summary
4. Customer Relationship Prediction with Ensembles
Customer relationship databaseChallengeDatasetEvaluation
Basic naive Bayes classifier baseline
Getting the dataLoading the data
Basic modeling
Evaluating modelsImplementing naive Bayes baseline
Advanced modeling with ensembles
Before we startData pre-processingAttribute selectionModel selectionPerformance evaluation
Summary
5. Affinity Analysis
Market basket analysisAffinity analysis
Association rule learning
Basic conceptsDatabase of transactionsItemset and ruleSupportConfidenceApriori algorithmFP-growth algorithm
The supermarket dataset
Discover patterns
AprioriFP-growth
Other applications in various areas
Medical diagnosisProtein sequencesCensus dataCustomer relationship managementIT Operations Analytics
Summary
6. Recommendation Engine with Apache Mahout
Basic conceptsKey conceptsUser-based and item-based analysisApproaches to calculate similarityCollaborative filteringContent-based filteringHybrid approachExploitation versus exploration
Getting Apache Mahout
Configuring Mahout in Eclipse with the Maven plugin
Building a recommendation engine
Book ratings datasetLoading the dataLoading data from fileLoading data from databaseIn-memory databaseCollaborative filteringUser-based filteringItem-based filteringAdding custom rules to recommendationsEvaluationOnline learning engine
Content-based filtering
Summary
7. Fraud and Anomaly Detection
Suspicious and anomalous behavior detectionUnknown-unknowns
Suspicious pattern detection
Anomalous pattern detection
Analysis typesPattern analysisTransaction analysisPlan recognition
Fraud detection of insurance claims
DatasetModeling suspicious patternsVanilla approachDataset rebalancing
Anomaly detection in website traffic
DatasetAnomaly detection in time series dataHistogram-based anomaly detectionLoading the dataCreating histogramsDensity based k-nearest neighbors
Summary
8. Image Recognition with Deeplearning4j
Introducing image recognitionNeural networksPerceptronFeedforward neural networksAutoencoderRestricted Boltzmann machineDeep convolutional networks
Image classification
Deeplearning4jGetting DL4JMNIST datasetLoading the dataBuilding modelsBuilding a single-layer regression modelBuilding a deep belief networkBuild a Multilayer Convolutional Network
Summary
9. Activity Recognition with Mobile Phone Sensors
Introducing activity recognitionMobile phone sensorsActivity recognition pipelineThe plan
Collecting data from a mobile phone
Installing Android StudioLoading the data collectorFeature extractionCollecting training data
Building a classifier
Reducing spurious transitionsPlugging the classifier into a mobile app
Summary
10. Text Mining with Mallet – Topic Modeling and Spam Detection
Introducing text miningTopic modelingText classification
Installing Mallet
Working with text data
Importing dataImporting from directoryImporting from filePre-processing text data
Topic modeling for BBC news
BBC datasetModelingEvaluating a modelReusing a modelSaving a modelRestoring a model
E-mail spam detection
E-mail spam datasetFeature generationTraining and testingModel performance
Summary
11. What is Next?
Machine learning in real lifeNoisy dataClass unbalanceFeature selection is hardModel chainingImportance of evaluationGetting models into productionModel maintenance
Standards and markup languages
CRISP-DMSEMMA methodologyPredictive Model Markup Language
Machine learning in the cloud
Machine learning as a service
Web resources and competitions
DatasetsOnline coursesCompetitionsWebsites and blogsVenues and conferences
Summary
A. References
3. Module 3
1. Machine Learning Review
Machine learning – history and definition
What is not machine learning?
Machine learning – concepts and terminology
Machine learning – types and subtypes
Datasets used in machine learning
Machine learning applications
Practical issues in machine learning
Machine learning – roles and process
RolesProcess
Machine learning – tools and datasets
Datasets
Summary
2. Practical Approach to Real-World Supervised Learning
Formal description and notationData quality analysisDescriptive data analysisBasic label analysisBasic feature analysisVisualization analysisUnivariate feature analysisCategorical featuresContinuous featuresMultivariate feature analysis
Data transformation and preprocessing
Feature constructionHandling missing valuesOutliersDiscretizationData samplingIs sampling needed?Undersampling and oversamplingStratified samplingTraining, validation, and test set
Feature relevance analysis and dimensionality reduction
Feature search techniquesFeature evaluation techniquesFilter approachUnivariate feature selectionInformation theoretic approachStatistical approachMultivariate feature selectionMinimal redundancy maximal relevance (mRMR)Correlation-based feature selection (CFS)Wrapper approachEmbedded approach
Model building
Linear modelsLinear RegressionAlgorithm input and outputHow does it work?Advantages and limitationsNaïve BayesAlgorithm input and outputHow does it work?Advantages and limitationsLogistic RegressionAlgorithm input and outputHow does it work?Advantages and limitationsNon-linear modelsDecision TreesAlgorithm inputs and outputsHow does it work?Advantages and limitationsK-Nearest Neighbors (KNN)Algorithm inputs and outputsHow does it work?Advantages and limitationsSupport vector machines (SVM)Algorithm inputs and outputsHow does it work?Advantages and limitationsEnsemble learning and meta learnersBootstrap aggregating or baggingAlgorithm inputs and outputsHow does it work?Random ForestAdvantages and limitationsBoostingAlgorithm inputs and outputsHow does it work?Advantages and limitations
Model assessment, evaluation, and comparisons
Model assessmentModel evaluation metricsConfusion matrix and related metricsROC and PRC curvesGain charts and lift curvesModel comparisonsComparing two algorithmsMcNemar's TestPaired-t testWilcoxon signed-rank testComparing multiple algorithmsANOVA testFriedman's test
Case Study – Horse Colic Classification
Business problemMachine learning mappingData analysisLabel analysisFeatures analysisSupervised learning experimentsWeka experimentsSample end-to-end process in JavaWeka experimenter and model selectionRapidMiner experimentsVisualization analysisFeature selectionModel process flowModel evaluation metricsEvaluation on Confusion MetricsROC Curves, Lift Curves, and Gain ChartsResults, observations, and analysis
Summary
References
3. Unsupervised Machine Learning Techniques
Issues in common with supervised learning
Issues specific to unsupervised learning
Feature analysis and dimensionality reduction
NotationLinear methodsPrincipal component analysis (PCA)Inputs and outputsHow does it work?Advantages and limitationsRandom projections (RP)Inputs and outputsHow does it work?Advantages and limitationsMultidimensional Scaling (MDS)Inputs and outputsHow does it work?Advantages and limitationsNonlinear methodsKernel Principal Component Analysis (KPCA)Inputs and outputsHow does it work?Advantages and limitationsManifold learningInputs and outputsHow does it work?Advantages and limitations
Clustering
Clustering algorithmsk-MeansInputs and outputsHow does it work?Advantages and limitationsDBSCANInputs and outputsHow does it work?Advantages and limitationsMean shiftInputs and outputsHow does it work?Advantages and limitationsExpectation maximization (EM) or Gaussian mixture modeling (GMM)Input and outputHow does it work?Advantages and limitationsHierarchical clusteringInput and outputHow does it work?Advantages and limitationsSelf-organizing maps (SOM)Inputs and outputsHow does it work?Advantages and limitationsSpectral clusteringInputs and outputsHow does it work?Advantages and limitationsAffinity propagationInputs and outputsHow does it work?Advantages and limitationsClustering validation and evaluationInternal evaluation measuresNotationR-SquaredDunn's IndicesDavies-Bouldin indexSilhouette's indexExternal evaluation measuresRand indexF-MeasureNormalized mutual information index
Outlier or anomaly detection
Outlier algorithmsStatistical-basedInputs and outputsHow does it work?Advantages and limitationsDistance-based methodsInputs and outputsHow does it work?Advantages and limitationsDensity-based methodsInputs and outputsHow does it work?Advantages and limitationsClustering-based methodsInputs and outputsHow does it work?Advantages and limitationsHigh-dimensional-based methodsInputs and outputsHow does it work?Advantages and limitationsOne-class SVMInputs and outputsHow does it work?Advantages and limitationsOutlier evaluation techniquesSupervised evaluationUnsupervised evaluation
Real-world case study
Tools and softwareBusiness problemMachine learning mappingData collectionData quality analysisData sampling and transformationFeature analysis and dimensionality reductionPCARandom projectionsISOMAPObservations on feature analysis and dimensionality reductionClustering models, results, and evaluationObservations and clustering analysisOutlier models, results, and evaluationObservations and analysis
Summary
References
4. Semi-Supervised and Active Learning
Semi-supervised learningRepresentation, notation, and assumptionsSemi-supervised learning techniquesSelf-training SSLInputs and outputsHow does it work?Advantages and limitationsCo-training SSL or multi-view SSLInputs and outputsHow does it work?Advantages and limitationsCluster and label SSLInputs and outputsHow does it work?Advantages and limitationsTransductive graph label propagationInputs and outputsHow does it work?Advantages and limitationsTransductive SVM (TSVM)Inputs and outputsHow does it work?Advantages and limitationsCase study in semi-supervised learningTools and softwareBusiness problemMachine learning mappingData collectionData quality analysisData sampling and transformationDatasets and analysisFeature analysis resultsExperiments and resultsAnalysis of semi-supervised learning
Active learning
Representation and notationActive learning scenariosActive learning approachesUncertainty samplingHow does it work?Least confident samplingSmallest margin samplingLabel entropy samplingAdvantages and limitationsVersion space samplingQuery by disagreement (QBD)How does it work?Query by Committee (QBC)How does it work?Advantages and limitationsData distribution samplingHow does it work?Expected model changeExpected error reductionVariance reductionDensity weighted methodsAdvantages and limitations
Case study in active learning
Tools and softwareBusiness problemMachine learning mappingData CollectionData sampling and transformationFeature analysis and dimensionality reductionModels, results, and evaluationPool-based scenariosStream-based scenariosAnalysis of active learning results
Summary
References
5. Real-Time Stream Machine Learning
Assumptions and mathematical notations
Basic stream processing and computational techniques
Stream computationsSliding windowsSampling
Concept drift and drift detection
Data managementPartial memoryFull memoryDetection methodsMonitoring model evolutionWidmer and KubatDrift Detection Method or DDMEarly Drift Detection Method or EDDMMonitoring distribution changesWelch's t testKolmogorov-Smirnov's testCUSUM and Page-Hinckley testAdaptation methodsExplicit adaptationImplicit adaptation
Incremental supervised learning
Modeling techniquesLinear algorithmsOnline linear models with loss functionsInputs and outputsHow does it work?Advantages and limitationsOnline Naïve BayesInputs and outputsHow does it work?Advantages and limitationsNon-linear algorithmsHoeffding trees or very fast decision trees (VFDT)Inputs and outputsHow does it work?Advantages and limitationsEnsemble algorithmsWeighted majority algorithmInputs and outputsHow does it work?Advantages and limitationsOnline Bagging algorithmInputs and outputsHow does it work?Advantages and limitationsOnline Boosting algorithmInputs and outputsHow does it work?Advantages and limitationsValidation, evaluation, and comparisons in online settingModel validation techniquesPrequential evaluationHoldout evaluationControlled permutationsEvaluation criteriaComparing algorithms and metrics
Incremental unsupervised learning using clustering
Modeling techniquesPartition basedOnline k-MeansInputs and outputsHow does it work?Advantages and limitationsHierarchical based and micro clusteringInputs and outputsHow does it work?Advantages and limitationsInputs and outputsHow does it work?Advantages and limitationsDensity basedInputs and outputsHow does it work?Advantages and limitationsGrid basedInputs and outputsHow does it work?Advantages and limitationsValidation and evaluation techniquesKey issues in stream cluster evaluationEvaluation measuresCluster Mapping Measures (CMM)V-MeasureOther external measures
Unsupervised learning using outlier detection
Partition-based clustering for outlier detectionInputs and outputsHow does it work?Advantages and limitationsDistance-based clustering for outlier detectionInputs and outputsHow does it work?Exact StormAbstract-CDirect Update of Events (DUE)Micro Clustering based Algorithm (MCOD)Approx StormAdvantages and limitationsValidation and evaluation techniques
Case study in stream learning
Tools and softwareBusiness problemMachine learning mappingData collectionData sampling and transformationFeature analysis and dimensionality reductionModels, results, and evaluationSupervised learning experimentsConcept drift experimentsClustering experimentsOutlier detection experimentsAnalysis of stream learning results
Summary
References
6. Probabilistic Graph Modeling
Probability revisitedConcepts in probabilityConditional probabilityChain rule and Bayes' theoremRandom variables, joint, and marginal distributionsMarginal independence and conditional independenceFactorsFactor typesDistribution queriesProbabilistic queriesMAP queries and marginal MAP queries
Graph concepts
Graph structure and propertiesSubgraphs and cliquesPath, trail, and cycles
Bayesian networks
RepresentationDefinitionReasoning patternsCausal or predictive reasoningEvidential or diagnostic reasoningIntercausal reasoningCombined reasoningIndependencies, flow of influence, D-Separation, I-MapFlow of influenceD-SeparationI-MapInferenceElimination-based inferenceVariable elimination algorithmInput and outputHow does it work?Advantages and limitationsClique tree or junction tree algorithmInput and outputHow does it work?Advantages and limitationsPropagation-based techniquesBelief propagationFactor graphMessaging in factor graphInput and outputHow does it work?Advantages and limitationsSampling-based techniquesForward sampling with rejectionInput and outputHow does it work?Advantages and limitationsLearningLearning parametersMaximum likelihood estimation for Bayesian networksBayesian parameter estimation for Bayesian networkPrior and posterior using the Dirichlet distributionLearning structuresMeasures to evaluate structuresMethods for learning structuresConstraint-based techniquesInputs and outputsHow does it work?Advantages and limitationsSearch and score-based techniquesInputs and outputsHow does it work?Advantages and limitations
Markov networks and conditional random fields
RepresentationParameterizationGibbs parameterizationFactor graphsLog-linear modelsIndependenciesGlobalPairwise MarkovMarkov blanketInferenceLearningConditional random fields
Specialized networks
Tree augmented networkInput and outputHow does it work?Advantages and limitationsMarkov chainsHidden Markov modelsMost probable path in HMMPosterior decoding in HMM
Tools and usage
OpenMarkovWeka Bayesian Network GUI
Case study
Business problemMachine learning mappingData sampling and transformationFeature analysisModels, results, and evaluationAnalysis of results
Summary
References
7. Deep Learning
Multi-layer feed-forward neural networkInputs, neurons, activation function, and mathematical notationMulti-layered neural networkStructure and mathematical notationsActivation functions in NNSigmoid functionHyperbolic tangent ("tanh") functionTraining neural networkEmpirical risk minimizationParameter initializationLoss functionGradientsGradient at the output layerGradient at the Hidden LayerParameter gradientFeed forward and backpropagationHow does it work?RegularizationL2 regularizationL1 regularization
Limitations of neural networks
Vanishing gradients, local optimum, and slow training
Deep learning
Building blocks for deep learningRectified linear activation functionRestricted Boltzmann MachinesDefinition and mathematical notationConditional distributionFree energy in RBMTraining the RBMSampling in RBMContrastive divergenceInputs and outputsHow does it work?Persistent contrastive divergenceAutoencodersDefinition and mathematical notationsLoss functionLimitations of AutoencodersDenoising AutoencoderUnsupervised pre-training and supervised fine-tuningDeep feed-forward NNInput and outputsHow does it work?Deep AutoencodersDeep Belief NetworksInputs and outputsHow does it work?Deep learning with dropoutsDefinition and mathematical notationInputs and outputsHow does it work?Learning Training and testing with dropoutsSparse codingConvolutional Neural NetworkLocal connectivityParameter sharingDiscrete convolutionPooling or subsamplingNormalization using ReLUCNN LayersRecurrent Neural NetworksStructure of Recurrent Neural NetworksLearning and associated problems in RNNsLong Short Term MemoryGated Recurrent Units
Case study
Tools and softwareBusiness problemMachine learning mappingData sampling and transforFeature analysisModels, results, and evaluationBasic data handlingMulti-layer perceptronParameters used for MLPCode for MLPConvolutional NetworkParameters used for ConvNetCode for CNNVariational AutoencoderParameters used for the Variational AutoencoderCode for Variational AutoencoderDBNParameter search using ArbiterResults and analysis
Summary
References
8. Text Mining and Natural Language Processing
NLP, subfields, and tasksText categorizationPart-of-speech tagging (POS tagging)Text clusteringInformation extraction and named entity recognitionSentiment analysis and opinion miningCoreference resolutionWord sense disambiguationMachine translationSemantic reasoning and inferencingText summarizationAutomating question and answers
Issues with mining unstructured data
Text processing components and transformations
Document collection and standardizationInputs and outputsHow does it work?TokenizationInputs and outputsHow does it work?Stop words removalInputs and outputsHow does it work?Stemming or lemmatizationInputs and outputsHow does it work?Local/global dictionary or vocabulary?Feature extraction/generationLexical featuresCharacter-based featuresWord-based featuresPart-of-speech tagging featuresTaxonomy featuresSyntactic featuresSemantic featuresFeature representation and similarityVector space modelBinaryTerm frequency (TF)Inverse document frequency (IDF)Term frequency-inverse document frequency (TF-IDF)Similarity measuresEuclidean distanceCosine distancePairwise-adaptive similarityExtended Jaccard coefficientDice coefficientFeature selection and dimensionality reductionFeature selectionInformation theoretic techniquesStatistical-based techniquesFrequency-based techniquesDimensionality reduction
Topics in text mining
Text categorization/classificationTopic modelingProbabilistic latent semantic analysis (PLSA)Input and outputHow does it work?Advantages and limitationsText clusteringFeature transformation, selection, and reductionClustering techniquesGenerative probabilistic modelsInput and outputHow does it work?Advantages and limitationsDistance-based text clusteringNon-negative matrix factorization (NMF)Input and outputHow does it work?Advantages and limitationsEvaluation of text clusteringNamed entity recognitionHidden Markov models for NERInput and outputHow does it work?Advantages and limitationsMaximum entropy Markov models for NERInput and outputHow does it work?Advantages and limitationsDeep learning and NLP
Tools and usage
MalletKNIMETopic modeling with malletBusiness problemMachine Learning mappingData collectionData sampling and transformationFeature analysis and dimensionality reductionModels, results, and evaluationAnalysis of text processing results
Summary
References
9. Big Data Machine Learning – The Final Frontier
What are the characteristics of Big Data?
Big Data Machine Learning
General Big Data frameworkBig Data cluster deployment frameworksHortonworks Data PlatformCloudera CDHAmazon Elastic MapReduceMicrosoft Azure HDInsightData acquisitionPublish-subscribe frameworksSource-sink frameworksSQL frameworksMessage queueing frameworksCustom frameworksData storageHDFSNoSQLKey-value databasesDocument databasesColumnar databasesGraph databasesData processing and preparationHive and HQLSpark SQLAmazon RedshiftReal-time stream processingMachine LearningVisualization and analysis
Batch Big Data Machine Learning
H2O as Big Data Machine Learning platformH2O architectureMachine learning in H2OTools and usage
Case study
Business problemMachine Learning mappingData collectionData sampling and transformationExperiments, results, and analysisFeature relevance and analysisEvaluation on test dataAnalysis of resultsSpark MLlib as Big Data Machine Learning platformSpark architectureMachine Learning in MLlibTools and usageExperiments, results, and analysisk-Meansk-Means with PCABisecting k-Means (with PCA)Gaussian Mixture ModelRandom ForestAnalysis of resultsReal-time Big Data Machine LearningSAMOA as a real-time Big Data Machine Learning frameworkSAMOA architectureMachine Learning algorithmsTools and usageExperiments, results, and analysisAnalysis of resultsThe future of Machine LearningSummaryReferences
A. Linear Algebra
VectorScalar product of vectors
Matrix
Transpose of a matrixMatrix additionScalar multiplicationMatrix multiplicationProperties of matrix productLinear transformationMatrix inverseEigendecompositionPositive definite matrixSingular value decomposition (SVD)
B. Probability
Axioms of probability
Bayes' theorem
Density estimationMeanVarianceStandard deviationGaussian standard deviationCovarianceCorrelation coefficientBinomial distributionPoisson distributionGaussian distributionCentral limit theoremError propagation
D. Bibliography
Index

Content preview from Machine Learning: End-to-End guide for Java developers

Text processing components and transformations

In this section, we will discuss some common preprocessing and transformation steps that are done in most text mining processes. The general concept is to convert the documents into structured datasets with features or attributes that most Machine Learning algorithms can use to perform different kinds of learning.

We will briefly describe some of the most used techniques in the next section. Different applications of text mining might use different pieces or variations of the components shown in the following figure:

Text processing components and transformations

Figure 10: Text Processing components and the flow

Document collection and standardization ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Machine Learning in Java - Second Edition

Publisher Resources

ISBN: 9781788622219

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Machine Learning: End-to-End guide for Java developers

by Richard M. Reese, Jennifer L. Reese, Boštjan Kaluža, Dr. Uday Kamath, Krishna Choppella

Text processing components and transformations

Document collection and standardization ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Machine Learning in Java - Second Edition

Mastering Java Machine Learning

Practical Java Machine Learning: Projects with Google Cloud Platform and Amazon Web Services

Mastering Java for Data Science

Publisher Resources