book

Big Data

by Kuan-Ching Li, Hai Jiang, Laurence T. Yang, Alfredo Cuzzocrea

February 2015

Beginner to intermediate

498 pages

16h 57m

English

Chapman and Hall/CRC

Read now

Unlock full access

Foreword by Dr. Yi PanForeword by D. Frank Hsu
Abstract1.1 Introduction1.2 Permutation-Based Indexing1.2.1 Indexing Model1.2.2 Technical Implementation1.3 Related Data Structures1.3.1 Metric Inverted Files1.3.2 Brief Permutation Index1.3.3 Prefix Permutation Index1.3.4 Neighborhood Approximation1.3.5 Metric Suffix Array1.3.6 Metric Permutation Table1.4 Distributed Indexing1.4.1 Data Based1.4.1.1 Indexing1.4.1.2 Searching1.4.2 Reference Based1.4.2.1 Indexing1.4.2.2 Searching1.4.3 Index Based1.4.3.1 Indexing1.4.3.2 Searching1.5 Evaluation1.5.1 Recall and Position Error1.5.2 Indexing and Searching Performance1.5.3 Big Data Indexing and Searching1.6 ConclusionAcknowledgmentReferences
Abstract2.1 Introduction2.2 Introduction of MapReduce and Apache Hadoop2.3 A Motivating Application: Movie Ratings from Netflix Prize2.4 Implementation in Hadoop2.5 Deployment Architecture2.6 Scalability and Cost Evaluation2.7 Discussions2.8 Related Work2.9 ConclusionAcknowledgmentReferencesAppendix 2.A: Source Code of Mappers and Reducers
Abstract3.1 Introduction3.2 Data Reduction Methods and SVD3.3 Clustering Methods3.3.1 Partitioning Methods3.3.2 Hierarchical Clustering3.3.3 Density-Based Methods3.3.4 Grid-Based Methods3.3.5 Subspace Clustering Methods3.4 Steps in Building an Index for k-NN Queries3.5 Nearest Neighbors Queries in High-Dimensional Space3.6 Alternate Method Combining SVD and Clustering3.7 Survey of High-Dimensional Indices3.8 ConclusionsAcknowledgmentsReferencesAppendix 3.A: Computing Approximate Distances with Dimensionality-Reduced Data
Abstract4.1 Introduction4.2 CDM4.3 PEA4.4 Divide and Conquer4.5 GAs4.6 DCGA4.7 K-Means4.8 Clustering Genetic Algorithm with the SSE Criterion4.9 MapReduce Section4.10 Simulation4.11 ConclusionReferences

Abstract5.1 Introduction5.2 Big Data Definition and Concepts5.3 Cloud Computing for Big Data Analysis5.3.1 Data Analytics Tools as SaaS5.3.2 Computing as IaaS5.4 Challenges and Current Research Directions5.5 Conclusions and PerspectivesReferences
Abstract6.1 Introduction6.2 Requirements for Scheduling in Big Data Platforms6.3 Scheduling Models and Algorithms6.4 Data Transfer Scheduling6.5 Scheduling Policies6.6 Optimization Techniques for Scheduling6.7 Case Study on Hadoop and Big Data Applications6.8 ConclusionsReferences
Abstract7.1 INTRODUCTION7.2 OVERVIEW OF Big Data PROCESSING ARCHITECTURE7.3 SELF-ADAPTIVE REDUCE TASK SCHEDULING7.3.1 Problem Analysis7.3.2 Runtime Analysis of MapReduce Jobs7.3.3 A Method of Reduce Task Start-Time Scheduling7.4 REDUCE PLACEMENT7.4.1 Optimal Algorithms for Cross-Rack Communication Optimization7.4.2 Locality-Aware Reduce Task Scheduling7.4.3 MapReduce Network Traffic Reduction7.4.4 The Source of MapReduce Skews7.4.5 Reduce Placement in Hadoop7.5 NER IN BIOMEDICAL Big Data MINING: A CASE STUDY7.5.1 Biomedical Big Data7.5.2 Biomedical Text Mining and NER7.5.3 MapReduce for CRFs7.6 CONCLUDING REMARKSReferences
Abstract8.1 INTRODUCTION8.2 RELATED INFRASTRUCTURES8.3 GEMS OVERVIEW8.4 GMT ARCHITECTURE8.4.1 GMT: Aggregation8.4.2 GMT: Fine-Grained Multithreading8.5 EXPERIMENTAL RESULTS8.5.1 Synthetic Benchmarks8.5.2 BSBM8.5.3 RDESC8.6 CONCLUSIONSReferences
Abstract9.1 INTRODUCTION9.2 KSC fOR Big Data NETWORKS9.2.1 Notations9.2.2 FURS Selection9.2.3 KSC Framework9.2.3.1 Training Model9.2.3.2 Model Selection9.2.3.3 Out-of-Sample Extension9.2.4 Practical Issues9.3 KSC-net SOFTWARE9.3.1 KSC Demo on Synthetic Network9.3.2 KSC Subfunctions9.3.3 KSC Demo on Real-Life Network9.4 CONCLUSIONAcknowledgmentsReferences
Abstract10.1 Introduction10.2 Software Developers’ Information Needs10.2.1 Information Needs: Core Work Practice10.2.2 Information Needs: Constructing and Maintaining Relationships10.2.3 Information Needs: Professional/Career Development10.3 Software Developers’ Ecosystem10.3.1 Social Media Use10.3.2 The Ecosystem10.4 Information Overload and Awareness Issue10.5 The Application of Big Data to Support the Software Developers’ Community10.5.1 Data Generated from Core Practices10.5.2 Software Analytics10.6 ConclusionReferences
Abstract11.1 INTRODUCTION11.1.1 Stream Computing11.1.2 Application Background11.1.3 Chapter Organization11.2 OVERVIEW OF A BDSC SYSTEM11.2.1 Directed Acyclic Graph and Stream Computing11.2.2 System Architecture for Stream Computing11.2.3 Key Technologies for BDSC Systems11.2.3.1 System Structure11.2.3.2 Data Stream Transmission11.2.3.3 Application Interfaces11.2.3.4 High Availability11.3 EXAMPLE BDSC SYSTEMS11.3.1 Twitter Storm11.3.1.1 Task Topology11.3.1.2 Fault Tolerance11.3.1.3 Reliability11.3.1.4 Storm Cluster11.3.2 Yahoo! S411.3.2.1 Processing Element11.3.2.2 Processing Nodes11.3.2.3 Fail-Over, Checkpointing, and Recovery Mechanism11.3.2.4 System Architecture11.3.3 Microsoft TimeStream and Naiad11.3.3.1 TimeStream11.3.3.2 Naiad11.4 FUTURE PERSPECTIVE11.4.1 Grand Challenges11.4.1.1 High Scalability11.4.1.2 High Fault Tolerance11.4.1.3 High Consistency11.4.1.4 High Load Balancing11.4.1.5 High Throughput11.4.2 On-the-Fly WorkAcknowledgmentsReferences
Abstract12.1 Introduction12.2 An Unconventional Big Data Processor12.2.1 Terminology12.2.2 Overview of Hadoop12.2.3 Hadoop Alternative: Big Data Replay12.3 Putting the Pieces Together12.3.1 More on the Scope of the Problem12.3.2 Overview of Literature12.4 The Data Streaming Problem12.4.1 Data Streaming Terminology12.4.2 Related Information Theory and Formulations12.4.3 Practical Applications and Designs12.5 Practical Hashing and Bloom Filters12.5.1 Bloom Filters: Store, Lookup, and Efficiency12.5.2 Unconventional Bloom Filter Designs for Data Streams12.5.3 Practical Data Streaming Targets12.6 Big Data Streaming Optimization12.6.1 A Simple Model of a Data Streaming Process12.6.2 Streaming on Multicore12.6.3 Performance Metrics12.6.4 Example Analysis12.7 Big Data Streaming on Multicore Technology12.7.1 Parallel Processing Basics12.7.2 DLL12.7.3 Lock-Free Parallelization12.7.4 Software APIs12.8 SummaryReferences
Abstract13.1 Introduction13.2 Overview of Related Work13.3 Organic Stream: Definitions and Organizations13.3.1 Metaphors and Graph Model13.3.2 Definition of Organic Stream13.3.3 Organization of Social Streams13.4 Experimental Result and Analysis13.4.1 Functional Modules13.4.2 Experiment Analysis13.5 SummaryReferences
Abstract14.1 Introduction14.2 Trajectory Representation and Management14.3 Online Trajectory Compression with Spatiotemporal Criteria14.4 Amnesic Multiresolution Trajectory Synopses14.5 Continuous Range Search over Uncertain Locations14.6 Multiplexing of Evolving Trajectories14.7 Toward Next-Generation Management of Big Trajectory DataReferences
Abstract15.1 Introduction15.1.1 Topic and Aim15.1.2 Note to the Reader, Structure, and Arguments15.2 Data Protection Aspects15.2.1 Big Data and Analytics in Four Steps15.2.2 Personal Data15.2.2.1 Profiling Activities on Personal Data15.2.2.2 Pseudonymization15.2.2.3 Anonymous Data15.2.2.4 Reidentification15.2.3 Purpose Limitation15.3 Conclusions and RecommendationsReferences
Abstract16.1 Introduction16.1.1 Problem Definition16.1.2 Chapter Organization16.2 Literature Overview and Survey16.2.1 Privacy-Preserving OLAP in Centralized Environments16.2.2 Privacy-Preserving OLAP in Distributed Environments16.3 Fundamental Definitions and Formal Tools16.4 Dealing with Overlapping Query Workloads16.5 Metrics for Modeling and Measuring Accuracy16.6 Metrics for Modeling and Measuring Privacy16.7 Accuracy and Privacy Thresholds16.8 Accuracy Grids and Multiresolution Accuracy Grids: Conceptual Tools for Handling Accuracy and Privacy16.9 An Effective and Efficient Algorithm for Computing Synopsis Data Cubes16.9.1 Allocation Phase16.9.2 Sampling Phase16.9.3 Refinement Phase16.9.4 The computeSynDataCube Algorithm16.10 Experimental Assessment and Analysis16.11 Conclusions and Future WorkReferences
Background17.1 Introduction17.2 Financial Domain Dynamics17.2.1 Historical Landscape versus Emerging Trends17.3 Financial Capital Market Domain: In-Depth View17.3.1 Big Data Origins17.3.2 Information Flow17.3.3 Data Analytics17.4 Emerging Big Data Landscape in Finance17.4.1 Challenges17.4.2 New Models of Computation and Novel Architectures17.5 Impact on Financial Research and Emerging Research Landscape17.5.1 Background17.5.2 UHFD (Big Data)–Driven Research17.5.3 UHFD (Big Data) Implications17.5.4 UHFD (Big Data) Challenges17.6 SummaryReferences
Abstract18.1 Introduction18.2 Related Work18.3 Proposed Framework18.3.1 Overview18.3.2 Semantic Annotation18.3.3 Optimization and User Feedback18.3.4 Semantic Representation18.3.5 NoSQL-Based Semantic Storage18.3.6 Heterogeneous Multimedia Retrieval18.4 Performance Evaluation18.4.1 Running Environment and Software Tools18.4.2 Performance Evaluation Model18.4.3 Precision Ratio Evaluation18.4.4 Time and Storage Cost18.5 Discussions and ConclusionsAcknowledgmentsReferences
Abstract19.1 Introduction19.2 Large-Scale Computing Frameworks19.3 Probabilistic Topic Modeling19.4 Couplings among Topic Models, Cloud Computing, and Multimedia Analysis19.4.1 Large-Scale Topic Modeling19.4.2 Topic Modeling for Multimedia19.4.3 Large-Scale Computing in Multimedia19.5 Large-Scale Topic Modeling for Multimedia Retrieval and Analysis19.6 Conclusions and Future DirectionsReferences
Abstract20.1 Introduction20.2 Background20.2.1 Intel Xeon Phi20.2.2 Iris Matching Algorithm20.2.3 OpenMP20.2.4 Intel VTune Amplifier20.3 Experiments20.3.1 Experiment Setup20.3.2 Workload Characteristics20.3.3 Impact of Different Affinity20.3.4 Optimal Number of Threads20.3.5 Vectorization20.4 ConclusionsAcknowledgmentsReferences
21.1 Introduction21.2 The Landsat Program21.3 New Challenges and Solutions21.3.1 The Conventional Satellite Imagery Distribution System21.3.2 The New Satellite Data Distribution Policy21.3.3 Impact on the Data Process Work Flow21.3.4 Impact on the System Architecture, Hardware, and Software21.3.5 Impact on the Characteristics of Users and Their Behaviors21.3.6 The New System Architecture21.4 Using Big Data Analytics to Improve Performance and Reduce Operation Cost21.4.1 Vis-EROS: Big Data Visualization21.4.2 FastStor: Data Mining-Based Multilayer Prefetching21.5 Conclusions: Experiences and Lessons LearnedAcknowledgmentsReferences
22.1 Introduction22.2 The Potential of Big Data: Benefits to the Social Sector—From Business to Social Enterprise to NGO22.3 How NGOs can Leverage Big Data to Achieve Their Missions22.4 Historical Limitations and Considerations22.5 The Gap in Understanding within the Social Sector22.6 Next Steps: How to Bridge the Gap22.7 ConclusionREFERENCES

Content preview from Big Data

Chapter 2

Scalability and Cost Evaluation of Incremental Data Processing Using Amazon’s Hadoop Service

Xing Wu, Yan Liu, and Ian Gorton

Abstract

Based on the MapReduce model and Hadoop Distributed File System (HDFS), Hadoop enables the distributed processing of large data sets across clusters with scalability and fault tolerance. Many data-intensive applications involve continuous and incremental updates of data. Understanding the scalability and cost of a Hadoop platform to handle small and independent updates of data sets sheds light on the design of scalable and cost-effective data-intensive applications. In this chapter, we introduce a motivating movie recommendation application implemented in the MapReduce model and deployed on Amazon Elastic ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Big Data

Bernard Marr

Big Data

Eglantine Schmitt

Big Data

James Warren, Nathan Marz

Big Data

James R. Kalyvas, Michael R. Overly

Publisher Resources

ISBN: 9781482240559

Scalability and Cost Evaluation of Incremental Data Processing Using Amazon’s Hadoop Service

Abstract

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Big Data

Big Data

Big Data

Big Data

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.