book

R in a Nutshell, 2nd Edition

by Joseph Adler

October 2012

Beginner to intermediate

721 pages

21h 38m

English

O'Reilly Media, Inc.

Read now

Unlock full access

WindowsMac OS XLinux and Unix SystemsInstallation using package management systemsInstalling R from downloaded files
WindowsMac OS XLinux and Unix
Command-Line Editing
Loading Packages on Windows and LinuxLoading Packages on Mac OS X
Exploring R Package Repositories on the WebFinding and Installing Packages Inside RWindows and Linux GUIsMac OS X GUIR consoleInstalling from the command line
Creating a Package DirectoryBuilding the Package
NAInf and -InfNaNNULL
Numeric VectorsCharacter VectorsSymbols
Order of OperationsAssignments
Separating ExpressionsParenthesesCurly Braces
Conditional StatementsLoops
Data Structure OperatorsIndexing by Integer VectorIndexing by Logical VectorIndexing by Name
MatricesArraysFactorsData FramesFormulasTime SeriesShinglesDates and TimesConnections
Class
Working with the Call StackEvaluating Functions in Different EnvironmentsAdding Objects to an Environment
Signaling ErrorsCatching Errors
Anonymous FunctionsProperties of Functions
Changes to Other EnvironmentsInput/OutputGraphics
Key IdeasImplementation Example
Defining ClassesNew ObjectsAccessing SlotsWorking with ObjectsCreating Coercion MethodsMethodsManaging MethodsBasic ClassesMore Help
S3 ClassesS3 MethodsUsing S3 Classes in S4 ClassesFinding Hidden S3 Methods
Entering Data Using R CommandsUsing the Edit GUIWindows Data EditorMac OS X Data EditorX Windows (Linux) Data Editor
Saving Objects with save
Text FilesDelimited filesFixed-width filesOther functions to parse dataOther Software
Export Then ImportDatabase Connection PackagesRODBCGetting RODBC workingInstalling the RODBC packageInstalling ODBC driversExample: SQLite ODBC on Mac OS XExample: SQLite ODBC on WindowsUsing RODBCOpening a channelGetting information about the databaseGetting dataClosing a channelDBIOpening a connectionGetting DB informationQuerying the databaseCleaning upTSDBI
Pasting Together Data StructuresPasterbind and cbindAn extended exampleMerging Data by Common Fields
Reassigning VariablesThe Transform FunctionApplying a Function to Each Element of an ObjectApplying a function to an arrayApplying a function to a list or vectorthe plyr library
ShinglesCutCombining Objects with a Grouping Variable
Bracket Notationsubset FunctionRandom Sampling
tapply, aggregateAggregating Tables with rowsumCounting ValuesReshaping DataTransposing matrices and data framesReshaping data frames and matricesUsing the Reshape LibraryMelting and CastingExamples of reshapemeltCast
Scatter PlotsPlotting Time SeriesBar ChartsPie ChartsPlotting Categorical DataThree-Dimensional DataPlotting DistributionsBox Plots
Common Arguments to Chart FunctionsGraphical ParametersAnnotationMarginsMultiple plotsText propertiesText sizeTypefaceAlignment and spacingRotationLine propertiesColorsAxesPointsGraphical parameters by nameBasic Graphics Functionspointslinescurvetextablinepolygonsegmentslegendtitleaxisboxmtexttrans3d
How Lattice WorksA Simple ExampleUsing Lattice FunctionsCustom Panel Functions
Univariate Trellis PlotsBar chartsDot plotsHistogramsDensity plotsStrip plotsUnivariate quantile-quantile plotsBivariate Trellis PlotsScatter plotsBox plots in latticeScatter plots matricesBivariate quantile-quantile plotsTrivariate PlotsLevel plotsContour plotsCloud plotsWire-frame plotsOther Plots
Common Arguments to Lattice Functionstrellis.skeletonControlling How Axes Are DrawnParametersplot.trellisstrip.defaultsimpleKey
Low-Level Graphics FunctionsPanel Functions
Normal Distribution-Based TestsComparing meansComparing paired dataComparing variances of two populationsComparing means across more than two groupsPairwise t-tests between multiple groupsTesting for normalityTesting if a data vector came from an arbitrary distributionTesting if two data vectors came from the same distributionCorrelation testsNon-Parametric TestsComparing two meansComparing more than two meansComparing variancesDifference in scale parameters
Proportion TestsBinomial TestsTabular Data TestsNon-Parametric Tabular Data Tests
Fitting a ModelHelper Functions for Specifying the ModelGetting Information About a ModelViewing the modelPredicting values using a modelAnalyzing the fitRefining the Model
Assumptions of Least Squares RegressionRobust and Resistant RegressionResistant regressionRobust regressionComparing lm, lqs, and rlm
Stepwise Variable SelectionRidge RegressionLasso and Least Angle RegressionelasticnetPrincipal Components Regression and Partial Least Squares Regression
Generalized Linear ModelsglmnetNonlinear Least Squares
SplinesFitting Polynomial SurfacesKernel Smoothing
Regression Tree ModelsRecursive partitioning treesPatient rule induction methodBagging for regressionBoosting for regressionRandom forests for regressionMARSNeural NetworksProject Pursuit RegressionGeneralized Additive ModelsSupport Vector Machines
Logistic RegressionLinear Discriminant AnalysisLog-Linear Models
k Nearest NeighborsClassification Tree ModelsBaggingBoostingNeural NetworksSVMsRandom Forests
Distance MeasuresClustering Algorithms
TimingProfilingMonitor How Much Memory You Are UsingProfiling Memory Usage
Using Vector OperationsIterative algorithms and vector operationsTransforming problems to use built-in functionsLookup Performance in RLookups and R objectsUsing environment objects in place of vectorsUse a Database to Query Large Data SetsPreallocate MemoryCleaning Up MemoryFunctions for Big Data Sets
The R Byte Code CompilerManual compilationInspecting byte codeJust-in-time compilationHigh-Performance R BinariesRevolution RBuilding your ownBuilding on Microsoft WindowsBuilding R on Unix-like systemsBuilding R on Mac OS X
Loading Raw Expression DataLoading Data from GEOMatching Phenotype DataAnalyzing Expression Data
eSetAssayDataAnnotatedDataFrameMIAMEOther Classes Used by Bioconductor Packages
Resources Outside BioconductorVignettesCoursesBooks
Overview of HadoopMap/ReduceDistributed data storageManaging a cluster of serversJava frameworkWhen should you consider Hadoop?RHadoopMake sure Hadoop is installed locallyInstalling RHadoop locallyAn example RHadoop applicationDetails of rmrLearning moreHadoop StreamingLearning More
SeguedoMC
FunctionsData Sets
FunctionsData Sets
Functions
FunctionsData Sets
Functions
FunctionsData Sets
Functions
Functions
FunctionsData Sets
FunctionsData Sets
Functions
Functions
FunctionsData Sets
Functions
Functions
FunctionsData Set
Functions
FunctionsData Sets
FunctionsData Sets
Functions

Content preview from R in a Nutshell, 2nd Edition

Optimizing Your R Code

Once you figure out where your program is spending its time, you can focus on improving those areas. This section describes some common causes for poor performance and shows how to resolve them.

Using Vector Operations

R is a functional language with built-in support for vector operations. Whenever possible, you should use vector operations in your code and not write iterative algorithms. This section explains why.

Iterative algorithms and vector operations

Let’s consider a simple problem: calculating a vector with the square of every integer between 1 and n. Consider the following naive implementation:

> naive.vector.of.squares <- function(n) {
+   v <- 1:n
+   for (i in 1:n)
+     v[i] <- v[i]^2
+   v
+ }
> naive.vector.of.squares(10)
 [1]   1   4   9  16  25  36  49  64  81 100

How does the performance of this function vary with n? Let’s do a quick experiment:

> # 10,000 values
> system.time(naive.vector.of.squares(10000))
   user  system elapsed
  0.037   0.000   0.037
> # 10,000,000 values
> system.time(naive.vector.of.squares(10000000))
   user  system elapsed
 30.211   0.233  30.178

As you can see, the time required to compute the vector varies linearly with the size of the vector (n). This makes sense: R is looping through all n elements in the vector and changing each element one at a time. (Note that R doesn’t actually copy the vector v repeatedly inside the loop; see Objects Are Copied in Assignment Statements for more about how this works.)

It turns out that there is a much better way to implement ...