Concepts and Basics of the SciencesPo Package

Daniel Marcelino <dmarcelino@live.com>

August 04, 2016

1 Overview

R is rapidly increasing popularity in scientific research; it has several advantages over other programming languages: it is free, it is fairly powerful, there is an excellent user community, and it was designed with the statistical analysis of data in mind. The purpose of this vignette is to introduce some of the function in the SciencesPo package, and how these can be used in data analysis workflows and reporting results.

The SciencesPo package is meant to provide primarily functions and algorithms for analyzing political behavior data, including measures of political fragmentation and seat allocation methods. In addition, it also equip the environment with some functions to do descriptive statistics, cross-tabulation, tests, and to render pre set publication-quality plots that will require only a minimum amount of fiddling details. Although this package is available for the general public, it meets my personal needs and tastes. Yours may be different.

The development of this package began in July of 2011. I found myself spending a great deal of time manipulating several datasets with a basket of standalone functions, but a very little time actually analyzing data. So, I thought this package could help researchers with interests alike. Since then, I have diligently labored to maintain this R library updated. The SciencesPo is currently hosted by the The R Foundation, and distributed over the internet via CRAN. You can find the package source code on Github, and you are welcome to contribute code via pull requests, or bug reports.

This document is organized as follows. In the next section, I addressed the basic aspects of setting up the package. Following, I discuss some of the basic data manipulation functions. Next, I show some of the basic functions for creating tables and cross-tabulations. Then, I present basic functions for computing distributions. Next, I address the basic functions for computing measures of political behavior. I end by introducing several plotting functions, including themes and color palettes available. Throughout the paper, I try to keep the commentary to a minimum so the user can easily breeze through this without having to digest my witty banter.

2 Learning by doing

This section introduces the basics of SciencesPo for conducting customary analysis; the package version I’m using is 1.4.1.

SciencesPo loads packages as needed and assumes that they are installed. Thus, the recomended form of installing the package to ensure that all the needed packages are also installed is by using the following statement:

install.packages("SciencesPo", dependencies = c("Depends", "Suggests"))

2.1 Loading and unloading SciencesPo

library('SciencesPo') 

## Do things ... 

detach('package:SciencesPo', unload = TRUE)

If you load it via loadNamespace('SciencesPo'), you can call unloadNamespace('SciencesPo').

Whenever you load the package, it will setup its own environment, including plotting themes, summary table styles, etc. Thus, some objects may be printed or plotted differently as you would have seen before loading SciencesPo. Also, the package is also operator (%>%) friendly, so much of the functions will work as expected in a chain like.

2.2 Vignette and dataset lists

To see a list of existing vignettes, type:

library("SciencesPo")

vignette(package = "SciencesPo")
no vignettes found

To see the collection of data included in the package, type:

data(package = "SciencesPo")

2.3 To search any topic within the package

Here are some examples that demonstrates the results of help.search(), or you can also use ?? to search for for all commands that have some strings related to the text within quotes. You can also restrict the search to a specific library as:

help.search("D'Hondt", package = 'SciencesPo')

The next example searches inside all R related sites for that “string”, and open the results in a web browser.

RSiteSearch("D'Hondt") 

3 Exploratory data functions

In this section, a samll selection of functions for conducting exploratory data analysis (EDA) is presented.

3.1 Frequency tables and cross-tables

There is a specific document covering one-way, two-way, and multiway cross-tabulations with accompanying independent tests, so the following examples will only introduce this topic. To describe an entire data.frame, there is the Describe function, which will produce a summary description table with the following features: variable names, labels, factor levels, and frequencies or summary statistics. For small datasets, the output can be used as the “table one” descriptive statistics in a data analysis publication report.

data("titanic")
Describe(titanic) 

Summary table
Dataset: titanic
------------------------------------------------------
Variable   Levels.or.Stats   Freq            Valid    
---------- ----------------- --------------- ---------
1.         1. 1st            1: 325 (14.8%)  2201/2201
CLASS      2. 2nd            2: 285 (12.9%)  (100.0%) 
           3. 3rd            3: 706 (32.1%)           
           4. crew           4: 885 (40.2%)           

2.         1. adult          1: 2092 (95%)   2201/2201
AGE        2. child          2: 109 (5%)     (100.0%) 

3.         1. female         1: 470 (21.4%)  2201/2201
SEX        2. male           2: 1731 (78.6%) (100.0%) 

4.         1. no             1: 1490 (67.7%) 2201/2201
SURVIVED   2. yes            2: 711 (32.3%)  (100.0%) 
------------------------------------------------------

To tabulate one variable responses, simply:

with(titanic, Crosstable(SURVIVED))

A more performant descriptive output can be obtained with the Freq command, which resembles the SPSS output.

Freq(titanic, SURVIVED) 

    Frequency  Valid Percent  Cum Percent
no       1490           67.7         67.7
yes       711           32.3        100.0
--------------------------------------------------------
Total     2201

Warning: Statistics may not be meaningful for factors!

         Mean           Std dev
     1.323035         0.4677422
      Minimum           Maximum
            1                 2
  Valid cases     Missing cases
         2201                 0

To add a second variable for a cross-tabulation:

with(titanic, Crosstable(SEX, CLASS))

=========================================
                  CLASS                  
       ---------------------------       
SEX     1st    2nd    3rd    crew  Total 
-----------------------------------------
female    145    106    196     23    470
          31%    23%    42%   4.9%   100%
male      180    179    510    862   1731
          10%    10%    29%  49.8%   100%
-----------------------------------------
Total     325    285    706    885   2201
          15%    13%    32%  40.2%   100%
=========================================

To delete table entries that are less relevant, switch it to FALSE and to TRUE otherwise. For instance, to not show column proportions, switch column to FALSE.

Crosstable(titanic, SEX, CLASS, column = FALSE) 

To add a third variable:

Crosstable(titanic, SEX, CLASS, SURVIVED) 

The Crosstable() function can produce tables up to 3 variables with some of the most common statistical tests.

with(titanic, Crosstable(SEX, SURVIVED, fisher=TRUE) )

===========================
         SURVIVED          
       -------------       
SEX      no    yes   Total 
---------------------------
female    126    344    470
          27%    73%   100%
male     1364    367   1731
          79%    21%   100%
---------------------------
Total    1490    711   2201
          68%    32%   100%
===========================


    Fisher's Exact Test for Count Data

data:  x
p-value < 0.00000000000000022
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.07734453 0.12536590
sample estimates:
odds ratio 
0.09869816 
with(titanic, Crosstable(SEX, CLASS, SURVIVED, chisq = TRUE))

=================================
                 SURVIVED          
               -------------       
SEX    CLASS   no    yes   Total 
---------------------------------
female 1st        4    141    145
               2.8%    97%   100%
       2nd       13     93    106
              12.3%    88%   100%
       3rd      106     90    196
              54.1%    46%   100%
       crew       3     20     23
              13.0%    87%   100%
       --------------------------
       Total    126    344    470
              26.8%    73%   100%
---------------------------------
male   1st      118     62    180
                66%    34%   100%
       2nd      154     25    179
                86%    14%   100%
       3rd      422     88    510
                83%    17%   100%
       crew     670    192    862
                78%    22%   100%
       --------------------------
       Total   1364    367   1731
                79%    21%   100%
---------------------------------
Total  1st      122    203    325
                38%    62%   100%
       2nd      167    118    285
                59%    41%   100%
       3rd      528    178    706
                75%    25%   100%
       crew     673    212    885
                76%    24%   100%
       --------------------------
       Total   1490    711   2201
                68%    32%   100%
=================================

SEX : female
Number of cases in table: 470 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 130.69, df = 3, p-value = 0.0000000000000000000000000003837

SEX : male  
Number of cases in table: 1731 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 29.852, df = 3, p-value = 0.000001483

3.2 Sampling and Stratify Data

3.3 Computing Exploratory Statistics

To demonstrate the output of these functions, below are the US presidents and their main opponents’ heights (cm). There is claim that height matters in American presidential elections.

require("SciencesPo")

data(stature)

3.3.1 Computing Method of Moments

These functions perform three types of skewness and kurtosis assessments in conformity with the popular e1071 package (Meyer et al. 2015):

3.3.1.1 Skewness of Data

data(stature)

attach(stature)

# Type 1:
Skewness(winner.height, type = 1)
[1] -0.5589481
# Type 2 
Skewness(winner.height, type = 2)
[1] -0.5741697
# Type 3, the default
Skewness(winner.height)
[1] -0.5443037

3.3.1.2 Kurtosis of Data

data(stature)
attach(stature)

# Type 1:
Kurtosis(winner.height, type = 1)
[1] -0.4117404
# Type 1:
Kurtosis(winner.height, type = 2)
[1] -0.3371491
# Type 3, the default
Kurtosis(winner.height)
[1] -0.5017599

3.3.2 Most Repeating Observations (Mode)

x <- sample (10, replace = TRUE)

Mode(x)
[1]  7 10

3.3.3 Standard Error and Confidence Intervals

data(stature)
attach(stature)

CI(winner.height, level=.95) # confidence interval
     Lower  Est. Mean      Upper Std. Error 
  178.6336   180.5614   182.4892     0.9623 
CI(winner.height, level=.95)@mean # get only the mean 
[1] 180.5614
CI(opponent.height, level=.95, na.rm = TRUE) # confidence interval
     Lower  Est. Mean      Upper Std. Error 
  177.6733   179.5000   181.3267     0.9119 
SE(winner.height) # std. error
[1] 0.9623285

3.3.4 Computing Winsorized Means

data(stature)
attach(stature)


Winsorize(winner.height)
[1] 180.5614
# replacing 35 outlier elements we get same stature values: 
Winsorize(winner.height, k=35)
[1] 183
Winsorize(opponent.height, k=35)
[1] 183

3.3.5 Computing Average Absolute Deviation

data(stature)
attach(stature)

AAD(winner.height) 
[1] 5.895968

4 Data Manipulation

4.1 Labeling

data(religiosity)

Label(religiosity$Religiosity) <- "Religiosity index"

4.2 Renaming columns

4.3 Recoding strings and numeric values

Used to recode values of a vector into new values.

require(SciencesPo)

table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
iris$Species <- Recode(iris$Species, "versicolor", 2)

table(iris$Species)

   setosa         2 virginica 
       50        50        50 

4.4 Safechars

By default, R converts character columns to factors. Instead of re-reading the data using , the function identifies which columns are currently factors, and convert them all to characters parsing the levels as strings.

require(SciencesPo)
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","2","virginica": 1 1 1 1 1 1 1 1 1 1 ...
iris_2 = Safechars(iris)

str(iris_2)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

4.5 Destring

This function converts factor variables to numeric, much like the way Stata does.

require(SciencesPo)

# Simulate some data (12 respondents x 2 items)
df <- data.frame(replicate(2, sample(1:5, 12, replace=TRUE)))

df <- data.frame(lapply(df, factor, ordered=TRUE, 
                          levels=1:5, 
                          labels=c("Strongly disagree","Disagree", "Neutral","Agree","Strongly Agree")))


print(df)
                  X1                X2
1              Agree    Strongly Agree
2            Neutral          Disagree
3              Agree          Disagree
4            Neutral             Agree
5           Disagree    Strongly Agree
6              Agree             Agree
7              Agree          Disagree
8            Neutral    Strongly Agree
9  Strongly disagree             Agree
10             Agree Strongly disagree
11 Strongly disagree          Disagree
12           Neutral          Disagree

By Destring it, we should get a numeric result with the same (un)order:

Destring(df, "X2") 
                   X1 X2
 1:             Agree  5
 2:           Neutral  2
 3:             Agree  2
 4:           Neutral  4
 5:          Disagree  5
 6:             Agree  4
 7:             Agree  2
 8:           Neutral  5
 9: Strongly disagree  4
10:             Agree  1
11: Strongly disagree  2
12:           Neutral  2

4.6 Unity-based normalization

x <- sample(10)

# won't print normalized values by default 
(y = Normalize(x) )
           [,1]
 [1,] 0.0000000
 [2,] 0.7777778
 [3,] 0.3333333
 [4,] 0.1111111
 [5,] 1.0000000
 [6,] 0.6666667
 [7,] 0.8888889
 [8,] 0.4444444
 [9,] 0.2222222
[10,] 0.5555556
attr(,"normalized:min")
[1] 1
attr(,"normalized:max")
[1] 10
# equals to: 
(x-min(x))/(max(x)-min(x))
 [1] 0.0000000 0.7777778 0.3333333 0.1111111 1.0000000 0.6666667 0.8888889
 [8] 0.4444444 0.2222222 0.5555556
# Smithson and Verkuilen approach
(y = Normalize(x, method="SV"))
      [,1]
 [1,] 0.05
 [2,] 0.75
 [3,] 0.35
 [4,] 0.15
 [5,] 0.95
 [6,] 0.65
 [7,] 0.85
 [8,] 0.45
 [9,] 0.25
[10,] 0.55
attr(,"normalized:min")
[1] 1
attr(,"normalized:max")
[1] 10

4.7 Formatted

Simple but useful. The Formatted function rounds numbers to text and drops leading zeros in the process.

x <- as.double(c(0.1, 1, 10, 100, 1000, 10000))

Formatted(x) 
[1] "$ 0.10"     "$ 1.00"     "$ 10.00"    "$ 100.00"   "$ 1000.00" 
[6] "$ 10000.00"
Formatted(x, "BRL")
[1] "R$ 0,10"      "R$ 1,00"      "R$ 10,00"     "R$ 100,00"   
[5] "R$ 1.000,00"  "R$ 10.000,00"
p = c(0.25, 25, 50)

Formatted(p, "Perc", flag="+")
[1] "+0.25%" "+25%"   "+50%"  
Formatted(p, "Perc", decimal.mark=",")
[1] "0,25%" "25%"   "50%"  

4.8 Anonymize data containing identifiable information

The following is an example data frame.

  Z X  Y
1 L 3 no
2 B 1 no
3 N 2 no
4 F 5 no
5 T 4 no

Then, the anonymized result of the data frame above:

dt %>% Anonymize()
  V1        V2 V3
1  B 1.0000000  G
2  N 0.3333333  G
3  M 0.6666667  G
4  R 1.6666667  G
5  G 1.3333333  G

5 Computing Some Statistical Tests

6 Distributions

Here are some implementation of standard statistical distributions for pedagogic use. Most of these distributions are available within the base R, however, I remaned them with more intuitive names–at least in my view.

6.1 Normal Probability Density Function

normalpdf(x=1.2, mu=0, sigma=1)

6.2 Normal Cumulative Distribution

normalcdf(lower=-1.96, upper=1.96, mu=0, sigma=1)

6.3 Inverse Cumulative Standard Normal Distribution

invnormal(area=0.35, mu=0, sigma=1)

6.4 Dirichlet Distribution

The function will return a matrix with rows, each containing a single random number according to the supplied alpha vector or matrix.

alphas <- cbind(1:4, 1, 4:1);
rdirichlet(4, alphas );
          [,1]       [,2]      [,3]
[1,] 0.1276054 0.21672055 0.6556741
[2,] 0.2417751 0.11867567 0.6395492
[3,] 0.3332945 0.05054353 0.6161619
[4,] 0.6977100 0.06968962 0.2326004

Consider the following example of usage: A Brazilian face-to-face poll conducted by Datafolha on Oct 03-04 (2014) with 18,116 insterviews, asked voters for their vote preference among presidential candidates.

# draw a sample from the posterior
set.seed(1234);
n <- 18116;
poll <- c(40,24,22,5,5,4) / 100 * n; # The data
mcmc <- 10000;
sims <- rdirichlet(mcmc, alpha = poll + 1);

Let’s see a descriptive summary of the simulated data:

Describe(sims)

Summary table
Dataset: sims
-----------------------------------------------------------------------------
Variable   Levels.or.Stats                     Freq               Valid      
---------- ----------------------------------- ------------------ -----------
1.         mean (sd) = 0.4 (0)                 10000 distinct val 10000/10000
V1         min < med < max = 0.39 < 0.4 < 0.41                    (100.0%)   
           IQR (CV) = 0 (0.01)                                               

2.         mean (sd) = 0.24 (0)                10000 distinct val 10000/10000
V2         min < med < max = 0.23 < 0.24 <                        (100.0%)   
           0.25                                                              
           IQR (CV) = 0 (0.01)                                               

3.         mean (sd) = 0.22 (0)                10000 distinct val 10000/10000
V3         min < med < max = 0.21 < 0.22 <                        (100.0%)   
           0.23                                                              
           IQR (CV) = 0 (0.01)                                               

4.         mean (sd) = 0.05 (0)                10000 distinct val 10000/10000
V4         min < med < max = 0.04 < 0.05 <                        (100.0%)   
           0.06                                                              
           IQR (CV) = 0 (0.03)                                               

5.         mean (sd) = 0.05 (0)                10000 distinct val 10000/10000
V5         min < med < max = 0.04 < 0.05 <                        (100.0%)   
           0.06                                                              
           IQR (CV) = 0 (0.03)                                               

6.         mean (sd) = 0.04 (0)                10000 distinct val 10000/10000
V6         min < med < max = 0.04 < 0.04 <                        (100.0%)   
           0.05                                                              
           IQR (CV) = 0 (0.04)                                               
-----------------------------------------------------------------------------

After obtained 10,000 simulated values, we look at the margins as of Aecio over Marina in the very last days of the campaign:

# compute the margins: Aecio minus Marina
margins <- sims[,2] - sims[,3];

# What is the mean of the margins
# posterior mean estimate:
mean(margins); 
[1] 0.02003513
# posterior standard deviation:
sd(margins); 
[1] 0.005006119
# 90% credible interval:
quantile(margins, probs = c(0.025, 0.975)); 
      2.5%      97.5% 
0.01031474 0.02998522 
# posterior probability of a positive margin (Aécio over Marina):
mean(margins > 0); 
[1] 1

7 Political Behavior Measures

There is a specific vignette for Indices, covering in great detail many of the political behaviour indices.

7.1 Measures of Political Diversity

Let’s consider the 1980’s US presidential election (vote share) to compute the effective number of electoral parties (ENEP).

 Democratic  Republican Independent Libertarian    Citizens      Others 
      0.410       0.507       0.066       0.011       0.003       0.003 
PoliticalDiversity(US1980, index = "laakso/taagepera");
[1] 2.328
PoliticalDiversity(US1980, index = "golosov");
[1] 2.597

Considers the following data.frame with electoral results for the 1999 election in Helsinki, the seats were allocated using both Saint-Laguë and D’Hondt methods:

   votes seats.SL seats.dH
1  68885        5        5
2  18343        1        1
3  86448        6        7
4  21982        1        1
5  51587        4        4
6  27227        2        2
7   8482        1        0
8   7250        0        0
9    365        0        0
10  2734        0        0
11  1925        0        0
12   475        0        0
13  1693        0        0
14   693        0        0
15   308        0        0
16   980        0        0
17   560        0        0
18   590        0        0
19   185        0        0
# ENEP defaults (laakso/taagepera) method 

with(Helsinki, PoliticalDiversity(votes)); #ENEP Votes
[1] 5.453
with(Helsinki, PoliticalDiversity(seats.SL)); #ENP for Saint-Lague
[1] 4.762
with(Helsinki, PoliticalDiversity(seats.dH)); #ENP for D'Hondt
[1] 4.167

7.2 Proportionality Measures

Let’s use the 2012 Quebec provincial election:

   party   votes pvotes seats pseats
1     PQ 1393703  31.95    54   43.2
2    Lib 1360968  31.20    50   40.0
3    CAQ 1180235  27.05    19   15.2
4     QS  263111   6.03     2    1.6
5 Option   82539   1.89     0    0.0
6  Green   43394   0.99     0    0.0
7 Others   38738   0.89     0    0.0

7.2.1 Rae’s proportionality index

with(Quebec, Proportionality(pvotes, pseats, 
                     index = "Rae") )

Rae's Index :  0.057 

7.2.2 Loosemore-Hanby’s Proportionality Index

with(Quebec, Proportionality(pvotes, pseats, 
                     index = "Loosemore-Hanby") )

Loosemore-Hanby's Index :  0.201 

7.2.3 Gallagher’s proportionality index

with(Quebec, Proportionality(pvotes, pseats, 
                     index = "Gallagher") )

Gallagher's Index :  0.136 

7.3 Highest Averages Methods of Allocating Seats Proportionally

Now using data from 2014 Brazilian legislative elections, especifically from one district, let’s compare the results from D’Hondt, Saint-Lague, Hungtinton-Hill, and Imperiali methods.

  PCdoB     PDT     PEN    PMDB     PRB     PSB     PSC    PSTU   PTdoB 
 187906  326841  132531  981096 2043217   15061  103679  109830  213988 
    PTC     PTN 
  67145  278267 

The basic imputs for this class of functions are: 1) a list of parties, 2) a list of positive votes, and 3) a constant value for the number of seats to be returned. A numeric value (0~1) for the threshold is optional.

7.3.1 d’Hondt

HighestAverages(parties=names(Ceara), votes=Ceara,
                seats = 42, method = "dh") 
Method: d'Hondt 
Divisors: 1 2 3 4 ... 
ENP: 3.65 (After): 3.14 
Gallagher Index:  47.01 
 

-------------------------
 Party   Seats   % Seats 
------- ------- ---------
  PRB     21       50    

 PMDB     10      23.8   

  PDT      3       7.1   

  PTN      2       4.8   

 PTdoB     2       4.8   

 PCdoB     1       2.4   

  PEN      1       2.4   

  PSC      1       2.4   

 PSTU      1       2.4   

  PSB      0        0    

  PTC      0        0    
-------------------------

The d’Hondt method is only one way of allocating seats in party list systems. Other methods include the Saint-Lague, the modified Saint-Lague, the Danish version, Imperiali (do not to confuse with the Imperiali quota which is a Largest remainder method), and Hungtinton-Hill.

7.3.2 Saint-Lague

HighestAverages(parties=names(Ceara), votes=Ceara,
                seats = 42, method = "sl") 
Method: Sainte-Laguë 
Divisors: 1 3 5 7 ... 
ENP: 3.65 (After): 3.74 
Gallagher Index:  43.99 
 

-------------------------
 Party   Seats   % Seats 
------- ------- ---------
  PRB     19      45.2   

 PMDB      9      21.4   

  PDT      3       7.1   

  PTN      3       7.1   

 PCdoB     2       4.8   

 PTdoB     2       4.8   

  PEN      1       2.4   

  PSC      1       2.4   

 PSTU      1       2.4   

  PTC      1       2.4   

  PSB      0        0    
-------------------------

7.3.3 Using thresholds

Let’s assume the electoral system has a 5% vote threshold. Meaning that parties must get at least 5% of the total unspoiled votes cast in order to participate in the distribution of seats. Parties PCdoB, PTdoB, PEN, PSC, PSTU, PSB, and PTC would then be elimiated from competition at the outset. If the d’Hondt method of seat allocation were employed, then party PRB would get 4 extra seats than otherwise, and party PMDB 3 additional seats.

HighestAverages(parties=names(Ceara), votes=Ceara,
               seats = 42, method = "dh", threshold = 5/100) 
Method: d'Hondt 
Divisors: 1 2 3 4 ... 
ENP: 3.65 (After): 2.39 
Gallagher Index:  53.23 
 

-------------------------
 Party   Seats   % Seats 
------- ------- ---------
  PRB     24      57.1   

 PMDB     12      28.6   

  PDT      3       7.1   

  PTN      3       7.1   

 PCdoB     0        0    

  PEN      0        0    

  PSB      0        0    

  PSC      0        0    

 PSTU      0        0    

  PTC      0        0    

 PTdoB     0        0    
-------------------------

Other methods divide the votes by a mathematically derived quota, such as the Droop quota, the Hare quota (or Hamilton/Vinton), or the Imperiali quota, see next.

7.4 Largest Remainder Methods of Allocating Seats Proportionally

7.4.1 Hare quota

LargestRemainders(parties=names(Ceara), votes=Ceara, 
                seats = 42, method = "hare") 

7.4.2 Droop quota

LargestRemainders(parties=names(Ceara), votes=Ceara, 
                seats = 42, method = "droop") 

7.5 Suitable output for recycling in rmarkdown documents

The output produced by the highestAveragesof() and largestRemainders() functions is always a data.frame; therefore, it’s very straightforward to use with other aplications. For instance, I like the idea of using the output with knitr (2016) and ggplot2 (2009) packages to produce publishable-quality tables and graphs.

mytable = HighestAverages(parties=names(Ceara), votes=Ceara, 
                seats = 42, method = "dh") 
Method: d'Hondt 
Divisors: 1 2 3 4 ... 
ENP: 3.65 (After): 3.14 
Gallagher Index:  47.01 
 
library(knitr)

kable(mytable, align=c("l","c","c"))
Party Seats % Seats
PRB 21 50.0
PMDB 10 23.8
PDT 3 7.1
PTN 2 4.8
PTdoB 2 4.8
PCdoB 1 2.4
PEN 1 2.4
PSC 1 2.4
PSTU 1 2.4
PSB 0 0.0
PTC 0 0.0
mytable = HighestAverages(parties=names(Ceara), votes=Ceara, 
                seats = 42, method = "dh") 
Method: d'Hondt 
Divisors: 1 2 3 4 ... 
ENP: 3.65 (After): 3.14 
Gallagher Index:  47.01 
 
p <- ggplot(mytable, aes(x=reorder(Party, Seats), y=Seats)) + 
  geom_lollipop() + coord_flip() + labs(x=NULL, y="# Seats")
p + theme_538() + theme(panel.grid.major.y = element_blank())
2014 Legislative Election in Ceara (M=42)

2014 Legislative Election in Ceara (M=42)

7.6 Measures of inequality and concentration

7.6.1 Lorenz curve

$p
 [1] 0.0000000 0.2264877 0.2303473 0.3608878 0.4116205 0.6213280 0.7130032
 [8] 0.7581524 0.8769094 1.0000000

$L
 [1] 0.0000000 0.1955209 0.1990112 0.3231465 0.3731350 0.5883759 0.6829786
 [8] 0.7313230 0.8617789 1.0000000

$L.general
 [1]   0.0000 176.2074 179.3530 291.2262 336.2768 530.2563 615.5142
 [8] 659.0832 776.6527 901.2203

attr(,"class")
[1] "lorenz"

7.6.2 Gini Index

x <- c(778, 815, 857, 888, 925, 930, 965, 990, 1012)

wgt <- runif(n=length(x))

Gini(x, wgt)
[1] 0.03611

7.6.3 Atkinson’s Index

x <- c(778, 815, 857, 888, 925, 930, 965, 990, 1012)

Atkinson(x, epsilon = 0.5)
[1] 0.001719

8 Plot Themes & Colors

The SciencesPo brings in few themes and color palettes to enhance ggplot2 graphs, but there are specialized packages like ggthemes, ggthemr, cowplot, ggalt, GGally, and sjPlot that can make graph decoration a lot easier. There’s also an interesting discussion on plot design in the outstanding reference Cookbook for R that might be of your interest.

With SciencesPo, one can “preview” ggplot themes by simply invoking the Previewplot() function. This will draw a predefined plot that we can combine with themes, for instance.

Previewplot() + theme_grey()

8.1 Themes

In my point of view, the default ggplot2 design has its charm, but, very often I don’t like the gray background grid. I feel it distracts from the data. For example, see the following ggplot2 graph with default settings on.

# detach("package:SciencesPo", unload = TRUE)

gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp)) 
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8)) 
gg 

library(SciencesPo)

gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp)) 
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8)) 
gg <- gg + theme_blank()
gg

library(SciencesPo)

gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp)) 
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8)) 
gg <- gg + theme_pub()
gg

library(SciencesPo)

gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp)) 
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8)) 
gg <- gg + theme_538()
gg

library(SciencesPo)

gg <- ggplot(mtcars, aes(mpg, disp,color=factor(carb),size=hp)) 
gg <- gg + geom_point(alpha=0.7) + labs(title="Bubble Plot")
gg <- gg +scale_size_continuous(range = c(3,8)) 
gg <- gg + theme_darkside()
gg

8.2 Changing defaults

If you want to change the theme for an entire session you can use theme_set() as in theme_set(theme_gray()) to switch to default ggplot2 theme for all subsequent plots. Otherwise, you may also apply themes without changing the defaults as of plot + theme_pub().

To modify general aspects of the theme_pub() as fontsize, font family, color etc:

require(SciencesPo)

# "Verdana", "Tahoma", "Gill Sans" "serif" and "sans" are also high-readability fonts
theme_set(theme_pub(base_size=18, base_family = "serif")) 

Modify it with theme(). You might modify text sizes, colors etc; then add to the plot.

prefs <- theme(axis.text = element_text(size=18, 
                                        family = "Gill Sans", 
                                        colour="red"))

8.2.1 Quick plot customization

Alternatively, we can use a set of built in function to quick plot customization:

Previewplot() + 
  theme_538() + 
  align_title_right()

Previewplot() + 
  theme_538() + 
  no_y_gridlines()

Previewplot() + 
  theme_538() + 
  no_y_gridlines() +
  no_x_gridlines()

8.3 Generic plot annotations

The function Footnote() can be handy for adding extra text information, such as data source, equation, or credits as footnotes to a ggplot2 or gtable objects. The next chunck demonstrates its usage:

gg <- ggplot(Presidents, aes(x=height_ratio, y=vote_support)) 
gg <- gg + geom_smooth(method=lm, colour="red", fill="gold")
gg <- gg + geom_point(size = 5, alpha = .7) 
gg <- gg +  xlim(0.9,1.2) + ylim(.40, .70)
gg <- gg + labs(x="Winner/Opponent height ratio", y="Winner vote share", title="Does Height Matter in Presidential Elections?")
gg <- gg + theme_pub()

# Commence adding layers here
Render(gg) + Footnote(note="danielmarcelino.github.io", color="orange")

8.4 Generic plots (Data Visualization 101)

with(mtcars, Scatterplot(x = wt, y = mpg,
main = "Vehicle Weight-Gas Mileage Relationship",
xlab = "Vehicle Weight",
ylab = "Miles per Gallon",
font.family = "serif") )

8.5 Enhanced geoms

8.6 Color palettes

The following scales help passing manual values to scale colors for ggplot2 objects. In many ways, I think these scales are consistent with the themes whitin the SciencesPo package described earlier.

8.7 Discrete palletes

library("scales", quietly = TRUE)

show_col(pub_pal("pub12")(12))

show_col(pub_pal("gray9")(9), labels = FALSE)

show_col(pub_pal("manyeyes")(20))

show_col(pub_pal("tableau20")(20))

show_col(pub_pal("tableau10")(10))

 show_col(pub_pal("tableau10light")(10))

show_col(pub_pal("cyclic")(20))

party_pal("BRA", plot=TRUE)

8.8 Continuous palletes

8.8.1 Bivariate scales

show_col(pub_pal("trafficlight")(9))

show_col(pub_pal("bivariate1")(9))