<!-- 
.. title: Indices
.. slug: indices
.. date: 2016-05-22 09:03:48 UTC
.. tags: mathematics, presentations
.. category: 
.. link: 
.. description: 
.. type: text
-->

## [A ȷupyter notebook](http://pdes-net.org/scripts/indices.tar.xz)

Scientists move in mysterious ways, particularly when they try to measure their individual performance as a scientist. As I've explained in a [previous post](http://pdes-net.org/cobra/posts/rsums-and-indices.html), the most popular and commonly accepted of these measures is the [h index](http://en.wikipedia.org/wiki/H-index) $\mathcal{H}$, which has been declared to be superfluous on both [empirical](http://michaelnielsen.org/blog/why-the-h-index-is-virtually-no-use/) and [mathematical](http://www.ams.org/journals/notices/201409/rnoti-p1040.pdf) grounds. Either of these references relates $\mathcal{H}$ to the square root of the total number of citations $\mathcal{N}$, the first one approximately 

$\mathcal{H} \approx 0.5 \sqrt{\mathcal{N}}$

and the second one exactly:

$\mathcal{H}=\sqrt{6}\log{2}\sqrt{\mathcal{N}}/\pi \approx 0.54 \sqrt{\mathcal{N}} $.

Since I anyway wanted to test [pandas](http://pandas.pydata.org/), [seaborn](https://stanford.edu/~mwaskom/software/seaborn/) and [statsmodel](http://statsmodels.sourceforge.net/), I gathered $\mathcal{H}$, $\mathcal{N}$, and the i10 index $\mathcal{I}$ from all my coauthors on Google Scholar. It turned out that not even a quarter of my coauthors have a Google Scholar account, but I figured that 71 data points would provide an acceptable statistics. 


```python
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import seaborn as sns
```


```python
# read data into a Pandas DataFrame
data = pd.read_table("/home/cobra/ownCloud/MyStuff/projects/python/publishing/hindex.dat")
# check the data head
data.head()
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Citations</th>
      <th>Hindex</th>
      <th>i10index</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>952</td>
      <td>12</td>
      <td>17</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1913</td>
      <td>20</td>
      <td>35</td>
    </tr>
    <tr>
      <th>2</th>
      <td>5327</td>
      <td>34</td>
      <td>151</td>
    </tr>
    <tr>
      <th>3</th>
      <td>650</td>
      <td>13</td>
      <td>14</td>
    </tr>
    <tr>
      <th>4</th>
      <td>3855</td>
      <td>25</td>
      <td>68</td>
    </tr>
  </tbody>
</table>
</div>



Looks all right.


```python
logdata = np.log10(data)
```

The correlation of the data is much more clear when displayed logarithmically:


```python
vars = ["Citations", "Hindex", "i10index"]
sns.pairplot(logdata, vars=vars, size=3, kind="reg");
```


![png](../images/corrmatrix.png)


Now look at *that!* Two lines of code and seaborn visualizes all correlations in my data set. The diagonal elements of this 3x3 matrix plot show the distributions of $\mathcal{N}$, $\mathcal{H}$, and $\mathcal{I}$ (which seem to be close to normal distributions), and the off-diagonal elements visualize their correlations emphasized by a linear regression (kind="reg"). And how correlated they are! There's indeed no need for a definition of 'indices' if the number of citations is all what it boils down to.

Seaborn is great for visualization, as we have seen, but for quantitative statistical information, it's better to use statsmodel:


```python
hc = sm.ols(formula='Hindex ~ Citations', data=logdata)
fithc = hc.fit()

ic = sm.ols(formula='i10index ~ Citations', data=logdata)
fitic = ic.fit()

hi = sm.ols(formula='Hindex ~ i10index', data=logdata)
fithi = hi.fit()
```

Let's compare the slope of our data with that predicted above:


```python
fithc.params.Citations
```




    0.45708354021378172




```python
np.sqrt(6)*np.log(2)/np.pi
```




    0.54044463946673071



Solid state phycisists have to work harder!

One can get also get more information, if desired:


```python
fithi.summary()
```

<STYLE TYPE="text/css">
<!--
TD{font-family: Monospace; font-size: 10pt;}
TH{font-family: Monospace; font-size: 10pt;}
--->
</STYLE>

<table style="text-align: left; class="simpletable">
<caption>OLS Regression Results</caption>
<tr>
  <th>Dep. Variable:</th>         <td>Hindex</td>      <th>  R-squared:         </th> <td>   0.974</td>
</tr>
<tr>
  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.974</td>
</tr>
<tr>
  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   2592.</td>
</tr>
<tr>
  <th>Date:</th>             <td>Mon, 16 May 2016</td> <th>  Prob (F-statistic):</th> <td>1.84e-56</td>
</tr>
<tr>
  <th>Time:</th>                 <td>14:42:47</td>     <th>  Log-Likelihood:    </th> <td>  114.08</td>
</tr>
<tr>
  <th>No. Observations:</th>      <td>    71</td>      <th>  AIC:               </th> <td>  -224.2</td>
</tr>
<tr>
  <th>Df Residuals:</th>          <td>    69</td>      <th>  BIC:               </th> <td>  -219.6</td>
</tr>
<tr>
  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>   
</tr>
<tr>
  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   
</tr>
</table>
<table style="text-align: left; class="simpletable">
<tr>
      <td></td>         <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th> <th>[95.0% Conf. Int.]</th> 
</tr>
<tr>
  <th>Intercept</th> <td>    0.4498</td> <td>    0.019</td> <td>   23.723</td> <td> 0.000</td> <td>    0.412     0.488</td>
</tr>
<tr>
  <th>i10index</th>  <td>    0.5454</td> <td>    0.011</td> <td>   50.910</td> <td> 0.000</td> <td>    0.524     0.567</td>
</tr>
</table>
<table style="text-align: left; class="simpletable">
<tr>
  <th>Omnibus:</th>       <td> 3.438</td> <th>  Durbin-Watson:     </th> <td>   2.391</td>
</tr>
<tr>
  <th>Prob(Omnibus):</th> <td> 0.179</td> <th>  Jarque-Bera (JB):  </th> <td>   2.600</td>
</tr>
<tr>
  <th>Skew:</th>          <td>-0.408</td> <th>  Prob(JB):          </th> <td>   0.273</td>
</tr>
<tr>
  <th>Kurtosis:</th>      <td> 3.462</td> <th>  Cond. No.          </th> <td>    7.44</td>
</tr>
</table>

And of course, we can display these fits independent of seaborn:


```python
xlist_cit = pd.DataFrame({'Citations': [logdata.Citations.min(), logdata.Citations.max()]})
xlist_i10 = pd.DataFrame({'i10index': [logdata.i10index.min(), logdata.i10index.max()]})
```


```python
preds_hcit = fithc.predict(xlist_cit)
preds_hcit;

preds_i10cit = fitic.predict(xlist_cit)
preds_i10cit;

preds_hi10 = fithi.predict(xlist_i10)
preds_hi10;
```


```python
logdata.plot(kind='scatter', x='Citations', y='Hindex')
plt.plot(xlist_cit, preds_hcit, c='red', linewidth=2);

logdata.plot(kind='scatter', x='Citations', y='i10index')
plt.plot(xlist_cit, preds_i10cit, c='red', linewidth=2);

logdata.plot(kind='scatter', x='i10index', y='Hindex')
plt.plot(xlist_i10, preds_hi10, c='red', linewidth=2);
```


![png](../images/corr1.png)

![png](../images/corr2.png)

![png](../images/corr3.png)


