En estas notas hago pruebas con la estructura de datos DataFrame.

Inicialización con el tipo "DataFrame"

Un objeto "DataFrame" es como una tabla SQL o una hoja de calculo. Lo que sigue son algunas formas de inicialización.

In [1]:

#importacion estandar de pandas
import pandas as pd
import numpy as np
from IPython.display import display

data = [10,20,30]
df = pd.DataFrame(data)
df

Out[1]:

	0
0	10
1	20
2	30

A partir de un diccionario

In [2]:

data = {'col1' : [1., 2., 3., 4.], 
    'col2' : [4., 3., 2., 1.]}
df = pd.DataFrame(data)
df

Out[2]:

	col1	col2
0	1	4
1	2	3
2	3	2
3	4	1

O de un diccionario con "Series"

In [3]:

data = {'col1' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 
     'col2' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df

Out[3]:

	col1	col2
a	1	1
b	2	2
c	3	3
d	NaN	4

A partir de un string JSON

In [4]:

json_str = '[ {"a": 10, "d": 1, "c": 2, "b": 3}, \
            {"a": 20, "d": 1, "c": 2, "b": 3}]'

df = pd.read_json(json_str)
df

Out[4]:

	a	b	c	d
0	10	3	2	1
1	20	3	2	1

A partir de un archivo JSON

In [5]:

df = pd.read_json("data/dumb_data.json")
df

Out[5]:

	a	b	c	d
0	10	3	2	1
1	20	3	2	1

Tomando solo algunos indices y columnas

In [6]:

df = pd.DataFrame(data, index=['d', 'b', 'a'], columns=['col2'])
df

Out[6]:

	col2
d	4
b	2
a	1

A partir de un archivo

In [7]:

df = pd.read_csv("data/dumb_data.csv")
df

Out[7]:

	col1	col2
0	1	4
1	2	3
2	3	2

A partir de un StringIO

In [8]:

from StringIO import StringIO
    
tsv = StringIO("""
Age Happiness # of pets Religion
10 Not happy 0 Not religious
20 Very happy 2 Islamic
2 Pretty happy 4 Hindu
""")

df = pd.read_csv(tsv, sep='\t')
df

Out[8]:

	Age	Happiness	# of pets	Religion
0	10	Not happy	0	Not religious
1	20	Very happy	2	Islamic
2	2	Pretty happy	4	Hindu

Un "DataFrame" contiene columnas de tipo "Series"

In [9]:

type(df["Age"])

Out[9]:

pandas.core.series.Series

Datos, indices y columnas

In [10]:

df.values

Out[10]:

array([[10L, 'Not happy', 0L, 'Not religious'],
       [20L, 'Very happy', 2L, 'Islamic'],
       [2L, 'Pretty happy', 4L, 'Hindu']], dtype=object)

In [11]:

df.index

Out[11]:

Int64Index([0, 1, 2], dtype='int64')

In [12]:

df.columns

Out[12]:

Index([u'Age', u'Happiness', u'# of pets', u'Religion'], dtype='object')

Seleccionando Datos

Seleccionando columnas

In [13]:

df["Age"]

Out[13]:

0    10
1    20
2     2
Name: Age, dtype: int64

In [14]:

columnas = ["Age", "Happiness"]
df[columnas]

Out[14]:

	Age	Happiness
0	10	Not happy
1	20	Very happy
2	2	Pretty happy

Seleccionando filas

In [15]:

df["Age"][0]

Out[15]:

In [16]:

filas = [0,18,19]
df["Age"][filas]

Out[16]:

0     10
18   NaN
19   NaN
Name: Age, dtype: float64

Seleccionando con condiciones

In [17]:

df[(df['Age'] <= 30) & \
   (df['Age'] >= 20)]

Out[17]:

	Age	Happiness	# of pets	Religion
1	20	Very happy	2	Islamic

In [18]:

df

Out[18]:

	Age	Happiness	# of pets	Religion
0	10	Not happy	0	Not religious
1	20	Very happy	2	Islamic
2	2	Pretty happy	4	Hindu

Operaciones Basicas

In [19]:

(df["# of pets"] / df["Age"]) + 100

Out[19]:

0    100.0
1    100.1
2    102.0
dtype: float64

Agregando columnas

In [20]:

df_tmp = df.copy()
df_tmp["tmp"] = [1, 2, 3]
df_tmp

Out[20]:

	Age	Happiness	# of pets	Religion	tmp
0	10	Not happy	0	Not religious	1
1	20	Very happy	2	Islamic	2
2	2	Pretty happy	4	Hindu	3

In [21]:

df_tmp["tmp_factorial"] = df_tmp["tmp"].apply(np.math.factorial)
df_tmp

Out[21]:

	Age	Happiness	# of pets	Religion	tmp	tmp_factorial
0	10	Not happy	0	Not religious	1	1
1	20	Very happy	2	Islamic	2	2
2	2	Pretty happy	4	Hindu	3	6

In [22]:

json_map = {"Not happy" : 0, "Pretty happy" : 1, "Very happy" : 2}
df_tmp["tmp_Happiness"] = df_tmp["Happiness"].map(json_map)
df_tmp

Out[22]:

	Age	Happiness	# of pets	Religion	tmp	tmp_factorial	tmp_Happiness
0	10	Not happy	0	Not religious	1	1	0
1	20	Very happy	2	Islamic	2	2	2
2	2	Pretty happy	4	Hindu	3	6	1

Eliminando columnas

In [23]:

# axis es 0 para fila, 1 para columna
df_tmp.drop(labels="tmp_factorial", axis=1)

Out[23]:

	Age	Happiness	# of pets	Religion	tmp	tmp_Happiness
0	10	Not happy	0	Not religious	1	0
1	20	Very happy	2	Islamic	2	2
2	2	Pretty happy	4	Hindu	3	1

In [24]:

df_tmp.drop(labels=["tmp_factorial", "tmp"], axis=1)

Out[24]:

	Age	Happiness	# of pets	Religion	tmp_Happiness
0	10	Not happy	0	Not religious	0
1	20	Very happy	2	Islamic	2
2	2	Pretty happy	4	Hindu	1

Eliminando filas

In [25]:

# suponiendo que hay una fila con datos faltantes
df_tmp.loc[1, "Age"] = np.nan
df_tmp

Out[25]:

	Age	Happiness	# of pets	Religion	tmp	tmp_factorial	tmp_Happiness
0	10	Not happy	0	Not religious	1	1	0
1	NaN	Very happy	2	Islamic	2	2	2
2	2	Pretty happy	4	Hindu	3	6	1

Aqui suponemos datos faltantes, se elimina la fila pero hay otras maneras de hacer esto, se vera mas adelante

In [26]:

df_tmp = df_tmp.drop(labels=[1], axis=0)
df_tmp

Out[26]:

	Age	Happiness	# of pets	Religion	tmp	tmp_factorial	tmp_Happiness
0	10	Not happy	0	Not religious	1	1	0
2	2	Pretty happy	4	Hindu	3	6	1

Y para reiniciar el indexado

In [27]:

df_tmp.reset_index()

Out[27]:

	index	Age	Happiness	# of pets	Religion	tmp	tmp_factorial	tmp_Happiness
0	0	10	Not happy	0	Not religious	1	1	0
1	2	2	Pretty happy	4	Hindu	3	6	1

Aplicando funciones

In [28]:

df

Out[28]:

	Age	Happiness	# of pets	Religion
0	10	Not happy	0	Not religious
1	20	Very happy	2	Islamic
2	2	Pretty happy	4	Hindu

Algo util que se puede hacer es crear nuebas columnas a partir de otras

In [29]:

def f(row):
    return row["# of pets"] * 1.0 / row["Age"]
    print row
    print

df["age/#pets"] = df.apply(f, axis=1) # recorriendo columnas
df

Out[29]:

	Age	Happiness	# of pets	Religion	age/#pets
0	10	Not happy	0	Not religious	0.0
1	20	Very happy	2	Islamic	0.1
2	2	Pretty happy	4	Hindu	2.0

Otras funciones

Descriptores Estadisticos

In [30]:

df.describe(percentiles=[0.25, 0.5, 0.75])

Out[30]:

	Age	# of pets	age/#pets
count	3.000000	3	3.000000
mean	10.666667	2	0.700000
std	9.018500	2	1.126943
min	2.000000	0	0.000000
25%	6.000000	1	0.050000
50%	10.000000	2	0.100000
75%	15.000000	3	1.050000
max	20.000000	4	2.000000

Primero elementos

In [31]:

df.head()

Out[31]:

	Age	Happiness	# of pets	Religion	age/#pets
0	10	Not happy	0	Not religious	0.0
1	20	Very happy	2	Islamic	0.1
2	2	Pretty happy	4	Hindu	2.0

Trabajando con Datos faltantes (missing data)

Los siguientes son ejemplos de inicializacion de datos que tienen datos faltantes.

In [32]:

tsv = StringIO("""
Age Happiness # of pets Religion
10 Not happy 0 NaN
NaN Very happy 2 Islamic
2 Pretty happy 4 Hindu
""")

df = pd.read_csv(tsv, sep='\t')
df

Out[32]:

	Age	Happiness	# of pets	Religion
0	10	Not happy	0	NaN
1	NaN	Very happy	2	Islamic
2	2	Pretty happy	4	Hindu

Verificando si tenemos datos faltantes

In [33]:

df.notnull()

Out[33]:

	Age	Happiness	# of pets	Religion
0	True	True	True	False
1	False	True	True	True
2	True	True	True	True

Columnas con datos completos (complete data) y faltantes (missing data)

In [34]:

df.notnull().all()

Out[34]:

Age          False
Happiness     True
# of pets     True
Religion     False
dtype: bool

Son datos completos?

In [35]:

df.notnull().all().all()

Out[35]:

False

Entonces presenta datos faltantes, luego algo que podemos hacer es rellenar con promedios (esto se puede hacer por columna)

In [36]:

df.fillna(df.mean())

Out[36]:

	Age	Happiness	# of pets	Religion
0	10	Not happy	0	NaN
1	6	Very happy	2	Islamic
2	2	Pretty happy	4	Hindu

Se relleno el dato faltante en "Age", el de "Religion" tal vez sea mejor dejarlo asi ya que "NaN" representa un dato faltante a pesar de su significado

Y para eliminar las filas que tienen datos faltantes en un cojunto de columnas

In [37]:

df.dropna(subset=["Age"])

Out[37]:

	Age	Happiness	# of pets	Religion
0	10	Not happy	0	NaN
2	2	Pretty happy	4	Hindu

O sino en todo el dataframe

In [38]:

df.dropna()

Out[38]:

	Age	Happiness	# of pets	Religion
2	2	Pretty happy	4	Hindu

Usando El Tipo DataFrame de Python Pandas