dataframe#

class dataframe(data=None, index=None, columns=None, dtype=None, copy=None, dtypes=None, nrows=None, **kwargs)[source]#

Bases: DataFrame

An extension of the pandas DataFrame with additional convenience methods for accessing rows and columns and performing other operations, such as adding rows.

Parameters:
  • data (dict/array/dataframe) – the data to use; passed to pd.DataFrame()

  • index (array) – the index to use; passed to pd.DataFrame()

  • columns (list) – column labels (if a dict is supplied, the value sets the dtype)

  • dtype (type) – a dtype for the whole datafrmae; passed to pd.DataFrame()

  • dtypes (list/dict) – alternatively, list of data types to set each column to

  • nrows (int) – the number of arrows to preallocate (default 0)

  • kwargs (dict) – if provided, treat these as data columns

Hint: Run the example below line by line to get a sense of how the dataframe changes.

Examples:

df = sc.dataframe(cols=['x','y'], data=[[1238,2],[384,5],[666,7]]) # Create data frame
df['x'] # Print out a column
df[0] # Print out a row
df['x',0] # Print out an element
df[0,:] = [123,6]; print(df) # Set values for a whole row
df['y'] = [8,5,0]; print(df) # Set values for a whole column
df['z'] = [14,14,14]; print(df) # Add new column
df.rmcol('z'); print(df) # Remove a column
df.addcol('z', [14,14,14]); print(df) # Alternate way to add new column
df.poprow(1); print(df) # Remove a row
df.append([555,2,14]); print(df) # Append a new row
df.insertrow(1,[556,2,14]); print(df) # Insert a new row
df.sort(); print(df) # Sort by the first column
df.sort('y'); print(df) # Sort by the second column
df.findrow(123) # Return the row starting with value 123
df.rmrow(); print(df) # Remove last row
df.rmrow(555); print(df) # Remove the row starting with element '555'

# Direct setting of data
df = sc.dataframe(a=[1,2,3], b=[4,5,6])

The dataframe can be used for both numeric and non-numeric data.

New in version 2.0.0: subclass pandas DataFrame
New in version 3.0.0: “dtypes” argument; handling of item setting
New in version 3.1.0: use panda’s equality operator by default (unless an exception is raised); new “equal” method; “cat” can be an instance method now

Attributes

T

The transpose of the DataFrame.

at

Access a single value for a row/column label pair.

attrs

Dictionary of global attributes of this dataset.

axes

Return a list representing the axes of the DataFrame.

cols

Get columns as a list

columns

The column labels of the DataFrame.

dtypes

Return the dtypes in the DataFrame.

empty

Indicator whether Series/DataFrame is empty.

flags

Get the properties associated with this pandas object.

iat

Access a single value for a row/column pair by integer position.

iloc

Purely integer-location based indexing for selection by position.

index

The index (row labels) of the DataFrame.

loc

Access a group of rows and columns by label(s) or a boolean array.

ncols

Get the number of columns in the dataframe

ndim

Return an int representing the number of axes / array dimensions.

nrows

Get the number of rows in the dataframe

shape

Return a tuple representing the dimensionality of the DataFrame.

size

Return an int representing the number of elements in this object.

style

Returns a Styler object.

values

Return a Numpy representation of the DataFrame.

Methods

property cols#

Get columns as a list

set_dtypes(dtypes)[source]#

Set dtypes in-place (see df.astype() for the user-facing version)

New in version 3.0.0.

col_index(col=None, *args, die=True)[source]#

Get the index of the column named col.

Similar to df.columns.get_loc(col), and opposite of df.col_name.

Parameters:
  • col (str/list) – the column(s) to get the index of (return 0 if None)

  • args (list) – additional column(s) to get the index of

  • die (bool) – whether to raise an exception if the column could not be found (else, return None)

Examples:

df = sc.dataframe(dict(a=[1,2,3], b=[4,5,6], c=[7,8,9]))
df.col_index('b') # Returns 1
df.col_index(1) # Returns 1
df.col_index('a', 'c') # Returns [0, 2]

New in version 3.0.0: renamed from “_sanitizecols”; multiple arguments

col_name(col=None, *args, die=True)[source]#

Get the name of the column(s) with index col.

Similar to df.columns[col], and opposite of df.col_index.

Note: This method always looks for named columns first. If col is name of a column, it will return col rather than columns[col]. See example below for more information.

Parameters:
  • col (int/list) – the column(s) to get the index of (return 0 if None)

  • args (list) – additional column(s) to get the index of

  • die (bool) – whether to raise an exception if the column could not be found (else, return None)

Examples:

df = sc.dataframe(dict(a=[1,2,3], b=[4,5,6], c=[7,8,9]))
df.col_name(1) # Returns 'b'
df.col_name('b') # Returns 'b'
df.col_name(0, 2) # Returns ['a', 'c']

New in version 3.0.0.

get(key)[source]#

Alias to pandas __getitem__ method; rarely used

set(key, value=None)[source]#

Alias to pandas __setitem__ method; rarely used

flexget(cols=None, rows=None, asarray=False, cast=True, default=None)[source]#

More complicated way of getting data from a dataframe. While getting directly by key usually returns the array data directly, this usually returns another dataframe.

Parameters:
  • cols (str/list) – the column(s) to get

  • rows (int/list) – the row(s) to get

  • asarray (bool) – whether to return an array (otherwise, return a dataframe)

  • cast (bool) – attempt to cast to an all-numeric array

  • default (any) – the value to return if the column(s)/row(s) can’t be found

Example:

df = sc.dataframe(cols=['x','y','z'],data=[[1238,2,-1],[384,5,-2],[666,7,-3]]) # Create data frame
df.flexget(cols=['x','z'], rows=[0,2])
classmethod equal(*args, equal_nan=True)[source]#

Class method returning boolean true/false equals that allows for more robust equality checks: same type, size, columns, and values. See df.equals() for equivalent instance method.

Examples:

df1 = sc.dataframe(a=[1, 2, np.nan])
df2 = sc.dataframe(a=[1, 2, 4])

sc.dataframe.equal(df1, df1) # Returns True
sc.dataframe.equal(df1, df1, equal_nan=False) # Returns False
sc.dataframe.equal(df1, df2) # Returns False
sc.dataframe.equal(df1, df1, df2) # Also returns False

New in version 3.1.0.

equals(other, *args, equal_nan=True)[source]#

Try the default equals(), but fall back on the more robust sc.dataframe.equal() if that fails.

New in version 3.1.0.

disp(nrows=None, ncols=None, width=999, precision=4, options=None, **kwargs)[source]#

Flexible display of a dataframe, showing all rows/columns by default.

Parameters:
  • nrows (int) – maximum number of rows to show (default: all)

  • ncols (int) – maximum number of columns to show (default: all)

  • width (int) – maximum screen width (default: 999)

  • precision (int) – number of decimal places to show (default: 4)

  • options (dict) – an optional dictionary of additional options, passed to pd.option_context()

  • kwargs (dict) – also passed to pd.option_context(), with ‘display.’ preprended if needed

Examples:

df = sc.dataframe(data=np.random.rand(100,10))
df.disp()
df.disp(precision=1, ncols=5, colheader_justify='left')

New in version 2.0.1.

replacedata(newdata=None, newdf=None, reset_index=True, inplace=True)[source]#

Replace data in the dataframe with other data; usually not used directly by the user, but used as part of e.g. df.concat().

Parameters:
  • newdata (array) – replace the dataframe’s data with these data

  • newdf (dataframe) – substitute the current dataframe with this one

  • reset_index (bool) – update the index

  • inplace (bool) – whether to modify in-place

New in version 3.0.0: improved dtype handling

appendrow(row, reset_index=True, inplace=True)[source]#

Add row(s) to the end of the dataframe.

See also df.concat() and df.insertrow(). Similar to the pandas operation df.iloc[-1] = ..., but faster and provides additional type checking.

Parameters:
  • value (array) – the row(s) to append

  • reset_index (bool) – update the index

  • inplace (bool) – whether to modify in-place

Note: “appendrow” and “concat” are equivalent, except appendrow() defaults to modifying in-place and “concat” defaults to returning a new dataframe.

Warning: modifying dataframes in-place is quite inefficient. For highest performance, construct the data in large chunks and then add to the dataframe all at once, rather than adding row by row.

Example:

import sciris as sc
import numpy as np

df = sc.dataframe(dict(
    a = ['foo','bar'],
    b = [1,2],
    c = np.random.rand(2)
))
df.appendrow(['cat', 3, 0.3])           # Append a list
df.appendrow(dict(a='dog', b=4, c=0.7)) # Append a dict

New in version 3.0.0: renamed “value” to “row”; improved performance

append(row, reset_index=True, inplace=True)[source]#

Alias to appendrow().

Note: pd.DataFrame.append was deprecated in pandas version 2.0; see pandas-dev/pandas#35407 for details. Since this method is implemented using pd.concat(), it does not suffer from the performance problems that append did.

New in version 3.0.0.

insertrow(index=0, value=None, reset_index=True, inplace=True, **kwargs)[source]#

Insert row(s) at the specified location. See also df.concat() and df.appendrow().

Parameters:
  • index (int) – index at which to insert new row(s)

  • value (array) – the row(s) to insert

  • reset_index (bool) – update the index

  • inplace (bool) – whether to modify in-place

  • kwargs (dict) – passed to :meth:`df.concat() <dataframe.concat>

Warning: modifying dataframes in-place is quite inefficient. For highest performance, construct the data in large chunks and then add to the dataframe all at once, rather than adding row by row.

Example:

import sciris as sc
import numpy as np

df = sc.dataframe(dict(
    a = ['foo','cat'],
    b = [1,3],
    c = np.random.rand(2)
))
df.insertrow(1, ['bar', 2, 0.2])           # Insert a list
df.insertrow(0, dict(a='rat', b=0, c=0.7)) # Insert a dict

New in version 3.0.0: renamed “row” to “index”

concat(data, *args, columns=None, reset_index=True, inplace=False, dfargs=None, **kwargs)[source]#

Concatenate additional data onto the current dataframe.

Similar to df.appendrow() and df.insertrow(); see also sc.dataframe.cat() for the equivalent class method.

Parameters:
  • data (dataframe/array) – the data to concatenate

  • *args (dataframe/array) – additional data to concatenate

  • columns (list) – if supplied, columns to go with the data

  • reset_index (bool) – update the index

  • inplace (bool) – whether to append in place

  • dfargs (dict) – arguments passed to construct each dataframe

  • **kwargs (dict) – passed to pd.concat()

Example:

arr1 = np.random.rand(6,3)
df2 = sc.dataframe(np.random.rand(4,3))
df3 = df2.concat(arr1)
New in version 2.0.2: “inplace” defaults to False
New in version 3.0.0: improved type handling
classmethod cat(data, *args, dfargs=None, **kwargs)[source]#

Convenience class method for concatenating multiple dataframes. See df.concat() for the equivalent instance method.

Parameters:
  • data (dataframe/array) – the dataframe/data to use as the basis of the new dataframe

  • args (list) – additional dataframes (or object that can be converted to dataframes) to concatenate

  • dfargs (dict) – arguments passed to construct each dataframe

  • kwargs (dict) – passed to df.concat()

Example:

arr1 = np.random.rand(6,3)
df2 = pd.DataFrame(np.random.rand(4,3))
df3 = sc.dataframe.cat(arr1, df2)

New in version 2.0.2.

merge(*args, reset_index=True, inplace=False, **kwargs)[source]#

Alias to pd.merge, except merge in place.

Parameters:
  • reset_index (bool) – update the index

  • inplace (bool) – whether to append in place

  • **kwargs (dict) – passed to pd.concat()

New in version 3.0.0.

Example:

df = sc.dataframe(dict(x=[1,2,3], y=[4,5,6]))
df2 = sc.dataframe(dict(x=[1,2,3], z=[9,8,7]))
df.merge(df2, on='x', inplace=True)
property ncols#

Get the number of columns in the dataframe

property nrows#

Get the number of rows in the dataframe

addcol(key=None, value=None, data=None, inplace=True, **kwargs)[source]#

Add new column(s) to the data frame

See also assign(), which is similar, but returns a new dataframe by default.

Parameters:
  • key (str) – the name of the column

  • value (array) – the values for the column

  • data (dict) – alternatively, specify a dictionary of columns to add

  • inplace (bool) – whether to return a new dataframe

  • kwargs (dict) – additional columns to add

NB: a single argument is interpreted as “data”

Example:

df = sc.dataframe(dict(x=[1,2,3], y=[4,5,6]))
new_cols = dict(z=[1,2,3], a=[9,8,7])
df.addcol(new_cols)
popcols(col=None, *args, die=True)[source]#

Remove a column or columns from the data frame.

Alias to pop(), except allowing multiple columns to be popped.

Parameters:
  • col (str/list) – the column(s) to be popped

  • args (list) – additional columns to pop

  • die (bool) – whether to raise an exception if a column is not found

Example:

df = sc.dataframe(cols=['a','b','c','d'], data=np.random.rand(3,4))
df.popcols('a','c')
findind(value=None, col=None, closest=False, die=True)[source]#

Find the row index for a given value and column.

See df.findrow() for the equivalent to return the row itself rather than the index of the row. See df.col_index() for the column equivalent.

Parameters:
  • value (any) – the value to look for (default: return last row index)

  • col (str) – the column to look in (default: first)

  • closest (bool) – if true, return the closest match if an exact match is not found

  • die (bool) – whether to raise an exception if the value is not found (otherwise, return None)

Example:

df = sc.dataframe(data=[[2016,0.3],[2017,0.5]], columns=['year','val'])
df.findind(2016) # returns 0
df.findind(0.5, 'val') # returns 1
df.findind(2013) # returns None, or exception if die is True
df.findind(2013, closest=True) # returns 0

New in version 3.0.0: renamed from “_rowindex”

poprow(row=-1, returnval=True)[source]#

Remove a row from the data frame.

Alias to drop, except drop by position rather than label, and modify in-place. To pop multiple rows, see meth:df.poprows() <dataframe.poprows>.

Parameters:
  • row (int) – index of the row to pop

  • returnval (bool) – whether to return the row that was popped

To pop a column, see df.pop().

New in version 3.0.0: “key” argument renamed “row”

poprows(inds=-1, value=None, col=None, reset_index=True, inplace=True, **kwargs)[source]#

Remove multiple rows by index or value

To pop a single row, see meth:df.poprow() <dataframe.poprow>.

Parameters:
  • inds (list) – the rows to remove

  • values (list) – alternatively, search for these values to remove; see df.findinds for details

  • col (str) – if removing by value, use this column to find the values

  • reset_index (bool) – update the index

  • inplace (bool) – whether to modify in-place

  • kwargs (dict) – passed to df.findinds

Examples:

df = sc.dataframe(np.random.rand(10,3))
df.poprows([3,4,5])

df = sc.dataframe(dict(x=[0,1,2,3,4], y=[2,3,2,7,8]))
df.poprows(value=2, col='y')
enumrows(cols=None, type='objdict')[source]#

Efficiently enumerate the rows of the dataframe

Similar to df.iterrows(), but up to 30x faster since uses tuples instead of pd.Series.

Parameters:
  • cols (list) – the list of columns to include in the enumeration (by default, all)

  • type (str/type) – the output type for each row: options are ‘objdict’ (default), tuple (fastest), list (very fast), dict (pretty fast)

Examples:

df = sc.dataframe(dict(x=[0,1,2,3,4], y=[2,3,2,7,8], z=[5,5,4,3,2]))
for i,row in df.enumrows(): print(i, row.x+row.y) # Typical use case
for i,row in df.enumrows(type=tuple): print(i, row[0]+row[1]) # Fastest
for i,row in df.enumrows(type=dict): print(i, row['x']+row['y']) # Still fast
for i,(x,y) in df.enumrows(cols=['x', 'y'], type=tuple): print(i, x+y) # Even faster
replacecol(col=None, old=None, new=None)[source]#

Replace all of one value in a column with a new value

to_odict(row=None)[source]#

Convert dataframe to a dict of columns, optionally specifying certain rows.

Parameters:

row (int/list) – the rows to include

findrow(value=None, col=None, default=None, closest=False, asdict=False, die=False)[source]#

Return a row by searching for a matching value.

See df.findind() for the equivalent to return the index of the row rather than the row itself, and df.findinds() to find multiple row indices.

Parameters:
  • value (any) – the value to look for

  • col (str) – the column to look for this value in

  • default (any) – the value to return if key is not found (overrides die)

  • closest (bool) – whether or not to return the closest row (overrides default and die)

  • asdict (bool) – whether to return results as dict rather than list

  • die (bool) – whether to raise an exception if the value is not found

Examples:

df = sc.dataframe(cols=['year','val'],data=[[2016,0.3],[2017,0.5], [2018, 0.3]])
df.findrow(2016) # returns array([2016, 0.3], dtype=object)
df.findrow(2013) # returns None, or exception if die is True
df.findrow(2013, closest=True) # returns array([2016, 0.3], dtype=object)
df.findrow(2016, asdict=True) # returns {'year':2016, 'val':0.3}
findinds(value=None, col=None, **kwargs)[source]#

Return the indices of all rows matching the given key in a given column.

Parameters:
  • value (any) – the value to look for

  • col (str) – the column to look in

  • kwargs (dict) – passed to sc.findinds()

Example:

df = sc.dataframe(cols=['year','val'],data=[[2016,0.3],[2017,0.5], [2018, 0.3]])
df.findinds(0.3, 'val') # Returns array([0,2])
filterin(inds=None, value=None, col=None, verbose=False, reset_index=True, inplace=False)[source]#

Keep only rows matching a criterion; see also df.filterout()

filterout(inds=None, value=None, col=None, verbose=False, reset_index=True, inplace=False)[source]#

Remove rows matching a criterion (in place); see also df.filterin()

filtercols(cols=None, *args, keep=True, die=True, reset_index=True, inplace=False)[source]#

Filter columns keeping only those specified – note, by default, do not perform in place

Parameters:
  • cols (str/list) – the columns to keep (or remove if keep=False)

  • args (list) – additional columns

  • keep (bool) – whether to keep the named columns (else, remove them)

  • die (bool) – whether to raise an exception if a column is not found

  • reset_index (bool) – update the index

  • inplace (bool) – whether to modify in-place

Examples:

df = sc.dataframe(cols=['a','b','c','d'], data=np.random.rand(3,4))
df2 = df.filtercols('a','b') # Keeps columns 'a' and 'b'
df3 = df.filtercols('a','c', keep=False) # Keeps columns 'b' and 'd'
sortrows(by=None, reverse=False, returninds=False, reset_index=True, inplace=True, **kwargs)[source]#

Sort the dataframe rows in place by the specified column(s).

Similar to df.sort_values(), except defaults to sorting in place, and optionally returns the indices used for sorting (like np.argsort()).

Parameters:
  • col (str or int) – column to sort by (default, first column)

  • reverse (bool) – whether to reverse the sort order (i.e., ascending=False)

  • returninds (bool) – whether to return the indices used to sort instead of the dataframe

  • reset_index (bool) – update the index

  • inplace (bool) – whether to modify the dataframe in-place

  • kwargs (dict) – passed to df.sort_values()

New in version 3.0.0: “inplace” argument; “col” argument renamed “by”

sort(by=None, reverse=False, returninds=False, inplace=True, **kwargs)[source]#

Alias to sortrows().

New in version 3.0.0.

sortcols(sortorder=None, reverse=False, inplace=True)[source]#

Like sortrows(), but change column order (usually in place) instead.

Parameters:
  • sortorder (list) – the list of indices to resort the columns by (if none, then alphabetical)

  • reverse (bool) – whether to reverse the order

  • inplace (bool) – whether to modify the dataframe in-place

New in version 3.0.0: Ensure dtypes are preserved; “inplace” argument; “returninds” argument removed

to_pandas(**kwargs)[source]#

Convert to a plain pandas dataframe

classmethod read_csv(*args, **kwargs)[source]#

Alias to pd.read_csv <pandas.read_csv(), returning a Sciris dataframe

classmethod read_excel(*args, **kwargs)[source]#

Alias to pd.read_excel <pandas.read_excel(), returning a Sciris dataframe