dataframe#
- class dataframe(data=None, index=None, columns=None, dtype=None, copy=None, dtypes=None, nrows=None, **kwargs)[source]#
Bases:
DataFrame
An extension of the pandas
DataFrame
with additional convenience methods for accessing rows and columns and performing other operations, such as adding rows.- Parameters:
data (dict/array/dataframe) – the data to use; passed to
pd.DataFrame()
index (array) – the index to use; passed to
pd.DataFrame()
columns (list) – column labels (if a dict is supplied, the value sets the dtype)
dtype (type) – a dtype for the whole datafrmae; passed to
pd.DataFrame()
dtypes (list/dict) – alternatively, list of data types to set each column to
nrows (int) – the number of arrows to preallocate (default 0)
kwargs (dict) – if provided, treat these as data columns
Hint: Run the example below line by line to get a sense of how the dataframe changes.
Examples:
df = sc.dataframe(cols=['x','y'], data=[[1238,2],[384,5],[666,7]]) # Create data frame df['x'] # Print out a column df[0] # Print out a row df['x',0] # Print out an element df[0,:] = [123,6]; print(df) # Set values for a whole row df['y'] = [8,5,0]; print(df) # Set values for a whole column df['z'] = [14,14,14]; print(df) # Add new column df.rmcol('z'); print(df) # Remove a column df.addcol('z', [14,14,14]); print(df) # Alternate way to add new column df.poprow(1); print(df) # Remove a row df.append([555,2,14]); print(df) # Append a new row df.insertrow(1,[556,2,14]); print(df) # Insert a new row df.sort(); print(df) # Sort by the first column df.sort('y'); print(df) # Sort by the second column df.findrow(123) # Return the row starting with value 123 df.rmrow(); print(df) # Remove last row df.rmrow(555); print(df) # Remove the row starting with element '555' # Direct setting of data df = sc.dataframe(a=[1,2,3], b=[4,5,6])
The dataframe can be used for both numeric and non-numeric data.
New in version 2.0.0: subclass pandas DataFrameNew in version 3.0.0: “dtypes” argument; handling of item settingNew in version 3.1.0: use panda’s equality operator by default (unless an exception is raised); new “equal” method; “cat” can be an instance method nowAttributes
T
The transpose of the DataFrame.
at
Access a single value for a row/column label pair.
attrs
Dictionary of global attributes of this dataset.
axes
Return a list representing the axes of the DataFrame.
Get columns as a list
columns
The column labels of the DataFrame.
dtypes
Return the dtypes in the DataFrame.
empty
Indicator whether Series/DataFrame is empty.
flags
Get the properties associated with this pandas object.
iat
Access a single value for a row/column pair by integer position.
iloc
Purely integer-location based indexing for selection by position.
index
The index (row labels) of the DataFrame.
loc
Access a group of rows and columns by label(s) or a boolean array.
Get the number of columns in the dataframe
ndim
Return an int representing the number of axes / array dimensions.
Get the number of rows in the dataframe
shape
Return a tuple representing the dimensionality of the DataFrame.
size
Return an int representing the number of elements in this object.
style
Returns a Styler object.
values
Return a Numpy representation of the DataFrame.
Methods
- property cols#
Get columns as a list
- set_dtypes(dtypes)[source]#
Set dtypes in-place (see
df.astype()
for the user-facing version)New in version 3.0.0.
- col_index(col=None, *args, die=True)[source]#
Get the index of the column named
col
.Similar to
df.columns.get_loc(col)
, and opposite ofdf.col_name
.- Parameters:
Examples:
df = sc.dataframe(dict(a=[1,2,3], b=[4,5,6], c=[7,8,9])) df.col_index('b') # Returns 1 df.col_index(1) # Returns 1 df.col_index('a', 'c') # Returns [0, 2]
New in version 3.0.0: renamed from “_sanitizecols”; multiple arguments
- col_name(col=None, *args, die=True)[source]#
Get the name of the column(s) with index
col
.Similar to
df.columns[col]
, and opposite ofdf.col_index
.Note: This method always looks for named columns first. If
col
is name of a column, it will returncol
rather thancolumns[col]
. See example below for more information.- Parameters:
Examples:
df = sc.dataframe(dict(a=[1,2,3], b=[4,5,6], c=[7,8,9])) df.col_name(1) # Returns 'b' df.col_name('b') # Returns 'b' df.col_name(0, 2) # Returns ['a', 'c']
New in version 3.0.0.
- flexget(cols=None, rows=None, asarray=False, cast=True, default=None)[source]#
More complicated way of getting data from a dataframe. While getting directly by key usually returns the array data directly, this usually returns another dataframe.
- Parameters:
Example:
df = sc.dataframe(cols=['x','y','z'],data=[[1238,2,-1],[384,5,-2],[666,7,-3]]) # Create data frame df.flexget(cols=['x','z'], rows=[0,2])
- classmethod equal(*args, equal_nan=True)[source]#
Class method returning boolean true/false equals that allows for more robust equality checks: same type, size, columns, and values. See
df.equals()
for equivalent instance method.Examples:
df1 = sc.dataframe(a=[1, 2, np.nan]) df2 = sc.dataframe(a=[1, 2, 4]) sc.dataframe.equal(df1, df1) # Returns True sc.dataframe.equal(df1, df1, equal_nan=False) # Returns False sc.dataframe.equal(df1, df2) # Returns False sc.dataframe.equal(df1, df1, df2) # Also returns False
New in version 3.1.0.
- equals(other, *args, equal_nan=True)[source]#
Try the default
equals()
, but fall back on the more robustsc.dataframe.equal()
if that fails.New in version 3.1.0.
- disp(nrows=None, ncols=None, width=999, precision=4, options=None, **kwargs)[source]#
Flexible display of a dataframe, showing all rows/columns by default.
- Parameters:
nrows (int) – maximum number of rows to show (default: all)
ncols (int) – maximum number of columns to show (default: all)
width (int) – maximum screen width (default: 999)
precision (int) – number of decimal places to show (default: 4)
options (dict) – an optional dictionary of additional options, passed to
pd.option_context()
kwargs (dict) – also passed to
pd.option_context()
, with ‘display.’ preprended if needed
Examples:
df = sc.dataframe(data=np.random.rand(100,10)) df.disp() df.disp(precision=1, ncols=5, colheader_justify='left')
New in version 2.0.1.
- replacedata(newdata=None, newdf=None, reset_index=True, inplace=True)[source]#
Replace data in the dataframe with other data; usually not used directly by the user, but used as part of e.g.
df.concat()
.- Parameters:
New in version 3.0.0: improved dtype handling
- appendrow(row, reset_index=True, inplace=True)[source]#
Add row(s) to the end of the dataframe.
See also
df.concat()
anddf.insertrow()
. Similar to the pandas operationdf.iloc[-1] = ...
, but faster and provides additional type checking.- Parameters:
Note: “appendrow” and “concat” are equivalent, except appendrow() defaults to modifying in-place and “concat” defaults to returning a new dataframe.
Warning: modifying dataframes in-place is quite inefficient. For highest performance, construct the data in large chunks and then add to the dataframe all at once, rather than adding row by row.
Example:
import sciris as sc import numpy as np df = sc.dataframe(dict( a = ['foo','bar'], b = [1,2], c = np.random.rand(2) )) df.appendrow(['cat', 3, 0.3]) # Append a list df.appendrow(dict(a='dog', b=4, c=0.7)) # Append a dict
New in version 3.0.0: renamed “value” to “row”; improved performance
- append(row, reset_index=True, inplace=True)[source]#
Alias to
appendrow()
.Note: pd.DataFrame.append was deprecated in pandas version 2.0; see pandas-dev/pandas#35407 for details. Since this method is implemented using
pd.concat()
, it does not suffer from the performance problems thatappend
did.New in version 3.0.0.
- insertrow(index=0, value=None, reset_index=True, inplace=True, **kwargs)[source]#
Insert row(s) at the specified location. See also
df.concat()
anddf.appendrow()
.- Parameters:
Warning: modifying dataframes in-place is quite inefficient. For highest performance, construct the data in large chunks and then add to the dataframe all at once, rather than adding row by row.
Example:
import sciris as sc import numpy as np df = sc.dataframe(dict( a = ['foo','cat'], b = [1,3], c = np.random.rand(2) )) df.insertrow(1, ['bar', 2, 0.2]) # Insert a list df.insertrow(0, dict(a='rat', b=0, c=0.7)) # Insert a dict
New in version 3.0.0: renamed “row” to “index”
- concat(data, *args, columns=None, reset_index=True, inplace=False, dfargs=None, **kwargs)[source]#
Concatenate additional data onto the current dataframe.
Similar to
df.appendrow()
anddf.insertrow()
; see alsosc.dataframe.cat()
for the equivalent class method.- Parameters:
data (dataframe/array) – the data to concatenate
*args (dataframe/array) – additional data to concatenate
columns (list) – if supplied, columns to go with the data
reset_index (bool) – update the index
inplace (bool) – whether to append in place
dfargs (dict) – arguments passed to construct each dataframe
**kwargs (dict) – passed to
pd.concat()
Example:
arr1 = np.random.rand(6,3) df2 = sc.dataframe(np.random.rand(4,3)) df3 = df2.concat(arr1)
New in version 2.0.2: “inplace” defaults to FalseNew in version 3.0.0: improved type handling
- classmethod cat(data, *args, dfargs=None, **kwargs)[source]#
Convenience class method for concatenating multiple dataframes. See
df.concat()
for the equivalent instance method.- Parameters:
data (dataframe/array) – the dataframe/data to use as the basis of the new dataframe
args (list) – additional dataframes (or object that can be converted to dataframes) to concatenate
dfargs (dict) – arguments passed to construct each dataframe
kwargs (dict) – passed to
df.concat()
Example:
arr1 = np.random.rand(6,3) df2 = pd.DataFrame(np.random.rand(4,3)) df3 = sc.dataframe.cat(arr1, df2)
New in version 2.0.2.
- merge(*args, reset_index=True, inplace=False, **kwargs)[source]#
Alias to
pd.merge
, except merge in place.- Parameters:
reset_index (bool) – update the index
inplace (bool) – whether to append in place
**kwargs (dict) – passed to
pd.concat()
New in version 3.0.0.
Example:
df = sc.dataframe(dict(x=[1,2,3], y=[4,5,6])) df2 = sc.dataframe(dict(x=[1,2,3], z=[9,8,7])) df.merge(df2, on='x', inplace=True)
- property ncols#
Get the number of columns in the dataframe
- property nrows#
Get the number of rows in the dataframe
- addcol(key=None, value=None, data=None, inplace=True, **kwargs)[source]#
Add new column(s) to the data frame
See also
assign()
, which is similar, but returns a new dataframe by default.- Parameters:
NB: a single argument is interpreted as “data”
Example:
df = sc.dataframe(dict(x=[1,2,3], y=[4,5,6])) new_cols = dict(z=[1,2,3], a=[9,8,7]) df.addcol(new_cols)
- popcols(col=None, *args, die=True)[source]#
Remove a column or columns from the data frame.
Alias to
pop()
, except allowing multiple columns to be popped.- Parameters:
Example:
df = sc.dataframe(cols=['a','b','c','d'], data=np.random.rand(3,4)) df.popcols('a','c')
- findind(value=None, col=None, closest=False, die=True)[source]#
Find the row index for a given value and column.
See
df.findrow()
for the equivalent to return the row itself rather than the index of the row. Seedf.col_index()
for the column equivalent.- Parameters:
Example:
df = sc.dataframe(data=[[2016,0.3],[2017,0.5]], columns=['year','val']) df.findind(2016) # returns 0 df.findind(0.5, 'val') # returns 1 df.findind(2013) # returns None, or exception if die is True df.findind(2013, closest=True) # returns 0
New in version 3.0.0: renamed from “_rowindex”
- poprow(row=-1, returnval=True)[source]#
Remove a row from the data frame.
Alias to
drop
, except drop by position rather than label, and modify in-place. To pop multiple rows, see meth:df.poprows() <dataframe.poprows>.- Parameters:
To pop a column, see
df.pop()
.New in version 3.0.0: “key” argument renamed “row”
- poprows(inds=-1, value=None, col=None, reset_index=True, inplace=True, **kwargs)[source]#
Remove multiple rows by index or value
To pop a single row, see meth:df.poprow() <dataframe.poprow>.
- Parameters:
inds (list) – the rows to remove
values (list) – alternatively, search for these values to remove; see
df.findinds
for detailscol (str) – if removing by value, use this column to find the values
reset_index (bool) – update the index
inplace (bool) – whether to modify in-place
kwargs (dict) – passed to
df.findinds
Examples:
df = sc.dataframe(np.random.rand(10,3)) df.poprows([3,4,5]) df = sc.dataframe(dict(x=[0,1,2,3,4], y=[2,3,2,7,8])) df.poprows(value=2, col='y')
- enumrows(cols=None, type='objdict')[source]#
Efficiently enumerate the rows of the dataframe
Similar to
df.iterrows()
, but up to 30x faster since uses tuples instead ofpd.Series
.- Parameters:
cols (list) – the list of columns to include in the enumeration (by default, all)
type (str/type) – the output type for each row: options are ‘objdict’ (default), tuple (fastest), list (very fast), dict (pretty fast)
Examples:
df = sc.dataframe(dict(x=[0,1,2,3,4], y=[2,3,2,7,8], z=[5,5,4,3,2])) for i,row in df.enumrows(): print(i, row.x+row.y) # Typical use case for i,row in df.enumrows(type=tuple): print(i, row[0]+row[1]) # Fastest for i,row in df.enumrows(type=dict): print(i, row['x']+row['y']) # Still fast for i,(x,y) in df.enumrows(cols=['x', 'y'], type=tuple): print(i, x+y) # Even faster
- replacecol(col=None, old=None, new=None)[source]#
Replace all of one value in a column with a new value
- to_odict(row=None)[source]#
Convert dataframe to a dict of columns, optionally specifying certain rows.
- Parameters:
row (int/list) – the rows to include
- findrow(value=None, col=None, default=None, closest=False, asdict=False, die=False)[source]#
Return a row by searching for a matching value.
See
df.findind()
for the equivalent to return the index of the row rather than the row itself, anddf.findinds()
to find multiple row indices.- Parameters:
value (any) – the value to look for
col (str) – the column to look for this value in
default (any) – the value to return if key is not found (overrides die)
closest (bool) – whether or not to return the closest row (overrides default and die)
asdict (bool) – whether to return results as dict rather than list
die (bool) – whether to raise an exception if the value is not found
Examples:
df = sc.dataframe(cols=['year','val'],data=[[2016,0.3],[2017,0.5], [2018, 0.3]]) df.findrow(2016) # returns array([2016, 0.3], dtype=object) df.findrow(2013) # returns None, or exception if die is True df.findrow(2013, closest=True) # returns array([2016, 0.3], dtype=object) df.findrow(2016, asdict=True) # returns {'year':2016, 'val':0.3}
- findinds(value=None, col=None, **kwargs)[source]#
Return the indices of all rows matching the given key in a given column.
- Parameters:
value (any) – the value to look for
col (str) – the column to look in
kwargs (dict) – passed to
sc.findinds()
Example:
df = sc.dataframe(cols=['year','val'],data=[[2016,0.3],[2017,0.5], [2018, 0.3]]) df.findinds(0.3, 'val') # Returns array([0,2])
- filterin(inds=None, value=None, col=None, verbose=False, reset_index=True, inplace=False)[source]#
Keep only rows matching a criterion; see also
df.filterout()
- filterout(inds=None, value=None, col=None, verbose=False, reset_index=True, inplace=False)[source]#
Remove rows matching a criterion (in place); see also
df.filterin()
- filtercols(cols=None, *args, keep=True, die=True, reset_index=True, inplace=False)[source]#
Filter columns keeping only those specified – note, by default, do not perform in place
- Parameters:
cols (str/list) – the columns to keep (or remove if keep=False)
args (list) – additional columns
keep (bool) – whether to keep the named columns (else, remove them)
die (bool) – whether to raise an exception if a column is not found
reset_index (bool) – update the index
inplace (bool) – whether to modify in-place
Examples:
df = sc.dataframe(cols=['a','b','c','d'], data=np.random.rand(3,4)) df2 = df.filtercols('a','b') # Keeps columns 'a' and 'b' df3 = df.filtercols('a','c', keep=False) # Keeps columns 'b' and 'd'
- sortrows(by=None, reverse=False, returninds=False, reset_index=True, inplace=True, **kwargs)[source]#
Sort the dataframe rows in place by the specified column(s).
Similar to
df.sort_values()
, except defaults to sorting in place, and optionally returns the indices used for sorting (likenp.argsort()
).- Parameters:
col (str or int) – column to sort by (default, first column)
reverse (bool) – whether to reverse the sort order (i.e., ascending=False)
returninds (bool) – whether to return the indices used to sort instead of the dataframe
reset_index (bool) – update the index
inplace (bool) – whether to modify the dataframe in-place
kwargs (dict) – passed to
df.sort_values()
New in version 3.0.0: “inplace” argument; “col” argument renamed “by”
- sort(by=None, reverse=False, returninds=False, inplace=True, **kwargs)[source]#
Alias to
sortrows()
.New in version 3.0.0.
- sortcols(sortorder=None, reverse=False, inplace=True)[source]#
Like sortrows(), but change column order (usually in place) instead.
- Parameters:
New in version 3.0.0: Ensure dtypes are preserved; “inplace” argument; “returninds” argument removed