Welcome to group_15_data_wrangling
| Package | |
| Meta |
Package Summary
This package aim to simplify the data wrangling for the Adult Census Income dataset found here. This will make it easier for someone who want to work with the data to quickly get the dataset in a clean format where they can then start analysis right away.
Links
Functions
set_dtype()- sets the data types for each column in the dataset to reduce memory requirements
cat_mode_impute()- performs imputation of missing values
clean_col_name()- replace “.” with “-” and make some names more meaningful
encode_income_binary()- encode the target feature, income, as binary
See all function documentation here: Function documentation
How this package fits into python ecosystem
There is an existant package called pyjanitor that has many useful data cleaning routines. These are general purpose and very powerful. Our package is much more focused on cleaning the specific adult census income dataset. In general pyjanitor is a much more useful package, but to tidy our specific dataset our functions will probably do the job with less effort.
Get started
You can install this package into your preferred Python environment using pip:
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple group_15_data_wranglingThe data can be downloaded from Kaggle, or downloaded from UCIrvine wich you can get by using:
pip install ucimlrepoTo use group_15_data_wrangling in your code:
# imports
from group_15_data_wrangling import cat_mode_impute, clean_col_name, encode_income_binary, set_dtype
from ucimlrepo import fetch_ucirepo
# fetch dataset
adult = fetch_ucirepo(id=2)
# census data as pandas dataframes
X = adult.data.features
y = adult.data.targets
census_df = X.join(y)
# use function from this package
census_df_no_nan = cat_mode_impute(census_df)
census_df_clean_names = clean_col_name(census_df)
census_df_binary_income = encode_income_binary(census_df)
census_df_consise_dtypes = set_dtype(census_df)Documentation
Contributing
Github repo
Get started contributing
git clone <repo> # clone group_15_data_wrangling repo
cd group_15_data_wrangling # cd into the project directory
conda env create -f environment.yml # setup dev environment
conda activate group-15-env # activate the env
pip install -e .[tests,dev,docs] # install package and development dependenciesRun tests
pytest --cov=src --cov-branch --cov-report=term-missingbuild, preview and deploy documentation
quartodoc build
quarto previewDocumentation website update automatically on push to main.
To force update documentation website from the current branch run (probably don’t do this).
quarto publish gh-pagesContributors
- Limor Winter
- Shihan Xu
- Zaki Aslam
- Michael Eirikson
Copyright
- Copyright © 2026 Michael Eirikson, Limor Winter, Shihan Xu, Zaki Aslam
- Free software distributed under the MIT License.