1,599 views
1 1 vote
Hello,

I have a dataset with a categorical column that contains three categories. One of the categories represents 98% of the data, while the remaining 2% are distributed between the other two categories, with a few (maybe around 50) in each. It is worth mentioning that the output for these 50 rows is the same, which suggests that these data points may be important.

However, the data is obviously imbalanced, and I am unable to perform any analysis. Should I drop the entire column, or perform a chi-square test on the data as-is?
0% Accept Rate Accepted 0 answers out of 1 questions

1 Answer

0 0 votes

For imbalanced categorical data, you shouldn't drop the column. Instead, you can try techniques like oversampling the minority class or using models that handle imbalanced data well, like XGBoost. This way, you can still extract useful insights without losing important information.

Related questions

0 0 votes
0 0 answers
549
549 views
Anas asked Nov 28, 2021
549 views
So say I have a column with categorical data like different styles of temperature: 'Lukewarm', 'Hot', 'Scalding', 'Cold', 'Frostbite',... etc.I know that we can use pd.ge...
1 1 vote
1 1 answer
648
648 views
hattum asked May 4, 2021
648 views
In case of 3 sensors reporting loads of values individually.. one sensor might be off. The average of the 2 trustworthy sensors is to be reported.. the third in need for ...
1 1 vote
1 answers 1 answer
1.5k
1.5k views
interview asked Dec 24, 2019
1,541 views
Consider the Pandas DataDrame df below. Filter it appropriately so that it outputs the shown results.gh owner language repo stars 0 pandas-dev python pandas 17800 1 tidyv...
1 1 vote
1 1 answer
1.4k
1.4k views
Anas asked Dec 18, 2021
1,359 views
It's a car prices dataset, and so I'm assuming that the more recent the more value a car should have. The values in the 'year' column simply consist of years from 1995 to...
0 0 votes
1 1 answer
596
596 views
tofighi asked Oct 18, 2018
596 views