Xgboost Handle Categorical Variable



The dummy() function creates one new variable for every level of the factor for which we are creating dummies. For example, if my variable has four categories,. Clustering tools have been around in Alteryx for a while. The features for our ML pipeline are defined by combining the categorical_class and numerics_out. Visiting: Categorical Features and Encoding in Decision Trees. , collie, shepherd, terrier) would be examples of categorical variables. I have to use it categorical feature. categorical variable. How many dummy varibles are needed? In a multiple regression there are times we want to include a categorical variable in our model. Even though XGBoost might have higher accuracy, LightGBM runs previously 10 times and currently 6 times faster than XGBoost. One of RFs nice features is their ability to calculate the importance of features for separating classes. Interview question for Analytics in Atlanta, GA. There are two ways in which we can handle categorical data:. If the proportion of cases in the reference category is small, the indicator variables will necessarily have high VIFs, even if the categorical variable is not associated with other variables in the regression model. Half of them have 3 to 4 categories but others have 14 to 28 categories. Train and Test Split. Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author, Tong He. conclude by showing how this method can be used to discover and handle nonlinear relationships between variables. That is, if your budget can cover running hours of training on expensive machines with TPUs or GPUs. It implements several methods for this purpose: “hammock plots, parallel sets plots, common angle plots, and common angle plots with a hammock-like adjustment for line widths” [ ggparallel manual ]. The model tuning in Random Forest is much easier than in case of XGBoost. We will use the terms variable and attribute interchangeably throughout this book. Mapping Categorical Data in pandas. A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques. R will handle string variables with no problem. Note that the XGBoost cross-validation function is not supported in SPSS Modeler. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. In each iteration, a new tree (or a forest) is built, which improves the accuracy of the current (ensemble) model. It handles categorical, binary and continuous variables and produces and output dataset that can be further customized. Like – The categorical variable with high cardinality/ continous variable are given preference over others (due to more number of splits) And correlation is not visible in case of RF feature importance. Convert Text in Table Variables to Categorical. It seems that categorical input variables are not allowed in XGBoost, but they seem to be a core feature in CatBoost. 02754v3 [cs. And indicate with a forward slash and the level option that is a categorical variable by including the word nominal following the equal sign. How do I handle multiple levels of categorical variables in path analysis? Hello, I am stuck on a path analysis problem and I don't know if what I'm attempting to do is reccomended or even possible. In the Factor procedure dialogs (Analyze->Dimension Reduction->;Factor), I do not see an option for defining the variables as categorical. For the record, I have created a solution using custom objective and error funcitons. Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author, Tong He. You should re-encode high cardinality categorical variables Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. The other option is to use ets or Arima models in the forecast package. Assuming that we are talking about using Xgboost for GBDT and not a linear model, This answer is simply not true. Just tell us which column holds the category you want to split on, and we'll handle the rest. get_dummies takes a column with multiple possible values and converts it into a series of columns each with only 0s and 1s. The accuracies are comparable. The methods to pool the statistical tests after MI will be elaborated below with the focus on testing whether a categorical variable as a whole significantly contributes to the model. This fix also depends on the availability of Spark 2. This algorithm currently only accepts categorical data. A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques. Visiting: Categorical Features and Encoding in Decision Trees. Choosing the tree structure. Thus: response variables act as the centroids of the objects and the distances between response variable points indicate their χ 2 distances. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features. It seems that the plot_importance function biases against categorical features. Multiple-response (or pick any/c) categorical variables summarize responses to survey questions that ask “pick any” from a set of item responses. , collie, shepherd, terrier) would be examples of categorical variables. Let's start RStudio and begin…. We will generate married2, the exact same variable, using the recode statement instead so that you can see that the recode statement requires a lot less coding and only two steps. Now you will learn how to read a dataset in Spark and encode categorical variables in Apache Spark's Python API, Pyspark. Consequently, each categorical feature transforms into N sub-categorical features, where N is the number of possible outcomes for this categorical feature. This tutorial will explore how categorical variables can be handled in R. 0 for none. One Hot Encoding them would only lead to a sparse matrix with too many features correlated between them. ), they are actually categorical variables. It appends the variable name with the factor level name to generate names for the dummy variables. I have data containing few categorical columns with a huge amount of categories at each (more than 1000 different categories at each column). R provides many methods for creating frequency and contingency tables. We launched a fast benchmark with a xgboost model and got a 0. raw a cached memory dump of the xgboost model saved as Rs raw type. You should re-encode high cardinality categorical variables Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. This term emanates from digital circuit language, where it means an array of binary signals and only legal values are 0s and 1s. How to prepare categorical input variables using one hot encoding. One of RFs nice features is their ability to calculate the importance of features for separating classes. XLMiner converted the string variable into three categorical variables that resulted in an entirely numeric data set. One could include Group as a categorial variable. Both treat missing values as information (i. The collection of four items above is a description of an individual organism. Basically, XGBoost is an algorithm. In RF we have two main parameters: number of features to be selected at each node and number of decision trees. Decision Trees¶. categorical variable. The variable importance plot of the XGBoost model indicated that the level of HBsAg was of high importance followed by age and the level of hepatitis B virus (HBV) DNA. may be categorical – to have missing values. Name is a variable with text data but I don't think it's a useful variable to predict survival. But before that it's good to brush up on some basic knowledge about Spark. you generally won't it for variables taking more than 15 different values. LightGBM and xgboost with the tree_method set to hist will both compute the bins at the beginning of training and reuse the same bins throughout the entire training process. These methods are applicable to univariate time series. In a three category variable the minim is a generic negative response to the quesiton, while the minimum in a 7 category variable a severe negative response to the question. It is also important to note that xgboost is not the best algorithm out there when all the features are categorical or when the number of rows is less than the number of fields (columns). Allocate a #categorical length vector The vector will be sparse if there are lots of categories, the learning algorithm is preferred to handle sparse data. Next let's show how one can apply XGBoost to their machine learning models. Independent variable: Categorical. Interpret the regression coefficients of a linear regression model containing a qualitative (categorical) predictor variable. 2 Date 2019-08-01 Description Extreme Gradient Boosting, which is an efficient implementation. One of RFs nice features is their ability to calculate the importance of features for separating classes. , collie, shepherd, terrier) would be examples of categorical variables. For example, suppose we wanted to assess the relationship between household income and political affiliation (i. After saving the ‘Titanic. Of the 79 variables provided, 51 were categorical and 28 were continuous. LightGBM in some cases reaches it's top accuracy in under a minute and while only reading a fraction of the whole dataset. It'd be a poor choice in some cases with fewer values, though. Secondly, if the model has slopes and intercepts (e. Treating ordinal variables as nominal. Training XGBoost model. This chapter discussed how categorical variables with more than two levels could be used in a multiple regression prediction model. I am trying to create a scatterplot of two categorical values in Windows Excel 2013. An efficient maximum entropy approach for categorical variable prediction D. Posted on April 15, 2017 April 15, 2017 Author John Mount Categories Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Tutorials Tags categorical variables, encoding, hashing, one-hot, R, vtreat, xgboost Encoding categorical variables: one-hot and beyond. Merging some categories of a categorical variable in SPSS is not hard if you do it the right way. In Section 5 we will use a data frame with categorical variables to illustrate logistic regression. The other option is to use ets or Arima models in the forecast package. The gradient boosting package which we'll use is xgboost. XGBoost was able to handle the over 2,000 brands data while Multiple Linear Regression was not able to. The procedure is called dummy coding and involves creating a number of dichotomous categorical variables from a single categorical variable with more than two levels. Decision trees can handle both categorical and numerical variables at the same time as features, there is not any problem in doing that. this functions deals with categorical variables in the test set which are not present in the training set. Represents previously calculated feature importance as a bar graph. Compared to the packages Amelia and mice, it better handles cases where the number of variables is larger than the number of units, and cases where regularization is needed (i. For each numeric variable, a table giving, for each target class, mean and standard deviation of the (sub-)variable. XGBoost is an example of a boosting algorithm. Flexible Data Ingestion. Ordinal encoding uses a single column of integers to represent the classes. Consequently, each categorical feature transforms into N sub-categorical features, where N is the number of possible outcomes for this categorical feature. , binary, dichotomous) or those with a few ordered categories. We can encode the categorical variables into numerical vector using one-hot encoding. It is also known as a constant variable or simply as a "control". #For this example, we will use the default base learner library for `h2o. How the Handle Missing Data with Imputer in Python by admin on April 14, 2017 with No Comments Some of the problem that you will encounter while practicing data science is to the case where you have to deal with missing data. XGBoost is a powerful library for building ensemble machine learning models via the algorithm called gradient boosting. I have already explained this step here. Applying Ordinal Encoding to Categoricals. ensemble`, which includes the default H2O GLM, Random Forest, GBM and Deep Neural Net (all using default model parameter values). After saving the 'Titanic. Convert Text in Table Variables to Categorical. It is more economical to use C - 1. Examples are given for entry of small datasets, recoding of categorical variables, and quantile-based or similar binning of counted or measured variables. Handles Up to 50 Debts at Once Use FFBMPR12. Let's check the encoded values now. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). based on categorical variables. The Cochran-Mantel-Haenszel test is an extension of the Chi-square test where the bivariate association between two dichotomous categorical variables is assessed in different treatment "conditions. This example shows how to convert a variable in a table from a cell array of character vectors to a categorical array. Numeric VS categorical variables¶ Xgboost manages only numeric vectors. iloc[:,:-1],data. Note: You should convert your categorical features to int type before you construct Dataset for LGBM. As with our other predictive tools, connect your training data to the I anchor, and your new raw data stream (which does not contain the target variable) to the D anchor. However, this is not what we wanted, because the computer might think that data with value 2 is greater than value 1, or data with value 1 is greater than value 0. We can convert the string categorical values into an integer code using factorize method of the pandas library. You can help correct errors and omissions. Kagglers start to use LightGBM more than XGBoost. This is the proper representation of a categorical variable for xgboost or any other machine learning tool. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. Chollet mentions that XGBoost is the one shallow learning technique that a successful applied machine learner should be familiar with today, so I took his word for it and dove in to learn more. Classification, Regression, Clustering, Dimensionality reduction, Model selection, Preprocessing. Data Science Live Book 6 Download book If you liked the book and wanted to support the project, you can acquire forever and ever the portable Data Science Live Book in PDF, mobi, and Kindle version. Also note that XGBoost in SPSS Modeler performs one-hot encoding automatically for categorical variables. By comparison with the ARIMA, XGBoost, C-XGBoost, and A-XGBoost models using data from Jollychic cross-border e-commerce platform, the C-A-XGBoost is proved to outperform than other four models. Let's start RStudio and begin…. 2 Unifying Categorical and Numerical Variables Realistic datasets often contain a mix of numerical and categorical variables. In this example, we want to compare the Hispanic group to the three other ethnicity race groups. The GLMMOD procedure can create dummy variables for each categorical variable. It is also known as a constant variable or simply as a "control". Handling Categorical features automatically: We can use CatBoost without any explicit pre-processing to convert categories into numbers. The Cochran-Mantel-Haenszel test is an extension of the Chi-square test where the bivariate association between two dichotomous categorical variables is assessed in different treatment "conditions. Below I made a very simple tutorial for this in Python. Download the following notebooks and try the AutoML Toolkit today: Evaluating Risk for Loan Approvals using XGBoost (0. Often, the term “categorical” is used to refer to nominal variables that consist of unordered categories. To avoid loosing cases when independent variables are missing you can try creating categorical variables and add missing category for that variable. An Introduction to Deep Learning for Tabular Data Written: 29 Apr 2018 by Rachel Thomas. Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion). Maths and Statistics Help Centre 1 Many statistical tests require the dependent (response) variable to be continuous so a different set of tests are needed when the dependent variable is categorical. Dealing with Categorical Features in Big Data with Spark. When XGBoost encounters a missing value at a node, it tries both the left and right hand split and learns the way leading to higher loss for each node. The reason for this is because we compute statistics on each feature (column). Categorical features not supported. 0 for none. Flexible Data Ingestion. You can also use get_dummies method of Pandas to one hot encode this categorical variable. And I cant use it numerical feature. If your data is in a different form, it must be prepared into the expected format. ANOVA and GLM have CLASS statements available to specify categorical variables. variables into the regression model using stepwise selection and a second block using forward selection. LightGBM in some cases reaches it's top accuracy in under a minute and while only reading a fraction of the whole dataset. XGBoost compatible. no numeric relationship). As it is a categorical variable, next step is to one hot encode this column. OrdinalEncoder (verbose=0, mapping=None, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value') [source] ¶ Encodes categorical features as ordinal, in one ordered feature. While XGBoost does not directly support categorical variables, the ability to efficiently store and process sparse input matrices allows us to process categorical variables through one-hot encoding. For each stratum, we define the odds ratio in terms of: 1 item and 2 rows, 2 items and 2 rows, and 2 items and 1 row. Both the continuous and categorical models perform equally well but the latter may be better suited to some applications with larger sample sizes. For logistic regression analysis, statistical testing of covariates after MI can be performed by different methods. , factorisation machines, boosted trees, has proved the potential of exploring the interactions among inter-field categories. use a machine learning model which handle categorical features, the greatest of solutions! in the categorical variable. My understanding is that XGBoost requires that categorical features go through one-hot encoding. Also, it has recently been dominating applied machine learning. To follow this tutorial, you will need the development version of Xgboost from. But before that it's good to brush up on some basic knowledge about Spark. We can encode the categorical variables into numerical vector using one-hot encoding. This algorithm currently only accepts categorical data. •Actually it is not necessary to handle categorical separately. Scikit-learn. As the algorithms have different sets of features (like an ability to handle missing measurements or categorical input variables), there is a little common ground between the classes. We will generate married2, the exact same variable, using the recode statement instead so that you can see that the recode statement requires a lot less coding and only two steps. conclude by showing how this method can be used to discover and handle nonlinear relationships between variables. Parameters for xgboost were found by cross-validation. xgboost only accepts numeric values thus one-hot encoding is required for categorical variables. We add our ethnicity race variable, ETHRACE, to the list of explanatory variables. LightGBM and xgboost with the tree_method set to hist will both compute the bins at the beginning of training and reuse the same bins throughout the entire training process. It becomes clear from the. If you have four levels, you’ll have three indicator variables, and so on. We will convert the numeric encoding of categorical variables to their intended meaning. We now alert the user whenever Spark doesn't handle missing values correctly (#4309, #4349). Unlike CatBoost or LGBM, XGBoost cannot handle categorical features by itself, it only accepts numerical values similar to Random Forest. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. We’ve looked at the interaction effect between two categorical variables. handle_new_test_levels: this functions deals with categorical variables in the test in pbhogale/simplexgb: Simple wrapper for xgboost. Compared to the packages Amelia and mice, it better handles cases where the number of variables is larger than the number of units, and cases where regularization is needed (i. Simple Linear Regression - One Binary Categorical Independent Variable Does sex influence confidence in the police? We want to perform linear regression of the police confidence score against sex, which is a binary categorical variable with two possible values (which we can see are 1= Male and 2= Female if we check the Values cell in the sex. Factors in R are stored as vectors of integer values and can be labelled. All material on this site has been provided by the respective publishers and authors. It is also known as a constant variable or simply as a "control". But to tell Python that it is a categorical variable, we need to type a capital C and then put the name of the categorical variable in parentheses after the capital C. You can define your variables by either double-clicking on the "var" at the top of the column or click in the column you want to define the variable for and then choose "Data" from the menu and select "Define Variable". We would like to show you a description here but the site won't allow us. In this post I’ll take a look at how they each work, compare their features and discuss which use cases are best suited to each decision tree algorithm implementation. 1BestCsharp blog 6,623,001 views. One could include Group as a categorial variable. Problem Statement :. Information on 1309 of those on board will be used to demonstrate summarising categorical variables. Decision trees can handle both categorical and numerical variables at the same time as features, there is not any problem in doing that. Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. How do I handle multiple levels of categorical variables in path analysis? Hello, I am stuck on a path analysis problem and I don't know if what I'm attempting to do is reccomended or even possible. XGBoost has done remarkably well in machine learning competitions because it robustly handles a variety of data types, relationships, and distributions, and the large number of hyperparameters that can be tweaked and tuned for improved the fit. However, the algorithm requires the variables to be coded into its equivalent integer codes. I have data containing few categorical columns with a huge amount of categories at each (more than 1000 different categories at each column). Classification, Regression, Clustering, Dimensionality reduction, Model selection, Preprocessing. We can convert the string categorical values into an integer code using factorize method of the pandas library. Separate out the numerical variables first. evaluation_log evaluation history storead as a data. II Calendar No. Simple Linear Regression - One Binary Categorical Independent Variable Does sex influence confidence in the police? We want to perform linear regression of the police confidence score against sex, which is a binary categorical variable with two possible values (which we can see are 1= Male and 2= Female if we check the Values cell in the sex. Interview question for Analytics in Atlanta, GA. importance uses the ggplot backend. DMatrix {xgboost} for categorical input Dec 24, 2014 This comment has been minimized. For instance, if a variable called Colour can have only one of these three values, red, blue or green, then Colour is a categorical variable. NLREG determines the values of parameters for an equation, whose form you specify, that cause the equation to best fit a set of data values. , city or URL), were most of the levels appear in a relatively small number of instances. Another thing to avoid using line chart are showing too many variables on one chart. Categorical features must be encoded as non-negative integers (int) less than Int32. Aim of Course: This online course, "Categorical Data Analysis" will focus on a logistic regression approach for the analysis of contingency table data, where the cell entries represent counts that are cross-tabulated using categorical variables. Summarising categorical variables in R. You can also use neural networks. How do I handle multiple levels of categorical variables in path analysis? Hello, I am stuck on a path analysis problem and I don't know if what I'm attempting to do is reccomended or even possible. Secondly, if the model has slopes and intercepts (e. This term emanates from digital circuit language, where it means an array of binary signals and only legal values are 0s and 1s. tqchen changed the title Documentation of xgb. sparkxgb is a new sparklyr extension that can be used to train XGBoost models in Spark. DMatrix {xgboost} for categorical input Dec 24, 2014 This comment has been minimized. Nowadays, it steals the spotlight in gradient boosting machines. We can see that handling categorical variables using dummy variables works for SVM and kNN and they perform even better than KDC. Our data set is very noisy and contains a lot of missing values e. I tried putting the values into a pivot table, but cannot create an XY chart type with a chart that has been created from PivotTable data. To use xgboost package, keep these things in mind: Convert the categorical variables into numeric using one hot encoding; For classification, if the dependent variable belongs to class factor, convert it to numeric. These are non-numeric variables -- or if numeric, the values should not be interpreted as numeric values -- that typically consist of a limited number of unique values (the categories or the levels). How the Handle Missing Data with Imputer in Python by admin on April 14, 2017 with No Comments Some of the problem that you will encounter while practicing data science is to the case where you have to deal with missing data. When you create a version with the gcloud tool, you may provide the framework name in capital letters with underscores (for example, SCIKIT_LEARN ) or in lowercase letters. Dummy variables are also called indicator variables. Basically, XGBoost is an algorithm. – B_Miner Aug 15 '16 at 16:42. iloc to subset the data. I want to answer this question not just in terms of XGBoost but in terms of any problem dealing with categorical data. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. I recently came across a new [to me] approach, gradient boosting machines (specifically XGBoost), in the book Deep Learning with Python by François Chollet. Name of Cities in Country), How will we convert them into dummy variables. categorical variable. Compared to the packages Amelia and mice, it better handles cases where the number of variables is larger than the number of units, and cases where regularization is needed (i. Parameters for xgboost were found by cross-validation. This algorithm has two phases: Preprocessing: AI Platform processes your mix of categorical and numerical data into an all numerical dataset in order to prepare it for training with XGBoost. This macro creates a table of charateristics for the variables listed. iloc[:,:-1],data. It is a library for implementing optimised and. Dummy variables are also called indicator variables. Categoricals are a pandas data type corresponding to categorical variables in statistics. Mathematically, it can be represented as : XGBoost handles only numeric variables. The new variable gets the same type and length as the expression on the right side of the assignment statement. These variables don't need any kind of processing and thus we can start applying normalization and machine learning models to these variables. Convert Text in Table Variables to Categorical. Example of categorical value is {male, female} and {low, medium, high}. This example shows how to convert a variable in a table from a cell array of character vectors to a categorical array. Practical Approaches to Dealing with Nonnormal and Categorical Variables. However, I was still able to train a xgboost model without one-hot encoding when I used the parsnip interface. MaxValue (2147483647). We now alert the user whenever Spark doesn't handle missing values correctly (#4309, #4349). Target encoding categorical variables is a great way to represent categorical data in a numerical format that machine learning algorithms can handle, without jacking up the dimensionality of your training data. Although logistic regression has become the cornerstone of modelling categorical outcomes in medical statistics, separate regression analysis for each outcome of interest is hardly challenged as a pragmatic approach even in the situations when the outcomes are naturally related. This is the proper representation of a categorical variable for xgboost or any other machine learning tool. and Nicholas J. Assuming that we are talking about using Xgboost for GBDT and not a linear model, This answer is simply not true. iloc[:,:-1],data. While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such. Although, it was designed for speed and per. Choosing the tree structure. Numeric VS categorical variables¶ Xgboost manages only numeric vectors. Properly code a qualitative variable so that it can be incorporated into a multiple regression model. How to handle this case? ie how to handle categorical feature with so many different values in it? I am using Xgboost in python. How do I handle multiple levels of categorical variables in path analysis? Hello, I am stuck on a path analysis problem and I don't know if what I'm attempting to do is reccomended or even possible. Categorical features must be encoded as non-negative integers (int) less than Int32. If your data is in a different form, it must be prepared into the expected format. use a machine learning model which handle categorical features, the greatest of solutions! in the categorical variable. In your independent variables list, you have a categorical variable with 4 categories (or levels). See general information about how to correct material in RePEc. in the dataset. TextExplainer, tabular explainers need a training set. When requesting a correction, please mention this item's handle: RePEc:vua:wpaper:1981-22. One way to represent a categorical variable is to code the categories 0 and 1 as follows:. 513 test set RMSLE. Assessing Factorial Invariance in Ordered-Categorical Measures Roger E. Categorical data might not have a logical order. Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). CatBoost can use categorical features directly and is scalable in nature. Nowadays, it steals the spotlight in gradient boosting machines. Convert categorical variable into dummy/indicator variables. A categorical variable that can take on exactly two values is termed a binary variable or a dichotomous variable; an important special case is the Bernoulli variable. 136 113th CONGRESS 1st Session H. Categorical data¶ This is an introduction to pandas categorical data type, including a short comparison with R's factor. In this interview, Alexey. based on categorical variables. Categorical variables take on values that are names or labels. But before that it's good to brush up on some basic knowledge about Spark. Each variable has an associated label attribute to help with interpretation. This still leaves approximately 20 categorical variables that I would be dropping from the dataset because my machine can't handle them. The regression can only use numerical variable as its inputs data. This is the proper representation of a categorical variable for xgboost or any other machine learning tool. Numeric variables are represented with a linear spline basis. Of the 79 variables provided, 51 were categorical and 28 were continuous. This term emanates from digital circuit language, where it means an array of binary signals and only legal values are 0s and 1s. Examples are given for entry of small datasets, recoding of categorical variables, and quantile-based or similar binning of counted or measured variables. Load data Access to a vector (variable) within a data frame. Often, the term “categorical” is used to refer to nominal variables that consist of unordered categories. this functions deals with categorical variables in the test set which are not present in the training set. This is important because sometimes it is difficult to encode these categorical variables into numerical values. Therefore one has to perform various encodings like label encoding, mean encoding or one-hot encoding before supplying categorical data to XGBoost. In reality, the categorical values start from 0 goes all the way up to N-1 categories.