Companies in many industries nowadays are turning their focus to data mining and predictive- modelling technology, and the data-analytic is becoming the priceless resources for them. As a result, companies are now actively hunting for data scientists to innovate their own business. However, at the same time, the job, the data scientist, is relatively new compared to other existing occupations. Since the data science is not a standalone industry and it is normally added into other industries, companies in each industry need to consider whether the data science will bring them the advantaged or disadvantaged situation.
With the assumption that these companies already decided to implement the data science into their existing business, they will now need to hire the right person, the data scientist, who can help to achieve their goals effectively. Considering that the job, data scientist, is new position in recent years, we cannot judge data scientists’ competence based on criteria of the existing job titles. Setting the appropriate criteria for hiring the right person in the field of data science is hard, and the cost of missing the right person or hiring the wrong person can be enormous.
To begin with the Kaggle website, it mainly acts as a medium between numerous competitions and participants which are all about data analysis and prediction. To find the solution to predict whether the candidate has potential to be hired, we grab roughly 16 thousand rows of survey data from the users of this website. The processes will be later described in detail. In order to exclude the bias coming from difference in country such as irregular purchasing power of units of currency, we only select the data from U.S. Accordingly, and actual data size that can be used shrinks to around 1,300 rows. The data includes data-scientist-related information along with personal information such as
The compensation from the survey data is the best indicator that can be used to compare among data scientists. Our model will be trained by survey data by using several data-analysis methods. Consequently, predicted value of the compensation will be based on over 70 columns which consist mainly of dimensions described above. Nonetheless, it doesn’t seem to be the most ideal method to directly use the compensation as the output variable since it does not provide any comparison with existing industries. To solve the business problem mentioned above, the output variable must be binary with two outcomes which are accepted or unaccepted. Under this circumstance, compensation will be only used as the reference to be compared with the original existing industries’ average compensation.
● Evaluation on accuracy, cost of missing potential person and cost of finding wrong person
● Working culture in different industries and countries is not considered
● Renewing our model is necessary after a period of time ●
participants’ skills, age, education status, compensation, previous job information, and current job
Some components of compensation can’t be evaluated in this model (such as basic salary,
working KPI, bonus, and welfare)