Our primary stakeholder is theater managers, an important role in theater who have to arrange released weeks and halls for each new movie. Therefore they have a potential need of knowing how new movies will perform on box office revenues. However, there’s a gap among the box office revenues in Taipei and in US and other movie features, and cause the prediction difficult. Hence, our business goal of this project is to allow managers knowing how new movies will perform on box office revenues in Taipei in advanced.
To use data mining method to achieve our business goal, first, we turn the business goal into data mining goal. Now our data mining goal is to predict box office revenues in Taipei and the outcomes managers will get are box office revenues of each movie in Taipei. This project is then an ongoing project, which means the managers can use this model repeatedly once they have new movie record.
The data we have consists of movie features such as budget, movie type, IMDB rating, release date in US, and box office revenues in US. The time period is from 2010 to 2015, 2,632 movies in total initially. After handling the missing values and outlier values, we have around 560 record that are accessible. We did data preprocess (e.g. dummy variables) for certain variables such as movie types and movie rating, then partition it. All we did before building the model were aim to make our project more accurate. We choose XLMiner and R as our data mining tools. The data mining method we used was linear regression. Our client can predict whether a movie will have great box office revenues in Taipei or not by the ultimate linear regression model, and they will get predicted box office revenues as the outcome.
Although our outcome of the model has a huge rate of error, it’s still much better than the average prediction. So those important predictors we mention in this report are credible.
For further works, we suggest the managers of theaters to collect more accessible data records and more valuable data dimensions like past released weeks and halls since the biggest weakness in our project is the data size. On the other hand, we suggest further studies should take environment changes and number of rating people into account. In addition, we also suggest further studies to try classification as the data mining method if our client require only level of box office revenues but not exact numbers.