Task Overview: Bike Rental Demand Prediction The objective of this assignment is to analyse a dataset concerning bike rentals. The dataset can be downloaded from Blackboard. It is based on the real data from Capital Bikeshare company that maintains a bike rental network in Washington DC. The dataset has one row for each hour of each day in 2011 and 2012, for a total of 17,379 rows. It contains features of the day (workday, holiday) as well as weather parameters such as temperature and humidity. The range of hourly bike rentals is from 1 to 977. The bike usage is stored in the field ‘cnt’. Our task is to develop a prediction model for the number of bike rentals such that Capital Bikeshare can predict the bike usage in advance. You need to write a report that discusses how you complete the task, and go into sufficient depth to demonstrate knowledge and critical understanding of the relevant processes involved. 100% of available marks are through the completion of the written report, with clear and separate marking criteria for each required report section. Notably, a distinct and significant report section on discussing and critiquing the analysis and implementation processes you carried out for your data solution is required. Attribute/feature information in the dataset: instant: record index dteday : date season : season (1:springer, 2:summer, 3:fall, 4:winter) yr : year (0: 2011, 1:2012) mnth : month ( 1 to 12) hr: hour (0 to 23) holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holidayschedules) weekday : day of the week workingday : if day is neither weekend nor holiday is 1, otherwise is 0. weathersit : – 1: Clear, Few clouds, Partly cloudy, Partly cloudy – 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist – 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds – 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale) atemp: Normalized feeling temperature in Celsius. The values are derived via (tt_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale) hum: Normalized humidity. The values are divided to 100 (max) windspeed: Normalized wind speed. The values are divided to 67 (max) casual: count of casual users registered: count of registered users cnt: count of total rental bikes including both Section 1: Data Summary, Preprocessing and Visualisation (5%) As a first step, you need to load the data set from the .csv file into Microsoft Azure Machine Learning Studio. You then provide a summary of the dataset and proceed data preprocessing. For example, what is the size of data? How many features are there? Which data entries are redundant and can be skipped? Is there any NAs? Which data entries are categorical and may be marked as numeric? Are there any features need to normalized (where appropriate)? For data visualisation, you need to generate several plots using Python or R. For example, generate trellies plots. As categorical features of the plot, use the ‘season’ and ‘weathersit’ feature, which categorize the season of the year and the current weather situation (sun, rain, etc…). Always use the target values for the y-axis and for the x- axis test the fields ‘temp’ (temperture), ‘atemp’ (feeling temperature), ‘hum’ (humidity) and ‘windspeed’. What are your findings? What relationships can you see? You need to report the most interesting plots and interpret your results! Note, the information you report here should be useful for your model development! Section 2: Comparison of Algorithms (7.5%) You need to test different algorithms on this data. Split the dataset into a 75% training set and a 25% test set (i.e., the test set method). Train a linear regression model and evaluate the performance on the test set. Please use the mean absolute error (MAE) when reporting the performance of the algorithm. Report your Azure graph (i.e., the plots generated within Azure ML Studio through its built-in functions or your own R/Python scripts) and your performance. Using the data visualisation, can you find some polynomial feature expansion that improves the performance? Report your steps and your results. If your results do not improve, explain why. Train a boosted decision tree regression model using the same data split. Use the default parameters. Do the prediction performance improve? Repeat the same step with the decision forest algorithm, again with default parameters. Section 3: Model Selection (15%) Regardless of the result of the previous section, you will now use the boosted trees (for computation time reasons). You want to understand its parameters a bit better. For doing so, you will use the parameter range option of the tree module and start with the ‘Minimum number of samples per leaf node’ per parameter, where you will use the following values: [1,2,3,4,6,8,10,15,20,40,80]. The other parameters will be set 32 (max number of leaves), 0.4 (learning rate) and 100 (num trees). Using the tune hyper-parameters module, show with a plot how the performance depends on the ‘min number of samples’ parameter. Interpret your results, what is the best parameter value? For which range can you see overfitting for which underfitting? Exemplify your conclusion referring to the lecture material. So far, you have done the model selection with the test set method. However, as a good data scientist you know this is not always a safe option. To be sure, you resort to 10-fold cross validation using the ‘partition and sample’ module. Redo your evaluation. Report your plots. Do you come to a different conclusion? Explain your results also by discussing the qualitative differences between cross validation and the test method. Repeat the process for the other parameters of the algorithm. As validation method, use the method of your choice but justify your choice. Always leave the other parameters fixed and use parameter values of your choice for the parameter you vary to generate these results. Report your most interesting findings and explain them referring to the material you have learned in the lecture. Section 4: Time Series Modelling (15%) You happily present your great results to the CEO of Bikeshare and he wants to immediately test it on the data of the new year 2013. Surprisingly, your algorithm works significantly worse on this data. The CEO is not amused and asks you for the reasons. What do you answer? In order to test the scenario with the new year data you will from now on only train on data from the year 2011 (yr = 0) and test on data from the year 2012 (yr=1). Use the relative expression from the ‘split data’ module for doing so. Repeat the training of the linear model and the regression forests (using the best found parameters). Can you confirm the findings of the CEO? What is your performance? After 3 days of sleeping badly, you have a brilliant idea of how to fix this problem. As you have experienced, you cannot directly predict the new year’s values. However, if you know the number of rented bikes in the last 12 hours, you might be able to predict the bike usage for the next hour. You want to test this hypothesis. For doing so, add 12 new features to the data set using Python or R code. The features should indicate the bike usage 1 to 12 hours before the actual entry. Remove the data of the first 12 hours as they do not have a history that is long enough. Report your code snippets. Retrain the regression forest. What performance do you obtain? In the shower, you have another brilliant idea. Maybe it also helps to add the progress of bike rentals for the last 12 days (using the same time of the current entry). Again use Python or R code to implement this 12 additional features. Remove the first 12 * 24 rows as the history of these entries is again not long enough. Compare the performance of the original approach, using the 12 hours before as additional features, using the 12 days before as features and using the 12 hours and the 12 days. Which results will you report back to the CEO? Also compare the decision forest algorithm to the tree boosting algorithm. Make sure that the comparison is fair (i.e., they use close to optimal parameter settings). Again report your steps and your findings. Section 5: Time Series Prediction (7.5%) The CEO of Bikeshare is happy now with your best results that you have. However, using the last 12 hours as features only allows for prediction of the next hour. This is too short notice for Bikeshare to make use of the prediction. The CEO is asking you how much the performance would decrease if you would predict for 2, 3, 4 and 5 hours ahead. Can you provide him with these results? Create a plot with the prediction horizon on the x- axis and the performance on the y-axis. Again report your code-snippets, diagrams or plots. Important Information The report should be a maximum of 4500 words. A presentation penalty of 5% will be strictly applied if you exceed the 4500 maximum word limit (10% leeway applies). Keep in mind that: The report must contain your name, student number, module name and code; The report must be in PDF and no more than 4500 words (~10%), including cover page (if you have one), table of contents, appendices (if you have any) and references (if you have, and word count doesn’t apply to references); The report must be formatted in single line spacing and use an 11pt font; The report does not include this briefing document; You describe and justify each step that is needed to reproduce your results by using codesnippets, screenshots and plots; You interpret the results of your data analysis and model development; Please explain trends, characteristics or even outliers when you summarise and describe data; Whenever you need to set a random seed for an Azure component, use your student ID as seed. This should prevent that you report exactly the same results as your fellow students; When using screen shots or plots generated from Azure, Python or R, please make sure they are clearly readable but also do not take a vast amount of space if this is not necessary; Always refer to the mean absolute error (MAE) when reporting the ‘performance’ of the algorithm