Projects

Olist

Forecasting hourly Uber rides by neighborhood in New York City

Online job listing classification using NLP

Property sale price predictive modelling

Comparing market reach of global retailers

Olist is a Brazilian online retail marketplace.

Featured sellers

Sellers grouped into three tiers using their customer acquisition rate, transactions and review scores.

Review Score Insights

Leverage view across the order process with vouchers/rewards. Request a review once the customer has been served.

Customer acquisition

Listing optimization

Markets

Markets were segmented using geographical hierarchies such as the "Areas of influence" published by Instituto Brasileiro de Geografia e Estatística (IBGE).

Transaction forecasting

Independent variables - Total number of listings, products and sellers.

Forecasting hourly Uber rides by neighborhood and borough in New York City

The purpose of this project was to develop a model that could predict the hourly demand for Uber rides by location in New York City. It used Uber rides data covering the time and geocoordinates of rides in New York City from Apr-14 - Sep-14. Rides were segmented by neighborhood and borough using GIS Neighborhood Tabulation Areas published by the NYC Department of City Planning. Prophet, a time-series procedure developed by Facebook's core data science team, was then used to generate hourly ride forecasts for each neighborhood and borough with high accuracy.

Heatmeaps were used to visualise the geographic distribution of historic and forecast volumes

Time, date and borough filters allow ride volume trends at specific times and dates to be compared across locations

Idenitfy market gaps and demand clusters to leverage volume.

Line charts displaying results across segments per hour, provided granularity and close analysis. These can be used to identify bottle necks and unusual events impacting demand to close gaps and maximise efficiency.

The Prophet forecasting procedure uses a growth function which is fit with daily, weekly and yearly cyclical trends and holiday effects.

Forecast = Growth + Seasonality + Holidays + Error

It features an error term which alters the direction of the growth function when shifts in the data outpace how quickly its linear and logistic equations can adjust. This was considered important to enable the model to adjust to demand shifts resulting from pricing changes and the introduction of new services by Uber and or competitors. Prophet calls these moments in time 'Changepoints'.

Density-based spatial clustering of applications with noise (DBSCAN) and Kmeans were used to find ways of analysing rides grouped closely together in location and time.

K-means is a centroid based clustering algorithm that segments observations into a target number of clusters (k). The silhoette coefficient (SC) is used to indicate the separation distance between clusters. A SC close to 1 indicates clusters that are far away from neighboring clusters and thus more distinguished. K-means was applied to every neighborhood for every hour between 00:00 - 23:00 for target (k) values from 2-15 to find the best number of clusters to group the rides.

ARMA, ARIMA and SARIMAX models were tested during EDA. Autocorrelation and Partial Autocorrelation to investigate changes in the series.

The report could be used by a company similar to Uber to introduce new products and services. For example, partnership arrangements with airlines advertising discounted Uber airport hotel transfers for inbound passengers online when booking flights may make flight bookings more attractive to consumers and while providing a reliable source of business volume for Uber meaning Uber would know where and when its drivers need to be to service those bookings and could factor that into its planning. Further to this, it may serve as a useful driver of customer acquisition and retention for both Uber and the partnering airlines. The report could also be used to identify events that cause unexpected demand fluctuations.

Classifying online job listing industry, category and salary

A tool that distinguishes data-related online job listing features that characterize job industry, category, title and salary. Key Data Science tools include Logistic Regression, Random Forest Classifier models, Gradient Boosting, Principal Component Analysis (PCA).

The purpose of this analysis was to develop a python coding procedure to scrape data from a job listing website and apply classification and regression algorithms to determine features in data scientist and data analyst related job listings that distinguish the industry, category, title and salary. To do this a scraping procedure was developed using python in the Scrapy framework to scrape over 20,000 job listings from Seek.com.

Job description data was then cleansed and Stemmers and Lemmatizers reduced words to root forms. Job listings for non-data related roles were removed to leave over 1,000 listings for data analyst related roles for analysis.

Models were trained to determine words and phrases in job listings that distinguish industry, category, salary and title of Data Scientist and Data Analyst related job listings. Over 25 models were trained.

PCA and Mean Decrease in Impurity were used to invesigate the accuracy benefits that may be leveraged from dimensionality reduction.

Model configurations and results were displayed in table form for efficient evaluation and decision making.

A model like this could be used by people seeking employment as Data Scientists to help distinguish the features and requirements of Data Scientist and data-related roles by industry, category, title and salary

Property investment profitability drivers

A report that leverages linear and classification techniques to distinguish the impact changes in property features have on sale price and predict sale price. The dataset covers property sales over 2007-2010 and each property has over 80 features which are analysed. Key Data Science tools include: Linear Regression and Lasso, Ridge and Elastic Net regularization techniques.

The purpose of this analysis was to identify unfixed property features that had the highest influence on sale price using the Ames property dataset. It involved two parts. The first involved training a model to predict property sale prices using only fixed variables (e.g. neighborhood). To do this, a model was trained using pre-2010 property sale prices to predict 2010 property sale prices. The second part involved determining the value of unfixed property features unexplained by the fixed ones. To do this several models were trained using only unfixed property characteristics to predict the residuals from the first model (i.e. the variance in price unexplained by the fixed characteristics).Data Science tools applied: Linear Regression, Lasso, Ridge, Elastic Net regularization, Gradient Boosting.

Lasso, Ridge and Elastic Net regression was used to predict the sale price change resulting from a unit increase in fixed and unfixed property features. Among the 80 variables in the dataset there were 50 fixed and unfixed features which had a reasonable impact on price.

.

Gradient Boosting models were developed to attempt to optimise accuracy by building an ensemble of weaker predictive models in a stage like fashion using different loss functions.

Impurity based feature importances and permutation methods were used to compare the impact of cardinality and validate results.

Several methods were used to examine the data from different angles during EDA, including correlation matrixes and linear charts.

Data cleansing involved labelling and tagging, addressing null values which was necessary to achieve analysis objectives.

A report like this could be used by individuals or companies to help identify property investmnet opportunities.

Global retail brands

Exploratory data

About

It's Hugo, aspiring Data Scientist. I love to solve problems, it's my passion. I also love technology and writing code. I am here to help you. I am here to wade through all the complex information and piece it together so you can make the best decisions possible.
No matter how difficult your business problem, I will find a way of organising the information so it all makes sense. Aside from that I also love cooking.

Contact

If you would like to get in touch with me about any business opportunities, or if you would like some personal cooking advise, feel free to reach out via the email or LinkedIn links below.