Applying Regression to stock returns in R
Introduction.
When faced with the challenge to outperform the market (that is, to obtain abnormal returns consistently through time, not explained by the systematic risk taken), the paths that can be taken are plenty and very diverse in nature. If the market we are exploring is the equities market in particular, outperforming the market could be simplified as picking the companies that will generate the greater returns. I’m a firm believer that an exhaustive fundamental analysis of each of these companies would provide the investor with the knowledge to better allocate her resources, effectively obtaining this abnormal returns (also known in the industry as Alpha).
Fundamental analysis requires a great amount of expertise, as you must understand the industry, the economic enviroment of both the country and the company under analysis, and of course the idiosyncratic business model of every player in the sector. As you might have guessed, this is time consuming and can’t (yet) be automated, so assigning an analyst to focus on a particular company and thoroughly study an investment recommendation should be done efficiently, i.e., not analysing every single company but some that may promise achiving greater returns.
This work is a follow up of our previous Markowitz Efficient Frontier and some of our insights shown there will continue to be used here, so i would recomend you to read it quickly before moving forward.
The Data: Structure and manipulation.
Out data set comes from kaggle: New York Stock Exchange dataset again, but we are going to explore the different tables that are available, in particular, the dataset containing variables from financial statements and data related to the securities. Let us load these tables.
fundamentales <- read_delim("fundamentals.csv",delim = ",")
securities <- read_delim("securities.csv",delim = ",")
names(securities) <-c("ticker","security","SECFilings","GICS","industry","address","firstAdded","CIK")
str(securities)
When working with data like these datasets it is important to get familiar with the information contained and the structure of it. The following lines of code is an example of the different tests and summary tables that i was creating along the way of finding what kind of data we had and if any relevant data was missing (in the particular case of the fundamentals, the variable for Year was very important and had missings) and in case of finding missings, could they be filled with enough confidence as to maintain the integrity of the data (my analysis concluded that the answer was YES!). This last point is very important, as keeping a couple of extra registers can have relevant impacts when the datasets are small.
#Analizamos la informacion que disponemos para seleccionar empresas
str(fundamentales)
head(fundamentales,6)
nombres <- names(fundamentales)
nombres[c(1,2,3)] <- c("obs","ticker","period")
nombres
names(fundamentales)<-nombres
#Exploramos el dataset de EEFF y observamos que existen 448 tickers únicos
group_by(fundamentales,ticker) %>%
summarise(numero = n()) %>%
arrange(ticker)
#Las fechas únicas de los EEFF son 162
group_by(fundamentales,period) %>%
summarise(numero = n()) %>%
arrange(desc(period))
#Pero la variable que nos permite identificar el año de los EEFF es For Year
group_by(fundamentales,`For Year`) %>%
summarise(numero = n()) %>%
arrange(desc(`For Year`))
fundamentales <- mutate(fundamentales,forYear = `For Year`)
dup <- group_by(fundamentales,ticker,forYear) %>%
summarise(numero = n()) %>%
arrange(desc(ticker,forYear)) %>%
ungroup() %>%
filter(!is.na(forYear)) %>%
filter(numero >= max(numero))
#lo que hemos detectado es que son muy pocas las empresas que no tienen info financiera
#y son aún menos las que tienen años repetidos (en realidad son 6)
veamos_dup <- filter(fundamentales,ticker %in% c((dup$ticker))) %>%
arrange(ticker,forYear)
as.data.frame(veamos_dup[,c("ticker","period","forYear")])
#corregimos el error detectado
fundamentales[fundamentales$ticker %in% c((dup$ticker)),]$forYear <- year(fundamentales[fundamentales$ticker %in% c((dup$ticker)),]$period)
fundamentales[fundamentales$ticker %in% c((dup$ticker)),c("ticker","period","forYear")]
fundamentales <- mutate(fundamentales,year=year(period),month=month(period))
missing <- group_by(fundamentales,ticker,forYear) %>%
summarise(numero = n()) %>%
arrange(desc(ticker,forYear)) %>%
ungroup() %>%
filter(is.na(forYear))
veamos_missing <- filter(fundamentales,ticker %in% c((missing$ticker))) %>%
arrange(ticker,forYear)
as_tibble(as.data.frame(list(veamos_missing[,c("ticker","period","forYear")],year=year(veamos_missing$period)))) %>%
group_by(ticker,year) %>%
summarise(n=n()) %>%
ungroup() %>%
filter(n >= 2)
corregir <- veamos_missing[!(veamos_missing$ticker %in% c("CERN","HBI","SNA","SWK")),]$ticker
corregir <- unique(corregir)
#corregimos el error detectado
fundamentales[fundamentales$ticker %in% c((corregir)),]$forYear <- year(fundamentales[fundamentales$ticker %in% c((corregir)),]$period)
missing <- group_by(fundamentales,ticker,forYear) %>%
summarise(numero = n()) %>%
arrange(desc(ticker,forYear)) %>%
ungroup() %>%
filter(is.na(forYear))
veamos_missing <- filter(fundamentales,ticker %in% c((missing$ticker))) %>%
arrange(ticker,forYear)
veamos_missing[,c("ticker","period","forYear")]
#corregimos el error detectado
fundamentales[(fundamentales$ticker %in% c("CERN","HBI","SWK") & is.na(fundamentales$forYear)),]$forYear <-2016
fundamentales[fundamentales$ticker %in% c("SNA"),]$forYear <- c(2013,2014,2015,2016)
sum(is.na(fundamentales$forYear))
#Comenzamos el análisis
fund_2015 <- group_by(fundamentales,ticker,forYear)
names(fund_2015)
#2,3,10,11,13,14,15,19,23,25,26,31,32,33,34,38,39,40,41,42,47,49,59,59,61,62,64,67,69,70,71,72,73,74,75,77,78,79,80,81
fund_2015 <- fund_2015[,(c(2,3,10,11,13,14,15,19,23,25,26,31,32,33,34,38,39,40,41,42,47,49,59,61,62,64,67,69,70,71,72,73,74,75,77,78,79,80,81))]
names(fund_2015)
comp <- cbind(names(fund_2015),c("ticker","period","CashRatio","Cash","CommonStocks","CostOfRevenue","CurrentRatio","EBIT","FixedAssets","GrossMargin","GrossProfit",
"Investments","Liabilities","LongTermDebt","LongTermInvestments","NetCashFlow","NCFOperations", "NCFFinancing","NCFInvesting","NetIncome","NonRItems","OperatingMargin","PreTaxMargin","ProfitMargin",
"QuickRatio","RetainedEarnings","STBtoCPLTB","TotalAssets","CurrentAssets","CurrentLiabilities","Equity","TotalLiabilities","LiabPlusEquity","Revenue","oldFY","EPS","SharesOutstanding","forYear","year"))
comp
names(fund_2015) <- c("ticker","period","CashRatio","Cash","CommonStocks","CostOfRevenue","CurrentRatio","EBIT","FixedAssets","GrossMargin","GrossProfit",
"Investments","Liabilities","LongTermDebt","LongTermInvestments","NetCashFlow","NCFOperations", "NCFFinancing","NCFInvesting","NetIncome","NonRItems","OperatingMargin","PreTaxMargin","ProfitMargin",
"QuickRatio","RetainedEarnings","STBtoCPLTB","TotalAssets","CurrentAssets","CurrentLiabilities","Equity","TotalLiabilities","LiabPlusEquity","Revenue","oldFY","EPS","SharesOutstanding","forYear","year")
names(fund_2015)
info <- inner_join(fund_2015,securities,by="ticker")
print(info)
str(info)
info <- mutate(info,GICS = factor(GICS),industry = factor(industry),SECFilings = factor(SECFilings),CIK=factor(CIK))
str(info)
Another important step in our data preparation was the renaming of relevant variables. If you plan to continue the analysis of this datasets you can obtain very different results just by adding variables that i decided to remove. The reason I’m keeping these subset only is because my plan is to build ratios rather than input the raw data, as it is common in the investment industry to do so to relativize the size of the financial statement accounts this way.
The next lines of code are related to our prices dataset. We need to estimate returns from these prices on each ticker and different periods. The rationale that we are going to apply to do this is the following:
- Our financial statements are for periods ending December year 20XX
- If we would use this information to decide on what companies to invest, we would look at the following year’s return, thus:
- We will estimate our returns for periods of 1 year, starting January 1st and ending December 31st for every year at out disposal, as well as their respective risk (standard deviation).
- When including these variables into our dataset, we will also add our “leadReturn” which corresponds to the performance of a specific company on the year following the financial statements, i.e, our Independent Variable.
precios <- read_delim("prices-split-adjusted.csv",delim = ",")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## symbol = col_character(),
## open = col_double(),
## close = col_double(),
## low = col_double(),
## high = col_double(),
## volume = col_double()
## )
str(precios)
head(precios)
names(precios)[2] <- "ticker"
precios <- arrange(precios,ticker,date)
precio_ant <- c(0,precios$close[1:(dim(precios)[1]-1)])
ticker_ant <- c("0",precios$ticker[1:(dim(precios)[1]-1)])
ticker_sig <- c(precios$ticker[2:(dim(precios)[1])],"0")
nuevos_datos <- as_tibble(list(precio_ant=precio_ant,ticker_ant=ticker_ant,ticker = precios$ticker, ticker_sig = ticker_sig))
nuevos_datos<- nuevos_datos %>%
mutate(marca = ifelse((ticker_ant==ticker & ticker == ticker_sig),0,ifelse(ticker_ant==ticker,1,-1))) %>%
select(precio_ant,marca)
precios_n <- cbind(precios,nuevos_datos)
precios_n <- mutate(precios_n,return = ifelse(marca >=0,log(close/precio_ant),0),year = year(date),
month=month(date),day=day(date))
index<-100
for (i in 1:(dim(precios_n)[1])){
if (precios_n$marca[i]==-1){
index[i] <- 100
}
else{index[i]<- index[i-1]*(1+precios_n$return[i])}
}
precios_n <- cbind(precios_n,index)
head(precios_n)
precios_y <- group_by(precios_n,year,ticker) %>%
mutate(inicio = first(index),final = last(index)) %>%
summarise(stddev = sd(return)*sqrt(252),inicio = min(inicio), final = max(final)) %>%
mutate(return = log(final/inicio)) %>%
ungroup()
lead_return <- dplyr::select(precios_y,year,ticker,return) %>%
dplyr::rename(forYear=year, leadReturn=return) %>%
mutate(yearLeadReturn=forYear,forYear = forYear - 1)
dplyr::rename(precios_y,yearReturn = return,yearStdDev = stddev)
info <- mutate(info,year_eeff =forYear)
base <- inner_join(info,precios_y,by=(c("ticker"="ticker","forYear"="year")))
base <- inner_join(base,lead_return,by=(c("ticker"="ticker","forYear"="forYear")))
print(base)
table(base$year_eeff)
The model: Generalized Linear Regression with a classification approach.
After all the hard work of processing the data, at last we have arrived at a stage of finding a good model for our data. This next section will focus on the design, analysis and visualization of the variables that we want to introduce in the model.
Specifically, everytime I say model I have something like this in mind:
\(R_{j,year_{t+1}} = \hat{\beta_0} + \sum_i^n\hat{\beta_i}X_{i,j}\)
Where each \(R_{j,year_{t+1}}\) corresponds to the return next year for company \(j\) and \(X_{i,j}\) represents a feature of company \(j\) in year \(t\), or previous year.
Exploring the features
There are many differents graphs and tests that should be made to explore the features to include in your model, I’m going to give a few examples here but this part of the modeling is more an art than science: it takes time and inspiration.
Our first step is splitting the sample in a training sample, from which we will obtain the parameters of our model through estimation; and a testing sample, on which we will test our model’s performance. It is worth noting that, statistically speaking, our features are not independent (because we are sampling from the same companies over different years). This will be addressed on further submissions, but bear with me on this as the purpose of this exercise is to show steps to create a model.
Aditionally to sampling, we’ll construct some features as ratios coming from the financial statements and we will winsorize them, which is explained [here] (https://en.wikipedia.org/wiki/Winsorizing).
training <- base[base$year_eeff %in% c(2012,2013),]
testing <- base[!(base$year_eeff %in% c(2012,2013)),]
#Vamos a ir iterando nuestro algoritmo sobre distintas ventanas de tiempo
#Para ver qué sucede con el modelo
training <- base[base$year_eeff %in% c(2012,2013),]
testing <- base[!(base$year_eeff %in% c(2012,2013)),]
colSums(is.na(training))/dim(training)[1]*100
## ticker period CashRatio
## 0.000000 0.000000 18.639053
## Cash CommonStocks CostOfRevenue
## 0.000000 0.000000 0.000000
## CurrentRatio EBIT FixedAssets
## 18.639053 0.000000 0.000000
## GrossMargin GrossProfit Investments
## 0.000000 0.000000 0.000000
## Liabilities LongTermDebt LongTermInvestments
## 0.000000 0.000000 0.000000
## NetCashFlow NCFOperations NCFFinancing
## 0.000000 0.000000 0.000000
## NCFInvesting NetIncome NonRItems
## 0.000000 0.000000 0.000000
## OperatingMargin PreTaxMargin ProfitMargin
## 0.000000 0.000000 0.000000
## QuickRatio RetainedEarnings STBtoCPLTB
## 18.639053 0.000000 0.000000
## TotalAssets CurrentAssets CurrentLiabilities
## 0.000000 0.000000 0.000000
## Equity TotalLiabilities LiabPlusEquity
## 0.000000 0.000000 0.000000
## Revenue oldFY EPS
## 0.000000 2.958580 4.881657
## SharesOutstanding forYear year
## 4.881657 0.000000 0.000000
## security SECFilings GICS
## 0.000000 0.000000 0.000000
## industry address firstAdded
## 0.000000 0.000000 39.792899
## CIK year_eeff stddev
## 0.000000 0.000000 0.000000
## inicio final return
## 0.000000 0.000000 0.000000
## leadReturn yearLeadReturn
## 0.000000 0.000000
#Tratemos de construir algunas variables que nos hagan sentido para explicar los retornos futuros, en particular,
#Buscaremos
training <- mutate(training,leverage= TotalAssets/TotalLiabilities,debtToEquity = TotalLiabilities/Equity,
EBITtoAssets = EBIT/TotalAssets, EBITtoEquity = EBIT/Equity, NCFtoEquity = NetCashFlow/Equity,
InvestToRetainedEarnings = Investments/RetainedEarnings)
names(training)
## [1] "ticker" "period"
## [3] "CashRatio" "Cash"
## [5] "CommonStocks" "CostOfRevenue"
## [7] "CurrentRatio" "EBIT"
## [9] "FixedAssets" "GrossMargin"
## [11] "GrossProfit" "Investments"
## [13] "Liabilities" "LongTermDebt"
## [15] "LongTermInvestments" "NetCashFlow"
## [17] "NCFOperations" "NCFFinancing"
## [19] "NCFInvesting" "NetIncome"
## [21] "NonRItems" "OperatingMargin"
## [23] "PreTaxMargin" "ProfitMargin"
## [25] "QuickRatio" "RetainedEarnings"
## [27] "STBtoCPLTB" "TotalAssets"
## [29] "CurrentAssets" "CurrentLiabilities"
## [31] "Equity" "TotalLiabilities"
## [33] "LiabPlusEquity" "Revenue"
## [35] "oldFY" "EPS"
## [37] "SharesOutstanding" "forYear"
## [39] "year" "security"
## [41] "SECFilings" "GICS"
## [43] "industry" "address"
## [45] "firstAdded" "CIK"
## [47] "year_eeff" "stddev"
## [49] "inicio" "final"
## [51] "return" "leadReturn"
## [53] "yearLeadReturn" "leverage"
## [55] "debtToEquity" "EBITtoAssets"
## [57] "EBITtoEquity" "NCFtoEquity"
## [59] "InvestToRetainedEarnings"
#Windsorize at 5%
training <- mutate(training,
leverage= max(min(quantile(training$leverage,.95),leverage),quantile(training$leverage,.05)),
debtToEquity= max(min(quantile(training$debtToEquity,.95),debtToEquity),quantile(training$debtToEquity,.05)),
EBITtoAssets= max(min(quantile(training$EBITtoAssets,.95),EBITtoAssets),quantile(training$EBITtoAssets,.05)),
EBITtoEquity= max(min(quantile(training$EBITtoEquity,.95),EBITtoEquity),quantile(training$EBITtoEquity,.05)),
NCFtoEquity= max(min(quantile(training$NCFtoEquity,.95),NCFtoEquity),quantile(training$NCFtoEquity,.05)),
InvestToRetainedEarnings= max(min(quantile(training$InvestToRetainedEarnings,.95),InvestToRetainedEarnings),quantile(training$InvestToRetainedEarnings,.05))
)
Different exploration tools are available in R to visualize data. What we are looking for are features that have any relationship with our independent variable and also have familiar shapes. To observe relationships between variables we can always use a featurePlot.
featurePlot(x=training[,c("leverage","debtToEquity","EBITtoAssets","EBITtoEquity")],y=training$leadReturn,plot="pairs")
Another example is a scatterplot with fitted linear models in it. In this case I’m splitting the training sample per year to observe if the relationship is the same both periods.
qq <- qplot(leverage,leadReturn,color = factor(forYear),data=training)
qq + geom_smooth(method = 'lm', formula = y~x)
This has to be done for differents features, I will spare you the code but not the plots!
Next, we are going to explore our features standarized, as we will input them in the model this way:
escalas <- preProcess(as.data.frame(training[,c(10,22,23,24,54:59)]),method=c("center","scale"))
training_esc <- predict(escalas,as.data.frame(training[,c(10,22,23,24,54:59)]))
training_esc <- as_tibble(training_esc)
Another useful R function to explore data is Desc:
DescTools::Desc(training_esc)
## -------------------------------------------------------------------------
## Describe training_esc (tbl_df, tbl, data.frame):
##
## data.frame: 676 obs. of 10 variables
##
## Nr ColName Class NAs Levels
## 1 GrossMargin numeric .
## 2 OperatingMargin numeric .
## 3 PreTaxMargin numeric .
## 4 ProfitMargin numeric .
## 5 leverage numeric .
## 6 debtToEquity numeric .
## 7 EBITtoAssets numeric .
## 8 EBITtoEquity numeric .
## 9 NCFtoEquity numeric .
## 10 InvestToRetainedEarnings numeric .
##
##
## -------------------------------------------------------------------------
## 1 - GrossMargin (numeric)
##
## length n NAs unique
## 676 676 0 95
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.6303112 -1.2441339 -0.6302112 -0.1549162
##
## range sd vcoef mad
## 3.9607919 1.0000000 -5'285'474'562'677'872.0000000 0.9395632
##
## 0s mean meanCI
## 0 -0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.6768501 1.3897927 2.1027352
##
## IQR skew kurt
## 1.3070613 0.2979358 -0.4663468
##
## lowest : -1.8580567 (28), -1.7392329, -1.6996250 (3), -1.6600171 (2), -1.6204092
## highest: 1.8254798 (2), 1.9046956 (2), 1.9443035, 2.0235194, 2.1027352 (38)
## -------------------------------------------------------------------------
## 2 - OperatingMargin (numeric)
##
## length n NAs unique
## 676 676 0 57
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.3663751 -1.1121774 -0.6885146 -0.1801193
##
## range sd vcoef mad
## 7.3717319 1.0000000 4'786'844'886'953'547.0000000 0.8793714
##
## 0s mean meanCI
## 0 0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.4977411 1.1756015 1.7687293
##
## IQR skew kurt
## 1.1862557 1.3482455 3.6692752
##
## lowest : -1.4511076 (31), -1.3663751 (8), -1.2816425 (12), -1.1969100 (13), -1.1121774 (9)
## highest: 3.5481129, 3.717578 (2), 4.480171, 5.3274965, 5.9206243
## -------------------------------------------------------------------------
## 3 - PreTaxMargin (numeric)
##
## length n NAs unique
## 676 676 0 56
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.2133039 -1.0406134 -0.6952324 -0.1771610
##
## range sd vcoef mad
## 9.0662499 1.0000000 17'700'193'884'316'598.0000000 0.8961081
##
## 0s mean meanCI
## 0 0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.5136009 1.2043628 1.8087794
##
## IQR skew kurt
## 1.2088333 1.8759263 7.6466022
##
## lowest : -1.3859943 (4), -1.2996491 (17), -1.2133039 (19), -1.1269586 (14), -1.0406134 (22)
## highest: 3.5356842 (2), 3.8810651, 3.9674104, 6.2123865, 7.6802556
## -------------------------------------------------------------------------
## 4 - ProfitMargin (numeric)
##
## length n NAs unique
## 676 676 0 50
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.0507575 -0.9518283 -0.6550405 -0.2593234
##
## range sd vcoef mad
## 7.8154114 1.0000000 -23'063'889'000'776'176.0000000 0.7333626
##
## 0s mean meanCI
## 0 -0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.4331814 1.1256862 1.7192617
##
## IQR skew kurt
## 1.0882218 2.3074586 8.9616830
##
## lowest : -1.2486160 (8), -1.1496868 (18), -1.0507575 (28), -0.9518283 (21), -0.8528990 (32)
## highest: 4.2914225 (2), 4.6871395, 6.2700076, 6.3689369, 6.5667954
## -------------------------------------------------------------------------
## 5 - leverage (numeric)
##
## length n NAs unique
## 676 676 0 608
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.1053582 -1.0271489 -0.7327368 -0.2544312
##
## range sd vcoef mad
## 3.7876534 1.0000000 -9'513'854'212'820'172.0000000 0.8103055
##
## 0s mean meanCI
## 0 -0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.3835375 1.5029613 2.6721803
##
## IQR skew kurt
## 1.1162743 1.2644435 1.0027729
##
## lowest : -1.1063814 (34), -1.1050172, -1.1022435, -1.1019083, -1.1015889
## highest: 2.6065516, 2.6074547, 2.6287892, 2.6691498, 2.681272 (34)
## -------------------------------------------------------------------------
## 6 - debtToEquity (numeric)
##
## length n NAs unique
## 676 676 0 608
## 100.0% 0.0%
##
## .05 .10 .25 median
## -0.9056427 -0.8271874 -0.6470017 -0.3883097
##
## range sd vcoef mad
## 3.6328807 1.0000000 -3'397'805'076'007'205.0000000 0.5011195
##
## 0s mean meanCI
## 0 -0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.1846233 1.7729387 2.7172606
##
## IQR skew kurt
## 0.8316250 1.5904281 1.4941713
##
## lowest : -0.9057203 (34), -0.9056168, -0.9049976, -0.9028295, -0.9007881
## highest: 2.6438227, 2.672031, 2.7099758, 2.7139607, 2.7271604 (34)
## -------------------------------------------------------------------------
## 7 - EBITtoAssets (numeric)
##
## length n NAs unique
## 676 676 0 607
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.2970600 -1.1764572 -0.7756573 -0.2563173
##
## range sd vcoef mad
## 3.4916684 1.0000000 -1'534'492'614'970'995.7500000 0.9386732
##
## 0s mean meanCI
## 0 -0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.6291145 1.6182670 2.1945937
##
## IQR skew kurt
## 1.4047717 0.7016487 -0.4668732
##
## lowest : -1.2970747 (34), -1.2970550, -1.2969816, -1.2961346 (2), -1.2911819
## highest: 2.1046146, 2.1262922, 2.1642992, 2.1705146, 2.1945937 (35)
## -------------------------------------------------------------------------
## 8 - EBITtoEquity (numeric)
##
## length n NAs unique
## 676 676 0 608
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.2848187 -1.0420162 -0.6928489 -0.2403940
##
## range sd vcoef mad
## 4.0795072 1.0000000 -2'973'079'441'506'304.5000000 0.7503793
##
## 0s mean meanCI
## 0 -0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.3870715 1.4219286 2.7883251
##
## IQR skew kurt
## 1.0799204 1.2851701 1.3096574
##
## lowest : -1.2857167 (34), -1.2845193, -1.2777760, -1.2744496, -1.2623503
## highest: 2.4217691, 2.5060169, 2.6798201, 2.7865033, 2.7937905 (34)
## -------------------------------------------------------------------------
## 9 - NCFtoEquity (numeric)
##
## length n NAs unique
## 676 676 0 607
## 100.0% 0.0%
##
## .05 .10 .25 median
## -1.8637925 -1.2805628 -0.4517470 -0.1372020
##
## range sd vcoef mad
## 4.3422467 1.0000000 9'513'854'212'820'172.0000000 0.6205196
##
## 0s mean meanCI
## 0 0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.3995718 1.3369580 2.4692807
##
## IQR skew kurt
## 0.8513188 0.5882077 0.6839576
##
## lowest : -1.8688189 (34), -1.8621170, -1.8512264, -1.7916917, -1.7459722
## highest: 2.3658195, 2.410999, 2.4434364, 2.4678983, 2.4734278 (34)
## -------------------------------------------------------------------------
## 10 - InvestToRetainedEarnings (numeric)
##
## length n NAs unique
## 676 676 0 461
## 100.0% 0.0%
##
## .05 .10 .25 median
## -3.3171136 -0.9874606 0.1141631 0.2866717
##
## range sd vcoef mad
## 4.7796193 1.0000000 19'027'708'425'640'356.0000000 0.0992086
##
## 0s mean meanCI
## 0 0.0000000 -0.0755186
## 0.0% 0.0755186
##
## .75 .90 .95
## 0.2955877 0.6192831 1.4181126
##
## IQR skew kurt
## 0.1814246 -2.1494602 4.7339715
##
## lowest : -3.3303971 (34), -3.3126858, -3.2513266, -3.1765685, -2.9812440
## highest: 1.230836, 1.2343786, 1.3865519, 1.4077427, 1.4492223 (34)
Clearly, Investment to Earnings has issues on the distribution, further testing shows that it can’t be incorporated in the model (very little variability). These and other conclusions can be made from these plots to understand what we are doing in the modeling stage.
When analyzing a continous response variable, sometimes it is also useful to define groups of it to study different distributions of the features for levels of the dependent variable.
class_return <- cut2(training$leadReturn,g=3)
training <- cbind(training,class_return=class_return)
featurePlot(x = training[, c(10,22,23,24)],
y = training$class_return,
plot = "density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3))
featurePlot(x = training[, c(54:57)],
y = training$class_return,
plot = "box",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3))
featurePlot(x = training[, c(58:59)],
y = training$class_return,
plot = "box",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(2, 1),
auto.key = list(columns = 3))
Calibrating the model
If you’re still reading this, at last it’s time to model!
We are modeling a linear relationship to the data, the features included in the model come from our exploratory analysis. Our first attempt will only include financial data and GICS classification. Let’s see how it comes out:
modFit_glm <- train(leadReturn~GrossMargin+OperatingMargin+PreTaxMargin+ProfitMargin+
leverage+ debtToEquity+EBITtoAssets+EBITtoEquity+NCFtoEquity+GICS,method="glm",preProcess=c("center","scale"),data=training)
summary(modFit_glm)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.89292 -0.11895 0.00192 0.11747 0.93630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1456050 0.0076077 19.139 < 2e-16
## GrossMargin 0.0226350 0.0107456 2.106 0.035545
## OperatingMargin -0.0407913 0.0168545 -2.420 0.015783
## PreTaxMargin 0.0020055 0.0182208 0.110 0.912392
## ProfitMargin 0.0083558 0.0144956 0.576 0.564519
## leverage -0.0184628 0.0110557 -1.670 0.095399
## debtToEquity 0.0083975 0.0145760 0.576 0.564731
## EBITtoAssets 0.0070850 0.0152655 0.464 0.642716
## EBITtoEquity -0.0036929 0.0145612 -0.254 0.799873
## NCFtoEquity 0.0039894 0.0077732 0.513 0.607963
## `GICSConsumer Staples` -0.0018329 0.0087399 -0.210 0.833954
## GICSEnergy -0.0348353 0.0094653 -3.680 0.000252
## GICSFinancials -0.0059113 0.0131853 -0.448 0.654067
## `GICSHealth Care` 0.0199842 0.0095221 2.099 0.036224
## GICSIndustrials -0.0019687 0.0094827 -0.208 0.835597
## `GICSInformation Technology` 0.0082773 0.0098383 0.841 0.400471
## GICSMaterials -0.0229906 0.0088043 -2.611 0.009227
## `GICSReal Estate` -0.0062013 0.0112853 -0.550 0.582848
## `GICSTelecommunications Services` -0.0122138 0.0081564 -1.497 0.134758
## GICSUtilities 0.0003061 0.0095970 0.032 0.974566
##
## (Intercept) ***
## GrossMargin *
## OperatingMargin *
## PreTaxMargin
## ProfitMargin
## leverage .
## debtToEquity
## EBITtoAssets
## EBITtoEquity
## NCFtoEquity
## `GICSConsumer Staples`
## GICSEnergy ***
## GICSFinancials
## `GICSHealth Care` *
## GICSIndustrials
## `GICSInformation Technology`
## GICSMaterials **
## `GICSReal Estate`
## `GICSTelecommunications Services`
## GICSUtilities
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03912492)
##
## Null deviance: 28.289 on 675 degrees of freedom
## Residual deviance: 25.666 on 656 degrees of freedom
## AIC: -250.81
##
## Number of Fisher Scoring iterations: 2
Not bad in any way, as some variables are statistically significant. The sign of the coefficients is open to discussion and should be taken to experts to understand why some are counter-intuitive (we prefer companies with lower Operating Margins, for example) but it isn’t all lost, as sometimes asset managers invest in companies that aren’t performing well in the short-term to profit big long-term.
Let us think for a minute what kind of model we are trying to obtain: as I said before, we aren’t really interested in fitting our data to predict future returns explicitly, but rather use this predicted return as a ranking, relative to the rest of the stocks considered in our analysis, to filter the companies that our best analysts will focus on.
Either way, our next steps always should be to test if the model is fitting the data correctly. To do this, I will attach some code but i won’t dig deeper into it, it’s left to the reader to understand (you can ask me, of course).
#Revisamos el ajuste del modelo, primero analizamos los puntos de high leverage
lev <- hat(model.matrix(modFit_glm$finalModel))
#se detectan un par, pero no muchos
plot(lev)
#A continuación analizamos los residuos estandarizados
res_student <- rstudent(modFit_glm$finalModel)
cook = cooks.distance(modFit_glm$finalModel)
plot(cook,ylab="Cooks distances")
points(which(lev>.2),cook[which(lev>.2)],col='red')
par(mfrow=c(1,3))
plot(training$GrossMargin, res_student, xlab="Gross Margin", ylab="Studentized Residuals")
plot(training$OperatingMargin, res_student,xlab="Operating Margin", ylab="Studentized Residuals")
plot(modFit_glm$finalModel$fitted.values, res_student,xlab="Valores Ajustados", ylab="Studentized Residuals")
par(mfrow=c(1,2))
qqnorm(res_student)
qqline(res_student)
hist(res_student, freq=FALSE,
main="Distribution of Residuals")
xfit<-seq(min(res_student),max(res_student),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
spreadLevelPlot(modFit_glm$finalModel)
Being said that, what we really want is use this model to predict a subset of companies that outperform. Let us explore the effectiveness of this model to find companies that, on average, outperform the others.
#Ahora probemos cómo funciona en el futuro:
testing <- mutate(testing,leverage= TotalAssets/TotalLiabilities,debtToEquity = TotalLiabilities/Equity,
EBITtoAssets = EBIT/TotalAssets, EBITtoEquity = EBIT/Equity, NCFtoEquity = NetCashFlow/Equity,
InvestToRetainedEarnings = Investments/RetainedEarnings)
names(testing)
## [1] "ticker" "period"
## [3] "CashRatio" "Cash"
## [5] "CommonStocks" "CostOfRevenue"
## [7] "CurrentRatio" "EBIT"
## [9] "FixedAssets" "GrossMargin"
## [11] "GrossProfit" "Investments"
## [13] "Liabilities" "LongTermDebt"
## [15] "LongTermInvestments" "NetCashFlow"
## [17] "NCFOperations" "NCFFinancing"
## [19] "NCFInvesting" "NetIncome"
## [21] "NonRItems" "OperatingMargin"
## [23] "PreTaxMargin" "ProfitMargin"
## [25] "QuickRatio" "RetainedEarnings"
## [27] "STBtoCPLTB" "TotalAssets"
## [29] "CurrentAssets" "CurrentLiabilities"
## [31] "Equity" "TotalLiabilities"
## [33] "LiabPlusEquity" "Revenue"
## [35] "oldFY" "EPS"
## [37] "SharesOutstanding" "forYear"
## [39] "year" "security"
## [41] "SECFilings" "GICS"
## [43] "industry" "address"
## [45] "firstAdded" "CIK"
## [47] "year_eeff" "stddev"
## [49] "inicio" "final"
## [51] "return" "leadReturn"
## [53] "yearLeadReturn" "leverage"
## [55] "debtToEquity" "EBITtoAssets"
## [57] "EBITtoEquity" "NCFtoEquity"
## [59] "InvestToRetainedEarnings"
dim(testing)
## [1] 877 59
length(predict(modFit_glm,testing))
## [1] 877
testing[is.na(testing$InvestToRetainedEarnings),]$InvestToRetainedEarnings <- c(0,0)
port_predict <- data.frame(list(testing,return_pred = predict(modFit_glm,testing)))
port_predict <- as.tbl(port_predict)
port_predict <- group_by(port_predict,forYear) %>%
mutate(ranking = rank(return_pred),class = ifelse(ranking <=170,0,ifelse(ranking<=340,1,2))) %>%
ungroup()
port_predict$class <- factor(port_predict$class,labels=c("LP","MP","HP"))
#No funciona muy bien, ya que en el 2014 sí nos consigue el mejor portafolio pero no en el 2015.
port_predict %>% group_by(forYear,class) %>%
summarise(return = mean(leadReturn),minimo=min(leadReturn),maximo=max(leadReturn), std=sd(leadReturn))
## # A tibble: 6 x 6
## # Groups: forYear [?]
## forYear class return minimo maximo std
## <dbl> <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 LP -0.159823424 -1.7142051 0.4078223 0.3433383
## 2 2014 MP -0.058765544 -1.0253847 0.6162113 0.2376850
## 3 2014 HP -0.009692326 -0.9032718 0.7316621 0.2357729
## 4 2015 LP 0.076845838 -0.8448250 0.7135035 0.2015338
## 5 2015 MP 0.051966839 -1.5537908 1.0982486 0.2273443
## 6 2015 HP 0.053958743 -0.6042448 0.4683819 0.1997241
That last table summarises our findings: We have ordered the stocks using financial data for years 2014 and 2015, and we have estimated the returns of these stocks for 2015 and 2016 respectively. What the table shows is that, for 2014, we are able to predict the Highest performing subset of the three, but in 2015 we do quite poorly.
After a lot of hard work (not shown here) we can conclude that we are missing a very important feature to our model: the past return of the stock. This past return isn’t strong to predict future returns (Efficient Market Hypothesis guys) but it can help discriminate among stocks.
modFit_glm2 <- train(leadReturn~return+GrossMargin+OperatingMargin+PreTaxMargin+ProfitMargin+
leverage+ debtToEquity+EBITtoAssets+EBITtoEquity+NCFtoEquity
+GICS,method="glm",preProcess=c("center","scale"),data=training)
summary(modFit_glm2)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8911 -0.1168 0.0059 0.1158 0.7850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1456050 0.0074783 19.470 < 2e-16
## return -0.0393134 0.0080417 -4.889 1.28e-06
## GrossMargin 0.0247096 0.0105714 2.337 0.01972
## OperatingMargin -0.0447707 0.0165878 -2.699 0.00713
## PreTaxMargin 0.0027164 0.0179114 0.152 0.87950
## ProfitMargin 0.0063055 0.0142552 0.442 0.65840
## leverage -0.0234910 0.0109162 -2.152 0.03177
## debtToEquity 0.0064875 0.0143334 0.453 0.65098
## EBITtoAssets 0.0063796 0.0150066 0.425 0.67089
## EBITtoEquity -0.0015149 0.0143204 -0.106 0.91579
## NCFtoEquity 0.0050150 0.0076439 0.656 0.51200
## `GICSConsumer Staples` -0.0049449 0.0086148 -0.574 0.56617
## GICSEnergy -0.0387982 0.0093396 -4.154 3.70e-05
## GICSFinancials -0.0039668 0.0129671 -0.306 0.75977
## `GICSHealth Care` 0.0209543 0.0093623 2.238 0.02555
## GICSIndustrials -0.0007739 0.0093246 -0.083 0.93388
## `GICSInformation Technology` 0.0105159 0.0096818 1.086 0.27781
## GICSMaterials -0.0257273 0.0086727 -2.966 0.00312
## `GICSReal Estate` -0.0128507 0.0111764 -1.150 0.25064
## `GICSTelecommunications Services` -0.0169121 0.0080751 -2.094 0.03661
## GICSUtilities -0.0078384 0.0095797 -0.818 0.41352
##
## (Intercept) ***
## return ***
## GrossMargin *
## OperatingMargin **
## PreTaxMargin
## ProfitMargin
## leverage *
## debtToEquity
## EBITtoAssets
## EBITtoEquity
## NCFtoEquity
## `GICSConsumer Staples`
## GICSEnergy ***
## GICSFinancials
## `GICSHealth Care` *
## GICSIndustrials
## `GICSInformation Technology`
## GICSMaterials **
## `GICSReal Estate`
## `GICSTelecommunications Services` *
## GICSUtilities
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03780523)
##
## Null deviance: 28.289 on 675 degrees of freedom
## Residual deviance: 24.762 on 655 degrees of freedom
## AIC: -273.04
##
## Number of Fisher Scoring iterations: 2
Always analyze the fitted model.
#Revisamos el ajuste del modelo, primero analizamos los puntos de high leverage
lev <- hat(model.matrix(modFit_glm2$finalModel))
#se detectan un par, pero no muchos
plot(lev)
#A continuación analizamos los residuos estandarizados
res_student <- rstudent(modFit_glm2$finalModel)
cook = cooks.distance(modFit_glm2$finalModel)
plot(cook,ylab="Cooks distances")
points(which(lev>.2),cook[which(lev>.2)],col='red')
par(mfrow=c(1,3))
plot(training$GrossMargin, res_student, xlab="Gross Margin", ylab="Studentized Residuals")
plot(training$OperatingMargin, res_student,xlab="Operating Margin", ylab="Studentized Residuals")
plot(modFit_glm2$finalModel$fitted.values, res_student,xlab="Valores Ajustados", ylab="Studentized Residuals")
par(mfrow=c(1,2))
qqnorm(res_student)
qqline(res_student)
hist(res_student, freq=FALSE,
main="Distribution of Residuals")
xfit<-seq(min(res_student),max(res_student),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
ggplot(aes(x=return,y=predict(modFit_glm2,testing),color=year_eeff),data=testing)+
geom_point() +
geom_smooth(method="lm") +
geom_abline(slope=1,intercept=0)
port_predict2 <- data.frame(list(testing,return_pred = predict(modFit_glm2,testing)))
port_predict2 <- as.tbl(port_predict2)
port_predict2 <- group_by(port_predict2,forYear) %>%
mutate(ranking = rank(return_pred),class = ifelse(ranking <=170,0,ifelse(ranking<=340,1,2))) %>%
ungroup()
port_predict2$class <- factor(port_predict2$class,labels=c("LP","MP","HP"))
as.data.frame(port_predict2 %>% group_by(forYear,class) %>%
summarise(n=n(),return = mean(leadReturn),minimo=min(leadReturn),maximo=max(leadReturn), std=sd(leadReturn), pseudo_Sharpe= return/std))
## forYear class n return minimo maximo std
## 1 2014 LP 170 -0.13012437 -1.7142051 0.4078223 0.3296853
## 2 2014 MP 170 -0.07659858 -1.4849681 0.6162113 0.2665800
## 3 2014 HP 94 -0.03115214 -0.6586459 0.7316621 0.2371732
## 4 2015 LP 170 0.04911236 -1.5537908 1.0982486 0.2275010
## 5 2015 MP 170 0.06254855 -0.6687296 0.5010547 0.1971311
## 6 2015 HP 103 0.08226747 -0.4226786 0.7135035 0.2063685
## pseudo_Sharpe
## 1 -0.3946927
## 2 -0.2873381
## 3 -0.1313476
## 4 0.2158776
## 5 0.3172942
## 6 0.3986435
Eureka! This model can filter some stocks nicely. In a sense, in 2014 we managed to find 94 stocks than, on average, had a sharpe that was twice and trice as big as the other 2 groups. For 2014 we didn’t do as well yet we managed to double the worst performing group’s pseudo Sharpe (no risk free included).
Hope you guys enjoyed this work, if any questions arise let me know.