Madrid real estate renting market data analysis

Madrid real estate renting market data analysis

Data analysis

Today I’m working with the data that I gathered from “habitaclia.com”, related to the flat renting market in Madrid. Find this data here(https://github.com/artikblue/datasets-analyses)
In this post I will walk you a little bit through the data I gathered:

Basic data analysis

As the dataset is stored in a mongo db the first step is connecting to it and retrieving the collection.

> con <- mongo("realestate_renting", url = "mongodb://127.0.0.1:27017/habitaclia")
> con$count()
[1] 6114

We can see that our dataset is composed by a total of 6114 flat renting offers.
With the dataset loaded we can start by doing some basic descriptive statistics. As it is important to know about what data we are doing our analysis we can print the urban areas and the villages were the offers we just retrieved belong.

> mydata <- con$find('{}')
> print("Studied areas")
[1] "Studied areas"
> unique(mydata["subzone"])
                            subzone
1    Cuenca del Alberche-Guadarrama
14           Cuenca del Tajo-Tajuña
31                           Madrid
366                        Zona Sur
1743                  Zona Noroeste
4243           Corredor del Henares
4797                     Zona Norte
> print("Studied villages")
[1] "Studied villages"
> unique(mydata["village"])
                                               village
1                                      Zona El Pijorro
2                                   Zona Casco Antiguo
3                                     Sevilla la Nueva
4                                              Brunete
5                                           Álamo (El)
6                                           Chapinería
7                                 Zona El Pinar-Dehesa
9                               Cadalso de los Vidrios
10                                 Pelayos de la Presa
[...]
5213                                  Zona Los Arroyos
5302                                        Valdelagua
5397                                            Cobeña
5437                                            Madrid
5671                  Zona El Cañaveral-Los Berrocales
5989  

As we are doing our analysis on the real estate renting market one obviously interesting thing to know is the avg renting price and the avg surface. As we are dealing with a lot of offers and some of them (specially the luxuriy properties) can have a very high price, using the median here is interesting and more revealing.

> print("general means")
[1] "general means"
> # GENERAL MEANS (relevant data)
> sapply(mydata["price"],mean)
   price 
2117.328
> sapply(mydata["price"],median)
price 
 1300 
> sapply(mydata["surface"],mean)
 surface 
128.5034 

And done that, we can notice an interesting difference between the median and the mean.
Box plots are also quite interesting and somehow confirm the theory I just presented. The boxplot related to the price shows that almost all values are between a specific range and then some outliers appear in the upper sector, so those are luxury properties for sure. priceboxplot Regarding to the surface something similar happens but not as extreme, probably because expensive properties are more about luxury than just “space”. surfaceboxplot Those two other boxplots I think are quite less relevant and follow the same logic.

roomsboxplot

toiletsboxplot

We can also generate the density graphs for each feat to quickly identify if they follow a normal distribution along all of the elements.
In the first graph, related to the price we can see that almost all of the values are concentrated at the very beggining and so the outliers have a big effect on “breaking” the normality.

pricedensity

Regarding to the surface all of the values are located between about 30 and 200 which is somehow expected, then we may have some very big properties but not to many.

surfacedensity The rooms and toilets follow the same logic, we see that the graphic is fluctuating because we have few “integer” values of each like 1,2,3… and all of the elements match at least in one category. roomsdensity

toiletsdensity

As a conclusion of this first stage of the analysis a summary of the data can complete this big picture for us:

> print("data summaries")
[1] "data summaries"
> # SUMMARIES
> summary(mydata$price)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    200.0     936.2    1300.0    2117.3    2000.0 1800000.0 
> summary(mydata$surface)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    65.0    92.0   128.5   142.8   992.0 
> summary(mydata$numpics)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   11.00   18.00   20.32   28.00  123.00 
> summary(mydata$rooms)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.00    2.00    2.61    3.00   24.00 
> summary(mydata$toilets)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   1.997   2.000  26.000 

From the data summary we can basically extract that prices range from 200EUR to 1800000EUR the avg house/flat may have a cost I would say between 1300EUR and 2000EUR and it would have a surface between 92m2 and 100m2 and would have 1 toilet and 2 or 3 rooms. Most of the offer posts include 18 pics. There is nothing much more to extract.
We can also try to identify some correlation between features by generating the pairs graph. basicpairs One can guess that price and surface may have correlation but price may depend on other factors such as location or luxury. In other terms price and surface have correlation but at the very beggining of the scale, then other factors come to play .Of course surface and rooms and toilets have correlation.

Analysis by categories

As the offers belong to specif categories such as zone or or village, we can try groping our data by those and exploring the differences between each.

> villages <- mydata %>%
+   group_by(village) %>%
+   summarize(price_mean=mean(price), price_deviation=sd(price), price_var=var(price),
+             price_median=median(price), total_val=length(price),price_max=max(price),
+             price_min=min(price),price_range=max(price)-min(price), rooms_med=mean(rooms), 
+             toilets_med=mean(toilets), npics_med=mean(numpics),
+             surface_max=max(surface), surface_min=min(surface), surface_mean=mean(surface),
+             surface_median=median(surface), surface_sd=sd(surface), surface_varr=var(surface),
+             surface_range=max(surface)-min(surface), price_meter=sum(price)/sum(surface))
> zones <- mydata %>%
+   group_by(subzone) %>%
+   summarize(price_mean=mean(price), price_deviation=sd(price), price_var=var(price),
+             price_median=median(price), total_val=length(price),price_max=max(price),
+             price_min=min(price),price_range=max(price)-min(price), rooms_med=mean(rooms), 
+             toilets_med=mean(toilets), npics_med=mean(numpics),
+             surface_max=max(surface), surface_min=min(surface), surface_mean=mean(surface),
+             surface_median=median(surface), surface_sd=sd(surface), surface_varr=var(surface),
+             surface_range=max(surface)-min(surface), price_meter=sum(price)/sum(surface))

Then we can analyze the zones who have more offers, the more expensive zones and the zones who offer larger spaces:

> select(zones %>% arrange(-total_vals), zone, total_vals)
# A tibble: 7 x 2
  zone                           total_vals
  <chr>                                  <int>
1 Madrid                                  4455
2 Zona Noroeste                            718
3 Zona Sur                                 341
4 Zona Norte                               309
5 Corredor del Henares                     171
6 Cuenca del Tajo-Tajuña                    84
7 Cuenca del Alberche-Guadarrama            36
> select(zones %>% arrange(-price_avg), zone, price_avg)
# A tibble: 7 x 2
  zone                           price_avg
  <chr>                                 <dbl>
1 Zona Norte                            3549.
2 Madrid                                2240.
3 Zona Noroeste                         1817.
4 Cuenca del Tajo-Tajuña                 952.
5 Zona Sur                               902.
6 Corredor del Henares                   866.
7 Cuenca del Alberche-Guadarrama         760.
> select(zonas %>% arrange(price_avg), zone, price_avg)
# A tibble: 7 x 2
  zone                           price_avg
  <chr>                                 <dbl>
1 Cuenca del Alberche-Guadarrama         760.
2 Corredor del Henares                   866.
3 Zona Sur                               902.
4 Cuenca del Tajo-Tajuña                 952.
5 Zona Noroeste                         1817.
6 Madrid                                2240.
7 Zona Norte                            3549.
> select(zonas %>% arrange(-surface_avg), zone, surface_avg)
# A tibble: 7 x 2
  zone                           surface_avg
  <chr>                                     <dbl>
1 Zona Norte                                192. 
2 Zona Noroeste                             188. 
3 Cuenca del Tajo-Tajuña                    129. 
4 Cuenca del Alberche-Guadarrama            121. 
5 Madrid                                    117. 
6 Zona Sur                                  107. 
7 Corredor del Henares                       97.2

Not a big surpise that most of the offers are located in the metropolital area (aka the big city). Another interesting fact here is that living in small villages outside the metropilitan area can be very cheap.
If we go by village (or quarter in the big city) we see that most of the offers are located in the very city centre.

> select(villages %>% arrange(-total_vals), village, total_vals)
# A tibble: 428 x 2
   village                      total_vals
   <chr>                                <int>
 1 Zona Castellana                        178
 2 Zona Recoletos                         173
 3 Zona Universidad-Malasaña              167
 4 Zona Embajadores-Lavapiés              130
 5 Zona Justicia-Chueca                   130
 6 Zona Almagro                           119
 7 Zona Lista                             117
 8 Zona Argüelles                         107
 9 Zona Goya                              100
10 Zona Hispanoamérica-Bernabéu            96
# … with 418 more rows
> select(villages %>% arrange(-price_avg), village, price_avg)
# A tibble: 428 x 2
   village                              price_avg
   <chr>                                       <dbl>
 1 Pedrezuela                                112950 
 2 Zona Nueva España                          23240.
 3 Zona La Moraleja                            8307.
 4 Zona La Finca                               8259.
 5 Zona Monteclaro                             6675 
 6 Zona El Plantío                             6125.
 7 Carabaña                                    6000 
 8 Zona La Pizarra                             6000 
 9 Zona Urbanización Este-Montepríncipe        5663.
10 Zona Montealina                             5221.
# … with 418 more rows
> select(villages %>% arrange(price_avg), village, price_avg)
# A tibble: 428 x 2
   village             price_avg
   <chr>                      <dbl>
 1 Cervera de Buitrago         340 
 2 Tielmes                     348.
 3 Valdeavero                  400 
 4 Bustarviejo                 425 
 5 Corpa                       425 
 6 Batres                      450 
 7 Pelayos de la Presa         540 
 8 Puentes Viejas              550 
 9 Titulcia                    550 
10 Redueña                     568.
# … with 418 more rows
> select(villages %>% arrange(-surface_avg), village, surface_avg)
# A tibble: 428 x 2
   village                              surface_avg
   <chr>                                           <dbl>
 1 Zona Club de Golf                                796 
 2 Cabanillas de la Sierra                          750 
 3 Ciudalcampo                                      608.
 4 Zona La Finca                                    600.
 5 Madrid                                           600 
 6 Zona Urbanización Este-Montepríncipe             534 
 7 Zona Las Lomas                                   525.
 8 Zona Bonanza                                     524 
 9 Zona La Pizarra                                  500 
10 Zona Los Robles                                  452 
# … with 418 more rows

We can also note that the most expensive properties are located in the peripheria of the big city in luxury districts.
The other interesting category here is the one related to companies:

> select(companies %>% arrange(-total_vals), company, total_vals)
# A tibble: 1,036 x 2
   company                                          total_vals
   <chr>                                                    <int>
 1 NA                                                         488
 2 SERVICHECK (VALLECAS)                                      488
 3 INMOBILIARIA EMMANUEL                                      235
 4 aProperties                                                188
 5 Consultoría Inmobiliaria Internacional de Madrid           142
 6 DEAL INMOBILIARIA                                          117
 7 OUTLETDEVIVIENDAS                                          111
 8 RENTA GARANTIZADA - UMBER                                   90
 9 TESTA RESIDENCIAL                                           80
10 SOLFAI CONSULTING                                           77
# … with 1,026 more rows
> select(companies %>% arrange(-price_avg), company, precio_media)
# A tibble: 1,036 x 2
   company                                  price_avg
   <chr>                                           <dbl>
 1 REDPISO LAS TABLAS                            113432.
 2 INMOBILIARIA EMMANUEL                          10105.
 3 COLDWELL BANKER GLOBAL LUXURY ZAROSAN           9280 
 4 Vive Home Style                                 7371.
 5 ARRAS CONSULTORIA INMOBILIARIA                  7300 
 6 MADRID TEAM SL                                  7057.
 7 VOHOME CENTRAL                                  6900 
 8 GESTION MADRID                                  6850 
 9 ALFEREZ REAL ESTATE                             6250 
10 Berkashire Hathaway Home Services Larvia        6112.
# … with 1,026 more rows
> select(companies %>% arrange(price_avg), company, price_avg)
# A tibble: 1,036 x 2
   company                         price_avg
   <chr>                                  <dbl>
 1 INMOBILIARIA PULPON                     200 
 2 HOUSE FM                                300 
 3 DP 2020 GESTIONES INMOBILIARIAS         320 
 4 DISTRITO GETAFE I                       350 
 5 RED IN                                  390 
 6 ABANTOS ( ISABEL BEATRIZ )              412.
 7 CARIHUELA SOL                           450 
 8 GCI                                     450 
 9 CASA SERVICIOS INMOBILIARIOS            488.
10 BARCHAN                                 500 
# … with 1,026 more rows
> select(companies %>% arrange(-surface_avg), company, surface_avg)
# A tibble: 1,036 x 2
   company                                  surface_avg
   <chr>                                               <dbl>
 1 LÍNEA DE GESTIÓN 2, S.A.                             600 
 2 MIRANDA SERVICIOS INMOBILIARIOS.                     546.
 3 MIRAMADRID GRUPO                                     544 
 4 Rester Iberia                                        500.
 5 GESTION MADRID                                       500 
 6 Berkashire Hathaway Home Services Larvia             495.
 7 DOMUS VENDI                                          480 
 8 HOUSING4YOU                                          455 
 9 Mg Grupo Inmobiliario                                452 
10 2MP                                                  450 
# … with 1,026 more rows

We can see that SERVICHECK and EMMANUEL are the top offering companies and we can also identify potential luxury companies such as MIRANDA, LINDEA DE GESTIÓN and REDPISO LAS TABLAS.
Seen that, we can end our general analysis by looking at the most common values on each category:

> print("most common values")
[1] "most common values"
> # PRICE
> sort(table(mydata["price"]),decreasing=TRUE)[1:5]

1200 1100  900 1300  850 
 258  232  214  200  190 
> # REAL ESTATE COMPANIES
> sort(table(mydata["company"]),decreasing=TRUE)[1:5] # NA = independent owner renting his property(ies)

                                              NA                            SERVICHECK (VALLECAS) 
                                             488                                              488 
                           INMOBILIARIA EMMANUEL                                      aProperties 
                                             235                                              188 
Consultoría Inmobiliaria Internacional de Madrid 
                                             142 
> # SURFACE
> sort(table(mydata["surface"]),decreasing=TRUE)[1:5] # squared meters

 60  70  90  80 100 
262 255 221 208 192 
> # ROOMS
> sort(table(mydata["rooms"]),decreasing=TRUE)[1:3]

   2    3    1 
1811 1651 1388 

As we saw before, most of the prices range from 900 to 1300 and surfaces go from 60 to 100.

Cluster analysis

As we saw in the first stage of this analysis, it is clear that we have different “categories” of properties (ex: rural vs urban, luxury vs affordable etc). In these situations running a cluster analysis can be helpful.

> ds <- select(mydata, 4,5,6,7,9) # all of the numeric feats
> # KMEANS
> dcluster <- kmeans(ds, 6, nstart = 1)
> dcluster
K-means clustering with 6 clusters of sizes 1301, 19, 293, 1, 4499, 1

Cluster means:
        price   surface    rooms  toilets  numpics
1    2742.738 195.87087 3.450423 2.860876 26.18370
2   16500.000 289.31579 7.684211 7.894737 36.73684
3    5831.676 402.96928 5.051195 4.935154 32.50853
4 1800000.000 350.00000 5.000000 4.000000 14.00000
5    1134.665  90.35074 2.185819 1.530118 17.76039
6  450000.000 437.00000 4.000000 3.000000 58.00000

In this case a cluster analysis is a bit revealing as it shows multiple categories. From those categories one of them (number 5) clearly corresponds to the category we just defined in the previous stage. Other categories correspond to (probably) other significant market sectors such as categories 1 and 3. The rest of the categories correspond to highly expensive properties I can’t even dream of.

So having those identified, we can extract them from the general dataset.

> group1 <- filter(mydata, price >250 & price < 1600)
> mean(data.matrix(group1["price"]))
[1] 1044.116
> sd(data.matrix(group1["price"]))
[1] 268.9029

And repeat our analyses focused on each group: basicpairs

> shapiro.test(data.matrix(group1["price"]))

	Shapiro-Wilk normality test

data:  data.matrix(group1["price"])
W = 0.97499, p-value < 2.2e-16

So we can see that we don’t have a normal distribution in terms of the price eventhough there is more “normality” here than in the general group.

basicpairs It is very interesting to note here that as we move to this general group (let’s say the avg property for the avg family) we can appreciate how in here price and surface do really correlate somehow.

basicpairs

> regresion = lm(data.matrix(group1["price"]) ~ data.matrix(group1["surface"])) 
> plot(price ~ surface, group1)
> abline (regresion, lwd=1, col ="red" )    
> regresion

Call:
lm(formula = data.matrix(group1["price"]) ~ data.matrix(group1["surface"]))

Coefficients:
                   (Intercept)  data.matrix(group1["surface"])  
                       891.924                           1.807  

> cor.test(data.matrix(group1["surface"]), data.matrix(group1["price"]), method=c("pearson", "kendall", "spearman"))

	Pearson's product-moment correlation

data:  data.matrix(group1["surface"]) and data.matrix(group1["price"])
t = 20.224, df = 3902, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2793456 0.3361389
sample estimates:
      cor 
0.3080166 

Anyway correlation is still weak here.

Let’s go for the second group, expensive properties.

> group2 <- filter(mydata, price > 1600 & price < 3200)
> mean(data.matrix(group2["price"]))
[1] 2234.894
> sd(data.matrix(group2["price"]))
[1] 419.5886

basicpairs We have a small degree of normality here but most of the properties range from 1500 to 2000.

> shapiro.test(data.matrix(group2["price"]))

	Shapiro-Wilk normality test

data:  data.matrix(group2["price"])
W = 0.93823, p-value < 2.2e-16

Anyway, we can see that we don’t have a normal distribution here either.

basicpairs

basicpairs

> pairs(~price + surface + rooms + toilets,data=group2,
+       main="correlation matrix") #Matríz de correlaciones
> regresion = lm(data.matrix(group2["price"]) ~ data.matrix(group2["surface"])) ## Construyo una ecuación de regresión lineal
> plot(price ~ surface, group2)
> abline (regresion, lwd=1, col ="red" )    ### Dibujo la línea de regresión
> regresion

Call:
lm(formula = data.matrix(group2["price"]) ~ data.matrix(group2["surface"]))

Coefficients:
                   (Intercept)  data.matrix(group2["surface"])  
                      1959.305                           1.724  

> cor.test(data.matrix(group2["price"]), data.matrix(group2["price"]), method=c("pearson", "kendall", "spearman"))

	Pearson's product-moment correlation

data:  data.matrix(group2["price"]) and data.matrix(group2["price"])
t = Inf, df = 1414, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 1 1
sample estimates:
cor 
  1 

We still have a bit more regression here than in the general group, but its more weird than the previous group.

And at the end we can categorize our elements and see how each group represents a % of the total.

> mydata$price_category<-ifelse(mydata$price <500, "VERYCHEAP", ifelse(mydata$price <1000,"NORMAL", ifelse(mydata$price < 5000,"EXPENSIVE","HIGHEXPENSIVE")))
> nrow(subset(mydata,price_category == "VERYCHEAP")) / nrow(mydata)
[1] 0.007360157
> nrow(subset(mydata,price_category == "NORMAL")) / nrow(mydata)
[1] 0.2924436
> nrow(subset(mydata,price_category == "EXPENSIVE")) / nrow(mydata)
[1] 0.6640497
> nrow(subset(mydata,price_category == "HIGHEXPENSIVE")) / nrow(mydata)
[1] 0.03614655

And we see how most of the properties belong to the “EXPENSIVE” category.

Predictive analysis

Can we predict the price of a property based on its features? Well, let’s be honest, probably NOT. But we can try that anyway.

Neuralnet package offers a nice way to do that. We can select all of the numerical feats of our dataset, split the dataset into 60-40 for training and validation and run some tests.

library(neuralnet)
ds <- select(mydata, 4,5,6,7,9) 
dvals <- ds
samplesize <-0.60 * nrow(dvals)
set.seed(80)
index = sample(seq_len(nrow(ds)),size = samplesize)

datatrain = dvals[ index, ]
datatest = dvals[ -index, ]

max = apply(dvals , 2 , max)
min = apply(dvals, 2 , min)
scaled = as.data.frame(scale(dvals, center = min, scale = max - min))

trainNN = scaled[index , ]
testNN = scaled[-index , ]

# fit neural network price
set.seed(2)
NN = neuralnet(price ~ toilets + surface + rooms , trainNN, hidden = c(4,3,4) , linear.output = T )

# plot neural network
plot(NN)

We can even plot the neuralnetwork. And as we see the error is so high.
basicpairs And we can plot the regression as well:

predict_testNN = compute(NN, testNN[,c(1:4)])
predict_testNN = (predict_testNN$net.result * (max(dvals$price) - min(dvals$price))) + min(dvals$price)

plot(datatest$price, predict_testNN, col='blue', pch=16, ylab = "predicted price NN", xlab = "real price")

abline(0,1)

It looks like we have a straight regression but… note that due to that outlier the scale moves a lot… so…. basicpairs
We can calculate the RMSE:

> # Calculate Root Mean Square Error (RMSE)
> RMSE.NN = (sum((datatest$price - predict_testNN)^2) / nrow(datatest)) ^ 0.5
> RMSE.NN
[1] 9128.876

And it shows that the error is really high…!
I will update this post soon with more interesting conclusions :)

Madrid real estate renting market data analysis
Older post

Embedded systems programming with atmel328 - 1

Newer post

Embedded systems programming with atmel328 - 3 (microservos)

Madrid real estate renting market data analysis