Friday, November 25, 2016

R : enabling the big data revolution?


This week, i am blogging from the Biological Control and Spatial Ecology group at the Université libre de Bruxelles, where i am spending about 8 days learning how to optimize coding to automatize the creation of databases.


Sounds Chinese? Attentive followers will have noticed the many blog posts around big data lately. If big data means bringing many different unorthodox data sets together to explore correlation, then we need someone that can integrate the different data sources into one database for analysis...
So let us look at my challenge : i want to extract DHS (Demographic health survey) from about 50 countries, which have similar variables that have been coded slightly differently, which means if you simply exact the data you cannot put it together easily, because one will write Poorest, the other poorest and another one might spell it differently... So is my only choice to everything manually, i.e. repeat the procedure 50 times (which might be a source of error)? Or are there tricks and tips on how to automatize data extraction and integration in a big database?

After my first week, the conclusion is that there are things you can automatize and others you have no choice but digging into the data manually... but you can optimize the whole process a lot, and code in a way that next time you need similar data, it becomes much easier... The R software with its whole suits in R-studio offers a very flexible and efficient coding environment to implement this. I have started coding in R about two years ago, yet the coding smartness i am discovering here is amazing!




Wanna follow my R journey? Follow the R! page here on my blog. It will collect all the small and big tips to make use of R for big and small data!

No comments:

Post a Comment