Tip and tricks for learning a new code base
As a consultant, approaching new code bases is something that happens every day. As exciting as it sounds however, working on many different projects can be really tough. This is because in many occasions we are not provided with all the details that we need, but we just requested to fix something that is broken.
What makes these tasks difficult is that the data and code structures can be very large and not very intuitive. Also, it might not be clear who holds the knowledge on them, or in some cases the knowledge can be even lost; for example being held by people who left the company.
And the code does not always help to explain itself. Indeed, a considerable percentage of the problems that are raised by clients are not easily reproducible. They are a manifestation of inconsistencies that were somehow generated in the past and that can be detected only by thoroughly analyzing the code in a reverse fashion and making guesses.
These are some tricks I’ve been using to cope with this kind of situations in a SQL, Java and git based environment:
One very handy tool is to have an Entity Relationship (ER) Generator to get a visual representation of the tables as a general picture. There are several tools that do the trick. I have been using Squirrel DB; it works well and it’s free. To create the ER graph, select all the tables, right mouse click on the selection and choose the ‘Add to graph’ menu.
Another thing that has helped me a lot is to have a way to understand where a particular field is stored in the db. For instance, recently I had to work on fixes on a migration process from an old system to a new one, without knowledge of either of the two systems’ databases, each one storing around a hundred tables. It has been very useful to know from where in the old database a certain row in the new database was gathering the information in the old system, understanding the mapping and the data flow.
The way I did this was to have the old db exported as a sql script and then search on the file:
grep “,field,” db_dump.sql | sed -e ‘s/INSERT INTO \(……….\)/\1/’
This script returns the name of all the tables that contain an instance of that field. There might be neater commands to do the same thing but the concept is going to be the same.
Even more important than this however is to see the way the data flows. I always recommend having logging tool setup. If you are using MySQL, the logging setup is very easy: just make sure that in the file /etc/mysql/my.cnf this line is included:
general_log_file = /var/log/mysql/mysql.log
Restart the server. Then simulate locally small chunks of the application leaving the logs active in a shell window with:
tail -f /var/log/mysqllogfilename.log
Copy the queries logged and check what they do on your sql editor. This can be a huge hint, as it tells us exactly what data we are modifying or selecting. I’ve been using Mybatis and Hibernate. Especially for the latter one it is not so easy to extrapolate what query was executed, and in both cases it is required to analyze the code deeply to determine what query was triggered each time.
One particular usage I have done of the log has been to filter insertions to determine what data has been added or modified, to have a list of the tables that the particular process we are analyzing writes. For example for tables having lower character names separated by underscores:
sed -e ‘s/(into|update) \(([a-z]|_)*\).*/write \1/’ | grep write | sort | uniq
The other really useful thing in this scenario is a tool to show the commit history. Right now I’m using Intellij as my IDE, which has a built-in annotate functionality. This gathers all the commits on a file and shows for each line the last commit that modified the line. From this we can understand a couple of things:
1) Who worked on the code. This is very important because in many cases we get to a point in which we have a way to fix what’s requested but we are not sure whether this may break something else, and knowing who wrote the code gives us a chance to ask why it was written that way and understand if the fix would work.
2) When the code was committed. If for example we have the impression that a snippet of code might be responsible for a bug and we see that the code is very old and has been stable for a long time this may add some doubt to our guess. On the other hand, fresh code is more likely to not have yet resolved all the different corner cases.
I hope these tips and tricks help you the next time you are trying to decipher a new code base.