Sustainable Code: Increasing the shelf-life of your repository
A guide to crafting and maintaining a repository
We all feel good when someone starts exploring our repository, forks it, or raises an issue. This demonstrates that the contributions you’ve shared with the community are being acknowledged and appreciated by others. However, the code that has been published isn’t always pleasing, even if its functionality is admirable. This could be due to lack of documentation, limited applicability, or a lack of comprehensibility. It’s important to note that these aspects do not directly impact the functionality of your code. Rather, they serve to better present your work, making it easier for others to understand, interpret, extend, and integrate. Sometimes it may appear that even when you revisit your own previous work, it takes considerable amount of time to understand and follow what is going on within your code. I see that it is all due to writing unsustainable code.
Everyone might have different point of view on writing sustainable code. In this article, sustainable code is the code that has practical and technical aspects from the way the code is written to how it is presented. It involves extensibility by other contributors that requires comprehensibility, clarity as well as being tested to provide robustness.
Your Contribution to the Community
You may be coming from different domains and you may assume that the subject you are working on can be easily understood by anyone within the same domain/community as you. Yes it is true, but isn’t always the case. There is a distinction between a piece of code that takes at least 1 day or 1 hour to comprehend. Moreover, while your contributions are contemplated for the community, your potential users or followers might come from different areas. It could include developers seeking to integrate the machine learning model you’ve been working on into their products. Therefore, it is advisable to target a broader audience when it comes to clarity and comprehensibility of your work.
You might consider your work as an implementation that showcases an example of a specific problem in your domain. For example, a notebook that demonstrates machine learning model on a specific dataset. However, this doesn’t imply that it won’t be utilized by others. Potential users may not adopt your entire example, but they could extract certain parts that interest them, referring to your work in their implementations. Briefly, every well-presented piece of code holds potential usefulness for others.
In the present day, virtually every one of us has account on platforms such as Github, GitLab, or other networking sites to share their work. This represents an excellent opportunity to showcase our contributions and efforts. The forthcoming tips and suggestions in this article are aimed at enhancing the visibility of your work on these platforms. Let’s preview the topics that will be discussed to improve the sustainability of your codebase:
- Coding style and conventions.
- Documentation, encompassing comments, docstrings, and appropriate naming of methods and variables.
- Type annotations for variables and arguments, especially for Python users.
- Comprehensive testing of your library.
- Effective utilization of Git version control.
- Clearly defining your development and production environments.
- Application of suitable licensing.
Please bear in mind that the subsequent sections will delve deeper into these subjects within the article.
Coding Style and Convention
Every programming language has its own coding conventions for naming files, methods, modules, structuring folders and files, specifying maximum line length, and usage of built-in functionalities, such as defining a for loop or argument assignment. While you can have your own conventions that you are accustomed to, it may not be helpful if you intend to share your code within a community of the programming language that you are using. Therefore, before starting to write code, please consider following the coding conventions for the programming language you are using. For example, you can follow PEP-8 for Python, the tidyverse style guide for R, or the scala style guide for Scala, JavaScript style guide for JS.
Please see the following example from PEP-8, which shows incorrect implementation for module imports and the usage of white-spaces.
# Correct:
import os
import sys
# Wrong:
import sys, os
# White spaces:
# Correct:
spam(ham[1], {eggs: 2})
# Wrong:
spam( ham[ 1 ], { eggs: 2 } )
There are also tools to validate the repository that it respects coding convention. Flake8 is an handy tool to validate your Python codes, and scalastyle for Scala. These tools also allow to customize coding convention rules as you prefer and can be added to your IDE as plugin.
Documentation
Documentation is perhaps the most important part of creating a comprehensible repository. It enables anyone interested in your work to understand it easily. However, please note that documentation is not limited to adding comments into code. It includes doc strings, tagged comments, and even proper variable naming.
Doc strings
Every programming language has its own style of doc strings, such as locating them right after the function/method name or just before their definition. Therefore, please ensure that you follow the guidelines for the programming language that you’re using.
Docstrings can be parsed and used as API references, which is the common practice for most published libraries. For instance, if you visit the Scikit-Learn documentation webpage, you will find that examples and descriptions are actually written as docstrings where modules, classes, and methods are defined. Preparing your docstrings in this manner will be quite beneficial if you plan to publish a documentation page for your library.
A docstring should primarily explain the purpose of the method, class, or module, including a description of the arguments. Typically, the first line of a docstring provides a concise description of the method, while subsequent lines can be used for detailed explanations, examples, and comments. It also includes information about the function’s it self such as its arguments, return values and potential exceptions/errors it may raise.
Here is an example doc string in Scala;
/** Takes in a number d, returns the square of d
*
* Here is some detailed explanation about how useful I am
*
* @param d the Double to square
* @return the result of squaring d
*/
def square(d: Double): Double = d * d
Here is for Python;
def square(n):
'''Takes in a number n, returns the square of n
Args:
n(int): integer to sqaure
Returns:
the result of squaring n
'''
return n**2
Comments
Comments are often confused with docstrings regarding their level of detail. Ideally, comments should be short and concise. They serve to clarify the purpose or provide notes/warnings for specific lines of code, highlighting potential issues. Adding lengthy and detailed comments to every line of code can be inconvenient as it hampers code readability. Occasionally, I come across repositories that have an excessive number of unnecessary comments, causing one to get lost in the code. If you find yourself needing to add more comments to your code, it is likely because your implementation is complex or the naming of your methods or variables is unclear.
In software development, there are common tags that can be used in comments, such as “FIXME” and “TODO”. Additionally, you can add additional tags like “PENDING”, “DOCME”, and “CHECKME”. These tags are useful for highlighting specific issues or improvements. For example, a “FIXME” tag indicates a potential issue that needs to be fixed, while a “TODO” tag marks a part of the code that requires implementation. These tags serve as reminders for you or other contributors regarding future needs and implementations.
Please find an example of a “TODO” tag below.
def say_hello(to: str):
# TODO: Add UTF-8 encoding
print("Hello " + to)
Naming variables
I believe that naming variables and objects is an essential part of documentation since it provides significant hints about their purpose. For example, using “num_customers” or “n_customers” to represent the number of customers would be a better choice compared to “n_c” or “num_c”. Additionally, if you have the same variable in different methods, it is preferable to use the same name instead of different variations. For instance, if you define the number of customers as “n_customers” in one method, using “num_customers” in another method would introduce inconsistencies. The same principles apply to naming classes, functions, and other elements as well.
Use Typing
Typing is crucial for identifying interconnections between different objects within a library or with third-party libraries. While typing is mandatory in some programming languages, it is optional or not supported in others. Python serves as a prime example, where adding types is possible but not required. In Python, unless your scripts are executed with “mypy,” adding types to variables, arguments, or return statements does not make a significant difference. Nonetheless, incorporating typing is highly beneficial for creating a well-structured and robust codebase, ensuring clarity regarding method inputs, outputs, and variable definitions. It also greatly aids other contributors in understanding the purpose of a method, its expected return type, and the types of input it accepts.
Typing helps clarify your code in the following ways:
1. It documents your code. Developers no longer need to inspect the implementation to determine the variable’s type. For instance, if a class instantiates an argument without typing, it becomes difficult for others to discern which module/class it belongs to. However, by employing typing, external contributors can establish connections among your modules and classes, enhancing comprehension and maintainability.
2. It contributes to creating a cleaner architecture where every part of your implementation is well defined.
3. It makes it easier to validate your work as it helps IDEs and linters perform better. For example, your IDE or linter can issue warnings if a method returns an unexpected value of a different type.
Test your library
If your code has different interconnected modules, adding new functionalities or making changes to the API may potentially break other parts of your code. Even if you are extremely cautious, there will always be some overlooked areas where functionalities might not work as intended. Therefore, it is highly recommended to incorporate testing, such as unit tests or integration tests, into your development process. Testing allows you to identify bugs in your implementation and ensure that every functionality works correctly. Once your tests cover a sufficient portion of your codebase, you can have confidence that any bugs resulting from the changes you’ve made will be detected.
Testing also benefits contributors who fork your repository and introduce new functionalities or bug fixes. Both you and the contributors will be able to determine whether the changes made introduce any bugs in other parts of the code.
Benefit from Git
Git is sometimes mistaken as a tool solely for storing codebases, but it is primarily a version control system designed to manage code history. Therefore, it is crucial to utilize Git effectively if your aim is to create a codebase open to collaborations. Git serves as an excellent tool for preserving the development history, benefiting both you and potential users or contributors of your code/repository.
To ensure smooth development without disrupting the functionalities of the master or main branch used by your users, it is recommended to follow a Git flow schema when publishing new functionalities or versions for your repository.
When committing changes, it is essential to provide clear and concise commit messages that describe the introduced changes. This practice facilitates easy review and allows for efficient inspection and reversion of previous changes.
Instead of making a single extensive commit that introduces numerous changes, it is advisable to create smaller commits. This approach enables easier tracking of specific modifications. However, if you wish to avoid having excessive commits, you can squash your smaller commits together when introducing a new functionality.
Define your dependencies
Your program may have dependencies on specific environments, modules, operating systems, etc. It is crucial to define these dependencies to inform others about the requirements. Most programming languages have package managers like NPM for JavaScript, SBT for Scala, and Pip for Python. Instead of declaring dependencies in your README.md, it is recommended to create a file that can be executed by the package manager associated with the programming language you are using for development. For example, you can use `requirements.txt` for Pip, where all dependencies are listed along with their versions, or `environment.yml` file if you are using Python with Conda. Additionally, you can include installation steps in the “README.md” file.
pip install -r requirements.txt
Licensing
Licensing is not limited to advanced, well-maintained libraries with a significant user base that are distributed through package managers. Even if your implementation consists of just two lines of code, it is important to apply a license to clarify the conditions under which it can be used, modified, republished, etc. What adds further value to your implementation is its potential usage by other libraries. Other libraries can install your package or utilize parts of its source code by referencing your repository, which in turn increases the visibility of your repository. They may even fork your repository, address bugs, introduce new functionalities, and create pull requests. This not only enhances the technical aspects of your work but also increases its value.
It is crucial to remember that repositories without a license are viewed unreliable by others because it is uncertain whether the code can be used or not. Even if people utilize certain portions of your codebase, they may not reference your repository due to the lack of a clear license. Therefore, it is highly recommended to explore different software licenses and choose one that aligns with your requirements.
Conclusions
Crafting well-presented and sustainable code is a challenging endeavor, requiring a substantial investment of time in its development. Often, we are tempted to resort to shortcuts and write code in a more straightforward manner. Nevertheless, by adhering to the guidelines outlined above, the process of writing code will gradually become more seamless over time. Furthermore, this approach contributes to establishing a well-structured codebase that others can trust and contribute to, ultimately enhancing the value of your product.
This article aims to provide straightforward guidelines to increase the comprehensibility and interoperability of your code, aligning with common standards in software development. The specific details of the topics and implementations may vary based on the programming language you employ or the team you collaborate with. The central goal is to create a unified codebase that can be shared within your team or community.