Maintaining URLs In RPM Metadata
As you likely know, all programs in ROSA Linux are distributed as RPM packages that include application files themselves (binary executables, libraries, data, etc.) and so called metadata - package name, summary, description, requirements and so on. In this paper, we will speak about metadata fileds that can contain URLs to some external Internet resources.
First of all, this includes the URL field which points to a program home page (which you can see among other information in package manager). Besides that, many package spec files (that contain instructions for rpmbuild on building packages from the source code) specifies location of the source code in the Internet:
What is the need to specify URL here? I guess that the original reason was to simplify maintainers life and to help him find source code of a new program version in the Internet (sometimes it is not enough to know application home page, you can spend some time browsing that page and looking for the sources). Given the source URL, you can just replace my-app-1.0.tar.gz with my-app-2.0.tar.gz and any downloader will bring you a new upstream version. Long time ago you had to do this manually, but nowadays rpmbuild in ROSA Desktop can download files from Internet by itself, so all you should do to build a new version is to update Version field in the spec file (given that version macro is used instead of hardcoded value wherever appropriate).
One more ROSA tool utilizing URLs to source code archives is Updates Tracker. It analyzes source URLs from the spec files and decides where to look for a new upstream tarballs.
Thus, URLs in package metadata are quite useful and actively used. But as any other Internet resources, from time to time they can disappear or change their location. Package maintainers should detect such situations and update metadata correspondingly, otherwise users will go to dead pages instead of application home sites and Updates Tracker will look for new application versions in those places where these will versions will never appear.
For popular packages actively maintained by ROSA developers and community, the metadata is usually updated manually. However, we have quite a few package in Contrib repository which are updated rarely or used by so few people that no maintainer pay much attention to them. In addition, one should remember that manual metadata update can suffer from common issues which arise when human being performs some routine task - one can make a typo, use a wrong URL or just forget to update the data due to laziness or lack of time.
To solve this problem, it is necessary to automate routine tasks. URL monitoring is definitely a good candidate for automation. For RPM packaging and maintenance, we don't need complex Web crawlers and tracker, but instead we need a specialized tool that would analyze our spec files, find dead URLs and try to look for replacements.
This year we decided that development of such a tool is a good task for students of Russian Higher School of Economics during two week they spend in ROSA as a part of their practical work. As a result, we got two scripts - one to analyze spec files and detect broken URLs and another to find replacements for them.
The first one named find_dead_links acts in a straightforward way - it analyzes given set of spec files, extracts all URLs from them and checks availability of every resources. A set of potentially dead URLs is dumped as an output.
The second script named URLFixer analyzes output of find_dead_links and tries to find new location of every resource. Currently it can't detect new location of application home pages (though with current Web search technologies this doesn't seem to be impossible even for machines), but can successfully detect new location of source tarballs. As our investigations showed, most dead URLs appear due to trivial reasons - upstream developers can remove old tarballs from their site (in this case we have a good sign for package maintainers to update the package) or repack tarball to another format (e.g., in the latest years it became popular to switch to XZ compression). The tool checks if one of such cases happened with tarball under analysis and tries to determine the new URL if this is the case.
To be sure, these scripts are not complex at all (so it wasn't hard for a couple of students without Linux programming experience to develop them in two weeks), but when run against ROSA Desktop Fresh repositories these tools produced wonderful results. It turned out that 500 of our packages contained broken URLs and about 80% of them were quickly fixed with the help of URLFixer.