A better way to verify file links!

zonfeld · Post by **zonfeld** » Fri Dec 16, 2022 6:02 pm

I maintain a ToDoList that contains a few hundred file links: concept papers, emails, other ToDoLists, copies of bug reports, spreadsheets, stuff like that. Many of those links point into a directory structure, which I keep synchronized between my local system and a network drive of a company I work with. For a host of reasons, folder names and file names are sometimes changed, mostly by me because I'm "continuously improving" things. To not upset me, the others don't complain.

From time to time, I have to verify the file links to avoid dangling references, which could bite those nice people or myself in the butt. My process is this:

I use XSLT to extract the FILEREFPATHS from the TDL file and store them in a text file that contains all file links (file_list.txt).
Here's the transformation (Extract_all_FILEREFPATH.xslt):

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
	<xsl:output method="text" version="1.0" encoding="UTF-8"/>
	<xsl:template match="/">
		<xsl:for-each select="//FILEREFPATH">
			<xsl:value-of select="."/>
			<xsl:text>&#xd;</xsl:text>			
		</xsl:for-each>
	</xsl:template>
</xsl:stylesheet>

I run it using Saxon (on Windows 11, although that shouldn't matter), with this command line:

Code: Select all

java.exe -jar «path to saxon-he-11.3.jar» -s:«path to TDL» -o:"file_list.txt" -xsl:"Extract_all_FILEREFPATH.xslt"

Afterwards, I use Python to obtain a list of missing references by checking, which files are missing from the directory structure. Here's my verify_file_list.py:

Code: Select all

# Check if files, obtained from a list, actually exist
import os

with open('file_list.txt', encoding='UTF-8') as linklist:
    chk_files = [line.rstrip() for line in linklist]

# remove hyperlinks
chk_files = [fname for fname in chk_files if "//" not in fname]

print("List of missing files:")
missing_files = []
for file in chk_files:
    if os.path.isfile(file) == False and os.path.isdir(file) == False:
        missing_files.append(file)
        print(file)

# If there are missing files, save their names in a text file
if missing_files:
    with open("missing_files.txt", 'w', encoding='UTF-8') as output:
        for file in missing_files:
            output.write(file + '\n')

I correct the defective references manually.

Does anybody know a better way?

Just in case: "Stop changing stuff!" is not what I'm looking for. Thank you!

zonfeld · Post by **zonfeld** » Sat Dec 17, 2022 7:09 pm

After posting, I had a few ideas for improvement myself. I determined that it probably would be good to only have one file to run and I didn't want to deal with the command line any more. So I packed everything into a little script. I saw this as a good opportunity to get more familiar with Python, which I had ignored for too long.

Users are presented with a file selection dialog, which lets them choose the ToDoList to check.
If the user cancels the selection, the program terminates.
The working directory is set to the directory of the ToDolist; that makes it easier to check relative links.
An XML parser parses the ToDoList.
All file links and their respective parent's task IDs are extracted into a list.
(I included the ID to make it easier to find the problematic tasks when fixing the ToDoList afterwards.)
One by one the file links are checked.
1. If a file is not found, it is added to the list of missing references.
2. "tdl://" references get special treatment
  1. The protocol identifier is removed.
  2. Direct links within the ToDoList (tdl://1234) are ignored.
  3. For links to other ToDoLists (tdl://test.tdl?1234), potential task references (?1234) are removed.
  4. Escaped characters (tdl://todolist%20test.tdl) are unescaped.
  5. What remains, hopefully is a file name and its existence is verified.
3. Any other kind of URL reference is ignored.
After processing all entries, a messagebox informs the user about the result.
To facilitate sorting by task ID or by file path, the report is stored in CSV format. (In case the report is opened with Excel, a few precautions are taken.) The report lies in same directory as the ToDoList.

Code: Select all

# Check if files that are referenced by ToDoList file links actually exist.
# If not, write the Task IDs and file names to missing_files.csv.
# Hyperlinks are not checked.

import os
from sys import exit
from lxml import etree
import csv
import urllib.parse
import tkinter as tk
from tkinter import filedialog
from tkinter import messagebox

tk_root = tk.Tk()
tk_root.withdraw()

file_path = filedialog.askopenfilename(initialdir=".", title = "Choose ToDoList for verifying file links", filetypes=[("Abstractspoon ToDoList", ".tdl")])

if os.path.isfile(file_path) == False:
    exit(1)

# Set working directory to make sure that relative file links will be evaluated correctly.
os.chdir(os.path.dirname(file_path))

xml_parser = etree.XMLParser(remove_blank_text=True)
xml_tree = etree.parse(file_path, xml_parser)

# Gather all dangling file references and their task IDs
missing_files = []
refs = xml_tree.xpath("//FILEREFPATH")
for ref in refs:
    id = ref.getparent().attrib['ID']
    file = ref.text
    # Check tdl:// links
    if "tdl://" in file:
        file = file[6:]
        if file.isnumeric():
            # Link to a task in the same ToDoList, nothing to do
            continue
        # Remove direct task id references for other task lists, such as tdl://test.tdl?1234
        id_pos = file.find("?")
        if id_pos != -1:
            file = file[:id_pos]
        # clean up %20 characters as in tdl://todolist%20test.tdl
        file = urllib.parse.unquote(file)

        if os.path.isfile(file) == False and os.path.isdir(file) == False:
                missing_files.append({"id": id, "file": file})
                continue
    # All other protocols/hyperlinks are ignored
    if "//" not in file:
        if os.path.isfile(file) == False and os.path.isdir(file) == False:
                missing_files.append({"id": id, "file": file})

if not missing_files:
    messagebox.showinfo(title="Process completed", message="Congratulations! There are no dangling file references.")
    exit(0)

# Save the names and the parent ID in a CSV file
messagebox.showinfo(title="Process completed", message= str(len(missing_files)) + " defective file links were found.")

# newline='' is necessary to prevent Excel from showing empty lines.
# utf-8-sig needs to be chosen as encoding because Excel expects a BOM, 
# otherwise umlauts are not properly displayed.
with open("missing_files.csv", 'w', newline='',encoding='utf-8-sig') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    # CSV header row
    csvwriter.writerow(["Task ID", "File Link"])

    for entry in missing_files:
        csvwriter.writerow([entry['id'], entry['file']])
exit(0)

Well, that's one way of spending a Saturday afternoon.

Suggestions are still welcome.

Post by **abstr** » Mon Dec 19, 2022 12:12 am

Great job!

I moved this to the 'Tips and Tricks' forum and changed the title from a question to a statement!

Q: Would it also be feasible to add parsing of the 'COMMENTS' field, which is the text version of the (custom) comments for each task?

Also, if you allowed the script to accept a tasklist path as an alternative to browsing, you could set it up as a UDT in the preferences...

zonfeld · Post by **zonfeld** » Mon Dec 19, 2022 9:08 pm

Those are excellent ideas. Thank you very much!

Accepting a task list path as an argument is the easy part. Parsing the COMMENTS element is much harder. While finding the start of a file link (tdl:// or file://) in a bunch of regular text is straightforward, determining where it ends is not as easy as one might think.

For example, when dragging a file path that contains spaces into the comments control, the path is embedded in angle brackets to separate it from the surrounding text. Without the brackets, any whitespace character terminates the file path, with the brackets, spaces may be a part of it. Another example: The colon symbol is a legal character for a file path but it may appear only once (if it is present at all) after the protocol identifier; if it exists, it needs to be preceded by a letter. The list of things to consider when constructing suitable search patterns gets quite long.

I'm not sure if I will be able to to catch all potential variations of file names but I hope I'll cover the most common occurrences. In any case, this is a fun little project and an unexpected but welcome refresher on regular expressions.

Post by **abstr** » Mon Dec 19, 2022 11:03 pm

I had some more thoughts in the shower this morning:

1. I'd be happy to add the script to the ToDoList_Resources repo and with the download
2. If it occurred to you to write other scripts in the future, these could also be added to the repo
3. If you had a GitHub login you would be able to maintain these scripts yourself

No pressure

zonfeld · Post by **zonfeld** » Tue Dec 20, 2022 12:39 am

Thanks a lot. That's very nice of you!

I would like that but the script needs to be a bit more polished for the official repo, I think. It has a bunch of dependencies that need to be installed and I read somewhere that Python users expect a requirements.txt to be provided for automatic processing. And, of course, it hasn't been tested thoroughly.

I do have a GitHub account but it has been dormant. It seems, however, that this would be a good opportunity to dive into GitHub a bit more.

I managed to get the script to work as a user-defined tool. Yay!

Here's the current source code, which does not fail catastrophically (all the time) but still needs a lot of work:

Code: Select all

# Check if files that are referenced by ToDoList file links actually exist.
# If not, write the Task IDs and file names to missing_files.csv.
# Hyperlinks are not checked.

import os
from sys import exit
from sys import argv
from lxml import etree
import csv
import tkinter as tk
from tkinter import filedialog
from tkinter import messagebox
import urllib.parse
import re

# Format tdl:// links so that they can be used with file operations
def format_tdl_protocol(link):
    # Remove the protocol identifier
    link = link[6:]
    # Remove direct task id references for other task lists, such as tdl://test.tdl?1234
    id_pos = link.find("?")
    if id_pos != -1:
        link = link[:id_pos]
    return link

# Verify that a file exists; if it is missing, add it to the list of missing files
def check_and_add(id, link, missing_files):
    # clean up %20 characters as in tdl://todolist%20test.tdl
    link = urllib.parse.unquote(link)

    if os.path.isfile(link) == False and os.path.isdir(link) == False:
        missing_files.append({"id": id, "file": link})
        return True
    else:
        return False

# Gather defective links from the FILEREFPATH element
def process_FILEREFPATH(xml_tree):
    missing_files = []
    checked_links = 0
    refpaths = xml_tree.xpath("//FILEREFPATH")
    for refpath in refpaths:
        id = refpath.getparent().attrib['ID']
        link = refpath.text
        # Check tdl:// links
        if "tdl://" in link:
            link = format_tdl_protocol(link)
            if link.isnumeric():
                # Link to a task in the same ToDoList, nothing to do
                continue
            if check_and_add(id, link, missing_files):
                continue
        # All other protocols/hyperlinks are ignored
        if "//" not in link:
            check_and_add(id, link, missing_files)
    return missing_files, len(refpaths)

# Gather defective links from the COMMENTS element
def process_COMMENTS(xml_tree):
    # Check list of tdl:// links
    def process_tdls(proc_id, matches, missing_files):
        for match in matches:
            link = format_tdl_protocol(match[0])
            if link.isnumeric():
                # Link to a task in the same ToDoList, nothing to do
                continue 
            check_and_add(proc_id, link, missing_files)

    missing_files = []
    checked_links = 0
    refs = xml_tree.xpath("//COMMENTS")
    for ref in refs:
        id = ref.getparent().attrib['ID']    
        comments = ref.text

        # Check for tdl:// that includes spaces, marked by a "<" character
        matches = re.findall("<(tdl://([a-zA-Z]:)?[/\\.\w\s\-,()_]*)>", comments)
        process_tdls(id, matches, missing_files)
        checked_links += len(matches)

        # Check for tdl:// without spaces
        matches = re.findall("[^<](tdl://([a-zA-Z]:)?[^\s<>\*\"|?]*)", comments)
        process_tdls(id, matches, missing_files)
        checked_links += len(matches)

        # Check file links that include spaces, embedded in "<>" characters
        matches = re.findall("<file:///?(([a-zA-Z]:)?[/\\.\w\s\-,()_]*)>", comments)
        for match in matches:
            check_and_add(id, match[0], missing_files)
        checked_links += len(matches)            

        # Check file links without spaces
        matches = re.findall("[^<]file:///?(([a-zA-Z]:)?([^\s<>]|[/\\.\w\-,()_])*)", comments)
        for match in matches:
            check_and_add(id, match[0], missing_files)
        checked_links += len(matches)            

    return missing_files, checked_links

# Save the missing files report as "<<ToDoList name>>_missing_files.csv"
def save_csv_report(file_path, missing_files):
    # newline='' is necessary to prevent Excel from showing empty lines.
    # utf-8-sig needs to be chosen as encoding because Excel expects a BOM, 
    # otherwise umlauts are not properly displayed.
    with open(file_path[0:len(file_path) - 4] + "_missing_files.csv", 'w', newline='',encoding='utf-8-sig') as csvfile:
        csvwriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)

        # CSV header row
        csvwriter.writerow(["Task ID", "File Link"])

        for entry in missing_files:
            csvwriter.writerow([entry['id'], entry['file']])    

# Execution of the main routine starts here
tk_root = tk.Tk()
tk_root.withdraw()

# Try to load the ToDoList icon for the dialogs
# It is assumed to be in the same directory as the script
icon_path = ""
icon_path = os.path.dirname(__file__) + '\ToDoList_2004.ico'
if os.path.isfile(icon_path):
    tk_root.iconbitmap(icon_path)

# Get the file name for the task list from an argument or
# by means of a file open dialog
if len(argv) > 2:
    messagebox.showinfo(title="Too many arguments", 
    message="Please provide \n* no argument or \n* the file name of the ToDoList as the only argument.")
    exit(1)
elif len(argv) > 1:
    file_path = argv[1]
else:
    file_path = filedialog.askopenfilename(initialdir=".", title = "Choose ToDoList for verifying file links", filetypes=[("Abstractspoon ToDoList", ".tdl")])

if os.path.isfile(file_path) == False:
    if len(argv) > 1:
        messagebox.showinfo(title="File not found", 
        message = "The file \n" + file_path + "\ncould not be found.")        
    exit(1)

# Set working directory to make sure that relative file links will be evaluated correctly.
if os.path.dirname(file_path):
    os.chdir(os.path.dirname(file_path))

xml_parser = etree.XMLParser(remove_blank_text=True)
xml_tree = etree.parse(file_path, xml_parser)

# Gather all dangling file references and their task IDs
missing_files = []
num_links = 0
missing_filerefpath_files, num_filerefpath_links = process_FILEREFPATH(xml_tree)
missing_comments_files, num_comments_links = process_COMMENTS(xml_tree)
missing_files = missing_filerefpath_files + missing_comments_files
num_links = num_filerefpath_links + num_comments_links

if not missing_files:
    messagebox.showinfo(title="Process completed", message = str(num_links) + " links checked.\n" + "Congratulations! There are no dangling file references.")
    exit(0)

if len(missing_files) > 1:
    msg_text = " defective file links were found"
else:
    msg_text = " defective file link was found"

messagebox.showinfo(title="Process completed", message = str(num_links) + " links checked.\n" + str(len(missing_files)) + msg_text)

save_csv_report(file_path, missing_files)            
exit(0)

Post by **abstr** » Tue Dec 20, 2022 12:59 am

I would like that but the script needs to be a bit more polished for the official repo

I completely understand your desire to get it 'right' (or rather 'not to get it wrong'), though I would add that it's a very safe script which doesn't modify the tasklist itself so the very worse that can happen is that it doesn't work perfectly, which is always fixable.

I read somewhere that Python users expect a requirements.txt to be provided for automatic processing

Yes, that does appear to be the case though, again, I would add that you could start by just documenting the requirements at the top of the script itself.

Bear in mind too that the script will probably be sitting in a folder called 'Resources\Scripts\Python' and people will need to go find it, so the take-up will probably be slow at first...

zonfeld · Post by **zonfeld** » Tue Dec 20, 2022 1:15 am

Feel free to add the script to ToDoList any way you want whenever you want. I just don't want to ruin the reputation of your wonderful software with half-baked additions from my first go at Python.

With Christmas around the corner, it seems I'll find some time to clean everything up so that it shouldn't stay too embarrassing for too long. And I'll try to figure out how collaboration on GitHub actually works instead of just pulling stuff from other repos. Thank you for your encouragement!

zonfeld · Post by **zonfeld** » Tue Dec 20, 2022 3:16 pm

I came up with a name for the tool: File Link Verifier for ToDoList

...or a bit shorter: filiverto

I also created a repo on GitHub for better access to source code and documentation: https://github.com/schnodo/filiverto

zonfeld · Post by **zonfeld** » Wed Dec 21, 2022 11:18 pm

I improved the patterns. Accuracy seems acceptable. filiverto is now reporting the full link as it was matched by the regular expression so that it's easier for the user to locate it in the text.

In ToDoList's Introduction.tdl it reports this:

Code: Select all

Task ID,File Link
17,file://C:/some folder/some file.txt
17,file://C:/somefolder/somefile.txt
23,file://todolist.exe%20-cmd%2032828

Considering the fact that in some cases it's extremely hard to distinguish between a file name and a file name followed by command line arguments, that seems okay. Surprising as it may be todolist.exe -cmd 32828 is a legal file name.

The report for ToDoListDocumentation.tdl contains the following, which looks mostly reasonable, as well.

Code: Select all

Task ID,File Link
606,file:///c:/boot.ini
732,tdl://420.
751,file://')
752,tdl://'
757,file://C:/boot.ini

Some real world examples worked fine, too.

ToDoList (AbstractSpoon) Support

A better way to verify file links! Topic is solved

A better way to verify file links!

Re: A better way to verify file links?

Re: A better way to verify file links!

Re: A better way to verify file links!

Re: A better way to verify file links!

Re: A better way to verify file links!

Re: A better way to verify file links!

Re: A better way to verify file links!

Re: A better way to verify file links!

Re: A better way to verify file links!