Skip to content

Rewrite parser in a formal grammer #25

Closed
campb303 opened this issue Apr 9, 2021 · 34 comments
Closed

Rewrite parser in a formal grammer #25

campb303 opened this issue Apr 9, 2021 · 34 comments
Assignees
Labels
enhancement Request for a change to existing functionality question Something that requires more information before moving forward

Comments

@campb303
Copy link
Collaborator

campb303 commented Apr 9, 2021

The parser as it is functional but brittle. It needs to be rewritten as library code with exceptions, context managers and other abstracted schemes.

@campb303 campb303 added enhancement Request for a change to existing functionality high-priority Needs immediate extra focus labels Apr 9, 2021
@campb303 campb303 added this to the production-ready milestone Apr 9, 2021
@campb303 campb303 removed the high-priority Needs immediate extra focus label Apr 28, 2021
@benne238
Copy link
Collaborator

Rewriting the parser using a formal grammar

While we are able to successfully parse any given item and return an appropriate json structure, our current implementation of the ECNQueue parser is over 1000 lines along and is difficult to read. There are a couple of solutions to this problem:

  1. Change how output is parsed by using a clearly defined set of rules and a formal grammar
  2. Separate out the different functions of the ECNQueue into multiple scripts to make it more readable and more modular

Using a formal grammar

Currently, there is not a clearly defined set of easy to read rules in the parser; it is a series of if statements throughout the entire program, which makes readability difficult. However, the program makes sense logically:

  1. Read one line from the item
  2. Do something with that line based on its contents and if it is a delimiter
  3. Read the next line

However this gets cumbersome as items grow in size, this logic must be done for each and every line. Using a formal grammar could alleviate two problems: readability and complexity

There a several things that stand out with how PLY (and some of the other parsers work):

  1. They rely heavily on regex
  2. They don't take input line by line, rather all input is passed as one large string with newlines included
  3. With PLY specifically, all of the rules/delimiters are defined at the top

What this means for us is that a series of regular expressions would have to be developed to separate each section of an item, however, it has the potential to greatly reduce the length and complexity of the parser to one that is easy to read.
It is worth noting that it is possible to write the custom regex without using one of the parser packages in the list above

Separating the parser into multiple scripts

This step, depending on our implementation, should be done to reduce the amount of code in one script and increase the readability of the backend. In the parsers current state, it would make sense to separate the parser by function (and maybe sperate out the main parser function into multiple functions). Seeing how there will be a rewrite of the parser anyway, it might make more sense to not separate out the parser until a formal grammar is being used

@campb303 campb303 modified the milestones: production-ready, write-access May 17, 2021
@campb303 campb303 added the high-priority Needs immediate extra focus label May 17, 2021
@campb303 campb303 added the question Something that requires more information before moving forward label May 17, 2021
@campb303
Copy link
Collaborator Author

Summary of Parsing In Python: Tools And Libraries mentioned above:

Parsers come in one of three variants:

  1. Using an existing library if one exists.
  2. Building a completely custom parser if deep integration is needed.
  3. Use a parser generator by writing a grammar and generating the logic.

Parsers have a general structure of two tools:

  1. A lexer that processes raw input and produces tokens
    ex: 1 + 1 => Lexer => NUM(1), OPERATOR("+", NUM(1)
  2. A parser that matches patterns of tokens
    ex: ADD = NUM OPERATOR NUM; NUM(1), OPERATOR("+", NUM(1) => ADD

(Some parsers do not have a lexer and analyze raw input directly. These are called scannerless parsers.)

We currently have a custom parser built from scratch. However, it is slow and difficult to maintain. Out of the three options above, the first is not an option as there is likely no pre-written library code for this job that we haven't written and the second option is what we already have.

So the operative questions, in order, are:

  1. If we choose to create a formal grammar and write a new parser, what tools should we use?
  2. Do those tools present a significant performance increase?
  3. Do those tools ease the maintenance process?

Related questions include:

  • Are there others at ECN who have written parsers before and could help us?

@campb303
Copy link
Collaborator Author

Mark Senn is experienced with Perl, a language known for its data processing abilities, and has experience with writing parsers in Perl. He believes that Perl's parsing support would be a great candidate for the job.

A quick Google search shows very simple parsers with very little code. Ideally, I'd like to write a parser using Python to avoid have to (de)serialize data. More research is needed.

@benne238
Copy link
Collaborator

pyparsing

The python pyparsing module in python looks like a python native way to create custom parsing grammars directly in python.

Simple example:

# https://pyparsing-docs.readthedocs.io/en/latest/HowToUsePyparsing.html#hello-world
import pyparsing as pp

greet = pp.Word(pp.alphas) + "," + pp.Word(pp.alphas) + "!"
for greeting_str in [
            "Hello, World!",
            "Bonjour, Monde!",
            "Hola, Mundo!",
            "Hallo, Welt!",
        ]:
    greeting = greet.parseString(greeting_str)
    print(greeting)

Output:

['Hello', ',', 'World', '!']
['Bonjour', ',', 'Monde', '!']
['Hola', ',', 'Mundo', '!']
['Hallo', ',', 'Welt', '!']

This looks like a relatively easy module to use which should allow for the easier creation of a formal grammar that can parse out the different sections within a given item.
Due to the module being designed specifically in python there is no need to (de)serialize data from an outside parser.

Possible Issues

The documentation is not the greatest and some of the advanced features that we might want to use are poorly documented, so learning how to use this tool will have a learning curve that will be hindered by its poor documentation.

After attempting to code a basic program that would match to the string located between two delimiters, I found that using this tool was difficult due to the lack of information available on it. An example of a parser that might be used to separate out all of the "additional information from user" sections from a given item might look like this:

Example.py:

import pyparsing as pp
import string


info_from_user = (
    pp.Literal("=== Additional information supplied by user ===\n") + 
    pp.Word(string.printable, excludeChars="=") + 
    pp.Literal("===============================================\n")
)

reply = ("=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: stuff\n"+
    "From: someone\n"+
    "Date: some time\n"+
    "X-ECN-Queue-Original-Path: https://google.com\n"+
    "X-ECN-Queue-Original-URL: https://amazon.com\n"+
    "\n"+
    "Thanks.\n"
    "\n"+
    "===============================================\n"+
    "This shouldn't be matched"
)

print(info_from_user.parseString(reply))

One problem with this example though is that if an equal sign is included anywhere outside of the delimiter, then an exception will occur. While this has the potential to be useful, it is difficult to use.

@benne238
Copy link
Collaborator

Examples of Parsers

After talking with @campb303, there seems to be a particular interest in ANTLR due to its popularity. However, to get a sense of all the different parsers and the unique tools each (might) provide, a small write up with an example will be made in follow up messages to this issue

@campb303
Copy link
Collaborator Author

Next steps are to create an example that demonstrates a parser's benefits beyond what we already have.

@benne238
Copy link
Collaborator

benne238 commented May 24, 2021

update

Note, this is barebones, as I am still learning and attemtpting to use all of the different features with pyparse, but this current code can look through a given string, and return a list of all the additional info from user sections in only a few lines of code

#https://stackoverflow.com/questions/35073566/custom-delimiter-using-pyparsing


import string
import pyparsing as pp
import  os

#finds all of the reply from user delimiters
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="

info_from_user = (
    pp.originalTextFor( # Returns the matched text exactally as it was before parse
        pp.nestedExpr( # defines a start and end delimiter, everything is matched in between (including delimiters)
            info_from_user_start_delimiter, 
            info_from_user_end_delimiter
        )
    )
)

item = ("This shouldn't be matched\n" +
    "\n=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: stuff\n"+
    "From: someone\n"+
    "Date: some time\n"+
    "X-ECN-Queue-Original-Path: https://google.com\n"+
    "X-ECN-Queue-Original-URL: https://amazon.com\n"+
    "\n"+
    "This is a message.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"+
    "This shouldn't be matched With anything\n" +
    "Neither should this\n"+
    "\n" +
    "*** Status updated by: me at: now ***\n"+
    "no match here either\n" +
    "=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: more\n"+
    "From: not me\n"+
    "Date: right meow\n"+
    "X-ECN-Queue-Original-Path: https://gogle.com\n"+
    "X-ECN-Queue-Original-URL: https://amzon.com\n"+
    "\n"+
    "This is a message that should be matched too.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"
)

parsed_item = (info_from_user.searchString(item)).asList()

print(parsed_item)

Output:

[
   [
    '=== Additional information supplied by user ===\n\nSubject: stuff\nFrom: someone\nDate: some time\nX-ECN-Queue-Original-Path: https://google.com\nX-ECN-Queue-Original-URL: https://amazon.com\n\nThis is a message.\nThanks,\nme\n==============================================='
   ], 
   [
     '=== Additional information supplied by user ===\n\nSubject: more\nFrom: not me\nDate: right meow\nX-ECN-Queue-Original-Path: https://gogle.com\nX-ECN-Queue-Original-URL: https://amzon.com\n\nThis is a message that should be matched too.\nThanks,\nme\n==============================================='
   ]
]

@benne238
Copy link
Collaborator

Pyparsing benefits and drawbacks

This code parses most of the sections of any given item (without error parsing). One of the limitations at the moment is being able differentiate which section was parsed with a given list (as seen in the output below, there is nothing indicated which section was parsed except for the content of the section itself. I believe there is a simple way to do this, but I'll need to look into it more.

Drawbacks:

  1. poor documentation
  2. performance (as indicated in the pyparsing github wiki) however I will look into how to increase the performance of the code below, tomorrow

Benefits:

  1. clear and easy to read
  2. significantly fewer lines of code to do many of the things were are already doing with parsing

Updated code:

import pyparsing as pp

#finds all of the reply_rule from user delimiters
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="

info_from_user_rule = pp.originalTextFor( # Returns the matched text exactally as it was before parse
    pp.nestedExpr( # defines a start and end delimiter, everything is matched in between (including delimiters)
        info_from_user_start_delimiter, 
        info_from_user_end_delimiter
    )
)

# finds all status updates
status_update_rule = pp.originalTextFor(
    pp.Regex("(\*{3} Status updated by: )(.*)(at: (.*)\*{3})") +
    pp.SkipTo((pp.LineEnd() + info_from_user_start_delimiter | "***"))
)

# finds all edits
edit_rule = pp.originalTextFor(
    pp.Regex("(\*{3} Edited by: )(.*)(at: (.*)\*{3})") +
    pp.SkipTo((pp.LineEnd() + info_from_user_start_delimiter | "***"))
)

#finds all ecn replies
reply_rule = pp.originalTextFor(
    pp.Regex("(\*{3} Replied by: )(.*)(at: (.*)\*{3})") +
    pp.SkipTo((pp.LineEnd() + info_from_user_start_delimiter | "***"))
)

# combination of all the defined rules from above
parse_item = (info_from_user_rule | status_update_rule | edit_rule | reply_rule)

item = ("\n=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: stuff\n"+
    "From: someone\n"+
    "Date: some time\n"+
    "X-ECN-Queue-Original-Path: https://google.com\n"+
    "X-ECN-Queue-Original-URL: https://amazon.com\n"+
    "\n"+
    "This is a message.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"+
    "*** Status updated by: you at: yesterday ***\n"+
    "more status stuff\n" +
    "\n\n\n\n"+
    "*** Status updated by: that_guy at: tmrw ***\n"+
    "status update\n" +
    "\n"+
    "*** Status updated by: me at: now ***\n"+
    "this is a status update\n" +
    "*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\n" +
    "this is an edit\n"+
    "=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: more\n"+
    "From: not me\n"+
    "Date: right meow\n"+
    "X-ECN-Queue-Original-Path: https://gogle.com\n"+
    "X-ECN-Queue-Original-URL: https://amzon.com\n"+
    "\n"+
    "This is a message that should be matched too.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"
)

# prints the output after searching through the item using the parse_item rule
print(parse_item.searchString(item))

Output:

[
   [
      "=== Additional information supplied by user ===\n\nSubject: stuff\nFrom: someone\nDate: some time\nX-ECN-Queue-Original-Path: https://google.com\nX-ECN-Queue-Original-URL: https://amazon.com\n\nThis is a message.\nThanks,\nme\n==============================================="
   ],
   [
      "*** Status updated by: you at: yesterday ***\nmore status stuff"
   ],
   [
      "*** Status updated by: that_guy at: tmrw ***\nstatus update"
   ],
   [
      "*** Status updated by: me at: now ***\nthis is a status update"
   ],
   [
      "*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\nthis is an edit"
   ],
   [
      "=== Additional information supplied by user ===\n\nSubject: more\nFrom: not me\nDate: right meow\nX-ECN-Queue-Original-Path: https://gogle.com\nX-ECN-Queue-Original-URL: https://amzon.com\n\nThis is a message that should be matched too.\nThanks,\nme\n==============================================="
   ]
]

@benne238
Copy link
Collaborator

Performance notes

After talking with @campb303, the parser from above is significantly slower than what we currently have: what we currently have currently parses a specific item in about .1 seconds, the new parser above parses the same item in about .5 seconds. While the tradeoff is significantly readable code, we decided that this is not a good tradeoff for such a significant hit in performance. While there may be ways to optimize the pyparsing code above, if it can't be done, then I should look into other parsers, specifically looking for parsers that are more performant (ANTLR was mentioned due to its basis in C)

@benne238
Copy link
Collaborator

Dictionary of replies

After some tinkering, I was able to return a list of dictionaries that point to their respective sections (see output below)
This is still a proof of concept code, and will most likely be refactored, however, it is easy to see this code outputs data structrues that look very similar to what we have already implemented

https://stackoverflow.com/questions/29282878/distinguish-matches-in-pyparsing (source for some of the functions below)

import pyparsing as pp

def makeDecoratingParseAction(marker):
    def parse_action_impl(s, l, t): # need to look into how this function works
        #print(t)
        fullcontent = ''.join(t)
        return {"type": marker, 
            "content": fullcontent.split("\n")
            }
    return parse_action_impl

def formatReplyFromUser():
    def parse_action_impl(t):
       # print(t)
        return {
            "type": "reply_from_user",
            "subject": t[2].strip(),
            "from": t[4].strip(),
            "date": t[6].strip(),
            t[7]: t[8].strip(),
            t[9]: t[10].strip(),
            "content": t[11]
        }
    return parse_action_impl

#finds all of the reply_rule from user delimiters
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "\n==============================================="

info_from_user_rule = (
    (
        # matches everything between the two info_from_user delimiters
        info_from_user_start_delimiter +
        "\n\n" +
        pp.Regex("Subject: ") + pp.Regex("(.*)\\n") +
        pp.Regex("From: ") + pp.Regex("(.*)\\n") +
        pp.Regex("Date: ") + pp.Regex("(.*)\\n")+
        pp.Regex("X-ECN-Queue-Original-Path: ") + pp.Regex("(.*)\\n") +
        pp.Regex("X-ECN-Queue-Original-URL: ") + pp.Regex("(.*)\\n") +      
        pp.SkipTo(info_from_user_end_delimiter)
    ).setWhitespaceChars('') # ensures all whitespace is captured from the item
).setParseAction(formatReplyFromUser()) # creates a dictionary

status_update_rule = (
    (
        pp.Regex("(\*{3} Status updated by: )(.*)(at: (.*)\*{3})") +
        pp.SkipTo((pp.LineEnd() + pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*") ))

    ).setWhitespaceChars("")
).setParseAction(makeDecoratingParseAction("status_update"))

edit_rule = ((
    pp.Regex("(\*{3} Edited by: )(.*)(at: (.*)\*{3})") +
    pp.SkipTo((pp.LineEnd() + pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*")))
).setWhitespaceChars("")).setParseAction(makeDecoratingParseAction("edit"))

reply_rule = ((
    pp.Regex("(\*{3} Replied by: )(.*)(at: (.*)\*{3})") +
    pp.SkipTo((pp.LineEnd() + pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*")))
).setWhitespaceChars("")).setParseAction(makeDecoratingParseAction("reply"))

parse_item = (info_from_user_rule | status_update_rule | edit_rule | reply_rule) # searches for each of these in the item


item = ("\n=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: stuff\n"+
    "From: someone\n"+
    "Date: some time\n"+
    "X-ECN-Queue-Original-Path: https://google.com\n"+
    "X-ECN-Queue-Original-URL: https://amazon.com\n"+
    "\n"+
    "This is a message.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"+
    "*** Status updated by: you at: yesterday ***\n"+
    "more status stuff\n" +
    "\n\n\n\n"+
    "*** Status updated by: that_guy at: tmrw ***\n"+
    "status update\n" +
    "\n"+
    "*** Status updated by: me at: now ***\n"+
    "this is a status update\n" +
    "*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\n" +
    "this is an edit\n"+
    "*** Replied by: no one: ever ***\n" +
    "this is a reply\n"+
    "=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: more\n"+
    "From: not me\n"+
    "Date: right meow\n"+
    "X-ECN-Queue-Original-Path: https://gogle.com\n"+
    "X-ECN-Queue-Original-URL: https://amzon.com\n"+
    "\n"+
    "This is a message that should be matched too.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"
)

for tokens, starter, end in parse_item.scanString(item):
    print(tokens[0])

Ouptut:

{
   "type":"reply_from_user",
   "subject":"stuff",
   "from":"someone",
   "date":"some time",
   "X-ECN-Queue-Original-Path: ":"https://google.com",
   "X-ECN-Queue-Original-URL: ":"https://amazon.com",
   "content":"This is a message.\nThanks,\nme"
},
{
   "type":"status_update",
   "content":[
      "*** Status updated by: you at: yesterday ***",
      "more status stuff"
   ]
},
{
   "type":"status_update",
   "content":[
      "*** Status updated by: that_guy at: tmrw ***",
      "status update"
   ]
},
{
   "type":"status_update",
   "content":[
      "*** Status updated by: me at: now ***",
      "this is a status update"
   ]
},
{
   "type":"edit",
   "content":[
      "*** Edited by: someoneelse at: 03/03/21 10:09:52 ***",
      "this is an edit"
   ]
},
{
   "type":"reply_from_user",
   "subject":"more",
   "from":"not me",
   "date":"right meow",
   "X-ECN-Queue-Original-Path: ":"https://gogle.com",
   "X-ECN-Queue-Original-URL: ":"https://amzon.com",
   "content":"This is a message that should be matched too.\nThanks,\nme"
}

@benne238
Copy link
Collaborator

benne238 commented May 28, 2021

Directory parsing

The code below now is able to locate directory information within an item and (for the most part) delaminate it correctly, some work will have to be done with the colons within the keys next week. It is also easy to see that the output looks even more similar to what we expect on the front end. I would have liked to make more progress on this, but implementing the grammar below is tedious in that troubleshooting issues take longer and finding solutions to various problems associated with pyparsing also takes time

import pyparsing as pp
import json


info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "\n==============================================="

# reply from user header info
info_from_user_headers_rule = (
    pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
    pp.Group("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) 
)

#finds all of the reply_rule from user delimiters
info_from_user_rule = (
    (
        # matches everything between the two info_from_user delimiters
        info_from_user_start_delimiter +
        "\n\n" +
        pp.Dict(
            info_from_user_headers_rule +
            pp.Group(pp.SkipTo(info_from_user_end_delimiter)).setResultsName("content")
        )      
    ).setWhitespaceChars('') # ensures all whitespace is captured from the item
)


status_update_rule = (
    pp.Dict(
        #matches everything from the start delimiter up to the start of another delimiter
        pp.Group(pp.Literal("*** Status updated by: ").suppress() + pp.SkipTo(" at: ") ).setResultsName("by")+  
        pp.Group(pp.Literal("at:").suppress() + pp.SkipTo(" ***\n")).setResultsName("datetime") +
        pp.Group(
            pp.Literal("***\n").suppress() + 
            pp.SkipTo(
                (pp.LineEnd() + (pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*") ))
            )
        ).setResultsName("content")
    ).setWhitespaceChars("")
)

edit_rule = (
    pp.Dict(
        pp.Group(pp.Literal("*** Edited by: ").suppress() + pp.SkipTo(" at: ") ).setResultsName("by")+  
        pp.Group(pp.Literal("at:").suppress() + pp.SkipTo(" ***\n")).setResultsName("datetime") +
        pp.Group(
            pp.Literal("***\n").suppress() + 
            pp.SkipTo(
                (pp.LineEnd() + (pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*") ))
            )
        ).setResultsName("content")
    )
).setWhitespaceChars("")

reply_rule = (
    pp.Dict(
        pp.Group(pp.Literal("*** Replied by: ").suppress() + pp.SkipTo(" at: ") ).setResultsName("by")+  
        pp.Group(pp.Literal("at:").suppress() + pp.SkipTo(" ***\n")).setResultsName("datetime") +
        pp.Group(
            pp.Literal("***\n").suppress() + 
            pp.SkipTo(
                pp.LineEnd() + (pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*"))
            )
        ).setResultsName("content")
    )
).setWhitespaceChars("")

directory_rule = pp.Dict(
    pp.White("\n").suppress() +
    pp.White("\t").suppress() +
    pp.Group(pp.Optional("Name:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Login:" +  pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Computer:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Location:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Email:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Phone:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Office:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("UNIX Dir:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Zero Dir:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("User ECNDB:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Host ECNDB:" + pp.Regex("(.*)(\\n)"))) +
    pp.Group(pp.Optional("Subject: " + pp.Regex("(.*)(\\n)"))) +
    pp.White("\n").suppress()
).setWhitespaceChars('').parseWithTabs()

item = ("\n\t" +
    "Name: i dont have one\n"
    "Login: tttt\n" + 
    "Computer: 5555\n" +
    "Location: yes\n" +
    "Email: t\n" + 
    "Phone: 5555555555\n" + 
    "Office: yo\n" +
    "UNIX Dir: 45\n" +
    "Zero Dir: 0\n" +
    "User ECNDB: 7\n" +
    "Host ECNDB: 8\n" +
    "Subject: I have no idea\n"+ 
    "\n" +
    "\n=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: stuff\n"+
    "From: someone\n"+
    "Date: some time\n"+
    "X-ECN-Queue-Original-Path: https://google.com\n"+
    "X-ECN-Queue-Original-URL: https://amazon.com\n"+
    "\n"+
    "This is a message.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"+
    "*** Status updated by: you at: yesterday ***\n"+
    "more status stuff\n" +
    "\n\n\n\n"+
    "*** Status updated by: that_guy at: tmrw ***\n"+
    "status update\n" +
    "\n"+
    "*** Status updated by: me at: now ***\n"+
    "this is a status update\n" +
    "*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\n" +
    "this is an edit\n"+
    "*** Replied by: no one at: ever ***\n" +
    "this is a reply\n"+
    "=== Additional information supplied by user ===\n"+
    "\n"+
    "Subject: more\n"+
    "From: not me\n"+
    "Cc: \"jacob Bennett\" <me@purdue.edu>\n" +
    "Date: right meow\n"+
    "X-ECN-Queue-Original-Path: https://gogle.com\n"+
    "X-ECN-Queue-Original-URL: https://amzon.com\n"+
    "\n"+
    "This is a message that should be matched too.\n"+
    "Thanks,\n"+
    "me"+
    "\n"+
    "===============================================\n"
)

sections = []

parse_objects = {
    'directory': directory_rule.scanString(item),
    'info_from_user': info_from_user_rule.scanString(item),
    'edit': edit_rule.scanString(item),
    'status_update': status_update_rule.scanString(item),
    'reply_from_ecn': reply_rule.scanString(item)
}

for key in parse_objects.keys():
    for token, start_location, end_location in parse_objects[key]:
        delete_tokens = []
        for token_key in token.keys():
            if token[token_key] == '': delete_tokens.append(token_key)
        for removable_token in delete_tokens:
            del token[removable_token]
        token = token.asDict()
        token["type"] = key
        sections.append(token)

sections = json.dumps(sections)
print(sections)

Output:

[
   {
      "Name:":"i dont have one\n",
      "Login:":"tttt\n",
      "Computer:":"5555\n",
      "Location:":"yes\n",
      "Email:":"t\n",
      "Phone:":"5555555555\n",
      "Office:":"yo\n",
      "UNIX Dir:":"45\n",
      "Zero Dir:":"0\n",
      "User ECNDB:":"7\n",
      "Host ECNDB:":"8\n",
      "Subject: ":"I have no idea\n",
      "type":"directory"
   },
   {
      "content":[
         "This is a message.\nThanks,\nme"
      ],
      "Subject":"stuff",
      "From":"someone",
      "Date":"some time",
      "X-ECN-Queue-Original-Path":"https://google.com",
      "X-ECN-Queue-Original-URL":"https://amazon.com",
      "type":"info_from_user"
   },
   {
      "content":[
         "This is a message that should be matched too.\nThanks,\nme"
      ],
      "Subject":"more",
      "From":"not me",
      "Cc":"\"jacob Bennett\" <me@purdue.edu>",
      "Date":"right meow",
      "X-ECN-Queue-Original-Path":"https://gogle.com",
      "X-ECN-Queue-Original-URL":"https://amzon.com",
      "type":"info_from_user"
   },
   {
      "by":[
         "someoneelse"
      ],
      "datetime":[
         "03/03/21 10:09:52"
      ],
      "content":[
         "this is an edit"
      ],
      "type":"edit"
   },
   {
      "by":[
         "you"
      ],
      "datetime":[
         "yesterday"
      ],
      "content":[
         "more status stuff"
      ],
      "type":"status_update"
   },
   {
      "by":[
         "that_guy"
      ],
      "datetime":[
         "tmrw"
      ],
      "content":[
         "status update"
      ],
      "type":"status_update"
   },
   {
      "by":[
         "me"
      ],
      "datetime":[
         "now"
      ],
      "content":[
         "this is a status update"
      ],
      "type":"status_update"
   },
   {
      "by":[
         "no one"
      ],
      "datetime":[
         "ever"
      ],
      "content":[
         "this is a reply"
      ],
      "type":"reply_from_ecn"
   }
]

@campb303
Copy link
Collaborator Author

campb303 commented Jun 1, 2021

I'd like to see a link to a particularly good tutorial or a writeup about how you've come to understand and use PyParser.

@benne238
Copy link
Collaborator

benne238 commented Jun 1, 2021

How to use Pyparsing

Grammar/Expression Creation

A grammar is a rule or set of rules (rules are often referred to as expressions) used to create a parser. Creating expressions in pyparser is relatively easy, here are some examples of some simpler ones:

import pyparsing as pp

colon_rule = pp.Word(pp.alphas) + ":" + pp.Word(pp.alphas) # matches two words on either side of a colon
skipTo_rule = pp.SkipTo("end") # matches everything up until the word "end"
regex_rule = pp.Regex("...") # matches the first three characters in a string
literal_rule = pp.Literal("Hello") # matches these characters exactally

To use these expressions, there are one of three functions that can be called:

Function Description
searchString() looks for all possible matches within a given string
parseString() requires that the grammar matches exactly with the given string (see below)
scanString() similar to search string, except the return type retains the location of the original match

These functions are associated with each rule and the string to parse is an argument that gets passed as an argument, using the rules from above for example:

print(colon_rule.parseString("hello:there")) # Output: ['hello', ':', 'there']
print(skipTo_rule.parseString("I'm just a simple sentenceend")) # Output: ["I'm just a simple sentence"]
print(regex_rule.parseString("This is a sentence")) # Output: ['Thi']
print(literal_rule.searchString("World Hello")) # Output: [['Hello']]

Notice the function searchString() was used on the literal_rule as opposed to parseString() like the three other rules: if you were to change it to parseString(), you would get an error, which in essence, states that "Hello" was expected, but "W" was found instead. The grammar must match exactly to the string, otherwise parse string will fail. The difference between parseString() and searchString() can be illistrated best by this example:

print(regex_rule.parseString("This is a sentence")) # Output: ['Thi']
print(regex_rule.searchString("This is a sentence")) # Output: [['Thi'], ['s i'], ['s a'], ['sen'], ['ten']]

As seen above, search string does not care about string or character placement and will attempt to match anything within the string.
Note that searchString() remembers what text has already been matched, so in the example above, the output is not [['Thi'], ['his'], ['is '], ['s i'], [' is'], ['is '], ['s a'], [' a '], ['a s'], [' se'], ['sen'], etc.] as one might expect if parser were going character by character

Generally speaking, searchString() should be used to match an expression that is present many times in varying locations within a string, and parseString() should be used when the general structure of the string will be known, otherwise parseString() will throw an exception if there is no match, unlike searchString() which just returns an empty list.

Complex Grammars

Expressions, as seen above can be combined to create a grammar. Take this example from this comment:

import pyparsing as pp

info_from_user_headers_rule = (
    pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
    pp.Group("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) 
)

This grammar uses several different pyparsing classes and expressions, but the result matches all of the header information in a reply_from_user section.

Class Description
Group Puts all matched output in its own list
Optional Specifies that this match is optional, but should be matched if present
suppress() When this function is used on a rule, anything matched in that rule is not output

Here is an example of a reply from a user:

print(info_from_user_headers_rule.searchString("""
=== Additional information supplied by user ===

Subject: subject_here
From: Jacob
Date: Tue, 2 Mar 2021 09:46:21 -0500
X-ECN-Queue-Original-Path: path_here
X-ECN-Queue-Original-URL: url_here

I am replying to ECN

Thanks,
Jacob


"""))

# Output: [[['Subject', 'subject_here'], ['From', 'Jacob'], [], ['Date', 'Tue, 2 Mar 2021 09:46:21 -0500'], ['X-ECN-Queue-Original-Path', 'path_here'], ['X-ECN-Queue-Original-URL', 'url_here']]]

@campb303
Copy link
Collaborator Author

campb303 commented Jun 1, 2021

I see above that some results on simple arrays like

['Thi']

and some results are nested arrays like

[
    ['Thi'], ['s i'], ['s a'], ['sen'], ['ten']
]

and other outputs are nested multiple times like

[
    [
        ['Subject', 'subject_here'], 
        ['From', 'Jacob'], 
        [], 
        ['Date', 'Tue, 2 Mar 2021 09:46:21 -0500'], 
        ['X-ECN-Queue-Original-Path', 'path_here'], 
        ['X-ECN-Queue-Original-URL', 'url_here']
    ]
]

When should I expect nesting or not with results?

@benne238
Copy link
Collaborator

benne238 commented Jun 1, 2021

Nested Lists

Nested lists can happen because of a couple of different reasons:

  • using the pyparsing.Group class
  • using the searchString() function

searchString() will return a list of lists, with each sub-list representing a match
parseString() will return a list of strings, each string representing a match

The pyparsing.Group class will explicitly put any match within its own list

To demonstrate these differences, here is an example:

import pyparsing as pp
string_var = 'key1:value1;key2:value2;key3:value3;'

rule_one = pp.Word(pp.alphanums) + ":" + pp.Word(pp.alphanums) + pp.Literal(";")
rule_two = pp.ZeroOrMore(pp.Group(pp.Word(pp.alphanums) + ":" + pp.Word(pp.alphanums) + pp.Literal(";")))
rule_three = pp.Group(pp.Word(pp.alphanums) + ":" + pp.Word(pp.alphanums) + pp.Literal(";"))

print(rule_one.parseString(string_var))
print(rule_two.parseString(string_var))
print(rule_three.parseString(string_var))
print(rule_one.searchString(string_var))
print(rule_two.searchString(string_var))
print(rule_three.searchString(string_var))

The pyparsing.ZeroOrMore class is used to denote a match that may occur zero or many times.

Output

['key1', ':', 'value1', ';']
[['key1', ':', 'value1', ';'], ['key2', ':', 'value2', ';'], ['key3', ':', 'value3', ';']]
[['key1', ':', 'value1', ';']]
[['key1', ':', 'value1', ';'], ['key2', ':', 'value2', ';'], ['key3', ':', 'value3', ';']]
[[['key1', ':', 'value1', ';'], ['key2', ':', 'value2', ';'], ['key3', ':', 'value3', ';']]]
[[['key1', ':', 'value1', ';']], [['key2', ':', 'value2', ';']], [['key3', ':', 'value3', ';']]]

@campb303
Copy link
Collaborator Author

campb303 commented Jun 1, 2021

Is there one approach that gives us all the flexibility we'd need so that we can standardize the expected output?

@benne238
Copy link
Collaborator

benne238 commented Jun 1, 2021

Is there one approach that gives us all the flexibility we'd need so that we can standardize the expected output?

searchString() is inherently more flexible, however, the structure of items in general is probably going to have to require the use of parseString() (I was going to write a quick paragraph tmrw about why parseString() has to be used if I can't find an alternative way of using searchString()). It is possible to get both functions have similar output, but parseString() has output that is a little cleaner to begin with. I can also do a write up for the pros and cons of both functions (specifically applying it to rewriting the parser)

@benne238
Copy link
Collaborator

benne238 commented Jun 3, 2021

Prototype Pyparsing Parser

The code below will parse all the information located in an item excluding any information located in the headers and return an output comparable to what already occurs in the current implementation of the parser. In addition, it does it faster and in fewer lines and in a way that is relatively easy to understand.

import pyparsing as pp
import json
import string

info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="

def addTypeKey(type):
    def parse_action_impl(s, l, t): # need to look into how exactally this function gets information
        t = t.asDict()
        unwantedKeys=[emptyKey for emptyKey in t.keys() if emptyKey == ''] # makes a list of keys with empty values
        for key in unwantedKeys: del t[key] # removes empty keys
        if len(t.keys()) == 0: return # used for optional sections such as directory info
        t["type"] = type
        return t
    return parse_action_impl

# additional information supplied by user rule
info_from_user_rule = (
    (info_from_user_start_delimiter + pp.LineEnd()).suppress() +
    pp.Literal("\n").setWhitespaceChars("").suppress() +
    pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo(pp.LineEnd())) +
    pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
    pp.Group(pp.Optional("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group(pp.Optional("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n")))  +
    pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd(), include=True).setResultsName("content")
).setParseAction(addTypeKey("reply_from_user"))

reply_rule = (
    pp.Literal("\n*** Replied by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))

edit_rule = (
    pp.Literal("\n*** Edited by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("edit"))

status_update_rule = (
    pp.Literal("\n*** Status updated by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("status"))


directory_rule = pp.Optional(pp.Dict(
    pp.Literal("\n").suppress().setWhitespaceChars("") +
    pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() +  pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Literal("\n\n").suppress().setWhitespaceChars("")
)).setParseAction(addTypeKey("directory_information"))

initial_message_rule = (
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))

headers_rule = pp.Group(pp.SkipTo("\n\n", include=True).setResultsName('headers')).leaveWhitespace()

item_rule = (
    headers_rule.suppress() + #supresses the output of the headers to the parsed item
    directory_rule + 
    initial_message_rule + 
    pp.ZeroOrMore(info_from_user_rule | reply_rule | edit_rule | status_update_rule)
)

raw_item = """
<Header information would typically go here>


     Name: Jacob Bennett
    Login: benne238
 Computer: 1.1.1.1
 Location: CARY 123
    Email: benne238@purdue.edu
    Phone: numberhere
   Office: I wish...
 UNIX Dir: dunno
 Zero Dir: dunno thatone either
    
  Subject: I need something from ECN


I am writing because I need something from ECN, 
thanks, Jacob
Bennett

*** Edited by: campb303 at: 01/01/2022 09:00:00 ***

I made an edit here



*** Edited by: campb303 at: 01/01/2022 12:29:38 ***

I also made an edit here


*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***

ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***

Hello there.... could you be more specific?

Thanks,
personone

*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)

*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***

i dont even know anymore




=== Additional information supplied by user ===

Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing

Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/

Thanks, Jacob
===============================================
"""
parsed_item = item_rule.parseString(raw_item).asList()

print(json.dumps(parsed_item))

Output:

[
  {
    "Name": "Jacob Bennett",
    "Login": "benne238",
    "Computer": "1.1.1.1",
    "Location": "CARY 123",
    "Email": "benne238@purdue.edu",
    "Phone": "numberhere",
    "Office": "I wish...",
    "UNIX Dir": "dunno",
    "Zero Dir": "dunno thatone either",
    "Subject": "I need something from ECN",
    "type": "directory_information"
  },
  {
    "content": "\nI am writing because I need something from ECN, \nthanks, Jacob\nBennett\n",
    "type": "initial_message"
  },
  {
    "by": "campb303",
    "datetime": "01/01/2022 09:00:00",
    "content": [
      "\nI made an edit here\n\n\n"
    ],
    "type": "edit"
  },
  {
    "by": "campb303",
    "datetime": "01/01/2022 12:29:38",
    "content": [
      "\nI also made an edit here\n\n"
    ],
    "type": "edit"
  },
  {
    "by": "someoneelse",
    "datetime": "01/01/2022 12:30:13",
    "content": [
      "I made a status update"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "01/02/2022 12:31:15",
    "content": [
      "\nooo, personone also edited this item"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "01/02/22 12:34:03",
    "content": [
      "\nHello there.... could you be more specific?\n\nThanks,\npersonone\n"
    ],
    "type": "reply_to_user"
  },
  {
    "by": "persontwo",
    "datetime": "01/05/22 14:58:03",
    "content": [
      "I made an edit too! (persontwo)\n"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "1/7/2022 15:40:55",
    "content": [
      "Something happened here"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "04/08/22 15:41:05",
    "content": [
      "\ni dont even know anymore\n\n\n\n"
    ],
    "type": "edit"
  },
  {
    "content": "Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/\n\nThanks, Jacob\n",
    "type": "reply_from_user"
  }
]

Modifications still need to be made to this script including:

  • adding keys to some of the sections, specifically having to do with information that can only be found in the header
  • parse_errors and if this is something we would like to continue doing
  • formatting date and content output

@campb303
Copy link
Collaborator Author

campb303 commented Jun 4, 2021

Currently, the output of the parser look good thought it is not functionally complete. We still need:

  • to integrate the parser into the Item class
  • add ParseError to the output
  • add some way to indicate improper formatting

I've also some other concerns:

  • Documentation is difficult to follow. May want to consider writing domain specific tutorial for future devs.
  • Need to compare timing of this parser to old parser.

Overall, code is cleaner than the previous parser. If we can add the laking functionality and prove the code is as effecient or faster than the old parser then we're good to go.

Next Steps:

  • Add error parsing
  • Integrate into Item class
  • Time parsing of all queues

@benne238
Copy link
Collaborator

benne238 commented Jun 7, 2021

Update

This version of the pyparsing parser:

  • parses assignments
  • formats any datetime field to the expected datetime format
from typing_extensions import Literal
import pyparsing as pp
import json
import string
from dateutil import parser, tz
from datetime import datetime
import os

info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="

HEADERS = ""

def addTypeKey(section_type):
    def parse_action_impl(s, l, t): # need to look into how exactally this function gets information
        t = t.asDict()
        
        unwantedKeys=[emptyKey for emptyKey in t.keys() if t[emptyKey] == ''] # makes a list of keys with empty values
        for key in unwantedKeys: del t[key] # removes empty keys

        t["type"] = section_type
        if "datetime" in t.keys(): t["datetime"] = getFormattedDate(t["datetime"])
        
        if "content" in t.keys(): 
            t["content"] = t["content"][0].strip()
            t["content"] = t["content"].splitlines(True)

        return t
    return parse_action_impl

def getAssignments() -> list:
    assignment_list = []
    for assignment in assignment_rule.searchString(HEADERS).asList():
        assignment_list.append(assignment[0])
    
    return assignment_list #need to write a blurd about yeild statements
        


def storeHeaders():
    def parse_action_impl(s, l, t):

        global HEADERS 
        HEADERS = t[0][0]
        return

    return parse_action_impl

def getFormattedDate(date: str) -> str:
        """Returns the date/time formatted as RFC 8601 YYYY-MM-DDTHH:MM:SS+00:00.
        Returns empty string if the string argument passed to the function is not a datetime.
        See: https://en.wikipedia.org/wiki/ISO_8601

        **Returns:**
        ```        
        str: Properly formatted date/time recieved or empty string.
        ```
        """
        try:
            # This date is never meant to be used. The default attribute is just to set timezone.
            parsedDate = parser.parse(date, default=datetime(
                1970, 1, 1, tzinfo=tz.gettz('EDT')))
        except:
            return ""

        parsedDateString = parsedDate.strftime("%Y-%m-%dT%H:%M:%S%z")

        return parsedDateString

# additional information supplied by user rule
info_from_user_rule = pp.Dict(
    (info_from_user_start_delimiter + pp.LineEnd()).suppress() +
    pp.Literal("\n").setWhitespaceChars("").suppress() +
    pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo(pp.LineEnd())) +
    pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
    pp.Group(pp.Optional("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group(pp.Optional("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n")))  +
    pp.Group(pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd())).setResultsName("content")
).setParseAction(addTypeKey("reply_from_user"))

reply_rule = (
    pp.Literal("\n*** Replied by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))

edit_rule = (
    pp.Literal("\n*** Edited by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("edit"))

status_update_rule = (
    pp.Literal("\n*** Status updated by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("status"))


directory_rule = pp.Dict(
    pp.Literal("\n").suppress().setWhitespaceChars("") +
    pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() +  pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Literal("\n\n").suppress().setWhitespaceChars("")
).setParseAction(addTypeKey("directory_information"))

initial_message_rule = pp.Group(
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))

headers_rule = pp.Group(pp.SkipTo("\n\n", include=True)).setResultsName('headers').leaveWhitespace()

item_rule = (
    headers_rule.setParseAction(storeHeaders()).suppress() + #supresses the output of the headers to the parsed item
    pp.Optional(directory_rule) + 
    initial_message_rule + 
    pp.ZeroOrMore(info_from_user_rule | reply_rule | edit_rule | status_update_rule)
)

assignment_rule = (
        pp.Literal("Assigned-To: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("to") +
        pp.Literal("Assigned-To-Updated-Time: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("datetime") +
        pp.Literal("Assigned-To-Updated-By: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("by")
).setParseAction(addTypeKey("assignment"))

raw_item = """
Assigned-To: not_me
Assigned-To-Updated-Time: Fri, 29 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: me
Assigned-To: you
Assigned-To-Updated-Time: 31 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: not_me


     Name: Jacob Bennett
    Login: benne238
 Computer: 1.1.1.1
 Location: CARY 123
    Email: benne238@purdue.edu
    Phone: numberhere
   Office: I wish...
 UNIX Dir: dunno
 Zero Dir: dunno thatone either
    
  Subject: I need something from ECN


I am writing because I need something from ECN, 
thanks, Jacob
Bennett

*** Edited by: campb303 at: 01/01/2022 09:00:00 ***

I made an edit here



*** Edited by: campb303 at: 01/01/2022 12:29:38 ***

I also made an edit here


*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***

ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***

Hello there.... could you be more specific?

Thanks,
personone

*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)

*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***

i dont even know anymore




=== Additional information supplied by user ===

Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing

Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/

Thanks, Jacob
===============================================
"""

parsed_item = item_rule.parseString(raw_item).asList()

for assignment in getAssignments():
    parsed_item.append(assignment)

print(json.dumps(parsed_item, indent=2))

Output:

[
  {
    "Name": "Jacob Bennett",
    "Login": "benne238",
    "Computer": "1.1.1.1",
    "Location": "CARY 123",
    "Email": "benne238@purdue.edu",
    "Phone": "numberhere",
    "Office": "I wish...",
    "UNIX Dir": "dunno",
    "Zero Dir": "dunno thatone either",
    "Subject": "I need something from ECN",
    "type": "directory_information"
  },
  {
    "content": [
      "I am writing because I need something from ECN, \n",
      "thanks, Jacob\n",
      "Bennett"
    ],
    "type": "initial_message"
  },
  {
    "by": "campb303",
    "datetime": "2022-01-01T09:00:00-0500",
    "content": [
      "I made an edit here"
    ],
    "type": "edit"
  },
  {
    "by": "campb303",
    "datetime": "2022-01-01T12:29:38-0500",
    "content": [
      "I also made an edit here"
    ],
    "type": "edit"
  },
  {
    "by": "someoneelse",
    "datetime": "2022-01-01T12:30:13-0500",
    "content": [
      "I made a status update"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "2022-01-02T12:31:15-0500",
    "content": [
      "ooo, personone also edited this item"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "2022-01-02T12:34:03-0500",
    "content": [
      "Hello there.... could you be more specific?\n",
      "\n",
      "Thanks,\n",
      "personone"
    ],
    "type": "reply_to_user"
  },
  {
    "by": "persontwo",
    "datetime": "2022-01-05T14:58:03-0500",
    "content": [
      "I made an edit too! (persontwo)"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "2022-01-07T15:40:55-0500",
    "content": [
      "Something happened here"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "2022-04-08T15:41:05-0400",
    "content": [
      "i dont even know anymore"
    ],
    "type": "edit"
  },
  {
    "content": [
      "Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/\n",
      "\n",
      "Thanks, Jacob"
    ],
    "Subject": "Re: I need something from ECN",
    "From": "\"Bennett, Jacob\" <benne238@purdue.edu>",
    "Date": "Tue, 3 Dec 2023 14:50:44 +0000",
    "X-ECN-Queue-Original-Path": "nothing",
    "X-ECN-Queue-Original-URL": "nothing",
    "type": "reply_from_user"
  },
  {
    "to": "not_me",
    "datetime": "2021-01-29T07:01:40-0500",
    "by": "me",
    "type": "assignment"
  },
  {
    "to": "you",
    "datetime": "2021-01-31T07:01:40-0500",
    "by": "not_me",
    "type": "assignment"
  }
]

@benne238
Copy link
Collaborator

benne238 commented Jun 9, 2021

Update

The following code will output an error_parse when the expected syntax of an item is not encountered.

import pyparsing as pp
import json
import string
from dateutil import parser, tz
from datetime import datetime
import os

info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="

nested_expression_rule = (
    pp.Literal(info_from_user_start_delimiter) | 
    pp.Regex("\*\*\* Replied by: (.*) at: (.*) \*\*\*") | 
    pp.Regex("\*\*\* Edited by: (.*) at: (.*) \*\*\*") |
    pp.Regex("\*\*\* Status updated by: (.*) at: (.*) \*\*\*")
)
HEADERS = ""

def errorHandler():
    def error_action_impl(s, l, t):
        location = (s[:l]).count('\n') + 1
        parse_error = {
            "type": "parse_error",
            'datetime': getFormattedDate(str(datetime.now())),
            'expected': f'Did not encounter a reply-from-user ending delimiter for the reply-from-user start delimiter on line {location}',
            'got': '\n',
            'line_num': location
        }
        parsed_item.append(parse_error)
        return

    return error_action_impl


def checkForNested():
    def nested_action_impl(s, l, t):
        errorParse = {}
        nested_expressions_generator = nested_expression_rule.scanString(t[0])
        for token, start, end in nested_expressions_generator:
            errorParse = {
                "type": "parse_error",
                "datetime": getFormattedDate(str(datetime.now())),
                "expected": "Reply from user ending delimiter",
                "got": token[0],
                "line_num": (s[:start + l]).count("\n") + 1
            }
            break
        if len(errorParse.keys()) != 0: parsed_item.append(errorParse)

        return 
    return nested_action_impl

def addTypeKey(section_type):
    def parse_action_impl(s, l, t): # need to look into how exactally this function gets information
        t = t.asDict()
        
        unwantedKeys=[emptyKey for emptyKey in t.keys() if t[emptyKey] == ''] # makes a list of keys with empty values
        for key in unwantedKeys: del t[key] # removes empty keys

        t["type"] = section_type
        if "datetime" in t.keys(): t["datetime"] = getFormattedDate(t["datetime"])
        
        if "content" in t.keys(): 
            t["content"] = t["content"][0].strip()
            t["content"] = t["content"].splitlines(True)

        parsed_item.append(t)
        return
    return parse_action_impl

def getAssignments() -> list:
    assignment_list = []
    for token, start, end in assignment_rule.scanString(HEADERS):
        token_dict = token.asDict()
        token_dict["type"] = "assignment"
        assignment_list.append(token_dict)
    
    return assignment_list
        


def storeHeaders():
    def parse_action_impl(s, l, t):

        global HEADERS 
        HEADERS = t[0][0]
        return

    return parse_action_impl

def getFormattedDate(date: str) -> str:
        """Returns the date/time formatted as RFC 8601 YYYY-MM-DDTHH:MM:SS+00:00.
        Returns empty string if the string argument passed to the function is not a datetime.
        See: https://en.wikipedia.org/wiki/ISO_8601

        **Returns:**
        ```        
        str: Properly formatted date/time recieved or empty string.
        ```
        """
        try:
            # This date is never meant to be used. The default attribute is just to set timezone.
            parsedDate = parser.parse(date, default=datetime(
                1970, 1, 1, tzinfo=tz.gettz('EDT')))
        except:
            return ""

        parsedDateString = parsedDate.strftime("%Y-%m-%dT%H:%M:%S%z")

        return parsedDateString

# additional information supplied by user rule
info_from_user_rule = (pp.Dict(
    (info_from_user_start_delimiter + pp.LineEnd()).suppress() +
    pp.Literal("\n").setWhitespaceChars("").suppress() +
    pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo(pp.LineEnd())) +
    pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
    pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
    pp.Group(pp.Optional("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
    pp.Group(pp.Optional("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n")))  +
    (pp.Group(pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd()).setParseAction(checkForNested())).setResultsName("content")) +
    (pp.Literal(info_from_user_end_delimiter) + pp.LineEnd()).suppress()
).setParseAction(addTypeKey("reply_from_user")))

reply_rule = (
    pp.Literal("\n*** Replied by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))

edit_rule = (
    pp.Literal("\n*** Edited by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("edit"))

status_update_rule = (
    pp.Literal("\n*** Status updated by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("status"))


directory_rule = pp.Dict(
    pp.Literal("\n").suppress().setWhitespaceChars("") +
    pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() +  pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Literal("\n\n").suppress().setWhitespaceChars("")
).setParseAction(addTypeKey("directory_information"))

initial_message_rule = pp.Group(
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))

headers_rule = pp.Group(pp.SkipTo("\n\n", include=True)).setResultsName('headers').leaveWhitespace()

missing_end_delimiter_rule = pp.Word(string.printable).setParseAction(errorHandler())

item_rule = (
    headers_rule.setParseAction(storeHeaders()).suppress() + #supresses the output of the headers to the parsed item
    pp.Optional(directory_rule) + 
    initial_message_rule + 
    pp.ZeroOrMore(info_from_user_rule | reply_rule | edit_rule | status_update_rule | missing_end_delimiter_rule)
)

assignment_rule = (
        pp.Literal("Assigned-To: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("to") +
        pp.Literal("Assigned-To-Updated-Time: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("datetime") +
        pp.Literal("Assigned-To-Updated-By: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("by")
).setParseAction(addTypeKey("assignment"))

raw_item = """
Assigned-To: not_me
Assigned-To-Updated-Time: Fri, 29 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: me
Assigned-To: you
Assigned-To-Updated-Time: 31 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: not_me


     Name: Jacob Bennett
    Login: benne238
 Computer: 1.1.1.1
 Location: CARY 123
    Email: benne238@purdue.edu
    Phone: numberhere
   Office: I wish...
 UNIX Dir: dunno
 Zero Dir: dunno thatone either
    
  Subject: I need something from ECN


I am writing because I need something from ECN, 
thanks, Jacob
Bennett

*** Edited by: campb303 at: 01/01/2022 09:00:00 ***

I made an edit here



*** Edited by: campb303 at: 01/01/2022 12:29:38 ***

I also made an edit here


*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***

ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***

Hello there.... could you be more specific?

Thanks,
personone

*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)

*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***

i dont even know anymore




=== Additional information supplied by user ===

Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing

Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/



*** Edited by: you at: none ***

*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here

*** Edited by: personone at: 04/08/22 15:41:05 ***

i dont even know anymore


Thanks, Jacob
===============================================
"""

parsed_item = []
item_rule.parseString(raw_item).asList()

for assignment in getAssignments():
    parsed_item.insert(0, assignment)

for count, section in enumerate(parsed_item):
    if section['type'] == "parse_error": 
        parsed_item = parsed_item[:count + 1]
        break

print(json.dumps(parsed_item, indent=2))

Output:

[
  {
    "to": "you",
    "datetime": "31 Jan 2021 07:01:40 EST",
    "by": "not_me",
    "type": "assignment"
  },
  {
    "to": "not_me",
    "datetime": "Fri, 29 Jan 2021 07:01:40 EST",
    "by": "me",
    "type": "assignment"
  },
  {
    "Name": "Jacob Bennett",
    "Login": "benne238",
    "Computer": "1.1.1.1",
    "Location": "CARY 123",
    "Email": "benne238@purdue.edu",
    "Phone": "numberhere",
    "Office": "I wish...",
    "UNIX Dir": "dunno",
    "Zero Dir": "dunno thatone either",
    "Subject": "I need something from ECN",
    "type": "directory_information"
  },
  {
    "content": [
      "I am writing because I need something from ECN, \n",
      "thanks, Jacob\n",
      "Bennett"
    ],
    "type": "initial_message"
  },
  {
    "by": "campb303",
    "datetime": "2022-01-01T09:00:00-0500",
    "content": [
      "I made an edit here"
    ],
    "type": "edit"
  },
  {
    "by": "campb303",
    "datetime": "2022-01-01T12:29:38-0500",
    "content": [
      "I also made an edit here"
    ],
    "type": "edit"
  },
  {
    "by": "someoneelse",
    "datetime": "2022-01-01T12:30:13-0500",
    "content": [
      "I made a status update"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "2022-01-02T12:31:15-0500",
    "content": [
      "ooo, personone also edited this item"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "2022-01-02T12:34:03-0500",
    "content": [
      "Hello there.... could you be more specific?\n",
      "\n",
      "Thanks,\n",
      "personone"
    ],
    "type": "reply_to_user"
  },
  {
    "by": "persontwo",
    "datetime": "2022-01-05T14:58:03-0500",
    "content": [
      "I made an edit too! (persontwo)"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "2022-01-07T15:40:55-0500",
    "content": [
      "Something happened here"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "2022-04-08T15:41:05-0400",
    "content": [
      "i dont even know anymore"
    ],
    "type": "edit"
  },
  {
    "type": "parse_error",
    "datetime": "2021-06-09T13:05:29-0400",
    "expected": "Reply from user ending delimiter",
    "got": "*** Edited by: you at: none ***",
    "line_num": 74
  }
]

@benne238
Copy link
Collaborator

Working pyparsing update

This version of the pyparsing parser does almost everything that our current parser does including formatting and sorting sections by date.

import pyparsing as pp
import json
import string
from dateutil import parser, tz
from datetime import datetime
import os, email.utils

info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="

nested_expression_rule = (
    pp.Literal(info_from_user_start_delimiter) | 
    pp.Regex("\*\*\* Replied by: (.*) at: (.*) \*\*\*") | 
    pp.Regex("\*\*\* Edited by: (.*) at: (.*) \*\*\*") |
    pp.Regex("\*\*\* Status updated by: (.*) at: (.*) \*\*\*")
)

def errorHandler():
    def error_action_impl(s, l, t):
        location = (s[:l]).count('\n') + 1
        message = 'Did not encounter a starting delimiter for any section'

        if t[0][0] == info_from_user_start_delimiter:
            message = "Did not encounter the ending delimiter for additional informtion from user"

        parse_error = {
            "type": "parse_error",
            'datetime': getFormattedDate(str(datetime.now())),
            'expected': message,
            'got': t[0][0],
            'line_num': location
        }

        parsed_item.append(parse_error)
        return

    return error_action_impl


def checkForNested():
    def nested_action_impl(s, l, t):
        errorParse = {}
        nested_expressions_generator = nested_expression_rule.scanString(t[0])
        for token, start, end in nested_expressions_generator:
            errorParse = {
                "type": "parse_error",
                "datetime": getFormattedDate(str(datetime.now())),
                "expected": "Reply from user ending delimiter",
                "got": token[0],
                "line_num": (s[:start + l]).count("\n") + 1
            }
            break
        if errorParse: parsed_item.append(errorParse)

        return 
    return nested_action_impl

def addTypeKey(section_type):
    def parse_action_impl(s, l, t):
        t = t.asDict()
        if section_type == "reply_from_user":
            t["headers"] = str(t["headers"][0]).split("\n")
            for count, header in enumerate(t["headers"]):
                key, value = header.split(": ", maxsplit=1)
                t["headers"][count] = {"type":key, "content":value}
                
            for header in t["headers"]:
                if header["type"] == "Date":
                    t["datetime"] = header["content"]
                if header["type"] == "Subject":
                    t["subject"] = header["content"]
                if header["type"] == "From":
                    user_name, user_email = email.utils.parseaddr(header["content"])
                    t["from_name"] = user_name
                    t["from_email"] = user_email
                if header["type"] == "Cc":
                    ccList = [
                        {"name":user_name, "email":user_email} 
                        for user_name, user_email in email.utils.getaddresses([header["content"]])
                    ]
                    t["cc"] = ccList
                    

        unwantedKeys=[emptyKey for emptyKey in t.keys() if t[emptyKey] == ''] # makes a list of keys with empty values
        for key in unwantedKeys: del t[key] # removes empty keys

        t["type"] = section_type
        if "datetime" in t.keys(): t["datetime"] = getFormattedDate(t["datetime"])
        
        if "content" in t.keys(): 
            t["content"] = t["content"][0].strip()
            t["content"] = t["content"].splitlines(True)

        if t["type"] == "directory_information": 
            global directory_info 
            directory_info = t
            return

        parsed_item.append(t)
        return
    return parse_action_impl

def getAssignments() -> list:
    assignment_list = []
    for token, start, end in assignment_rule.scanString(headers):
        token_dict = token.asDict()
        token_dict["datetime"] = getFormattedDate(token_dict["datetime"])
        token_dict["type"] = "assignment"
        assignment_list.append(token_dict)
    
    return assignment_list
        
def storeHeaders():
    def parse_action_impl(s, l, t):

        global headers 
        headers = t[0][0]
        return

    return parse_action_impl

def getInitialMessageHeaders():
    initialMessageHeaders = {}

    subject = (
        (pp.LineStart() + pp.Literal("Subject: ")).suppress() + 
        pp.SkipTo(pp.LineEnd())
    ).scanString(headers)

    for token, start, end in subject:
        initialMessageHeaders["subject"] = token[0]

    from_email = (
        (pp.LineStart() + pp.Literal("From: ")).suppress() + 
        pp.SkipTo(pp.LineEnd())
    ).scanString(headers)

    for token, start, end in from_email:
        user_name, user_email = email.utils.parseaddr(token[0])
        initialMessageHeaders["from_name"] = user_name
        initialMessageHeaders["from_email"] = user_email

    to = (
        (pp.LineStart() + pp.Literal("To: ")).suppress() + 
        pp.SkipTo(pp.LineEnd())
    ).scanString(headers)

    for token, start, end in to:
        recipientList = [
            {"name":user_name, "email":user_email} 
            for user_name, user_email in email.utils.getaddresses(token)
        ]
        initialMessageHeaders["to"] = recipientList
    cc = (
        (pp.LineStart() + pp.Literal("CC: ")).suppress() + 
        pp.SkipTo(pp.LineEnd())
    ).scanString(headers)

    for token, start, end in cc:
        ccList = [
            {"name":user_name, "email":user_email} 
            for user_name, user_email in email.utils.getaddresses(token)
        ]
        initialMessageHeaders["cc"] = ccList

    datetime = (
        (pp.LineStart() + pp.Literal("Date: ")).suppress() + 
        pp.SkipTo(pp.LineEnd())
    ).scanString(headers)

    for token, start, end in datetime:
        initialMessageHeaders["datetime"] = getFormattedDate(token[0])

    return initialMessageHeaders

def getFormattedDate(date: str) -> str:
        """Returns the date/time formatted as RFC 8601 YYYY-MM-DDTHH:MM:SS+00:00.
        Returns empty string if the string argument passed to the function is not a datetime.
        See: https://en.wikipedia.org/wiki/ISO_8601

        **Returns:**
        ```        
        str: Properly formatted date/time recieved or empty string.
        ```
        """
        try:
            # This date is never meant to be used. The default attribute is just to set timezone.
            parsedDate = parser.parse(date, default=datetime(
                1970, 1, 1, tzinfo=tz.gettz('EDT')))
        except:
            return ""

        parsedDateString = parsedDate.strftime("%Y-%m-%dT%H:%M:%S%z")

        return parsedDateString

# additional information supplied by user rule
info_from_user_rule = (pp.Dict(
    (info_from_user_start_delimiter + pp.LineEnd()).suppress() +
    pp.Literal("\n").setWhitespaceChars("").suppress() +
    (pp.Group(pp.SkipTo("\n\n"))).setResultsName("headers")  +
    (pp.Group(pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd()).setParseAction(checkForNested())).setResultsName("content")) +
    (pp.Literal(info_from_user_end_delimiter) + pp.LineEnd()).suppress()
).setParseAction(addTypeKey("reply_from_user")))

reply_rule = (
    pp.Literal("\n*** Replied by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))

edit_rule = (
    pp.Literal("\n*** Edited by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("edit"))

status_update_rule = (
    pp.Literal("\n*** Status updated by: ").suppress() + 
    pp.Word(pp.alphanums).setResultsName("by")+
    pp.Literal(" at: ").suppress() +
    pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
    (pp.Literal(" ***") + pp.LineEnd()).suppress() +
    pp.Group(
        pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
    ).setResultsName("content") 
).leaveWhitespace().setParseAction(addTypeKey("status"))


directory_rule = pp.Dict(
    pp.Literal("\n").suppress().setWhitespaceChars("") +
    pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() +  pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
    pp.Literal("\n\n").suppress().setWhitespaceChars("")
).setParseAction(addTypeKey("directory_information"))

initial_message_rule = pp.Group(
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))

headers_rule = pp.Group(pp.SkipTo("\n\n", include=True)).setResultsName('headers').leaveWhitespace()

error_rule = pp.Group(pp.Word(string.printable) + pp.LineEnd()).setParseAction(errorHandler())

item_rule = (
    headers_rule.setParseAction(storeHeaders()).suppress() + #supresses the output of the headers to the parsed item
    pp.Optional(directory_rule) + 
    initial_message_rule + 
    pp.ZeroOrMore(
        (info_from_user_rule | reply_rule | edit_rule | status_update_rule) | 
        (error_rule)
    )
)

assignment_rule = (
        pp.Literal("Assigned-To: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("to") +
        pp.Literal("Assigned-To-Updated-Time: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("datetime") +
        pp.Literal("Assigned-To-Updated-By: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("by")
).setParseAction(addTypeKey("assignment"))

raw_item = """
Assigned-To: not_me
Assigned-To-Updated-Time: Fri, 29 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: me
Assigned-To: you
Assigned-To-Updated-Time: 31 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: not_me
To: hello@purdue.edu
Date: 1/1/1990 12:00:40 EST
CC: not_anyone@gmail.com
Subject: dunno
From: you


     Name: Jacob Bennett
    Login: benne238
 Computer: 1.1.1.1
 Location: CARY 123
    Email: benne238@purdue.edu
    Phone: numberhere
   Office: I wish...
 UNIX Dir: dunno
 Zero Dir: dunno thatone either
    
  Subject: I need something from ECN


I am writing because I need something from ECN, 
thanks, Jacob
Bennett

*** Edited by: campb303 at: 01/01/2022 09:00:00 ***

I made an edit here



*** Edited by: campb303 at: 01/01/2022 12:29:38 ***

I also made an edit here


*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***

ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***

Hello there.... could you be more specific?

Thanks,
personone

*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)

*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***

i dont even know anymore




=== Additional information supplied by user ===

Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing

Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/



*** Edited by: you at: none ***

*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here

*** Edited by: personone at: 04/08/22 15:41:05 ***

i dont even know anymore


Thanks, Jacob
===============================================
"""

parsed_item = []
headers = ""
directory_info = {}
item_rule.parseString(raw_item).asList()
initial_message_headers = getInitialMessageHeaders()


for assignment in getAssignments():
    parsed_item.insert(2, assignment)

for count, section in enumerate(parsed_item):
    if section['type'] == "parse_error": 
        parsed_item = parsed_item[:count + 1]
        break

for section in parsed_item:
    if section['type'] == "initial_message":
        for key in initial_message_headers.keys():
            section[key] = initial_message_headers[key]
    break

parsed_item = sorted(parsed_item, key = lambda dateTimeKey: parser.parse(dateTimeKey['datetime']))
parsed_item.insert(0, directory_info)
print(json.dumps(parsed_item, indent=2))

Output:

[
  {
    "Name": "Jacob Bennett",
    "Login": "benne238",
    "Computer": "1.1.1.1",
    "Location": "CARY 123",
    "Email": "benne238@purdue.edu",
    "Phone": "numberhere",
    "Office": "I wish...",
    "UNIX Dir": "dunno",
    "Zero Dir": "dunno thatone either",
    "Subject": "I need something from ECN",
    "type": "directory_information"
  },
  {
    "content": [
      "I am writing because I need something from ECN, \n",
      "thanks, Jacob\n",
      "Bennett"
    ],
    "type": "initial_message",
    "subject": "dunno",
    "from_name": "",
    "from_email": "you",
    "to": [
      {
        "name": "",
        "email": "hello@purdue.edu"
      }
    ],
    "cc": [
      {
        "name": "",
        "email": "not_anyone@gmail.com"
      }
    ],
    "datetime": "1990-01-01T12:00:40-0500"
  },
  {
    "to": "not_me",
    "datetime": "2021-01-29T07:01:40-0500",
    "by": "me",
    "type": "assignment"
  },
  {
    "to": "you",
    "datetime": "2021-01-31T07:01:40-0500",
    "by": "not_me",
    "type": "assignment"
  },
  {
    "type": "parse_error",
    "datetime": "2021-06-11T14:27:53-0400",
    "expected": "Reply from user ending delimiter",
    "got": "*** Edited by: you at: none ***",
    "line_num": 79
  },
  {
    "by": "campb303",
    "datetime": "2022-01-01T09:00:00-0500",
    "content": [
      "I made an edit here"
    ],
    "type": "edit"
  },
  {
    "by": "campb303",
    "datetime": "2022-01-01T12:29:38-0500",
    "content": [
      "I also made an edit here"
    ],
    "type": "edit"
  },
  {
    "by": "someoneelse",
    "datetime": "2022-01-01T12:30:13-0500",
    "content": [
      "I made a status update"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "2022-01-02T12:31:15-0500",
    "content": [
      "ooo, personone also edited this item"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "2022-01-02T12:34:03-0500",
    "content": [
      "Hello there.... could you be more specific?\n",
      "\n",
      "Thanks,\n",
      "personone"
    ],
    "type": "reply_to_user"
  },
  {
    "by": "persontwo",
    "datetime": "2022-01-05T14:58:03-0500",
    "content": [
      "I made an edit too! (persontwo)"
    ],
    "type": "edit"
  },
  {
    "by": "personone",
    "datetime": "2022-01-07T15:40:55-0500",
    "content": [
      "Something happened here"
    ],
    "type": "status"
  },
  {
    "by": "personone",
    "datetime": "2022-04-08T15:41:05-0400",
    "content": [
      "i dont even know anymore"
    ],
    "type": "edit"
  }
]

@benne238
Copy link
Collaborator

Current parser vs pyparsing parser

Here is an example nested delimiter in me5 in the current queue:

...
=== Additional information supplied by user ===

Subject: <subject>
From: <someone>
Date: <sometime>
X-ECN-Queue-Original-Path: <somepath>
X-ECN-Queue-Original-URL: <someurl>

<some content would go here>

*** Replied by: flowersr at: 06/01/21 15:38:19 ***

...

Current Ouput:

{
    "type": "parse_error", 
    "datetime": "2021-06-11T14:42:37-0400", 
    "file_path": "/home/pier/e/queue/Mail/me/5", 
    "expected": "Did not encounter a reply-from-user ending delimiter", 
    "got": "\n", 
    "line_num": 391
}

Pyparsing Output:

{
    "type": "parse_error",
    "datetime": "2021-06-11T14:40:03-0400",
    "expected": "Reply from user ending delimiter",
    "got": "*** Replied by: flowersr at: 06/01/21 15:38:19 ***",
    "line_num": 392
  }

Changes

As seen above, the three main differences are:

  1. The lack of a file location in the pyparsing output (which can easily be added, but excluded for the moment due to not being implemented in the item class yet)
  2. The expected key in the pyparsing output is a little less redundant
  3. The got and line_num keys both point to the line that actually caused a nested delimiter in the pyparsing output which is a little more helpful than pointing to a newline as seen in the current parser's output

The changes in the error-parse section are the only changes thus far that have been implemented.

@benne238
Copy link
Collaborator

benne238 commented Jul 6, 2021

multiprocessing

The python multiprocessing package allows for the creation of sub processes that can execute asynchronously from each other. This is useful for us when parsing the entire contents of a queue at once as we can now parse multiple items at once as opposed to waiting for each item to be parsed sequentially. The Queue class was modified to call a sub process for each item in the __get_items() function:

    def __get_items(self, headers_only: bool) -> list:
        """Returns a list of items for this Queue

        Args:
                headers_only (bool): If True, loads Item headers.

        Returns:
                list: a list of items for this Queue
        """
        items = []
+       valid_items = []
+       multi_item_processes = multiprocessing.Pool(processes=32)

        for item in os.listdir(self.path):
            item_path = Path(self.path, item)

            is_file = True if os.path.isfile(item_path) else False

            if is_file and is_valid_item_name(item):
-                 items.append(Item(self.name, item, headers_only))
+                 valid_items.append(item)

+         items = multi_item_processes.starmap_async(Item, [(self.name, item, headers_only) for item in valid_items]).get()
+         multi_item_processes.close()
+         multi_item_processes.join()
+
        return items

After making this change, the time to parse the entire live queue takes about 70 seconds. (It takes approximately 130 seconds to parse the live queue without multiprocessing) this is a significant improvement, however other packages exist that might make the paring entire queues even faster, such as Ray and other similar packages

@campb303
Copy link
Collaborator Author

campb303 commented Jul 6, 2021

It appears that multi_item_processes = multiprocessing.Pool(processes=32) is setting the number of processes that can run at the same time. Can the hardcoded 32 be replaced with some system agnostic number that is calculated at run time so we can have whatever number of cores is available to us on the machine?

It appears that instead of loading items sequentially in the for loop, we now generate a list of valid item names and store it in valid_items. After, we run items = multi_item_processes.starmap_async(Item, [(self.name, item, headers_only) for item in valid_items]).get(). Please explain that line more.

  • What does the starmap_async() function do?
  • What are the arguments?
  • What are the results?
  • What does multi_item_processes.close() do?
  • What does multi_item_processes.joing() do?
  • Whats going on behind the scenes to make this faster? Do you have a diagram or visualization that would help?

@campb303 campb303 removed this from the write-access milestone Jul 6, 2021
@campb303 campb303 removed the high-priority Needs immediate extra focus label Jul 6, 2021
@campb303
Copy link
Collaborator Author

campb303 commented Jul 6, 2021

Further multiprocessing talk should go in #35 .

@benne238
Copy link
Collaborator

benne238 commented Jul 6, 2021

It appears that multi_item_processes = multiprocessing.Pool(processes=32) is setting the number of processes that can run at the same time. Can the hardcoded 32 be replaced with some system agnostic number that is calculated at run time so we can have whatever number of cores is available to us on the machine?

Yes, that can easily be changed like this by using the multiprocessing.cpu_count() function, which returns the number of processors on the machine.


items = multi_item_processes.starmap_async(Item, [(self.name, item, headers_only) for item in valid_items]).get()
This line will return a list of items and store that list in items
The multi_item.processes.starmap_async() creates a sub process to run a function in. This function takes 2 arguments:

  1. a function to be run in a sub process (in this case Item)
  2. a list of tuples that represent arguments to be passed to the function (in this case [(self.name, item, headers_only)])

However, the list of tuples needs to represent every item in a given queue, so that is why there is a list comprehension that creates a tuple representing the queue, the item, and the headers_only argument for every item in a queue. So the list of tuples looks more like [("bidc", 1, False), ("bidc", 2, False), ("bidc", 3, False), ...] in which every tuple in the list is passed as an argument to the Item class.

The processes.starmap_async() function will run each process independently (or asynchronously) from each other, so that each Item is returned as soon as the Item is finished created, regardless of what order it was "queued" in. (For example, if bidc1 and then bidc2 tried to be created in that order, but bidc2 is a smaller item, bidc2 will be returned first)

Finally, the processes.starmap_async() function does not return the return value of the passed function, but rather a custom class created by the multiprocessing package. To get the return value of Item, the get() function is used at the end


What does multi_item_processes.close() do?
What does multi_item_processes.join() do?

multi_item_processes.join() will wait until all of the sub processes created with multi_item_processes to finish executing before it advances to the next line in the python script

multi_item_process.close() simply states that no more "jobs" will be submitted to the multi_item_processes object. But it is also required to be able to use the join() function on that same object


Why using multiprocessing is faster

image

The way we parsed items without the multiprocessing (also called serial processing) means that each line of code happens sequentially. This means that when creating an item, one item is parsed, then the next item is parsed, and so on and so forth. The total time to create every item in a queue is the total time to create each item individually

multiprocessing allows for multiple items to be created and parsed at once, as opposed to waiting for one item to be created before creating the next item.

@campb303
Copy link
Collaborator Author

campb303 commented Jul 6, 2021

All of the above makes sense. According to your timing we're looking at parsing speeds of approx. 2x, yes? Why is this not closer to 32x faster? I understand we won't get a perfect 32x faster because not every item takes the same time to load but only 2x faster seems odd.

@benne238
Copy link
Collaborator

benne238 commented Jul 7, 2021

Performance and multiprocessing

The reason for the lack of performance is partially due to the fact some items take longer to parse with pyparsing, but a significant hit in performance is due to the way multiprocessing is implemented

In this comment, multiprocessing is implemented such that multiple items in a queue are processed at once. However, only one single queue is processed at a time. This implementation can parse all the items with content in the live queue in approximately 70 seconds.

To contrast this method of multiprocessing, I decided implement multiprocessing so that multiple queues were processed at once, but each item in that queue was still done sequentially, one at a time:

test.py

import multiprocessing
import webqueue2api.parser.queue

valid_queues = webqueue2api.parser.queue.get_valid_queues()
multi_queue_processes = multiprocessing.Pool(processes=multiprocessing.cpu_count())
items = multi_queue_processes.starmap_async(webqueue2api.parser.queue.Queue, [(queue, False) for queue in valid_queues]).get()
multi_queue_processes.close()
multi_queue_processes.join()

This implementation of multiprocessing was able to parse all of the items with content in the live queue in approximately 60 seconds

As a third way to implement multiprocessing, the name of every valid item in each queue was retrieved and put into a list, so that any item could be processed along side any other item regardless of which queue the items belong to:
test.py

import multiprocessing
import webqueue2api.parser.queue
import webqueue2api.parser.item

all_valid_items = []
valid_queues = webqueue2api.parser.queue.get_valid_queues()
valid_queues = [webqueue2api.parser.queue.Queue(name=queue, headers_only=True) for queue in valid_queues]

for queue in valid_queues:
    for item in queue.items:
        all_valid_items.append((queue.name, item.number, False))

start_time = datetime.timestamp(datetime.now())
multi_queue_processes = multiprocessing.Pool(processes=multiprocessing.cpu_count())
items = multi_queue_processes.starmap_async(webqueue2api.parser.item.Item, [(queue, number, header) for (queue, number, header) in all_valid_items]).get()
multi_queue_processes.close()
multi_queue_processes.join()

end_time = datetime.timestamp(datetime.now())

print(f"Time to parse all items with content: {(end_time - start_time)} seconds")

Note: it is possible that the list of all_valid_items in this example can become outdated and cause errors before processing of each item can finish
After getting all of the valid item names, it took approximately 20 seconds to parse every item with content using this implementation.

@campb303
Copy link
Collaborator Author

campb303 commented Jul 8, 2021

Re: our call today; threading at the item level is currently the most eficient method we have for loading queues in parralell. I'd like to see if we can nest threaded workloads by parallelizing a Queue's Item loading and parallelizing loading of the queues. If we can do this, we can rewrite the load_queues() function with the following signature:

def load_queues(*args: list, headers_only: bool = True) -> list:
    """Load Queues requested.

            Args:
                *args (list): List of strings of Queue names. If only one name exists, loading happens sequentially. If multiple names are passed, loading happens in paralell
    """
    pass

@benne238
Copy link
Collaborator

benne238 commented Jul 8, 2021

Allowing child sub processes from a sub process

By default, multiprocessing does not allow for any sub process to create other child sub processes. When it's attempted, an exception is raised and the script exits:

AssertionError: daemonic processes are not allowed to have children

However, according to this stackoverflow answer, it is possible to create a custom class that allows for child sub processes to be created from an already existing sub process.

For this implementation, the changes made to __get_items() in this comment were kept. A script outside of the package was created to make use of loading multiple queues at once:

test.py

import multiprocessing
import multiprocessing.pool
import webqueue2api.api.resources.queue
import webqueue2api.parser.queue
from datetime import datetime

start_time = datetime.timestamp(datetime.now())
# custom class creation based on stackoverflow answer
class NoDaemonProcess(multiprocessing.Process):
    # make 'daemon' attribute always return False
    def _get_daemon(self):
        return False
    def _set_daemon(self, value):
        pass
    daemon = property(_get_daemon, _set_daemon)

# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class MyPool(multiprocessing.pool.Pool):
    Process = NoDaemonProcess

valid_queues = webqueue2api.parser.queue.get_valid_queues()
headers_only = False
multi_queue_process = MyPool(processes=multiprocessing.cpu_count())

queues = multi_queue_process.starmap_async(webqueue2api.parser.queue.Queue, [(queue, headers_only) for queue in valid_queues]).get()

multi_queue_process.close()
multi_queue_process.join()
end_time = datetime.timestamp(datetime.now())
print(f'Total time to parse with{"out" if headers_only else ""} headers: {end_time - start_time}')

With this implementation, it takes approximately 36 seconds to parse every item in every queue with content

@benne238
Copy link
Collaborator

benne238 commented Jul 9, 2021

load_queues() updated function

taking into account that nested processing is now possible, the load_queues() function was updated from this:

def load_queues() -> list:
    """Return a list of Queues for each queue.

    Returns:
        list: list of Queues for each queue.
    """
    queues = []

    for queue in get_valid_queues():
        queues.append(Queue(queue))

    return queues

To this:

def load_queues(*queues, headers_only: bool = True) -> list:
    """Returns a list of queues

    Example:
            [example]

    Args:
            headers_only (bool, optional): Weather or not the content of items in the queue should be loaded. Defaults to True.
            *queues: List of strings that represent Queue names. 

    Returns:
            list: A list of all the queues that were given as arguments, or all of the queues if no queues were specified
    """    

    # custom class creation based on stackoverflow answer
    class NoDaemonProcess(multiprocessing.Process):
        # make 'daemon' attribute always return False
        def _get_daemon(self):
            return False
        def _set_daemon(self, value):
            pass
        daemon = property(_get_daemon, _set_daemon)

    # We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
    # because the latter is only a wrapper function, not a proper class.
    class MyPool(multiprocessing.pool.Pool):
        Process = NoDaemonProcess

    if len(queues) == 0: queues_to_load = get_valid_queues()

    elif len(queues) == 1: return [Queue(name=queues[0], headers_only=headers_only)]

    else: queues_to_load = queues
    
    multi_queue_process = MyPool(processes=multiprocessing.cpu_count())
    loaded_queues = multi_queue_process.starmap_async(Queue, [(queue, headers_only) for queue in queues_to_load]).get()
    multi_queue_process.close()
    multi_queue_process.join()

    return loaded_queues

@benne238
Copy link
Collaborator

Summary of changes made to the api with pyparsing and multiprocessing

pyparsing changes

The parser is now a specific grammar, made with pyparsing, that uses a series of rules to get information from an item and format it to a json structure that the frontend can understand.
Some specific changes made to the parser include:

  • Making more use of the different features in the email email package, such as the raise_on_defect "policy" which will raise an error if a header is malformed. (see 8289ae7)
  • More explicit rules. The rules for some of the sections needed to be modified slightly so that there was an explicit format that a section needed to fit into, and anything other than that format needed to raise an error. (see 66b65d7, 5fbc373, and a3fafba
  • general bug fixes

multiprocessing

To enhance the speed of the new pyparser, which is slower than our original parser, multiprocessing was implemented so that multiple items and multiple queues could be parsed at once as opposed to having to wait for each item in each queue sequentially.
Some specific changes made with multiprocessing:

  • Allowing multiple queues to be parsed at once along with having multiple items in each queue be parsed at once
    • By default sub processes aren't allowed to create sub processes so a custom class, based in the already existent multiprocessing package, was created to allow for this type of behavior (see 81e28e5)
  • modified the load_queues() function to accept arguments including which queues to parse and if they should have headers only (see 55cb9ad)

Still needs to be done

For the correct implementation of multiprocessing to work, I think some changes need to be made to this file to properly make use of the new load_queues() function, will confer with @campb303 on this.

@campb303
Copy link
Collaborator Author

campb303 commented Aug 2, 2021

Closed by #41

@campb303 campb303 closed this as completed Aug 2, 2021
Sign in to join this conversation on GitHub.
Labels
enhancement Request for a change to existing functionality question Something that requires more information before moving forward
Projects
None yet
Development

No branches or pull requests

2 participants