-
Notifications
You must be signed in to change notification settings - Fork 0
Rewrite parser in a formal grammer #25
Comments
Rewriting the parser using a formal grammarWhile we are able to successfully parse any given item and return an appropriate json structure, our current implementation of the ECNQueue parser is over 1000 lines along and is difficult to read. There are a couple of solutions to this problem:
Using a formal grammarCurrently, there is not a clearly defined set of easy to read rules in the parser; it is a series of
However this gets cumbersome as items grow in size, this logic must be done for each and every line. Using a formal grammar could alleviate two problems: readability and complexity
There a several things that stand out with how PLY (and some of the other parsers work):
What this means for us is that a series of regular expressions would have to be developed to separate each section of an item, however, it has the potential to greatly reduce the length and complexity of the parser to one that is easy to read. Separating the parser into multiple scriptsThis step, depending on our implementation, should be done to reduce the amount of code in one script and increase the readability of the backend. In the parsers current state, it would make sense to separate the parser by function (and maybe sperate out the main parser function into multiple functions). Seeing how there will be a rewrite of the parser anyway, it might make more sense to not separate out the parser until a formal grammar is being used |
Summary of Parsing In Python: Tools And Libraries mentioned above: Parsers come in one of three variants:
Parsers have a general structure of two tools:
(Some parsers do not have a lexer and analyze raw input directly. These are called scannerless parsers.) We currently have a custom parser built from scratch. However, it is slow and difficult to maintain. Out of the three options above, the first is not an option as there is likely no pre-written library code for this job that we haven't written and the second option is what we already have. So the operative questions, in order, are:
Related questions include:
|
Mark Senn is experienced with Perl, a language known for its data processing abilities, and has experience with writing parsers in Perl. He believes that Perl's parsing support would be a great candidate for the job. A quick Google search shows very simple parsers with very little code. Ideally, I'd like to write a parser using Python to avoid have to (de)serialize data. More research is needed. |
pyparsingThe python pyparsing module in python looks like a python native way to create custom parsing grammars directly in python. Simple example: # https://pyparsing-docs.readthedocs.io/en/latest/HowToUsePyparsing.html#hello-world
import pyparsing as pp
greet = pp.Word(pp.alphas) + "," + pp.Word(pp.alphas) + "!"
for greeting_str in [
"Hello, World!",
"Bonjour, Monde!",
"Hola, Mundo!",
"Hallo, Welt!",
]:
greeting = greet.parseString(greeting_str)
print(greeting) Output:
This looks like a relatively easy module to use which should allow for the easier creation of a formal grammar that can parse out the different sections within a given item. Possible IssuesThe documentation is not the greatest and some of the advanced features that we might want to use are poorly documented, so learning how to use this tool will have a learning curve that will be hindered by its poor documentation. After attempting to code a basic program that would match to the string located between two delimiters, I found that using this tool was difficult due to the lack of information available on it. An example of a parser that might be used to separate out all of the "additional information from user" sections from a given item might look like this: Example.py: import pyparsing as pp
import string
info_from_user = (
pp.Literal("=== Additional information supplied by user ===\n") +
pp.Word(string.printable, excludeChars="=") +
pp.Literal("===============================================\n")
)
reply = ("=== Additional information supplied by user ===\n"+
"\n"+
"Subject: stuff\n"+
"From: someone\n"+
"Date: some time\n"+
"X-ECN-Queue-Original-Path: https://google.com\n"+
"X-ECN-Queue-Original-URL: https://amazon.com\n"+
"\n"+
"Thanks.\n"
"\n"+
"===============================================\n"+
"This shouldn't be matched"
)
print(info_from_user.parseString(reply)) One problem with this example though is that if an equal sign is included anywhere outside of the delimiter, then an exception will occur. While this has the potential to be useful, it is difficult to use. |
Next steps are to create an example that demonstrates a parser's benefits beyond what we already have. |
updateNote, this is barebones, as I am still learning and attemtpting to use all of the different features with pyparse, but this current code can look through a given string, and return a list of all the additional info from user sections in only a few lines of code #https://stackoverflow.com/questions/35073566/custom-delimiter-using-pyparsing
import string
import pyparsing as pp
import os
#finds all of the reply from user delimiters
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="
info_from_user = (
pp.originalTextFor( # Returns the matched text exactally as it was before parse
pp.nestedExpr( # defines a start and end delimiter, everything is matched in between (including delimiters)
info_from_user_start_delimiter,
info_from_user_end_delimiter
)
)
)
item = ("This shouldn't be matched\n" +
"\n=== Additional information supplied by user ===\n"+
"\n"+
"Subject: stuff\n"+
"From: someone\n"+
"Date: some time\n"+
"X-ECN-Queue-Original-Path: https://google.com\n"+
"X-ECN-Queue-Original-URL: https://amazon.com\n"+
"\n"+
"This is a message.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"+
"This shouldn't be matched With anything\n" +
"Neither should this\n"+
"\n" +
"*** Status updated by: me at: now ***\n"+
"no match here either\n" +
"=== Additional information supplied by user ===\n"+
"\n"+
"Subject: more\n"+
"From: not me\n"+
"Date: right meow\n"+
"X-ECN-Queue-Original-Path: https://gogle.com\n"+
"X-ECN-Queue-Original-URL: https://amzon.com\n"+
"\n"+
"This is a message that should be matched too.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"
)
parsed_item = (info_from_user.searchString(item)).asList()
print(parsed_item) Output:
|
Pyparsing benefits and drawbacksThis code parses most of the sections of any given item (without error parsing). One of the limitations at the moment is being able differentiate which section was parsed with a given list (as seen in the output below, there is nothing indicated which section was parsed except for the content of the section itself. I believe there is a simple way to do this, but I'll need to look into it more. Drawbacks:
Benefits:
Updated code: import pyparsing as pp
#finds all of the reply_rule from user delimiters
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="
info_from_user_rule = pp.originalTextFor( # Returns the matched text exactally as it was before parse
pp.nestedExpr( # defines a start and end delimiter, everything is matched in between (including delimiters)
info_from_user_start_delimiter,
info_from_user_end_delimiter
)
)
# finds all status updates
status_update_rule = pp.originalTextFor(
pp.Regex("(\*{3} Status updated by: )(.*)(at: (.*)\*{3})") +
pp.SkipTo((pp.LineEnd() + info_from_user_start_delimiter | "***"))
)
# finds all edits
edit_rule = pp.originalTextFor(
pp.Regex("(\*{3} Edited by: )(.*)(at: (.*)\*{3})") +
pp.SkipTo((pp.LineEnd() + info_from_user_start_delimiter | "***"))
)
#finds all ecn replies
reply_rule = pp.originalTextFor(
pp.Regex("(\*{3} Replied by: )(.*)(at: (.*)\*{3})") +
pp.SkipTo((pp.LineEnd() + info_from_user_start_delimiter | "***"))
)
# combination of all the defined rules from above
parse_item = (info_from_user_rule | status_update_rule | edit_rule | reply_rule)
item = ("\n=== Additional information supplied by user ===\n"+
"\n"+
"Subject: stuff\n"+
"From: someone\n"+
"Date: some time\n"+
"X-ECN-Queue-Original-Path: https://google.com\n"+
"X-ECN-Queue-Original-URL: https://amazon.com\n"+
"\n"+
"This is a message.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"+
"*** Status updated by: you at: yesterday ***\n"+
"more status stuff\n" +
"\n\n\n\n"+
"*** Status updated by: that_guy at: tmrw ***\n"+
"status update\n" +
"\n"+
"*** Status updated by: me at: now ***\n"+
"this is a status update\n" +
"*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\n" +
"this is an edit\n"+
"=== Additional information supplied by user ===\n"+
"\n"+
"Subject: more\n"+
"From: not me\n"+
"Date: right meow\n"+
"X-ECN-Queue-Original-Path: https://gogle.com\n"+
"X-ECN-Queue-Original-URL: https://amzon.com\n"+
"\n"+
"This is a message that should be matched too.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"
)
# prints the output after searching through the item using the parse_item rule
print(parse_item.searchString(item)) Output: [
[
"=== Additional information supplied by user ===\n\nSubject: stuff\nFrom: someone\nDate: some time\nX-ECN-Queue-Original-Path: https://google.com\nX-ECN-Queue-Original-URL: https://amazon.com\n\nThis is a message.\nThanks,\nme\n==============================================="
],
[
"*** Status updated by: you at: yesterday ***\nmore status stuff"
],
[
"*** Status updated by: that_guy at: tmrw ***\nstatus update"
],
[
"*** Status updated by: me at: now ***\nthis is a status update"
],
[
"*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\nthis is an edit"
],
[
"=== Additional information supplied by user ===\n\nSubject: more\nFrom: not me\nDate: right meow\nX-ECN-Queue-Original-Path: https://gogle.com\nX-ECN-Queue-Original-URL: https://amzon.com\n\nThis is a message that should be matched too.\nThanks,\nme\n==============================================="
]
] |
Performance notesAfter talking with @campb303, the parser from above is significantly slower than what we currently have: what we currently have currently parses a specific item in about .1 seconds, the new parser above parses the same item in about .5 seconds. While the tradeoff is significantly readable code, we decided that this is not a good tradeoff for such a significant hit in performance. While there may be ways to optimize the pyparsing code above, if it can't be done, then I should look into other parsers, specifically looking for parsers that are more performant (ANTLR was mentioned due to its basis in C) |
Dictionary of repliesAfter some tinkering, I was able to return a list of dictionaries that point to their respective sections (see output below) https://stackoverflow.com/questions/29282878/distinguish-matches-in-pyparsing (source for some of the functions below) import pyparsing as pp
def makeDecoratingParseAction(marker):
def parse_action_impl(s, l, t): # need to look into how this function works
#print(t)
fullcontent = ''.join(t)
return {"type": marker,
"content": fullcontent.split("\n")
}
return parse_action_impl
def formatReplyFromUser():
def parse_action_impl(t):
# print(t)
return {
"type": "reply_from_user",
"subject": t[2].strip(),
"from": t[4].strip(),
"date": t[6].strip(),
t[7]: t[8].strip(),
t[9]: t[10].strip(),
"content": t[11]
}
return parse_action_impl
#finds all of the reply_rule from user delimiters
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "\n==============================================="
info_from_user_rule = (
(
# matches everything between the two info_from_user delimiters
info_from_user_start_delimiter +
"\n\n" +
pp.Regex("Subject: ") + pp.Regex("(.*)\\n") +
pp.Regex("From: ") + pp.Regex("(.*)\\n") +
pp.Regex("Date: ") + pp.Regex("(.*)\\n")+
pp.Regex("X-ECN-Queue-Original-Path: ") + pp.Regex("(.*)\\n") +
pp.Regex("X-ECN-Queue-Original-URL: ") + pp.Regex("(.*)\\n") +
pp.SkipTo(info_from_user_end_delimiter)
).setWhitespaceChars('') # ensures all whitespace is captured from the item
).setParseAction(formatReplyFromUser()) # creates a dictionary
status_update_rule = (
(
pp.Regex("(\*{3} Status updated by: )(.*)(at: (.*)\*{3})") +
pp.SkipTo((pp.LineEnd() + pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*") ))
).setWhitespaceChars("")
).setParseAction(makeDecoratingParseAction("status_update"))
edit_rule = ((
pp.Regex("(\*{3} Edited by: )(.*)(at: (.*)\*{3})") +
pp.SkipTo((pp.LineEnd() + pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*")))
).setWhitespaceChars("")).setParseAction(makeDecoratingParseAction("edit"))
reply_rule = ((
pp.Regex("(\*{3} Replied by: )(.*)(at: (.*)\*{3})") +
pp.SkipTo((pp.LineEnd() + pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*")))
).setWhitespaceChars("")).setParseAction(makeDecoratingParseAction("reply"))
parse_item = (info_from_user_rule | status_update_rule | edit_rule | reply_rule) # searches for each of these in the item
item = ("\n=== Additional information supplied by user ===\n"+
"\n"+
"Subject: stuff\n"+
"From: someone\n"+
"Date: some time\n"+
"X-ECN-Queue-Original-Path: https://google.com\n"+
"X-ECN-Queue-Original-URL: https://amazon.com\n"+
"\n"+
"This is a message.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"+
"*** Status updated by: you at: yesterday ***\n"+
"more status stuff\n" +
"\n\n\n\n"+
"*** Status updated by: that_guy at: tmrw ***\n"+
"status update\n" +
"\n"+
"*** Status updated by: me at: now ***\n"+
"this is a status update\n" +
"*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\n" +
"this is an edit\n"+
"*** Replied by: no one: ever ***\n" +
"this is a reply\n"+
"=== Additional information supplied by user ===\n"+
"\n"+
"Subject: more\n"+
"From: not me\n"+
"Date: right meow\n"+
"X-ECN-Queue-Original-Path: https://gogle.com\n"+
"X-ECN-Queue-Original-URL: https://amzon.com\n"+
"\n"+
"This is a message that should be matched too.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"
)
for tokens, starter, end in parse_item.scanString(item):
print(tokens[0]) Ouptut:
|
Directory parsingThe code below now is able to locate directory information within an item and (for the most part) delaminate it correctly, some work will have to be done with the colons within the keys next week. It is also easy to see that the output looks even more similar to what we expect on the front end. I would have liked to make more progress on this, but implementing the grammar below is tedious in that troubleshooting issues take longer and finding solutions to various problems associated with pyparsing also takes time import pyparsing as pp
import json
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "\n==============================================="
# reply from user header info
info_from_user_headers_rule = (
pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
pp.Group("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))
)
#finds all of the reply_rule from user delimiters
info_from_user_rule = (
(
# matches everything between the two info_from_user delimiters
info_from_user_start_delimiter +
"\n\n" +
pp.Dict(
info_from_user_headers_rule +
pp.Group(pp.SkipTo(info_from_user_end_delimiter)).setResultsName("content")
)
).setWhitespaceChars('') # ensures all whitespace is captured from the item
)
status_update_rule = (
pp.Dict(
#matches everything from the start delimiter up to the start of another delimiter
pp.Group(pp.Literal("*** Status updated by: ").suppress() + pp.SkipTo(" at: ") ).setResultsName("by")+
pp.Group(pp.Literal("at:").suppress() + pp.SkipTo(" ***\n")).setResultsName("datetime") +
pp.Group(
pp.Literal("***\n").suppress() +
pp.SkipTo(
(pp.LineEnd() + (pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*") ))
)
).setResultsName("content")
).setWhitespaceChars("")
)
edit_rule = (
pp.Dict(
pp.Group(pp.Literal("*** Edited by: ").suppress() + pp.SkipTo(" at: ") ).setResultsName("by")+
pp.Group(pp.Literal("at:").suppress() + pp.SkipTo(" ***\n")).setResultsName("datetime") +
pp.Group(
pp.Literal("***\n").suppress() +
pp.SkipTo(
(pp.LineEnd() + (pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*") ))
)
).setResultsName("content")
)
).setWhitespaceChars("")
reply_rule = (
pp.Dict(
pp.Group(pp.Literal("*** Replied by: ").suppress() + pp.SkipTo(" at: ") ).setResultsName("by")+
pp.Group(pp.Literal("at:").suppress() + pp.SkipTo(" ***\n")).setResultsName("datetime") +
pp.Group(
pp.Literal("***\n").suppress() +
pp.SkipTo(
pp.LineEnd() + (pp.Regex(info_from_user_start_delimiter) | pp.Regex("\*\*\*"))
)
).setResultsName("content")
)
).setWhitespaceChars("")
directory_rule = pp.Dict(
pp.White("\n").suppress() +
pp.White("\t").suppress() +
pp.Group(pp.Optional("Name:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Login:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Computer:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Location:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Email:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Phone:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Office:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("UNIX Dir:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Zero Dir:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("User ECNDB:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Host ECNDB:" + pp.Regex("(.*)(\\n)"))) +
pp.Group(pp.Optional("Subject: " + pp.Regex("(.*)(\\n)"))) +
pp.White("\n").suppress()
).setWhitespaceChars('').parseWithTabs()
item = ("\n\t" +
"Name: i dont have one\n"
"Login: tttt\n" +
"Computer: 5555\n" +
"Location: yes\n" +
"Email: t\n" +
"Phone: 5555555555\n" +
"Office: yo\n" +
"UNIX Dir: 45\n" +
"Zero Dir: 0\n" +
"User ECNDB: 7\n" +
"Host ECNDB: 8\n" +
"Subject: I have no idea\n"+
"\n" +
"\n=== Additional information supplied by user ===\n"+
"\n"+
"Subject: stuff\n"+
"From: someone\n"+
"Date: some time\n"+
"X-ECN-Queue-Original-Path: https://google.com\n"+
"X-ECN-Queue-Original-URL: https://amazon.com\n"+
"\n"+
"This is a message.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"+
"*** Status updated by: you at: yesterday ***\n"+
"more status stuff\n" +
"\n\n\n\n"+
"*** Status updated by: that_guy at: tmrw ***\n"+
"status update\n" +
"\n"+
"*** Status updated by: me at: now ***\n"+
"this is a status update\n" +
"*** Edited by: someoneelse at: 03/03/21 10:09:52 ***\n" +
"this is an edit\n"+
"*** Replied by: no one at: ever ***\n" +
"this is a reply\n"+
"=== Additional information supplied by user ===\n"+
"\n"+
"Subject: more\n"+
"From: not me\n"+
"Cc: \"jacob Bennett\" <me@purdue.edu>\n" +
"Date: right meow\n"+
"X-ECN-Queue-Original-Path: https://gogle.com\n"+
"X-ECN-Queue-Original-URL: https://amzon.com\n"+
"\n"+
"This is a message that should be matched too.\n"+
"Thanks,\n"+
"me"+
"\n"+
"===============================================\n"
)
sections = []
parse_objects = {
'directory': directory_rule.scanString(item),
'info_from_user': info_from_user_rule.scanString(item),
'edit': edit_rule.scanString(item),
'status_update': status_update_rule.scanString(item),
'reply_from_ecn': reply_rule.scanString(item)
}
for key in parse_objects.keys():
for token, start_location, end_location in parse_objects[key]:
delete_tokens = []
for token_key in token.keys():
if token[token_key] == '': delete_tokens.append(token_key)
for removable_token in delete_tokens:
del token[removable_token]
token = token.asDict()
token["type"] = key
sections.append(token)
sections = json.dumps(sections)
print(sections) Output: [
{
"Name:":"i dont have one\n",
"Login:":"tttt\n",
"Computer:":"5555\n",
"Location:":"yes\n",
"Email:":"t\n",
"Phone:":"5555555555\n",
"Office:":"yo\n",
"UNIX Dir:":"45\n",
"Zero Dir:":"0\n",
"User ECNDB:":"7\n",
"Host ECNDB:":"8\n",
"Subject: ":"I have no idea\n",
"type":"directory"
},
{
"content":[
"This is a message.\nThanks,\nme"
],
"Subject":"stuff",
"From":"someone",
"Date":"some time",
"X-ECN-Queue-Original-Path":"https://google.com",
"X-ECN-Queue-Original-URL":"https://amazon.com",
"type":"info_from_user"
},
{
"content":[
"This is a message that should be matched too.\nThanks,\nme"
],
"Subject":"more",
"From":"not me",
"Cc":"\"jacob Bennett\" <me@purdue.edu>",
"Date":"right meow",
"X-ECN-Queue-Original-Path":"https://gogle.com",
"X-ECN-Queue-Original-URL":"https://amzon.com",
"type":"info_from_user"
},
{
"by":[
"someoneelse"
],
"datetime":[
"03/03/21 10:09:52"
],
"content":[
"this is an edit"
],
"type":"edit"
},
{
"by":[
"you"
],
"datetime":[
"yesterday"
],
"content":[
"more status stuff"
],
"type":"status_update"
},
{
"by":[
"that_guy"
],
"datetime":[
"tmrw"
],
"content":[
"status update"
],
"type":"status_update"
},
{
"by":[
"me"
],
"datetime":[
"now"
],
"content":[
"this is a status update"
],
"type":"status_update"
},
{
"by":[
"no one"
],
"datetime":[
"ever"
],
"content":[
"this is a reply"
],
"type":"reply_from_ecn"
}
] |
I'd like to see a link to a particularly good tutorial or a writeup about how you've come to understand and use PyParser. |
How to use PyparsingGrammar/Expression CreationA grammar is a rule or set of rules (rules are often referred to as expressions) used to create a parser. Creating expressions in pyparser is relatively easy, here are some examples of some simpler ones: import pyparsing as pp
colon_rule = pp.Word(pp.alphas) + ":" + pp.Word(pp.alphas) # matches two words on either side of a colon
skipTo_rule = pp.SkipTo("end") # matches everything up until the word "end"
regex_rule = pp.Regex("...") # matches the first three characters in a string
literal_rule = pp.Literal("Hello") # matches these characters exactally To use these expressions, there are one of three functions that can be called:
These functions are associated with each rule and the string to parse is an argument that gets passed as an argument, using the rules from above for example: print(colon_rule.parseString("hello:there")) # Output: ['hello', ':', 'there']
print(skipTo_rule.parseString("I'm just a simple sentenceend")) # Output: ["I'm just a simple sentence"]
print(regex_rule.parseString("This is a sentence")) # Output: ['Thi']
print(literal_rule.searchString("World Hello")) # Output: [['Hello']] Notice the function print(regex_rule.parseString("This is a sentence")) # Output: ['Thi']
print(regex_rule.searchString("This is a sentence")) # Output: [['Thi'], ['s i'], ['s a'], ['sen'], ['ten']] As seen above, search string does not care about string or character placement and will attempt to match anything within the string. Generally speaking, Complex GrammarsExpressions, as seen above can be combined to create a grammar. Take this example from this comment: import pyparsing as pp
info_from_user_headers_rule = (
pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
pp.Group("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))
) This grammar uses several different pyparsing classes and expressions, but the result matches all of the header information in a reply_from_user section.
Here is an example of a reply from a user: print(info_from_user_headers_rule.searchString("""
=== Additional information supplied by user ===
Subject: subject_here
From: Jacob
Date: Tue, 2 Mar 2021 09:46:21 -0500
X-ECN-Queue-Original-Path: path_here
X-ECN-Queue-Original-URL: url_here
I am replying to ECN
Thanks,
Jacob
"""))
# Output: [[['Subject', 'subject_here'], ['From', 'Jacob'], [], ['Date', 'Tue, 2 Mar 2021 09:46:21 -0500'], ['X-ECN-Queue-Original-Path', 'path_here'], ['X-ECN-Queue-Original-URL', 'url_here']]] |
I see above that some results on simple arrays like ['Thi'] and some results are nested arrays like [
['Thi'], ['s i'], ['s a'], ['sen'], ['ten']
] and other outputs are nested multiple times like [
[
['Subject', 'subject_here'],
['From', 'Jacob'],
[],
['Date', 'Tue, 2 Mar 2021 09:46:21 -0500'],
['X-ECN-Queue-Original-Path', 'path_here'],
['X-ECN-Queue-Original-URL', 'url_here']
]
] When should I expect nesting or not with results? |
Nested ListsNested lists can happen because of a couple of different reasons:
The To demonstrate these differences, here is an example: import pyparsing as pp
string_var = 'key1:value1;key2:value2;key3:value3;'
rule_one = pp.Word(pp.alphanums) + ":" + pp.Word(pp.alphanums) + pp.Literal(";")
rule_two = pp.ZeroOrMore(pp.Group(pp.Word(pp.alphanums) + ":" + pp.Word(pp.alphanums) + pp.Literal(";")))
rule_three = pp.Group(pp.Word(pp.alphanums) + ":" + pp.Word(pp.alphanums) + pp.Literal(";"))
print(rule_one.parseString(string_var))
print(rule_two.parseString(string_var))
print(rule_three.parseString(string_var))
print(rule_one.searchString(string_var))
print(rule_two.searchString(string_var))
print(rule_three.searchString(string_var)) The Output
|
Is there one approach that gives us all the flexibility we'd need so that we can standardize the expected output? |
|
Prototype Pyparsing ParserThe code below will parse all the information located in an item excluding any information located in the headers and return an output comparable to what already occurs in the current implementation of the parser. In addition, it does it faster and in fewer lines and in a way that is relatively easy to understand. import pyparsing as pp
import json
import string
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="
def addTypeKey(type):
def parse_action_impl(s, l, t): # need to look into how exactally this function gets information
t = t.asDict()
unwantedKeys=[emptyKey for emptyKey in t.keys() if emptyKey == ''] # makes a list of keys with empty values
for key in unwantedKeys: del t[key] # removes empty keys
if len(t.keys()) == 0: return # used for optional sections such as directory info
t["type"] = type
return t
return parse_action_impl
# additional information supplied by user rule
info_from_user_rule = (
(info_from_user_start_delimiter + pp.LineEnd()).suppress() +
pp.Literal("\n").setWhitespaceChars("").suppress() +
pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo(pp.LineEnd())) +
pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
pp.Group(pp.Optional("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group(pp.Optional("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd(), include=True).setResultsName("content")
).setParseAction(addTypeKey("reply_from_user"))
reply_rule = (
pp.Literal("\n*** Replied by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))
edit_rule = (
pp.Literal("\n*** Edited by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("edit"))
status_update_rule = (
pp.Literal("\n*** Status updated by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("status"))
directory_rule = pp.Optional(pp.Dict(
pp.Literal("\n").suppress().setWhitespaceChars("") +
pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Literal("\n\n").suppress().setWhitespaceChars("")
)).setParseAction(addTypeKey("directory_information"))
initial_message_rule = (
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))
headers_rule = pp.Group(pp.SkipTo("\n\n", include=True).setResultsName('headers')).leaveWhitespace()
item_rule = (
headers_rule.suppress() + #supresses the output of the headers to the parsed item
directory_rule +
initial_message_rule +
pp.ZeroOrMore(info_from_user_rule | reply_rule | edit_rule | status_update_rule)
)
raw_item = """
<Header information would typically go here>
Name: Jacob Bennett
Login: benne238
Computer: 1.1.1.1
Location: CARY 123
Email: benne238@purdue.edu
Phone: numberhere
Office: I wish...
UNIX Dir: dunno
Zero Dir: dunno thatone either
Subject: I need something from ECN
I am writing because I need something from ECN,
thanks, Jacob
Bennett
*** Edited by: campb303 at: 01/01/2022 09:00:00 ***
I made an edit here
*** Edited by: campb303 at: 01/01/2022 12:29:38 ***
I also made an edit here
*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***
ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***
Hello there.... could you be more specific?
Thanks,
personone
*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)
*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***
i dont even know anymore
=== Additional information supplied by user ===
Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing
Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/
Thanks, Jacob
===============================================
"""
parsed_item = item_rule.parseString(raw_item).asList()
print(json.dumps(parsed_item)) Output: [
{
"Name": "Jacob Bennett",
"Login": "benne238",
"Computer": "1.1.1.1",
"Location": "CARY 123",
"Email": "benne238@purdue.edu",
"Phone": "numberhere",
"Office": "I wish...",
"UNIX Dir": "dunno",
"Zero Dir": "dunno thatone either",
"Subject": "I need something from ECN",
"type": "directory_information"
},
{
"content": "\nI am writing because I need something from ECN, \nthanks, Jacob\nBennett\n",
"type": "initial_message"
},
{
"by": "campb303",
"datetime": "01/01/2022 09:00:00",
"content": [
"\nI made an edit here\n\n\n"
],
"type": "edit"
},
{
"by": "campb303",
"datetime": "01/01/2022 12:29:38",
"content": [
"\nI also made an edit here\n\n"
],
"type": "edit"
},
{
"by": "someoneelse",
"datetime": "01/01/2022 12:30:13",
"content": [
"I made a status update"
],
"type": "status"
},
{
"by": "personone",
"datetime": "01/02/2022 12:31:15",
"content": [
"\nooo, personone also edited this item"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "01/02/22 12:34:03",
"content": [
"\nHello there.... could you be more specific?\n\nThanks,\npersonone\n"
],
"type": "reply_to_user"
},
{
"by": "persontwo",
"datetime": "01/05/22 14:58:03",
"content": [
"I made an edit too! (persontwo)\n"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "1/7/2022 15:40:55",
"content": [
"Something happened here"
],
"type": "status"
},
{
"by": "personone",
"datetime": "04/08/22 15:41:05",
"content": [
"\ni dont even know anymore\n\n\n\n"
],
"type": "edit"
},
{
"content": "Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/\n\nThanks, Jacob\n",
"type": "reply_from_user"
}
] Modifications still need to be made to this script including:
|
Currently, the output of the parser look good thought it is not functionally complete. We still need:
I've also some other concerns:
Overall, code is cleaner than the previous parser. If we can add the laking functionality and prove the code is as effecient or faster than the old parser then we're good to go. Next Steps:
|
UpdateThis version of the pyparsing parser:
from typing_extensions import Literal
import pyparsing as pp
import json
import string
from dateutil import parser, tz
from datetime import datetime
import os
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="
HEADERS = ""
def addTypeKey(section_type):
def parse_action_impl(s, l, t): # need to look into how exactally this function gets information
t = t.asDict()
unwantedKeys=[emptyKey for emptyKey in t.keys() if t[emptyKey] == ''] # makes a list of keys with empty values
for key in unwantedKeys: del t[key] # removes empty keys
t["type"] = section_type
if "datetime" in t.keys(): t["datetime"] = getFormattedDate(t["datetime"])
if "content" in t.keys():
t["content"] = t["content"][0].strip()
t["content"] = t["content"].splitlines(True)
return t
return parse_action_impl
def getAssignments() -> list:
assignment_list = []
for assignment in assignment_rule.searchString(HEADERS).asList():
assignment_list.append(assignment[0])
return assignment_list #need to write a blurd about yeild statements
def storeHeaders():
def parse_action_impl(s, l, t):
global HEADERS
HEADERS = t[0][0]
return
return parse_action_impl
def getFormattedDate(date: str) -> str:
"""Returns the date/time formatted as RFC 8601 YYYY-MM-DDTHH:MM:SS+00:00.
Returns empty string if the string argument passed to the function is not a datetime.
See: https://en.wikipedia.org/wiki/ISO_8601
**Returns:**
```
str: Properly formatted date/time recieved or empty string.
```
"""
try:
# This date is never meant to be used. The default attribute is just to set timezone.
parsedDate = parser.parse(date, default=datetime(
1970, 1, 1, tzinfo=tz.gettz('EDT')))
except:
return ""
parsedDateString = parsedDate.strftime("%Y-%m-%dT%H:%M:%S%z")
return parsedDateString
# additional information supplied by user rule
info_from_user_rule = pp.Dict(
(info_from_user_start_delimiter + pp.LineEnd()).suppress() +
pp.Literal("\n").setWhitespaceChars("").suppress() +
pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo(pp.LineEnd())) +
pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
pp.Group(pp.Optional("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group(pp.Optional("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group(pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd())).setResultsName("content")
).setParseAction(addTypeKey("reply_from_user"))
reply_rule = (
pp.Literal("\n*** Replied by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))
edit_rule = (
pp.Literal("\n*** Edited by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("edit"))
status_update_rule = (
pp.Literal("\n*** Status updated by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("status"))
directory_rule = pp.Dict(
pp.Literal("\n").suppress().setWhitespaceChars("") +
pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Literal("\n\n").suppress().setWhitespaceChars("")
).setParseAction(addTypeKey("directory_information"))
initial_message_rule = pp.Group(
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))
headers_rule = pp.Group(pp.SkipTo("\n\n", include=True)).setResultsName('headers').leaveWhitespace()
item_rule = (
headers_rule.setParseAction(storeHeaders()).suppress() + #supresses the output of the headers to the parsed item
pp.Optional(directory_rule) +
initial_message_rule +
pp.ZeroOrMore(info_from_user_rule | reply_rule | edit_rule | status_update_rule)
)
assignment_rule = (
pp.Literal("Assigned-To: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("to") +
pp.Literal("Assigned-To-Updated-Time: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("datetime") +
pp.Literal("Assigned-To-Updated-By: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("by")
).setParseAction(addTypeKey("assignment"))
raw_item = """
Assigned-To: not_me
Assigned-To-Updated-Time: Fri, 29 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: me
Assigned-To: you
Assigned-To-Updated-Time: 31 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: not_me
Name: Jacob Bennett
Login: benne238
Computer: 1.1.1.1
Location: CARY 123
Email: benne238@purdue.edu
Phone: numberhere
Office: I wish...
UNIX Dir: dunno
Zero Dir: dunno thatone either
Subject: I need something from ECN
I am writing because I need something from ECN,
thanks, Jacob
Bennett
*** Edited by: campb303 at: 01/01/2022 09:00:00 ***
I made an edit here
*** Edited by: campb303 at: 01/01/2022 12:29:38 ***
I also made an edit here
*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***
ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***
Hello there.... could you be more specific?
Thanks,
personone
*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)
*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***
i dont even know anymore
=== Additional information supplied by user ===
Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing
Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/
Thanks, Jacob
===============================================
"""
parsed_item = item_rule.parseString(raw_item).asList()
for assignment in getAssignments():
parsed_item.append(assignment)
print(json.dumps(parsed_item, indent=2)) Output: [
{
"Name": "Jacob Bennett",
"Login": "benne238",
"Computer": "1.1.1.1",
"Location": "CARY 123",
"Email": "benne238@purdue.edu",
"Phone": "numberhere",
"Office": "I wish...",
"UNIX Dir": "dunno",
"Zero Dir": "dunno thatone either",
"Subject": "I need something from ECN",
"type": "directory_information"
},
{
"content": [
"I am writing because I need something from ECN, \n",
"thanks, Jacob\n",
"Bennett"
],
"type": "initial_message"
},
{
"by": "campb303",
"datetime": "2022-01-01T09:00:00-0500",
"content": [
"I made an edit here"
],
"type": "edit"
},
{
"by": "campb303",
"datetime": "2022-01-01T12:29:38-0500",
"content": [
"I also made an edit here"
],
"type": "edit"
},
{
"by": "someoneelse",
"datetime": "2022-01-01T12:30:13-0500",
"content": [
"I made a status update"
],
"type": "status"
},
{
"by": "personone",
"datetime": "2022-01-02T12:31:15-0500",
"content": [
"ooo, personone also edited this item"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "2022-01-02T12:34:03-0500",
"content": [
"Hello there.... could you be more specific?\n",
"\n",
"Thanks,\n",
"personone"
],
"type": "reply_to_user"
},
{
"by": "persontwo",
"datetime": "2022-01-05T14:58:03-0500",
"content": [
"I made an edit too! (persontwo)"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "2022-01-07T15:40:55-0500",
"content": [
"Something happened here"
],
"type": "status"
},
{
"by": "personone",
"datetime": "2022-04-08T15:41:05-0400",
"content": [
"i dont even know anymore"
],
"type": "edit"
},
{
"content": [
"Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/\n",
"\n",
"Thanks, Jacob"
],
"Subject": "Re: I need something from ECN",
"From": "\"Bennett, Jacob\" <benne238@purdue.edu>",
"Date": "Tue, 3 Dec 2023 14:50:44 +0000",
"X-ECN-Queue-Original-Path": "nothing",
"X-ECN-Queue-Original-URL": "nothing",
"type": "reply_from_user"
},
{
"to": "not_me",
"datetime": "2021-01-29T07:01:40-0500",
"by": "me",
"type": "assignment"
},
{
"to": "you",
"datetime": "2021-01-31T07:01:40-0500",
"by": "not_me",
"type": "assignment"
}
] |
UpdateThe following code will output an import pyparsing as pp
import json
import string
from dateutil import parser, tz
from datetime import datetime
import os
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="
nested_expression_rule = (
pp.Literal(info_from_user_start_delimiter) |
pp.Regex("\*\*\* Replied by: (.*) at: (.*) \*\*\*") |
pp.Regex("\*\*\* Edited by: (.*) at: (.*) \*\*\*") |
pp.Regex("\*\*\* Status updated by: (.*) at: (.*) \*\*\*")
)
HEADERS = ""
def errorHandler():
def error_action_impl(s, l, t):
location = (s[:l]).count('\n') + 1
parse_error = {
"type": "parse_error",
'datetime': getFormattedDate(str(datetime.now())),
'expected': f'Did not encounter a reply-from-user ending delimiter for the reply-from-user start delimiter on line {location}',
'got': '\n',
'line_num': location
}
parsed_item.append(parse_error)
return
return error_action_impl
def checkForNested():
def nested_action_impl(s, l, t):
errorParse = {}
nested_expressions_generator = nested_expression_rule.scanString(t[0])
for token, start, end in nested_expressions_generator:
errorParse = {
"type": "parse_error",
"datetime": getFormattedDate(str(datetime.now())),
"expected": "Reply from user ending delimiter",
"got": token[0],
"line_num": (s[:start + l]).count("\n") + 1
}
break
if len(errorParse.keys()) != 0: parsed_item.append(errorParse)
return
return nested_action_impl
def addTypeKey(section_type):
def parse_action_impl(s, l, t): # need to look into how exactally this function gets information
t = t.asDict()
unwantedKeys=[emptyKey for emptyKey in t.keys() if t[emptyKey] == ''] # makes a list of keys with empty values
for key in unwantedKeys: del t[key] # removes empty keys
t["type"] = section_type
if "datetime" in t.keys(): t["datetime"] = getFormattedDate(t["datetime"])
if "content" in t.keys():
t["content"] = t["content"][0].strip()
t["content"] = t["content"].splitlines(True)
parsed_item.append(t)
return
return parse_action_impl
def getAssignments() -> list:
assignment_list = []
for token, start, end in assignment_rule.scanString(HEADERS):
token_dict = token.asDict()
token_dict["type"] = "assignment"
assignment_list.append(token_dict)
return assignment_list
def storeHeaders():
def parse_action_impl(s, l, t):
global HEADERS
HEADERS = t[0][0]
return
return parse_action_impl
def getFormattedDate(date: str) -> str:
"""Returns the date/time formatted as RFC 8601 YYYY-MM-DDTHH:MM:SS+00:00.
Returns empty string if the string argument passed to the function is not a datetime.
See: https://en.wikipedia.org/wiki/ISO_8601
**Returns:**
```
str: Properly formatted date/time recieved or empty string.
```
"""
try:
# This date is never meant to be used. The default attribute is just to set timezone.
parsedDate = parser.parse(date, default=datetime(
1970, 1, 1, tzinfo=tz.gettz('EDT')))
except:
return ""
parsedDateString = parsedDate.strftime("%Y-%m-%dT%H:%M:%S%z")
return parsedDateString
# additional information supplied by user rule
info_from_user_rule = (pp.Dict(
(info_from_user_start_delimiter + pp.LineEnd()).suppress() +
pp.Literal("\n").setWhitespaceChars("").suppress() +
pp.Group("Subject" + pp.Literal(": ").suppress() + pp.SkipTo(pp.LineEnd())) +
pp.Group("From" + pp.Literal(": ").suppress() + pp.SkipTo("\n")) +
pp.Group(pp.Optional("Cc" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group("Date" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))+
pp.Group(pp.Optional("X-ECN-Queue-Original-Path" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
pp.Group(pp.Optional("X-ECN-Queue-Original-URL" + pp.Literal(": ").suppress() + pp.SkipTo("\n"))) +
(pp.Group(pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd()).setParseAction(checkForNested())).setResultsName("content")) +
(pp.Literal(info_from_user_end_delimiter) + pp.LineEnd()).suppress()
).setParseAction(addTypeKey("reply_from_user")))
reply_rule = (
pp.Literal("\n*** Replied by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))
edit_rule = (
pp.Literal("\n*** Edited by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("edit"))
status_update_rule = (
pp.Literal("\n*** Status updated by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("status"))
directory_rule = pp.Dict(
pp.Literal("\n").suppress().setWhitespaceChars("") +
pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Literal("\n\n").suppress().setWhitespaceChars("")
).setParseAction(addTypeKey("directory_information"))
initial_message_rule = pp.Group(
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))
headers_rule = pp.Group(pp.SkipTo("\n\n", include=True)).setResultsName('headers').leaveWhitespace()
missing_end_delimiter_rule = pp.Word(string.printable).setParseAction(errorHandler())
item_rule = (
headers_rule.setParseAction(storeHeaders()).suppress() + #supresses the output of the headers to the parsed item
pp.Optional(directory_rule) +
initial_message_rule +
pp.ZeroOrMore(info_from_user_rule | reply_rule | edit_rule | status_update_rule | missing_end_delimiter_rule)
)
assignment_rule = (
pp.Literal("Assigned-To: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("to") +
pp.Literal("Assigned-To-Updated-Time: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("datetime") +
pp.Literal("Assigned-To-Updated-By: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("by")
).setParseAction(addTypeKey("assignment"))
raw_item = """
Assigned-To: not_me
Assigned-To-Updated-Time: Fri, 29 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: me
Assigned-To: you
Assigned-To-Updated-Time: 31 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: not_me
Name: Jacob Bennett
Login: benne238
Computer: 1.1.1.1
Location: CARY 123
Email: benne238@purdue.edu
Phone: numberhere
Office: I wish...
UNIX Dir: dunno
Zero Dir: dunno thatone either
Subject: I need something from ECN
I am writing because I need something from ECN,
thanks, Jacob
Bennett
*** Edited by: campb303 at: 01/01/2022 09:00:00 ***
I made an edit here
*** Edited by: campb303 at: 01/01/2022 12:29:38 ***
I also made an edit here
*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***
ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***
Hello there.... could you be more specific?
Thanks,
personone
*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)
*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***
i dont even know anymore
=== Additional information supplied by user ===
Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing
Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/
*** Edited by: you at: none ***
*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***
i dont even know anymore
Thanks, Jacob
===============================================
"""
parsed_item = []
item_rule.parseString(raw_item).asList()
for assignment in getAssignments():
parsed_item.insert(0, assignment)
for count, section in enumerate(parsed_item):
if section['type'] == "parse_error":
parsed_item = parsed_item[:count + 1]
break
print(json.dumps(parsed_item, indent=2)) Output: [
{
"to": "you",
"datetime": "31 Jan 2021 07:01:40 EST",
"by": "not_me",
"type": "assignment"
},
{
"to": "not_me",
"datetime": "Fri, 29 Jan 2021 07:01:40 EST",
"by": "me",
"type": "assignment"
},
{
"Name": "Jacob Bennett",
"Login": "benne238",
"Computer": "1.1.1.1",
"Location": "CARY 123",
"Email": "benne238@purdue.edu",
"Phone": "numberhere",
"Office": "I wish...",
"UNIX Dir": "dunno",
"Zero Dir": "dunno thatone either",
"Subject": "I need something from ECN",
"type": "directory_information"
},
{
"content": [
"I am writing because I need something from ECN, \n",
"thanks, Jacob\n",
"Bennett"
],
"type": "initial_message"
},
{
"by": "campb303",
"datetime": "2022-01-01T09:00:00-0500",
"content": [
"I made an edit here"
],
"type": "edit"
},
{
"by": "campb303",
"datetime": "2022-01-01T12:29:38-0500",
"content": [
"I also made an edit here"
],
"type": "edit"
},
{
"by": "someoneelse",
"datetime": "2022-01-01T12:30:13-0500",
"content": [
"I made a status update"
],
"type": "status"
},
{
"by": "personone",
"datetime": "2022-01-02T12:31:15-0500",
"content": [
"ooo, personone also edited this item"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "2022-01-02T12:34:03-0500",
"content": [
"Hello there.... could you be more specific?\n",
"\n",
"Thanks,\n",
"personone"
],
"type": "reply_to_user"
},
{
"by": "persontwo",
"datetime": "2022-01-05T14:58:03-0500",
"content": [
"I made an edit too! (persontwo)"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "2022-01-07T15:40:55-0500",
"content": [
"Something happened here"
],
"type": "status"
},
{
"by": "personone",
"datetime": "2022-04-08T15:41:05-0400",
"content": [
"i dont even know anymore"
],
"type": "edit"
},
{
"type": "parse_error",
"datetime": "2021-06-09T13:05:29-0400",
"expected": "Reply from user ending delimiter",
"got": "*** Edited by: you at: none ***",
"line_num": 74
}
] |
Working pyparsing updateThis version of the pyparsing parser does almost everything that our current parser does including formatting and sorting sections by date. import pyparsing as pp
import json
import string
from dateutil import parser, tz
from datetime import datetime
import os, email.utils
info_from_user_start_delimiter = "=== Additional information supplied by user ==="
info_from_user_end_delimiter = "==============================================="
nested_expression_rule = (
pp.Literal(info_from_user_start_delimiter) |
pp.Regex("\*\*\* Replied by: (.*) at: (.*) \*\*\*") |
pp.Regex("\*\*\* Edited by: (.*) at: (.*) \*\*\*") |
pp.Regex("\*\*\* Status updated by: (.*) at: (.*) \*\*\*")
)
def errorHandler():
def error_action_impl(s, l, t):
location = (s[:l]).count('\n') + 1
message = 'Did not encounter a starting delimiter for any section'
if t[0][0] == info_from_user_start_delimiter:
message = "Did not encounter the ending delimiter for additional informtion from user"
parse_error = {
"type": "parse_error",
'datetime': getFormattedDate(str(datetime.now())),
'expected': message,
'got': t[0][0],
'line_num': location
}
parsed_item.append(parse_error)
return
return error_action_impl
def checkForNested():
def nested_action_impl(s, l, t):
errorParse = {}
nested_expressions_generator = nested_expression_rule.scanString(t[0])
for token, start, end in nested_expressions_generator:
errorParse = {
"type": "parse_error",
"datetime": getFormattedDate(str(datetime.now())),
"expected": "Reply from user ending delimiter",
"got": token[0],
"line_num": (s[:start + l]).count("\n") + 1
}
break
if errorParse: parsed_item.append(errorParse)
return
return nested_action_impl
def addTypeKey(section_type):
def parse_action_impl(s, l, t):
t = t.asDict()
if section_type == "reply_from_user":
t["headers"] = str(t["headers"][0]).split("\n")
for count, header in enumerate(t["headers"]):
key, value = header.split(": ", maxsplit=1)
t["headers"][count] = {"type":key, "content":value}
for header in t["headers"]:
if header["type"] == "Date":
t["datetime"] = header["content"]
if header["type"] == "Subject":
t["subject"] = header["content"]
if header["type"] == "From":
user_name, user_email = email.utils.parseaddr(header["content"])
t["from_name"] = user_name
t["from_email"] = user_email
if header["type"] == "Cc":
ccList = [
{"name":user_name, "email":user_email}
for user_name, user_email in email.utils.getaddresses([header["content"]])
]
t["cc"] = ccList
unwantedKeys=[emptyKey for emptyKey in t.keys() if t[emptyKey] == ''] # makes a list of keys with empty values
for key in unwantedKeys: del t[key] # removes empty keys
t["type"] = section_type
if "datetime" in t.keys(): t["datetime"] = getFormattedDate(t["datetime"])
if "content" in t.keys():
t["content"] = t["content"][0].strip()
t["content"] = t["content"].splitlines(True)
if t["type"] == "directory_information":
global directory_info
directory_info = t
return
parsed_item.append(t)
return
return parse_action_impl
def getAssignments() -> list:
assignment_list = []
for token, start, end in assignment_rule.scanString(headers):
token_dict = token.asDict()
token_dict["datetime"] = getFormattedDate(token_dict["datetime"])
token_dict["type"] = "assignment"
assignment_list.append(token_dict)
return assignment_list
def storeHeaders():
def parse_action_impl(s, l, t):
global headers
headers = t[0][0]
return
return parse_action_impl
def getInitialMessageHeaders():
initialMessageHeaders = {}
subject = (
(pp.LineStart() + pp.Literal("Subject: ")).suppress() +
pp.SkipTo(pp.LineEnd())
).scanString(headers)
for token, start, end in subject:
initialMessageHeaders["subject"] = token[0]
from_email = (
(pp.LineStart() + pp.Literal("From: ")).suppress() +
pp.SkipTo(pp.LineEnd())
).scanString(headers)
for token, start, end in from_email:
user_name, user_email = email.utils.parseaddr(token[0])
initialMessageHeaders["from_name"] = user_name
initialMessageHeaders["from_email"] = user_email
to = (
(pp.LineStart() + pp.Literal("To: ")).suppress() +
pp.SkipTo(pp.LineEnd())
).scanString(headers)
for token, start, end in to:
recipientList = [
{"name":user_name, "email":user_email}
for user_name, user_email in email.utils.getaddresses(token)
]
initialMessageHeaders["to"] = recipientList
cc = (
(pp.LineStart() + pp.Literal("CC: ")).suppress() +
pp.SkipTo(pp.LineEnd())
).scanString(headers)
for token, start, end in cc:
ccList = [
{"name":user_name, "email":user_email}
for user_name, user_email in email.utils.getaddresses(token)
]
initialMessageHeaders["cc"] = ccList
datetime = (
(pp.LineStart() + pp.Literal("Date: ")).suppress() +
pp.SkipTo(pp.LineEnd())
).scanString(headers)
for token, start, end in datetime:
initialMessageHeaders["datetime"] = getFormattedDate(token[0])
return initialMessageHeaders
def getFormattedDate(date: str) -> str:
"""Returns the date/time formatted as RFC 8601 YYYY-MM-DDTHH:MM:SS+00:00.
Returns empty string if the string argument passed to the function is not a datetime.
See: https://en.wikipedia.org/wiki/ISO_8601
**Returns:**
```
str: Properly formatted date/time recieved or empty string.
```
"""
try:
# This date is never meant to be used. The default attribute is just to set timezone.
parsedDate = parser.parse(date, default=datetime(
1970, 1, 1, tzinfo=tz.gettz('EDT')))
except:
return ""
parsedDateString = parsedDate.strftime("%Y-%m-%dT%H:%M:%S%z")
return parsedDateString
# additional information supplied by user rule
info_from_user_rule = (pp.Dict(
(info_from_user_start_delimiter + pp.LineEnd()).suppress() +
pp.Literal("\n").setWhitespaceChars("").suppress() +
(pp.Group(pp.SkipTo("\n\n"))).setResultsName("headers") +
(pp.Group(pp.SkipTo(info_from_user_end_delimiter + pp.LineEnd()).setParseAction(checkForNested())).setResultsName("content")) +
(pp.Literal(info_from_user_end_delimiter) + pp.LineEnd()).suppress()
).setParseAction(addTypeKey("reply_from_user")))
reply_rule = (
pp.Literal("\n*** Replied by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("reply_to_user"))
edit_rule = (
pp.Literal("\n*** Edited by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("edit"))
status_update_rule = (
pp.Literal("\n*** Status updated by: ").suppress() +
pp.Word(pp.alphanums).setResultsName("by")+
pp.Literal(" at: ").suppress() +
pp.SkipTo(" ***" + pp.LineEnd()).setResultsName("datetime") +
(pp.Literal(" ***") + pp.LineEnd()).suppress() +
pp.Group(
pp.SkipTo(pp.LineEnd() + (pp.Literal(info_from_user_start_delimiter) | pp.Literal("***"))) | pp.Word(string.printable)
).setResultsName("content")
).leaveWhitespace().setParseAction(addTypeKey("status"))
directory_rule = pp.Dict(
pp.Literal("\n").suppress().setWhitespaceChars("") +
pp.Optional(pp.Group("Name" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Login" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Computer" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Location" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Email" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Phone" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Office" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("UNIX Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Zero Dir" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("User ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Host ECNDB" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Optional(pp.Group("Subject" + pp.Literal(":").suppress() + pp.SkipTo(pp.LineEnd()))) +
pp.Literal("\n\n").suppress().setWhitespaceChars("")
).setParseAction(addTypeKey("directory_information"))
initial_message_rule = pp.Group(
pp.SkipTo(pp.Regex(info_from_user_start_delimiter) | pp.Regex('\\n\*\*\*')).leaveWhitespace()
).setResultsName("content").setParseAction(addTypeKey("initial_message"))
headers_rule = pp.Group(pp.SkipTo("\n\n", include=True)).setResultsName('headers').leaveWhitespace()
error_rule = pp.Group(pp.Word(string.printable) + pp.LineEnd()).setParseAction(errorHandler())
item_rule = (
headers_rule.setParseAction(storeHeaders()).suppress() + #supresses the output of the headers to the parsed item
pp.Optional(directory_rule) +
initial_message_rule +
pp.ZeroOrMore(
(info_from_user_rule | reply_rule | edit_rule | status_update_rule) |
(error_rule)
)
)
assignment_rule = (
pp.Literal("Assigned-To: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("to") +
pp.Literal("Assigned-To-Updated-Time: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("datetime") +
pp.Literal("Assigned-To-Updated-By: ").suppress() + pp.SkipTo(pp.LineEnd()).setResultsName("by")
).setParseAction(addTypeKey("assignment"))
raw_item = """
Assigned-To: not_me
Assigned-To-Updated-Time: Fri, 29 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: me
Assigned-To: you
Assigned-To-Updated-Time: 31 Jan 2021 07:01:40 EST
Assigned-To-Updated-By: not_me
To: hello@purdue.edu
Date: 1/1/1990 12:00:40 EST
CC: not_anyone@gmail.com
Subject: dunno
From: you
Name: Jacob Bennett
Login: benne238
Computer: 1.1.1.1
Location: CARY 123
Email: benne238@purdue.edu
Phone: numberhere
Office: I wish...
UNIX Dir: dunno
Zero Dir: dunno thatone either
Subject: I need something from ECN
I am writing because I need something from ECN,
thanks, Jacob
Bennett
*** Edited by: campb303 at: 01/01/2022 09:00:00 ***
I made an edit here
*** Edited by: campb303 at: 01/01/2022 12:29:38 ***
I also made an edit here
*** Status updated by: someoneelse at: 01/01/2022 12:30:13 ***
I made a status update
*** Edited by: personone at: 01/02/2022 12:31:15 ***
ooo, personone also edited this item
*** Replied by: personone at: 01/02/22 12:34:03 ***
Hello there.... could you be more specific?
Thanks,
personone
*** Edited by: persontwo at: 01/05/22 14:58:03 ***
I made an edit too! (persontwo)
*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***
i dont even know anymore
=== Additional information supplied by user ===
Subject: Re: I need something from ECN
From: "Bennett, Jacob" <benne238@purdue.edu>
Date: Tue, 3 Dec 2023 14:50:44 +0000
X-ECN-Queue-Original-Path: nothing
X-ECN-Queue-Original-URL: nothing
Hi! Thanks for the quick reply. I dunnno, I was hoping you could help me with that :/
*** Edited by: you at: none ***
*** Status updated by: personone at: 1/7/2022 15:40:55 ***
Something happened here
*** Edited by: personone at: 04/08/22 15:41:05 ***
i dont even know anymore
Thanks, Jacob
===============================================
"""
parsed_item = []
headers = ""
directory_info = {}
item_rule.parseString(raw_item).asList()
initial_message_headers = getInitialMessageHeaders()
for assignment in getAssignments():
parsed_item.insert(2, assignment)
for count, section in enumerate(parsed_item):
if section['type'] == "parse_error":
parsed_item = parsed_item[:count + 1]
break
for section in parsed_item:
if section['type'] == "initial_message":
for key in initial_message_headers.keys():
section[key] = initial_message_headers[key]
break
parsed_item = sorted(parsed_item, key = lambda dateTimeKey: parser.parse(dateTimeKey['datetime']))
parsed_item.insert(0, directory_info)
print(json.dumps(parsed_item, indent=2)) Output: [
{
"Name": "Jacob Bennett",
"Login": "benne238",
"Computer": "1.1.1.1",
"Location": "CARY 123",
"Email": "benne238@purdue.edu",
"Phone": "numberhere",
"Office": "I wish...",
"UNIX Dir": "dunno",
"Zero Dir": "dunno thatone either",
"Subject": "I need something from ECN",
"type": "directory_information"
},
{
"content": [
"I am writing because I need something from ECN, \n",
"thanks, Jacob\n",
"Bennett"
],
"type": "initial_message",
"subject": "dunno",
"from_name": "",
"from_email": "you",
"to": [
{
"name": "",
"email": "hello@purdue.edu"
}
],
"cc": [
{
"name": "",
"email": "not_anyone@gmail.com"
}
],
"datetime": "1990-01-01T12:00:40-0500"
},
{
"to": "not_me",
"datetime": "2021-01-29T07:01:40-0500",
"by": "me",
"type": "assignment"
},
{
"to": "you",
"datetime": "2021-01-31T07:01:40-0500",
"by": "not_me",
"type": "assignment"
},
{
"type": "parse_error",
"datetime": "2021-06-11T14:27:53-0400",
"expected": "Reply from user ending delimiter",
"got": "*** Edited by: you at: none ***",
"line_num": 79
},
{
"by": "campb303",
"datetime": "2022-01-01T09:00:00-0500",
"content": [
"I made an edit here"
],
"type": "edit"
},
{
"by": "campb303",
"datetime": "2022-01-01T12:29:38-0500",
"content": [
"I also made an edit here"
],
"type": "edit"
},
{
"by": "someoneelse",
"datetime": "2022-01-01T12:30:13-0500",
"content": [
"I made a status update"
],
"type": "status"
},
{
"by": "personone",
"datetime": "2022-01-02T12:31:15-0500",
"content": [
"ooo, personone also edited this item"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "2022-01-02T12:34:03-0500",
"content": [
"Hello there.... could you be more specific?\n",
"\n",
"Thanks,\n",
"personone"
],
"type": "reply_to_user"
},
{
"by": "persontwo",
"datetime": "2022-01-05T14:58:03-0500",
"content": [
"I made an edit too! (persontwo)"
],
"type": "edit"
},
{
"by": "personone",
"datetime": "2022-01-07T15:40:55-0500",
"content": [
"Something happened here"
],
"type": "status"
},
{
"by": "personone",
"datetime": "2022-04-08T15:41:05-0400",
"content": [
"i dont even know anymore"
],
"type": "edit"
}
] |
Current parser vs pyparsing parserHere is an example nested delimiter in
Current Ouput: {
"type": "parse_error",
"datetime": "2021-06-11T14:42:37-0400",
"file_path": "/home/pier/e/queue/Mail/me/5",
"expected": "Did not encounter a reply-from-user ending delimiter",
"got": "\n",
"line_num": 391
} Pyparsing Output: {
"type": "parse_error",
"datetime": "2021-06-11T14:40:03-0400",
"expected": "Reply from user ending delimiter",
"got": "*** Replied by: flowersr at: 06/01/21 15:38:19 ***",
"line_num": 392
} ChangesAs seen above, the three main differences are:
The changes in the |
multiprocessingThe python def __get_items(self, headers_only: bool) -> list:
"""Returns a list of items for this Queue
Args:
headers_only (bool): If True, loads Item headers.
Returns:
list: a list of items for this Queue
"""
items = []
+ valid_items = []
+ multi_item_processes = multiprocessing.Pool(processes=32)
for item in os.listdir(self.path):
item_path = Path(self.path, item)
is_file = True if os.path.isfile(item_path) else False
if is_file and is_valid_item_name(item):
- items.append(Item(self.name, item, headers_only))
+ valid_items.append(item)
+ items = multi_item_processes.starmap_async(Item, [(self.name, item, headers_only) for item in valid_items]).get()
+ multi_item_processes.close()
+ multi_item_processes.join()
+
return items After making this change, the time to parse the entire live queue takes about 70 seconds. (It takes approximately 130 seconds to parse the live queue without multiprocessing) this is a significant improvement, however other packages exist that might make the paring entire queues even faster, such as Ray and other similar packages |
It appears that It appears that instead of loading items sequentially in the for loop, we now generate a list of valid item names and store it in
|
Further multiprocessing talk should go in #35 . |
It appears that multi_item_processes = multiprocessing.Pool(processes=32) is setting the number of processes that can run at the same time. Can the hardcoded 32 be replaced with some system agnostic number that is calculated at run time so we can have whatever number of cores is available to us on the machine? Yes, that can easily be changed like this by using the
However, the list of tuples needs to represent every item in a given queue, so that is why there is a list comprehension that creates a tuple representing the queue, the item, and the headers_only argument for every item in a queue. So the list of tuples looks more like The Finally, the What does multi_item_processes.close() do?
Why using multiprocessing is fasterThe way we parsed items without the
|
All of the above makes sense. According to your timing we're looking at parsing speeds of approx. 2x, yes? Why is this not closer to 32x faster? I understand we won't get a perfect 32x faster because not every item takes the same time to load but only 2x faster seems odd. |
Performance and multiprocessingThe reason for the lack of performance is partially due to the fact some items take longer to parse with pyparsing, but a significant hit in performance is due to the way multiprocessing is implemented In this comment, multiprocessing is implemented such that multiple items in a queue are processed at once. However, only one single queue is processed at a time. This implementation can parse all the items with content in the live queue in approximately 70 seconds. To contrast this method of multiprocessing, I decided implement multiprocessing so that multiple queues were processed at once, but each item in that queue was still done sequentially, one at a time: test.py import multiprocessing
import webqueue2api.parser.queue
valid_queues = webqueue2api.parser.queue.get_valid_queues()
multi_queue_processes = multiprocessing.Pool(processes=multiprocessing.cpu_count())
items = multi_queue_processes.starmap_async(webqueue2api.parser.queue.Queue, [(queue, False) for queue in valid_queues]).get()
multi_queue_processes.close()
multi_queue_processes.join() This implementation of multiprocessing was able to parse all of the items with content in the live queue in approximately 60 seconds As a third way to implement multiprocessing, the name of every valid item in each queue was retrieved and put into a list, so that any item could be processed along side any other item regardless of which queue the items belong to: import multiprocessing
import webqueue2api.parser.queue
import webqueue2api.parser.item
all_valid_items = []
valid_queues = webqueue2api.parser.queue.get_valid_queues()
valid_queues = [webqueue2api.parser.queue.Queue(name=queue, headers_only=True) for queue in valid_queues]
for queue in valid_queues:
for item in queue.items:
all_valid_items.append((queue.name, item.number, False))
start_time = datetime.timestamp(datetime.now())
multi_queue_processes = multiprocessing.Pool(processes=multiprocessing.cpu_count())
items = multi_queue_processes.starmap_async(webqueue2api.parser.item.Item, [(queue, number, header) for (queue, number, header) in all_valid_items]).get()
multi_queue_processes.close()
multi_queue_processes.join()
end_time = datetime.timestamp(datetime.now())
print(f"Time to parse all items with content: {(end_time - start_time)} seconds") Note: it is possible that the list of all_valid_items in this example can become outdated and cause errors before processing of each item can finish |
Re: our call today; threading at the item level is currently the most eficient method we have for loading queues in parralell. I'd like to see if we can nest threaded workloads by parallelizing a Queue's Item loading and parallelizing loading of the queues. If we can do this, we can rewrite the def load_queues(*args: list, headers_only: bool = True) -> list:
"""Load Queues requested.
Args:
*args (list): List of strings of Queue names. If only one name exists, loading happens sequentially. If multiple names are passed, loading happens in paralell
"""
pass |
Allowing child sub processes from a sub processBy default, multiprocessing does not allow for any sub process to create other child sub processes. When it's attempted, an exception is raised and the script exits:
However, according to this stackoverflow answer, it is possible to create a custom class that allows for child sub processes to be created from an already existing sub process. For this implementation, the changes made to test.py import multiprocessing
import multiprocessing.pool
import webqueue2api.api.resources.queue
import webqueue2api.parser.queue
from datetime import datetime
start_time = datetime.timestamp(datetime.now())
# custom class creation based on stackoverflow answer
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
def _get_daemon(self):
return False
def _set_daemon(self, value):
pass
daemon = property(_get_daemon, _set_daemon)
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class MyPool(multiprocessing.pool.Pool):
Process = NoDaemonProcess
valid_queues = webqueue2api.parser.queue.get_valid_queues()
headers_only = False
multi_queue_process = MyPool(processes=multiprocessing.cpu_count())
queues = multi_queue_process.starmap_async(webqueue2api.parser.queue.Queue, [(queue, headers_only) for queue in valid_queues]).get()
multi_queue_process.close()
multi_queue_process.join()
end_time = datetime.timestamp(datetime.now())
print(f'Total time to parse with{"out" if headers_only else ""} headers: {end_time - start_time}') With this implementation, it takes approximately 36 seconds to parse every item in every queue with content |
|
Summary of changes made to the api with pyparsing and multiprocessingpyparsing changesThe parser is now a specific grammar, made with pyparsing, that uses a series of rules to get information from an item and format it to a json structure that the frontend can understand.
multiprocessingTo enhance the speed of the new pyparser, which is slower than our original parser, multiprocessing was implemented so that multiple items and multiple queues could be parsed at once as opposed to having to wait for each item in each queue sequentially.
Still needs to be doneFor the correct implementation of multiprocessing to work, I think some changes need to be made to this file to properly make use of the new |
Closed by #41 |
The parser as it is functional but brittle. It needs to be rewritten as library code with exceptions, context managers and other abstracted schemes.
The text was updated successfully, but these errors were encountered: