blog

New attempt to make markdown converter work

And that's eum... Pretty hard. I don't know why, but I can only imagine my regexes become so fucking ugly, that I can't do this properly in Python without creating a burden of problems and bugs. Putting this in the commit for historic reference.

Author
Maarten 'Vngngdn' Vangeneugden
Date
Sept. 28, 2017, 11:50 p.m.
Hash
5a08d014792d6a8417ef684c8e0a22eb6e6ac351
Parent
772aedaf0099886a369576cc8f8ab9a6860ced2e
Modified files
markdown.py
models.py

markdown.py

22 additions and 15 deletions.

View changes Hide changes
1
1
import pygments
2
2
3
3
""" So welcome to my Markdown module. Since the markdown library in PyPI is
4
4
fucking shit, I've decided to write my own implementation. Contary to the one in
5
5
PyPI, my version handles **all** cases, and is a **full implementation** of the
6
6
reference.
7
7
8
8
Oh, and just so you know: You don't need an entire shitty object oriented system
9
9
to make something decent. Sometimes the solution is a function. Period.
10
10
"""
11
11
12
12
"""
13
13
Checklist about all shit that must be implemented:
14
14
    - headers need to have their ID's be the same as the title. BUT! id's
15
15
      mustn't have spaces, and need to be unique. The latter isn't that big of a
16
16
      deal, but spaces in the header title must be converted to dashes.
17
17
    - HTML code needs to be escaped; & must become &amp;, < and > become &lt;
18
18
      and &gt; and so on. This isn't necessary for UTF-8 symbols such as ©,
19
19
      which can be put in place as is, instead of converting to &copy;.
20
20
    - Some elements have to be placed in the tag itself, such as links in <a />.
21
21
      This is noted with the {#} tags. The context in which they are used in the
22
22
      defaults should give a good explanation on what number points to what.
23
23
    - Remember to support 2 trailing spaces as <br />!
24
24
    - There are also "closing ATX headers": "# title" is the same as 
25
25
      "# title ####" and "# title #". (So it's purely cosmetic, remove the
26
26
      trailing whitespace in these cases)
27
27
    - When code is used, call Pygments to markup the code properly. If a code
28
28
      tag is provided (e.g. "Python", "C", ...), tell that to Pygments as well,
29
29
      so it can do a better job. If nothing is provided, leave it as is. When
30
30
      it's an inline code block (`CODE`), leave that always as is.
31
31
      Look how to do it at
32
32
      <http://docs.getpelican.com/en/stable/content.html#syntax-highlighting>.
33
33
34
34
Future expansions:
35
35
    - Allow nesting of more elements. For example: Headers cannot be nested in
36
36
      blockquotes, but this is a nice thing to have.
37
37
    - Allow headers to follow a line wrapping, if the next line is perceded by
38
38
      the same amount of hashtags (=> same header level).
39
39
    - Allow the special p "Perseverance porn" stories, about how someone walks 10 miles to work every day, have the effect of normalizing the big disadvantages from society that make people do hard labor that society should not need.
40
-
41
-
Marriage
42
-
7 August 2017aragraph blockquote style:
43
-
      https://daringfireball.net/projects/markdown/syntax#blockquote
44
40
    """
45
41
46
42
47
-
def toHTML(
48
43
        text,
49
44
        emphasis = r"<em>{text}</em>",
50
45
        strong = r"<strong>{text}</strong>",
51
46
        unordered_list = r"<ul>{items}</ul>",
52
47
        ordered_list = r"<ol>{items}</ol>",
53
48
        list_item = r"<li>{text}</li>",
54
49
        hyperlink = r'<a href="{link}" title="{title}">{text}</a>',
55
50
        image = r'<img src="{link}" alt="{alt}" title="{title}" />',
56
51
        paragraph = r"<p>{text}</p>",
57
52
        blockquote = r"<blockquote>{text}</blockquote>",
58
53
        header1 = r'<h1 id="{link}">{text}</h1>',
59
54
        header2 = r'<h2 id="{link}">{text}</h2>',
60
55
        header3 = r'<h3 id="{link}">{text}</h3>',
61
56
        header4 = r'<h4 id="{link}">{text}</h4>',
62
57
        header5 = r'<h5 id="{link}">{text}</h5>',
63
58
        header6 = r'<h6 id="{link}">{text}</h6>',
64
59
        code = r'<code lang="{language}">{code}</code>',
65
-
        incorrect = r"<s>{text}</s>",
+
60
        code = r'<code lang="\g<language>">\g<code></code>',
+
61
        incorrect = r"<s>{text}</s>",
66
62
        line_break = r"<br />",
67
63
        horizontal_rule = r"<hr />",
68
64
        ):
69
65
    """ Translates Markdown code to HTML code.
70
66
71
67
    This is a pure function.
72
68
73
69
    This function will translate given Markdown code to HTML code.
74
70
    It follows the specification as good as possible, with a few custom additions:
75
71
    - Incorrect text can be marked with "~" around a text block.
76
72
77
73
    The default parameters have sane defaults, but can be customized if you wish
78
74
    to do so. Pay attention to the tags, as your custom value must also
79
75
    incorporate these.
80
76
81
77
    The function works in a simple way:
82
78
    1. Replace all redundant content with only 1 unique part
83
79
    1.1. For example: 5 blank lines mean the same as 2; a line with only spaces
84
80
         and tabs means the same as an empty line; hashtags at the end of a header
85
81
         line are meaningless; ...
86
82
    2. Handle blockquotes. Blockquotes have the highest precedence and can contain
87
83
       any other element, thus it's easiest to just handle these as soon as possible.
88
84
    3. Replace Setext with atx-style headers, to provide consistency for header handling.
89
85
    4. Handle block elements (paragraphs, code, ...).
90
86
    5. In all block elements, handle span elements (links, emphasis, ...).
91
87
    """
92
88
93
89
    # Replacing some shit:
94
90
    text = re.sub(r"^[ \t]+$", "\n", text)  # Make all blank lines consistent
95
91
    text = re.sub(r"\n{3}", "\n\n", text)  # Replace redundant blanks with 2 blank lines
96
92
97
93
    # XXX: Blockquotes have the highest precedence: **ANYTHING** can be nested
98
94
    # in a blockquote. So, handle these first, and convert them up front to
99
95
    # make it easier to handle the other text.
100
96
101
97
102
98
    """ About handling blockquotes:
103
99
    Every line that starts with "> " is a blockquote. As long as the next line
104
100
    starts in the same way, it's considered part of the same blockquote.
105
101
    **However**, there is 1 exception to this rule:
106
102
    paragraphs that are hard-wrapped only need 1 > for their first line, but can
107
103
    then be hard wrapped, and even start without prior spacing.
108
104
    """
109
105
    blockquotes_left = True
110
106
    while blockquotes_left:
111
107
         blockquote = re.compile(r"(^> .+\n)+")
112
108
         quote = blockquote.search(text)
113
109
         if quote is None:
114
110
             blockquotes_left = False
115
111
         else:
116
112
             begin, end = quote.span()
117
113
             reworked = "<blockquote>" + text[begin:end].replace(r"\n> ", r"\n") + r"</blockquote>\n"
118
114
             text = text[:begin] + reworked + text[end:]
119
115
120
116
    # All blockquotes are now removed
121
117
122
118
    # Converting setext to atx headers
123
119
    text = re.sub(r"^(?P<title>.+)\n=+$", r"# \g<title>", text, flags=re.MULTILINE)
124
120
    text = re.sub(r"^(?P<title>.+)\n-+$", r"## \g<title>", text, flags=re.MULTILINE)
125
121
    # All are now converted to atx style headers
126
122
    # Transforming headers:
127
123
    for i in range(1,7):
128
124
        header = r"^#{"+str(i)+r"} (?P<title>.+)$"
129
125
        match = re.search(header, text, flags=re.MULTILINE)
130
126
        while match is not None:
131
127
            future_id = match['title'].lower()
132
128
            future_id = re.sub(r"[ _,.!]", r"-", future_id)
133
-
            dictionary = match.groupdict()
+
129
            future_id = re.sub(r" ", r"-", future_id)
+
130
            dictionary = match.groupdict()
134
131
            dictionary['link'] = future_id
135
132
            replacement = (r'<h'+str(i)+r' id="{link}">{title}</h'+str(i)+r'>').format_map(dictionary)
136
133
            text = text[:match.start()] + replacement + text[match.end():]
137
134
            match = re.search(header, text, flags=re.MULTILINE)
138
135
139
136
    # All headers transformed
140
137
141
138
    # Paragraphs
142
139
    text = re.sub(r"(?P<text>(?:^(?!<).+\n)+)", r"<p>\n\g<text></p>", text, flags=re.MULTILINE)
143
140
144
141
145
142
    # Doing inline hyperlinks
146
143
    text = re.sub(r"\[(?P<text>.+)\]\((?P<url>.+) \"(?P<title>.+)\"\)", r'<a href="\g<url>" title="\g<title>">\g<text></a>', text)
147
-
    text = re.sub(r"\[(?P<text>.+)\]\((?P<url>.+)\)", r'<a href="\g<url>">\g<text></a>', text)
148
-
149
-
    # Doing emphasis and strongs
150
-
    text = re.sub(r"\*\*(?P<text>[^*.]*)\*\*", r"<strong>\g<text></strong>", text)
151
-
    text = re.sub(r"__(?P<text>[^\_.]*)__", r"<strong>\g<text></strong>", text)
152
-
    text = re.sub(r"\*(?P<text>[^\*.]*)\*", r"<em>\g<text></em>", text)
153
-
    text = re.sub(r"_(?P<text>[^\_.]*)_", r"<em>\g<text></em>", text)
154
-
+
144
    text = re.sub(r"\[(?P<text>.+?)\]\((?P<url>.+?)\)", r'<a href="\g<url>">\g<text></a>', text, flags=re.S)
+
145
+
146
    # Doing strongs
+
147
    text = re.sub(r"\*\*(?P<text>.+?)\*\*", r"<strong>\g<text></strong>", text, flags=re.S)
+
148
    text = re.sub(r"__(?P<text>.+?)__", r"<strong>\g<text></strong>", text, flags=re.S)
+
149
    # Doing emphasis
+
150
    text = re.sub(r"\*(?P<text>.+?)\*", r"<em>\g<text></em>", text, flags=re.S)
+
151
    text = re.sub(r"_(?P<text>.+?)_", r"<em>\g<text></em>", text, flags=re.S)
+
152
    # Code blocks
+
153
    text = re.sub(r"^```(?P<language>.+?)\n(?P<code>.+?)\n```$", code, text, flags=re.S)
+
154
    # Doing inline code
+
155
    text = re.sub(r"``(?P<code>.+?)``", inline_code, text, flags=re.S)
+
156
    text = re.sub(r"`(?P<code>.+?)`", inline_code, text, flags=re.S)
+
157
    # Header lines
+
158
    text = re.sub(r"^((\*|_|-) *){3,}$", horizontal_rule, text)
+
159
    # Line breaks
+
160
    text = re.sub(r"  $", line_break, text)
+
161
155
162
    return text
156
163
"""
157
164
158
165
159
166
160
167
161
168
162
169
    block_elements_table = {
163
170
        "code": r"```(?P<language>\w+)\n(    .*\n)+",
164
171
        "blockquote": r"^> (?P<text>.+)
165
172
        "paragraph": r"(?P<text>(^.+\n)+)",
166
173
        "header": r"^#{1,6} (?P<title>(\w+ ?)+ *) ?#*$",
167
174
168
175
169
176
    element_table = {
170
177
        "emphasis": (r"\*(?P<text>[^\*.]*)\*|_(?P<text>[^\_.]*)_", emphasis, emphasis_end),
171
178
        "strong": (r"\*\*(?P<text>[^*.]*)\*\*|__(?P<text>[^\_.]*)__", strong, strong_end),
172
179
        "unordered list": (r"")
173
180
        "inline link": (r"\[(\w\s)+\]\(
174
181
175
182
176
183
    def translate(text, begin, end, parameters):
177
184
178
185
        if alpha:  # If this contains no more nested elements:
179
186
            return begin.format(parameters) + text + end
180
187
        elif beta:  # text contains nested elements:
181
188
            # Find parameters or something IDK
182
189
            return begin.format(parameters) + _
183
190
        translate(text[alpha:beta], begin_tag, end_tag, found_parameters) + _
184
191
        end
185
192
186
193
    # Zoom zoom insert magic code here
187
194
188
195
    # NOTE: Hyperlinks are handled specially in Markdown. Check the syntax page
189
196
    # for more information. That said, it's imperative to **first** collect all
190
197
    # information about hyperlinks, and remove it, so it can be used when
191
198
    # parsing hyperlinks.
192
199
193
200
    # Table of all elements and their respective regular expression:
194
201
    elements = {
195
202
        paragraph: r"",
196
203
        ordered_list_item: r""
197
204
        hyperlink:
198
205
        header1: r"^# [*\(\n)] \n"
199
206
    }
200
207
201
208
"""
202
209
""" The reason the length is stored instead of the end, is because it is
203
210
    less error prone; if a parent node is updated, only the begin needs to be
204
211
    updated, as the length is still the same for the node. The begin can be
205
212
    relative to the parent node, so even that won't have to be updated. """
206
213
"""
207
214
    node = {
208
215
            "type": block_type,
209
216
            "begin": begin,
210
217
            "length": length,
211
218
            "children": children,
212
219
            }
213
220
214
221
    return markdown_code
215
222
"""
216
223

models.py

21 additions and 0 deletions.

View changes Hide changes
1
1
from django.db import models
2
2
import datetime
3
3
import os
4
4
5
5
def post_title_directory(instance, filename):
6
6
    """ Files will be uploaded to MEDIA_ROOT/blog/<year of publishing>/<blog
7
7
    title>
8
8
    The blog title is determined by the text before the first period (".") in
9
9
    the filename. So if the file has the name "Trains are bæ.en.md", the file
10
10
    will be stored in "blog/<this year>/Trains are bæ". Name your files
11
11
    properly!
12
12
    It should also be noted that all files are stored in the same folder if they
13
13
    belong to the same blogpost, regardless of language. The titles that are
14
14
    displayed to the user however, should be the titles of the files themselves,
15
15
    which should be in the native language. So if a blog post is titled
16
16
    "Universities of Belgium", its Dutch counterpart should be titled
17
17
    "Universiteiten van België", so the correct title can be derived from the
18
18
    filename.
19
19
20
20
    Recommended way to name the uploaded file: "<name of blog post in language
21
21
    it's written>.md". This removes the maximum amount of redundancy (e.g. the
22
22
    language of the file can be derived from the title, no ".fr.md" or something
23
23
    like that necessary), and can directly be used for the end user (the title
24
24
    is what should be displayed).
25
25
    """
26
26
    english_file_name = os.path.basename(instance.english_file.name) # TODO: Test if this returns the file name!
27
27
    english_title = english_file_name.rpartition(".")[0] 
28
28
    year = datetime.date.today().year
29
29
30
30
    return "blog/{0}/{1}/{2}".format(year, english_title, filename)
31
31
32
32
class Post(models.Model):
33
33
    """ Represents a blog post. The title of the blog post is determnined by the name
34
34
    of the files.
35
35
    A blog post can be in 5 different languages: German, Spanish, English, French,
36
36
    and Dutch. For all these languages, a seperate field exists. Thus, a
37
37
    translated blog post has a seperate file for each translation, and is
38
38
    seperated from Django's internationalization/localization system.
39
39
    Only the English field is mandatory. The others may contain a value if a
40
40
    translated version exists, which will be displayed accordingly.
41
41
    """
42
42
    published = models.DateTimeField(auto_now_add=True)
43
43
    english_file = models.FileField(upload_to=post_title_directory, unique=True, blank=False)
44
44
    dutch_file = models.FileField(upload_to=post_title_directory, blank=True)
45
45
    french_file = models.FileField(upload_to=post_title_directory, blank=True)
46
46
    german_file = models.FileField(upload_to=post_title_directory, blank=True)
47
47
    spanish_file = models.FileField(upload_to=post_title_directory, blank=True)
48
48
    # Only the English file can be unique, because apparantly, there can't be
49
49
    # two blank fields in a unique column. Okay then.
50
50
51
51
    def __str__(self):
52
52
        return os.path.basename(self.english_file.name).rpartition(".")[0]
53
53
+
54
class Comment(models.model):
+
55
    """ Represents a comment on a blog post.
+
56
+
57
    Comments are not linked to an account or anything, I'm trusting the
+
58
    commenter that he is honest with his credentials. That being said:
+
59
    XXX: Remember to put up a notification that comments are not checked for
+
60
    identity, and, unless verified by a trustworthy source, cannot be seen as
+
61
    being an actual statement from the commenter.
+
62
    Comments are linked to a blogpost, and are not filtered by language. (So a
+
63
    comment made by someone reading the article in Dutch, that's written in
+
64
    Dutch, will show up (unedited) for somebody whom's reading the Spanish
+
65
    version.
+
66
    XXX: Remember to notify (tiny footnote or something) that comments showing
+
67
    up in a foreign language is by design, and not a bug.
+
68
    """
+
69
    date = models.DateTimeField(auto_now_add=True)
+
70
    name = models.TextField()
+
71
    mail = models.EmailField()
+
72
    post = models.ForeignKey(Post) # TODO: Finish this class and the shit...
+
73
+
74