Doc To MediaWiki Converter

From Physics 111-Lab Wiki

Jump to: navigation, search

Contents

Our Converter File

Our .doc to MediaWiki converter is based on scripts found all over the web. We've combined several of them. Primarily, we started with Word2Wiki.bas. This script handles formatting, tables, lists, and also images. The PasteInHere.doc file contains the ConvertToMediaWiki macro - just paste in your materials into the .doc file, save the file as something descriptive, and then run the macro. The macro will produce two things:

  • a .txt file. The content of this file can be pasted directly into MediaWiki edit box.
  • a folder containing the images. Each image has a prefix derived from the name of the .doc file - this helps greatly if you are uploading numerous pages.

Download: PasteInHere.doc

For safety, the macro does not touch your .doc file.

Steps in Converting a Word File to Wiki

  1. Consider how the Wiki version of the document will be organized. Consider breaking up long documents (greater than 10 pages) into multiple Wiki pages in some logical structure. Sections that might have been called Appendix A, B, . . . could be given more meaningful names. You may want to convert the Word file into a single Wiki page initially, then cut and paste sections into new Wiki pages as needed.
  2. Set up and apply consistent heading styles in the Word document (Heading 1, Heading 2, . . .)
  3. Remove headers and footers
  4. Any pictures within a frame need to be moved out of the frame; also, compare all images in the .doc file to their .pdf version: if the .pdf version is higher quality, replace the .doc version with it
  5. Copy-Paste the .doc text into the conversion document(PasteInHere.doc), and save the conversion document as something corresponding to the lab you are working on. i.e. BRA for Beta Ray Spectroscopy
  6. Run the conversion macro
  7. Remove Backslashes added around brackets by conversion program
  8. Check to make sure headings have been converted correctly (the heading level in wiki text is related to the number of equal signs i.e. =This Section= would have a level of Heading 1). Sometimes it puts the right amount, but spaces it so that the wiki doesn’t detect it.
    1. One thing it does wrong is add a carriage return, putting the right hand set of equal signs on the line below; wiki doesn’t realize this! You need to fix it.
    2. The other thing it will do is start the section text on the same line as the section heading. This will cause wiki to think the section heading is part of the text, causing it to not be detected.
  9. Fix numbered lists and any text related to them. All lettered lists are turned into numbers by wiki.
    1. There can’t be any empty lines between the separate numbered sections or it will reset
  10. Equations are converted to images that are difficult to edit. Replace equations, greek letters, and other symbols with TeX code.
    1. Put carriage returns around equations to space them correctly and add few colons in front to indent them.
  11. Add Images
    1. that image spacing is much more easily done in the wiki via the markup code.
    2. Set the image size in the wiki image markup tag to approximately the same size as the length or width (set it to the larger of the two)
  12. Check all super and sub scripts



Source Code for Word Document to Wiki converter

Here is the source code in case you are interested:


Private myCopy  As Document
Private fileName As String
Private filePath As String

'version 1.1

Sub ConvertToMediaWiki()

    On Error Resume Next
    
    Application.ScreenUpdating = False
    
    'manipulate only the copy for safety
    MakeACopy
       
    'clean it up a bit before we begin
    LeftAlign
    ReplaceQuotes
    MediaWikiEscapeChars
    CleanFormattingParagraphEndingsBold
    CleanFormattingParagraphEndingsItalic
    
    'begin converting
    ConvertHyperlinks
    MediaWikiConvertH1
    MediaWikiConvertH2
    MediaWikiConvertH3
    MediaWikiConvertH4
    MediaWikiConvertH5
    
    MediaWikiConvertBoldItalic
    MediaWikiConvertItalic
    MediaWikiConvertBold
    
    MediaWikiConvertSuperscript
    MediaWikiConvertSubscript
    MediaWikiConvertLists
    MediaWikiConvertTables
           
    'fix and save
    FixParagraphEndings
    EmitImagesAndSaveasHTMLandTXT
       
    myCopy.Close
    
    Application.ScreenUpdating = True
End Sub

Private Sub MakeACopy()

    filePath = ActiveDocument.Path + "\"
    
    sBuffer = Trim$(ActiveDocument.Name)
    
    If Len(sBuffer) > 0 Then
        nPosn = InStrRev(sBuffer, ".", , vbTextCompare)
        If nPosn > 0 Then
            sBuffer = Left$(sBuffer, nPosn - 1) & sSuffix
        End If
    End If
  
    fileName = sBuffer
    
    Set myCopy = Documents.Add(ActiveDocument.FullName)

End Sub

Private Sub LeftAlign()

    myCopy.Select
    
    Selection.WholeStory
    
    With Selection.ParagraphFormat
        .LeftIndent = InchesToPoints(0)
        .SpaceBeforeAuto = False
        .SpaceAfterAuto = False
    End With
    
    With Selection.ParagraphFormat
        .FirstLineIndent = InchesToPoints(0)
        .SpaceBeforeAuto = False
        .SpaceAfterAuto = False
    End With

    Selection.ParagraphFormat.Alignment = wdAlignParagraphLeft

End Sub
Private Sub ConvertHyperlinks()
    Dim link As Hyperlink
    For Each link In myCopy.Hyperlinks
        If Len(link.TextToDisplay) < 1 Then
            link.Range.InsertBefore link.Address
        Else
            link.Range.InsertBefore "[" & link.Address & " " & link.TextToDisplay & "]"
        End If
    Next link
    While myCopy.Hyperlinks.Count > 0
    For Each link In myCopy.Hyperlinks
        link.Range.Delete
    Next link
    Wend
End Sub

Private Sub MediaWikiConvertH1()
    ReplaceHeading wdStyleHeading1, "="
End Sub

Private Sub MediaWikiConvertH2()
    ReplaceHeading wdStyleHeading2, "=="
End Sub

Private Sub MediaWikiConvertH3()
    ReplaceHeading wdStyleHeading3, "==="
End Sub

Private Sub MediaWikiConvertH4()
    ReplaceHeading wdStyleHeading4, "===="
End Sub

Private Sub MediaWikiConvertH5()
    ReplaceHeading wdStyleHeading5, "====="
End Sub

Private Sub MediaWikiConvertBoldItalic()
    
    myCopy.Select
    
    With Selection.Find

        .ClearFormatting
        .Font.Bold = True
        .Font.Italic = True
        .Text = ""
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False

        .Forward = True
        .Wrap = wdFindContinue

        Do While .Execute
            With Selection
                If Len(.Text) > 1 And InStr(1, .Text, vbCr) Then
                    ' Just process the chunk before any newline characters
                    ' We'll pick-up the rest with the next search
                    .Collapse
                    .MoveEndUntil vbCr
                End If

                ' Don't bother to markup newline characters (prevents a loop, as well)
                If Not .Text = vbCr Then
                    .InsertBefore "'''''"
                    .InsertAfter "'''''"
                End If

                .Style = myCopy.Styles("Default Paragraph Font")
                .Font.Bold = False
                .Font.Italic = False
            End With
        Loop
    End With

End Sub

Private Sub MediaWikiConvertBold()
    
    myCopy.Select
    
    With Selection.Find

        .ClearFormatting
        .Font.Bold = True
        .Text = ""

        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False

        .Forward = True
        .Wrap = wdFindContinue

        Do While .Execute
            With Selection
                If Len(.Text) > 1 And InStr(1, .Text, vbCr) Then
                    ' Just process the chunk before any newline characters
                    ' We'll pick-up the rest with the next search
                    .Collapse
                    .MoveEndUntil vbCr
                End If

                ' Don't bother to markup newline characters (prevents a loop, as well)
                If Not .Text = vbCr Then
                    .InsertBefore "'''"
                    .InsertAfter "'''"
                End If

                .Style = myCopy.Styles("Default Paragraph Font")
                .Font.Bold = False
            End With
        Loop
    End With

End Sub

Private Sub MediaWikiConvertItalic()

    myCopy.Select

    With Selection.Find
    
        .ClearFormatting
        .Font.Italic = True
        .Text = ""
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
        .Forward = True
        .Wrap = wdFindContinue
        
        Do While .Execute
            With Selection
                If Len(.Text) > 1 And InStr(1, .Text, vbCr) Then
                    ' Just process the chunk before any newline characters
                    ' We'll pick-up the rest with the next search
                    .Collapse
                    .MoveEndUntil vbCr
                End If

                ' Don't bother to markup newline characters (prevents a loop, as well)
                If Not .Text = vbCr Then
                    .InsertBefore "''"
                    .InsertAfter "''"
                End If

                .Style = myCopy.Styles("Default Paragraph Font")
                .Font.Italic = False
            End With
        Loop
    End With
End Sub

' Replace all smart quotes with their dumb equivalents

Private Sub ReplaceQuotes()
    Dim quotes As Boolean
    quotes = Options.AutoFormatAsYouTypeReplaceQuotes
    Options.AutoFormatAsYouTypeReplaceQuotes = False
    ReplaceString ChrW(8220), """"
    ReplaceString ChrW(8221), """"
    ReplaceString "‘", "'"
    ReplaceString "’", "'"
    Options.AutoFormatAsYouTypeReplaceQuotes = quotes
End Sub



Private Function ReplaceHeading(styleHeading As String, headerPrefix As String)

    Dim normalStyle As Style

    Set normalStyle = myCopy.Styles(wdStyleNormal)
    
    myCopy.Select

    With Selection.Find
    
        .ClearFormatting
        .Style = myCopy.Styles(styleHeading)
        .Text = ""

        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False

        .Forward = True
        .Wrap = wdFindContinue

        Do While .Execute
            With Selection
                If InStr(1, .Text, vbCr) Then
                    ' Just process the chunk before any newline characters
                    ' We'll pick-up the rest with the next search
                    .Collapse
                    .MoveEndUntil vbCr
                End If

                ' Don't bother to markup newline characters (prevents a loop, as well)
                If Not .Text = vbCr Then
                    .InsertBefore headerPrefix
                    .InsertBefore vbCr
                    .InsertAfter headerPrefix
                End If
                .Style = normalStyle
            End With
        Loop
    End With
End Function

Private Function EscapeCharacter(char As String)
    ReplaceString char, "\" & char
End Function

Private Sub MediaWikiEscapeChars()
    EscapeCharacter "*"
    EscapeCharacter "#"
    'EscapeCharacter "_"
    'EscapeCharacter "-"
    'EscapeCharacter "+"
    EscapeCharacter "{"
    EscapeCharacter "}"
    EscapeCharacter "["
    EscapeCharacter "]"
    'EscapeCharacter "~"
    'EscapeCharacter "^^"
    EscapeCharacter "|"
    'EscapeCharacter "'"
End Sub

Private Function ReplaceString(findStr As String, replacementStr As String)
    Selection.Find.ClearFormatting
    Selection.Find.Replacement.ClearFormatting
    With Selection.Find
        .Text = findStr
        .Replacement.Text = replacementStr
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Function

Private Sub MediaWikiConvertSuperscript()

    myCopy.Select

    With Selection.Find

        .ClearFormatting
        .Font.Superscript = True
        .Text = ""
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False

        .Forward = True
        .Wrap = wdFindContinue

        Do While .Execute
            With Selection
                .Text = Trim(.Text)
                If Len(.Text) > 1 And InStr(1, .Text, vbCr) Then
                    ' Just process the chunk before any newline characters
                    ' We'll pick-up the rest with the next search
                    .Collapse
                    .MoveEndUntil vbCr
                End If

                ' Don't bother to markup newline characters (prevents a loop, as well)
                If Not .Text = vbCr Then
                    .InsertBefore "<sup>"
                    .InsertAfter "</sup>"
                End If

                .Style = myCopy.Styles("Default Paragraph Font")
                .Font.Superscript = False
            End With
        Loop
    End With

End Sub

Private Sub MediaWikiConvertSubscript()

    myCopy.Select

    With Selection.Find
    
        .ClearFormatting
        .Font.Subscript = True
        .Text = ""
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
        .Forward = True

        .Wrap = wdFindContinue

        Do While .Execute
            With Selection
                .Text = Trim(.Text)
                If Len(.Text) > 1 And InStr(1, .Text, vbCr) Then
                    ' Just process the chunk before any newline characters
                    ' We'll pick-up the rest with the next search
                    .Collapse
                    .MoveEndUntil vbCr
                End If

                ' Don't bother to markup newline characters (prevents a loop, as well)
                If Not .Text = vbCr Then
                    .InsertBefore "<sub>"
                    .InsertAfter "</sub>"
                End If

                .Style = myCopy.Styles("Default Paragraph Font")
                .Font.Subscript = False
            End With
        Loop
    End With

End Sub

Private Sub MediaWikiConvertLists()
    Dim para As Paragraph
    For Each para In myCopy.ListParagraphs
        With para.Range
            .InsertBefore " "
            For i = 1 To .ListFormat.ListLevelNumber
                If .ListFormat.ListType = wdListBullet Then
                    .InsertBefore "*"
                Else
                    .InsertBefore "#"
                End If
            Next i
            .ListFormat.RemoveNumbers
        End With
    Next para

End Sub

Private Sub MediaWikiConvertTables()
    Dim thisTable As Table
    For Each thisTable In myCopy.Tables
        With thisTable
            For Each arow In thisTable.Rows
                With arow
                For Each acell In arow.Cells
                    With acell
                        acell.Range.InsertBefore "|"
                    End With
                Next acell
                .Range.InsertAfter vbCrLf + "|-"
                End With
            Next arow
        .Range.InsertBefore "{|" + vbCrLf
        .Range.InsertAfter vbCrLf + "|}"
        .ConvertToText "|"
        End With
    Next thisTable
End Sub

Private Sub CleanFormattingParagraphEndingsBold()
    Selection.Find.ClearFormatting
    With Selection.Find
        .Text = "^p"
        .Font.Bold = True
        .Replacement.ClearFormatting
        .Replacement.Font.Bold = False
        .Replacement.Text = "^p"
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = True
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

Private Sub CleanFormattingParagraphEndingsItalic()
    Selection.Find.ClearFormatting
    With Selection.Find
        .Text = "^p"
        .Font.Italic = True
        .Replacement.ClearFormatting
        .Replacement.Font.Italic = False
        .Replacement.Text = "^p"
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = True
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

Private Sub FixParagraphEndings()
    
    ReplaceString "^p", "^p^p"
    
    'lists are strange - NO ^p before tolerated
    ReplaceString "^p^p#", "^p#"
    ReplaceString "^p^p*", "^p*"
    
End Sub

Private Sub EmitImagesAndSaveasHTMLandTXT()

   Dim ImageFolder As String
   Dim s As Shape
   
    For Each s In myCopy.Shapes
        If s.Type = msoPicture Then
            s.ConvertToInlineShape
        End If
    Next

    FolderName = filePath + fileName + "_files"

    myCopy.SaveAs fileName:=filePath + fileName + ".htm", _
                  FileFormat:=wdFormatFilteredHTML, LockComments:=False, Password:="", _
                  AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
                  EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
                  :=False, SaveAsAOCELetter:=False
                  
    Set fs = CreateObject("Scripting.FileSystemObject")
    
    'make a new folder for the images
    Set fs1 = CreateObject("Scripting.FileSystemObject")
    ImageFolder = filePath + fileName + " Images"
    If fs1.FolderExists(ImageFolder) Then
        'do nothing
    Else
        Set f = fs1.CreateFolder(ImageFolder)
    End If
    'ok - image folder is done
    
    Set objFSO = CreateObject("Scripting.FileSystemObject")
    
    Dim imgname As String
    
    If fs.FolderExists(FolderName) Then
        
        Dim fo: Set fo = fs.GetFolder(FolderName)

        Dim iShape As InlineShape
        
        Dim Folder: Set Folder = fo.Files
        
        i = 1
        For Each File In Folder
            If i <= myCopy.InlineShapes.Count Then
                Set iShape = myCopy.InlineShapes.Item(i)
                iShape.Range.InsertBefore "[[Image:" & fileName & File.Name & "|thumb|250px|Title and Description]]"
                i = i + 1
            End If
                    
        Set objFile = objFSO.GetFile(FolderName + "\" + File.Name)
        
        ' Copy the file to its destination
        objFile.Copy (filePath + fileName + " Images\" + fileName + " " + File.Name)
        
        Next
        
        myCopy.SaveAs fileName:=filePath + fileName + ".txt", FileFormat:=wdFormatText
        
        Set filesys = CreateObject("Scripting.FileSystemObject")
        Set htmlfile = filesys.GetFile(filePath + fileName + ".htm")
        htmlfile.Delete
        Set htmlfolder = filesys.GetFolder(FolderName)
        htmlfolder.Delete
        
    End If

End Sub

Other converters

There are many other converters but they are typically more difficult to use and modify. Word2MediaWikiPlus is a very powerful one, albeit complicated. The Word2MediaWikiPlus instructions are here and the package can be downloaded here.

Personal tools