Repurposing data - stripping links
Posted on Mar 23, 2005
I often find that, in my specific job resposibilities, I end up working on things that I feel may never be useful to anyone outside of the specifics of that particular project. I am sure this is not an uncommon to others - but, in the same respect, I have to do it so somebody else out there may have to as well. This is one of those cases. In this case I had two combine and repurpose two existing news applications into a single application - yeah, that sounds like it might be easy but...ll, in fact, most of it was not that hard (and I did it using Matt Woodwards OOP presentation to create my first CF OOP app). However, the problems arose (as they often do) when I was importing the prior data into the new tables.
The basis of the problem I am going to walk through here is that the news app will now be a little more logical and fully compatible with creating RSS feeds. The problem was that the title in this case actually contained the link to which the news item was to refer to in an A tag. The content also contained various HTML tags. One complication was that I could not remove HTML that was specific to formatting (i.e. em,strong, etc.). Another complication I ran across was that not all the values of the href attribute were direct links - some were javascript function calls, which needed to be retained as is.
Well, let's get to the code...but first, you will need the safetext function by Nathan Dintenfass and Lena Aleksandrova available at cflib.org (if you want to strip all the HTML rather than just some...switch this for stripHTML by Raymond Camden). You will need to modify the safetext function as follows:
- Change var mode = "escape" to var mode = "strip"
- Add tags like FONT,A,LI,P,U,SPAN to the badTags list (or whatever other tags you may have to remove)
Obviously you need a query of the values you want to change too, but you don't need those specifics as they should be simple...
<!--- create the regular expression to strip url href's --->
<!--- This regex would get the value of the href properly, but did not work for href="javascript"
<cfsavecontent variable="regex">href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']</cfsavecontent>--->
<!--- This regex worked but was not locating the end of the href properly
<cfsavecontent variable="regex">href=[\"\']?((?:[^>]|[^\s]|[^"]|[^'])+)[\"\']?</cfsavecontent> --->
<cfsavecontent variable="regex">href[\s]*=[\s]*"[^\n"]*"</cfsavecontent>
<cfoutput>
<cfloop from="1" to="#qryHeadlines.recordCount#" index="i">
<cfset querySetCell(qryHeadlines,"newsID",createUUID(),i)>
<!--- strip out the first occurrence of a url link (we don't care about multiple occurrences)
to find all, you would loop until it doesn't have any more matches --->
<cfset stFindLink = REFindNoCase(regex, qryHeadlines.title[i], 1, true)>
<!--- set the cell equal to the first occurrence without the href and quotes --->
<cfif stFindLink.pos[1] GT 0>
<cfset querySetCell(qryHeadlines,"url",replaceList(Mid(qryHeadlines.title[i],stFindLink.pos[1],stFindLink.len[1]),'href=,"',','),i)>
</cfif>
<!--- strip uneccessary html and characters from the title, replace non breaking spaces with standard spaces --->
<cfset querySetCell(qryHeadlines,"title",trim(replaceList(safetext(qryHeadlines.title[i]),"#chr(172)#, ",", ")),i)>
</cfloop>
</cfoutput>
<cfdump var="#qryHeadlines#">
Comments
There are currently no comments for this entry...be the first!